Index Theme
June 2026 9 min read System DesignDataWebAssembly

Profiling a National Dataset in the Browser

A system-design walkthrough of an on-device data-quality engine that profiles all 435,029 NaPTAN stops with nothing leaving the device — and an honest look at where the design breaks at 100×.

Two of the systems I work on point at the same data from opposite ends. By day I help maintain NaPTAN, the UK Department for Transport's record of every public-transport access point in Great Britain — 435,029 stops, published openly, relied on by services I will never meet. Away from that, I rebuilt a slice of the work as something anyone can run: a data-quality engine that profiles the entire national dataset inside a browser tab, with nothing ever leaving the machine. This is the design of that second system, and an honest account of where it would buckle if you pushed it a hundred times harder.

The brief I set myself was deliberately awkward. Open the page; see the true shape of 435,029 records — types, null rates, cardinality, distributions, duplicates, staleness — in under a second; on a laptop; with no server to pay for and not a single row uploaded anywhere. Several of those constraints actively fight each other. The interesting engineering is in which ones you refuse to surrender.

The cheapest, most private, lowest-latency backend is the one you don't run.
OFFLINE · BUILD STEP NaPTAN export 435,029 rows · CSV/XML Build → Parquet 9 MB · ZSTD · columnar Static CDN no backend to run loaded once BROWSER TAB · NOTHING LEAVES THE DEVICE DuckDB-WASM SQL over Parquet Aggregate dashboard ad-hoc SQL · filters User drops CSV/TSV any file, at runtime Zig → WASM single streaming pass Profile report types · nulls · cardinality · dups ~1,000,000 rows / second · exact counts, not estimates
On-device architecture — built offline into a static Parquet, profiled at runtime by two WebAssembly engines, with no data leaving the browser.

The constraints that shaped everything

I treat requirements as forcing functions, so I wrote them down as refusals rather than wishes.

Each of those is easy alone. Together they push the computation off the server and onto the one machine I don't control: the reader's.

Two engines, because there are two questions

There are really two distinct jobs hiding inside "profile this data," and trying to serve both with one tool is how you end up doing neither well.

The first question is what is the shape of this known dataset — and I answer it with DuckDB compiled to WebAssembly, running SQL over a columnar Parquet file. The national export is baked, once, into a 9 MB ZSTD-compressed Parquet and served as a static asset. Columnar layout plus compression is the whole trick: 435,029 rows is 9 MB, not 90, and an ad-hoc question like how many stops haven't changed in three years returns in milliseconds, because a column store only ever reads the columns the query actually touches.

The second question is profile this arbitrary file I just dropped in — and that one DuckDB can't pre-bake for, because the input doesn't exist until runtime. So the profiler is a Zig program compiled to WebAssembly that streams any CSV or TSV in a single pass. Two engines, deliberately: the right tool for SQL over a known dataset is the wrong tool for a full profile of an unknown one.

Why a single streaming pass

The profiler infers, in one read of the bytes, every column's type, null rate, cardinality, value distribution, and any duplicate rows. The design rule that makes this fast is the same one that makes it honest: every statistic has to be computable incrementally, because the file is consumed once and never held whole in memory.

Throughput lands around a million rows a second. That number is mostly the absence of things: no per-row allocation, no boxing of every cell into an object, no garbage collector waking up mid-file. Zig gives me a lean WASM binary over a linear memory I manage by hand, so the hot loop touches bytes, not objects.

Why ship the whole dataset to the client

The reflex is a backend — an endpoint that runs the query server-side and hands back a summary. I went the other way and shipped all 435,029 rows to the browser. The tradeoffs, stated plainly:

The price is a 9 MB first download and a hard ceiling. This is the correct design for this dataset and the wrong one the moment the data is private, too large to ship, or changing by the second. Naming the conditions under which your design is wrong is most of what separates a decision from a habit.

Where it breaks at 100×

The useful question in any review isn't whether a design works — it's where it stops working. Take this to 43 million rows, or to data I'm not allowed to ship at all:

What I was actually designing for

Every choice here is the same choice, made repeatedly: push the work to where it's cheapest and most honest, and keep the architecture's shape stable across three orders of magnitude. The browser version isn't a toy demo — it's the complete national dataset, profiled in under a second, on hardware I don't own, for a running cost of zero. And when someone eventually needs the forty-three-million-row version, the answer isn't a new system. It's this system with a bigger substrate underneath — which is exactly the sentence you want to be able to say, calmly, in a design review.