Profiling a National Dataset in the Browser

Two of the systems I work on point at the same data from opposite ends. By day I help maintain NaPTAN, the UK Department for Transport's record of every public-transport access point in Great Britain — 435,029 stops, published openly, relied on by services I will never meet. Away from that, I rebuilt a slice of the work as something anyone can run: a data-quality engine that profiles the entire national dataset inside a browser tab, with nothing ever leaving the machine. This is the design of that second system, and an honest account of where it would buckle if you pushed it a hundred times harder.

The brief I set myself was deliberately awkward. Open the page; see the true shape of 435,029 records — types, null rates, cardinality, distributions, duplicates, staleness — in under a second; on a laptop; with no server to pay for and not a single row uploaded anywhere. Several of those constraints actively fight each other. The interesting engineering is in which ones you refuse to surrender.

The cheapest, most private, lowest-latency backend is the one you don't run.

On-device architecture — built offline into a static Parquet, profiled at runtime by two WebAssembly engines, with no data leaving the browser.

The constraints that shaped everything

I treat requirements as forcing functions, so I wrote them down as refusals rather than wishes.

Nothing is uploaded. The dataset is public, but the principle isn't: a tool people trust to inspect their data must never quietly become a way to exfiltrate it. Privacy here is a feature, not a footnote.
The whole dataset, never a sample. Sampling is the easy way out, and it hides precisely the long-tail defects that data quality exists to catch. A duplicate that appears once in 435,029 rows is invisible to a one-percent sample and fatal to a downstream join.
Sub-second to first insight. If the read on data health isn't close to instant, nobody runs it twice — and a check nobody runs is decoration.
No backend. Nothing to scale, secure, or bankrupt me. The cheapest, most private, lowest-latency backend is the one you don't run.

Each of those is easy alone. Together they push the computation off the server and onto the one machine I don't control: the reader's.

Two engines, because there are two questions

There are really two distinct jobs hiding inside "profile this data," and trying to serve both with one tool is how you end up doing neither well.

The first question is what is the shape of this known dataset — and I answer it with DuckDB compiled to WebAssembly, running SQL over a columnar Parquet file. The national export is baked, once, into a 9 MB ZSTD-compressed Parquet and served as a static asset. Columnar layout plus compression is the whole trick: 435,029 rows is 9 MB, not 90, and an ad-hoc question like how many stops haven't changed in three years returns in milliseconds, because a column store only ever reads the columns the query actually touches.

The second question is profile this arbitrary file I just dropped in — and that one DuckDB can't pre-bake for, because the input doesn't exist until runtime. So the profiler is a Zig program compiled to WebAssembly that streams any CSV or TSV in a single pass. Two engines, deliberately: the right tool for SQL over a known dataset is the wrong tool for a full profile of an unknown one.

Why a single streaming pass

The profiler infers, in one read of the bytes, every column's type, null rate, cardinality, value distribution, and any duplicate rows. The design rule that makes this fast is the same one that makes it honest: every statistic has to be computable incrementally, because the file is consumed once and never held whole in memory.

Types narrow as evidence arrives. Every column starts as the most specific type — integer — and gets demoted on contradiction: the first `12A` knocks it to string, with no going back. One pass, no second look.
Null rates, extremes, and counts are trivial running accumulators.
Cardinality and duplicates are the honest hard ones. Exact distinct counts mean hashing every value; exact duplicate detection means hashing every row. At national-dataset scale that still fits in memory, so I kept it exact — 0 duplicate ATCO identifiers across 435,029 stops is a fact, not an estimate, and for a correctness tool that distinction is the entire point.

Throughput lands around a million rows a second. That number is mostly the absence of things: no per-row allocation, no boxing of every cell into an object, no garbage collector waking up mid-file. Zig gives me a lean WASM binary over a linear memory I manage by hand, so the hot loop touches bytes, not objects.

Why ship the whole dataset to the client

The reflex is a backend — an endpoint that runs the query server-side and hands back a summary. I went the other way and shipped all 435,029 rows to the browser. The tradeoffs, stated plainly:

Cost: a 9 MB file on a CDN versus a query service to run, secure, patch, and scale. Free versus not-free, forever.
Latency: after the first load, everything is local. Every filter, search, and SQL query is a memory operation, not a network round trip.
Privacy: there is nothing to leak, because nothing is sent.

The price is a 9 MB first download and a hard ceiling. This is the correct design for this dataset and the wrong one the moment the data is private, too large to ship, or changing by the second. Naming the conditions under which your design is wrong is most of what separates a decision from a habit.

Where it breaks at 100×

The useful question in any review isn't whether a design works — it's where it stops working. Take this to 43 million rows, or to data I'm not allowed to ship at all:

The Parquet stops being shippable. You move the columnar store back behind an API — but you keep the shape. Parquet, columnar, DuckDB: the same engine, relocated to a server or pointed at object storage. Nothing about the query model changes.
Exact counts stop fitting in memory. You swap exact hashing for sketches — HyperLogLog for distinct counts, a Bloom filter or a sort-based pass for duplicates — trading a known, bounded error for bounded memory. And you surface that error in the UI, because a silent estimate is worse than an honest one.
The single pass becomes a distributed one. This is where the original design earns its keep: the same incremental accumulators that ran in one WASM loop become partial aggregates you merge across shards. Scaling out is a change of substrate, not a rewrite — because the statistics were built to be mergeable from the start.

What I was actually designing for

Every choice here is the same choice, made repeatedly: push the work to where it's cheapest and most honest, and keep the architecture's shape stable across three orders of magnitude. The browser version isn't a toy demo — it's the complete national dataset, profiled in under a second, on hardware I don't own, for a running cost of zero. And when someone eventually needs the forty-three-million-row version, the answer isn't a new system. It's this system with a bigger substrate underneath — which is exactly the sentence you want to be able to say, calmly, in a design review.