Index Theme
March 2025 7 min read DataPublic SectorArchitecture

Modernising a National Dataset Without Breaking Its Consumers

What maintaining the UK's NaPTAN public-transport dataset taught me about legacy modernisation, validation, and the discipline of public-sector delivery.

Most engineers want to build the thing nobody has built yet. I've come round to a different ambition: maintaining the thing everybody already depends on, and doing it so quietly that nobody has reason to notice. That is the work on NaPTAN — the National Public Transport Access Nodes dataset for the UK Department for Transport. It is the canonical list of every place you can board public transport in Great Britain: every bus stop, rail station, ferry terminal, tram platform. It is public, it is free, and it is load-bearing in a way that is easy to underestimate until you're the one holding it.

The weight of being a single source of truth

A national dataset doesn't ask to be impressive; it asks to be right, every day, for everyone who never thinks about it.

When a dataset is the source of truth, every mistake in it becomes someone else's bug. A bus stop with the wrong coordinates doesn't fail loudly in our systems; it fails in a journey planner three layers downstream, in an app a commuter is staring at in the rain, in a timetabling system that assumes a stop exists where it no longer does. We don't have a clean inventory of every consumer, and that is precisely the point — open public data means you are integrated with people you will never meet. You cannot ask them to upgrade. You cannot put a feature flag in their code. The only lever you have is your own discipline.

That reframes what "quality" means. In a product team, you can ship something 90% right and iterate, because your users and your telemetry tell you where the 10% hurts. With a national reference dataset, the 10% is the whole job. A stop that's been decommissioned but never marked inactive will silently poison routing for years. A duplicated identifier will quietly corrupt joins in systems you've never heard of. The interesting work isn't adding capability — it's defending correctness against entropy, because reality keeps changing underneath the data. Stops move. Operators rename them. Local authorities submit updates in good faith that contradict each other.

Validation is the product

The instinct of a lot of engineers is to treat validation as a gate you pass through on the way to the real work. On NaPTAN, validation is the real work. The schema is not the contract — the schema is the floor. A record can be perfectly schema-valid and still be nonsense: coordinates in the sea, a stop type that doesn't match its physical attributes, a status transition that's physically impossible. So the rules that matter are the semantic ones, and most of them encode hard-won domain knowledge that lives in people's heads and in the failures of years past.

I've learned to be ruthless about making that knowledge executable. Every time someone says "oh, that kind of record is always wrong," that sentence should become a test or a validation rule, not a tribal anecdote. The goal is that the dataset cannot regress in a way a human already understands, because a human shouldn't have to be in the loop to catch it again. This is unglamorous. It is also the difference between a dataset people trust and one they learn to work around — and once consumers start defensively cleaning your data on their side, you've lost the thing that made a national dataset worth having in the first place.

Backwards compatibility is a promise, not a preference

The other half of the discipline is modernisation without betrayal. Legacy formats here are not technical debt to be impatiently paid down — they're a public commitment. Somewhere there is a script written years ago by someone who has moved on, parsing a fixed-format export, and it has every right to keep working. So we modernise the way you renovate a building people still live in: you add, you don't subtract. New APIs and richer representations sit alongside the old outputs; the old outputs keep their exact shape, their exact quirks, sometimes including quirks I'd love to fix but can't, because a quirk that's been stable for a decade is now part of the contract.

That requires a particular temperament, and it's one public-sector delivery teaches you whether you like it or not. You move deliberately. You write the migration guide before you need it. You assume a long deprecation horizon measured in years, not sprints. You treat "this change is invisible to every existing consumer" as the headline success, not the boring caveat. CI/CD, automated checks, generated documentation — all of it exists to make caution cheap and repeatable, so that being careful doesn't mean being slow.

People sometimes ask what's hard about maintaining a list of bus stops. The honest answer is that the list is easy; the guarantee is hard. Anyone can produce data. Producing data that thousands of strangers can build on for a decade without ever having to question it — that's an engineering discipline, and I've come to think it's one of the most underrated ones we have.