Selected work Theme
Lead Data Architect  ·  UK Department for Transport · via Thoughtworks  ·  United Kingdom  ·  2021 — present

NaPTAN — Great Britain's national stop register

Maintaining the canonical record of every public-transport access point in Great Britain — 435,029 stops consumed by a set of downstream systems no one can enumerate, where a breaking change has no rollback.

435,029
stops modelled, validated & published
0
breaking changes shipped downstream
Open
unbounded set of integrators served

Context

NaPTAN is the National Public Transport Access Nodes dataset — every bus stop, rail station entrance, ferry berth and tram platform in Great Britain, identified, located and classified. It is open data: journey planners, transit apps, local authorities and other government systems consume it directly, with no registration and no controllable list of integrators.

That openness is the whole job. You cannot survey who depends on a field before you change it, because you cannot see them. The dataset is a public API to the country's transport geography, and the contract is implicit, permanent, and held by strangers.

The hard part

The hard part is not the volume of records — it is that backwards-compatibility is an invariant, not a preference. A breaking change ships to consumers who never agreed to a migration window and cannot be reached to coordinate one. There is no rollback for data that has already been downloaded and trusted.

The second hard part is quality. A single bad record — an impossible coordinate, an invalid status transition, a stop quietly marked active years after it was removed — propagates into every downstream journey plan. Bad data fails quietly, in someone else's system, long after it left yours.

Architecture

Local authoritysubmissionsSemanticvalidationthe gateVersionedstore + APIUnboundedconsumersimpossible coords · status transitions · staleness
Submissions are validated against executable domain rules before anything is versioned and published. The gate sits before publication because there is no recall after it.

Key decisions

Treat backwards-compatibility as an invariant
WhyConsumers can't be enumerated, so "we'll coordinate the migration" is not an option that exists. Every change has to be safe for systems you'll never meet.
Trade-offSchema changes are additive; legacy shapes are carried far longer than feels comfortable, and deprecation is a multi-year conversation, not a release note.
Make domain rules executable as validation
Why"A stop can't jump 200km overnight" is tribal knowledge until it's code. Encoded as rules, it stops bad data at submission instead of in a downstream incident.
Trade-offUp-front cost to encode the rules, plus explicit allow-lists for the legitimate edge cases that look like errors.
Fail loudly before publication, never after
WhyA bad record published to an open dataset is unrecallable. The only safe place to be strict is the gate before publish.
Trade-offStricter gates can block well-intentioned submissions until they're corrected — friction that has to be designed for, not wished away.

Related writing

I wrote more about the discipline behind this in Modernising a National Dataset and Designing for the Hard Part.