Backfills Are Migrations With Worse Manners

A schema migration is a feared, ceremonial event. You write the up and the down, you stage it, you wrap it in a transaction, you have a rollback plan, you do it at 2am with a colleague watching. A backfill is the same operation pointed at history instead of structure, and most teams run it from a notebook on a Tuesday afternoon. The reverence is inverted. The migration changes what the next row can be; the backfill changes what the last million rows already were, and someone downstream has already believed them. The second one is more dangerous. We treat it as a chore.

I learned the weight of this the way you learn most things on a public dataset: by holding one that cannot be taken back. I help maintain NaPTAN, the UK Department for Transport's record of every public-transport access node in Great Britain — 435,029 records covering every place you can board a bus, train, tram or ferry. When a definition changes and you have to reprocess the back catalogue to match, you are not editing a table. You are altering facts that journey planners, timetabling systems, and apps written by people who have long since moved on have already read, cached, and acted upon. There is no migration window. There is no rollback that reaches into someone else's database.

A schema migration changes what the next row can be; a backfill changes what the last million rows already were — and someone downstream has already believed them.

A backfill rewrites a span of already-trusted history; idempotency only buys a safe re-run, and the real constraint is the downstream consumer that keeps reading and cannot be paused — so the read, not the write, must decide when the new history goes live.

A backfill is a migration that never announced itself

Backfills get casual treatment because they don't look like migrations. A migration arrives with a file in `migrations/`, a review, a sequence number. A backfill arrives as a Slack message — "can you re-run the pipeline for Q1, we found a bug in the dedup logic" — and it goes out the door without any of the apparatus we built specifically to make changes to data safe. It is a schema change in everything but ceremony, and it operates on the one part of the system we tell ourselves is settled: the past.

What makes it worse-mannered than a migration is the surface area. A forward-only schema change touches new writes. You can let it bake, watch the new rows, catch a problem before it spreads. A backfill rewrites a contiguous slab of history all at once, and it does so to data that has already been consumed. The blast radius is not "everything from now on." It is "everything that ever was, reinterpreted, and pushed to people who already drew conclusions from the old version." A dashboard that was correct yesterday is now quietly wrong about last March, and nobody fired an alert, because from the platform's point of view nothing failed. The job succeeded. That is the trap. A successful backfill and a catastrophic one look identical in the logs.

So the first discipline is to call it what it is. A backfill is a migration. It deserves a reviewed artefact, a documented intent, a stated blast radius, and a named person who owns the outcome — the same rituals you would never skip for an `ALTER TABLE` that touched a fraction of the rows. The 2026 discourse around data platforms has half-absorbed this. ThoughtWorks' Technology Radar, Volume 34 moves Apache Iceberg to Adopt and tracks code-first semantic layers as a maturing technique, and the surrounding consensus has settled on a slogan: design for backfilling from day one, idempotency is non-negotiable. The slogan is correct. Almost nobody acts on it, because the day you most need to reprocess history is the day you discover you built no machinery to do it safely.

Idempotency is the floor, not the achievement

The advice you will hear first is: make your pipelines idempotent. Re-running the job for a partition should produce the same result as running it once. It is true and necessary — and it is where most teams stop, mistaking the precondition for the solution. Idempotency means your backfill won't double-count. It says nothing about whether the numbers it produces are the ones the world already agreed on.

Here is the failure idempotency does not catch. Your pipeline is perfectly idempotent. You fix a bug, re-run a year of partitions, and every run is deterministic and clean. But the fix changed a definition — what counts as an active record, how a null is treated, where a day boundary falls in a timezone you'd been getting wrong. The backfill is idempotent and the output is different from what consumers saw last week. No schema check fires, because the schema didn't change. No data-quality rule trips, because the new values are individually valid. You have silently violated a contract that no validation expresses: the promise that a figure, once published, stays stable unless someone is told.

I build open-source data-contract tooling — SEIP, metricspec, dataproduct-kit — and the recurring lesson is that the schema is the easy part of a contract and the values are the hard part. A backfill is precisely the operation that honours the schema and breaks the contract. The column is still a non-null integer; it just means something different now. So idempotency is the floor. The achievement is reproducibility of meaning: being able to say not only "this job is deterministic" but "I can reconstruct exactly what every consumer saw on any past date, and I can tell them precisely what changed and why." That is versioned history, not just repeatable computation.

This is where the Iceberg-and-semantic-layer turn earns its place, beyond the slideware. Table formats that keep snapshots give you the thing a backfill destroys by default — the before. A semantic layer defined as code gives you a single, reviewable place where a definition lives, so that changing it is a pull request with a diff and a blame, not an edit buried in a transformation step. Together they let a backfill carry the one thing it almost never carries: a record of what it changed, expressed in the same terms the consumer reasons in. The tools won't make the decision for you. They make the decision legible, which is the part that was missing.

There is a ladder here, and most teams stop on the first rung. Idempotent means re-running is safe to compute; it is table stakes. Reversible means you kept the prior snapshot and can return to it — Iceberg gives you this nearly for free, and almost everyone leaves it switched off. Reconcilable means you can produce the diff between old and new and hand it to the people who depended on the old. That last one takes design, and it is the one that decides whether a backfill is a routine or an incident.

The hard part is the consumer you cannot pause

You can get all of that right — idempotent, snapshotted, diffable, reviewed — and still be undone by the part of the system you do not control. The consumer. In a closed product you can feature-flag a change, drain the queue, replay behind a flag, and flip it when the numbers reconcile. The defining property of the systems I have worked on is that this option does not exist.

NaPTAN's consumers are uncountable and unreachable. I cannot pause a journey planner I have never heard of while I reprocess a definition. I cannot put a feature flag in someone else's decade-old script. The only lever I have is my own discipline, applied before the change goes out — which means the backfill has to be designed so that the worst version of a downstream consumer survives it. Not the well-behaved one that re-reads on every request. The one that cached your data three years ago and will reconcile against it tomorrow.

I felt this most sharply on the COVID-19 vaccination platform we built for an Australian government, serving 5,000,000-plus residents. The deadline there was not a sprint boundary; it was a physical queue of people in a car park, arriving regardless. When you have to replay or reconcile records under that kind of clock, the question is not "is my reprocessing correct." It is "which of my guarantees are load-bearing, and which can I let slip without stopping the line." A backfill that fails closed on a field nobody downstream actually depends on will halt an operation to protect a number that didn't matter. A backfill that fails open on the field that determines whether someone is recorded as vaccinated is a different kind of disaster. The entire judgement is knowing the difference — which guarantees a consumer would genuinely suffer without, and which are vanity — and no amount of idempotency supplies it.

So the design rule I keep returning to is this: a backfill is safe only to the degree you can describe what it does to a consumer who cannot tell you what they depend on. That forces a particular order of work. You establish the prior state as a snapshot you can point at. You compute the change as a diff, not a silent overwrite. You decide, explicitly and in advance, which differences are publishable as routine and which require telling someone before they ship. And you accept that for the consumers you cannot reach, "telling someone" means the change has to be one they could have survived without being told — backwards-compatible in meaning, not just in shape.

This is migration discipline, moved from structure to history and stripped of the courtesy of a maintenance window. The reframing I want to leave you with is small and, I think, load-bearing. Stop treating reprocessing as an operational chore and start treating it as the most consequential change your platform makes, because it edits the one thing everyone assumed was fixed. Build the snapshot, the diff, and the named owner before you need them, the way you write the migration's down-script before you run the up. Most teams discover they needed all three at exactly the moment it is too late to add them. Designing the platform so that backfilling history is a calm, reviewable, reversible routine rather than a held breath — that is the unglamorous, defining work. It is the kind of work I want to be brought into.