Promotion Is A Transaction, Treat It Like One

Six weeks after the credit model went into production, a risk reviewer asked one question. Show me the challenger comparison the approver saw when they clicked approve. Not the metrics now. The metrics then, on that screen, at that moment.

The team had a lot of things. They had the model version hash. They had the dataset snapshot reference. They had the approver's name, their SSO identity, and a timestamp accurate to the millisecond. The Jira ticket linked to a Confluence page that linked to a notebook that, at some point, had rendered a challenger diff against the incumbent. The notebook had been re-run since. The Confluence page had been edited. The challenger metrics object in the registry had been overwritten by the next candidate's evaluation, because the registry stores the latest evaluation per model, not the one that was on screen at decision time.

So the honest answer to the reviewer was: we know who approved, and we know what they approved, but we cannot show you what they were looking at. That is the failure mode this post is about. Not a lineage problem. Not a metrics problem. A missing write.

What The Registry Records, And What It Quietly Doesn't

A model registry is a catalogue of things that exist. Model versions, their artefacts, their training runs, their evaluation reports, their tags, their stage transitions. Useful, necessary, and not remotely sufficient.

What the registry records is the state of the world: which artefacts exist, which version is currently in production, who pushed it there. What it does not record is the state of the decision: what the approver had in front of them, what they compared it against, which explanation artefact they read, and which dataset reference framed the evaluation they were shown. Those two states drift apart almost immediately after the click. The world keeps moving. The decision does not get to.

Lineage tools do not close this gap, and this is where teams get caught out. A lineage graph proves that artefact A was produced from dataset B by pipeline C. It shows flow. It does not show consent. A reviewer is not asking how the model got built. They are asking what the human approving it agreed to, in writing, against a specific comparison, on a specific day. Flow is not consent, and provenance is not approval. A clean DAG does not answer the question "did the approver see the calibration drift on segment four before signing?"

The approval ticket, meanwhile, is the worst kind of evidence. Human-written. Editable. It links outward to systems that have since changed. It says "approved per attached evaluation" and the attachment is a URL whose contents have been mutated three retrains ago. (If your governance story rests on a Confluence link, your governance story rests on Confluence's edit history. That is the kind of mistake you only make once.)

So you end up with three records that do not, together, reconstruct the decision. The registry has the artefact identity. Lineage has the production path. The ticket has the human identity. None of them has the thing the reviewer actually wants, which is a frozen view of what was on the approver's screen at t = approve.

The Promotion Event As An Atomic Write

The fix is structural, not procedural. Stop treating promotion as a workflow transition between registry stages. Treat it as a transaction. One atomic write, immutable, content-addressed, that binds the decision to its evidence at the moment the decision happened.

Concretely, the promotion event is a single record with six fields, all of them bound together by being written in the same operation and never written again:

Frozen model version – the candidate's artefact hash, pinned. Not a tag. Not "latest". The bytes.
Challenger comparison – the rendered diff between candidate and incumbent on the evaluation set used for sign-off, captured as the approver saw it. Metrics, segments, calibration, whatever was on screen.
Approval request – the request payload that triggered the human review: who was asked, what materiality tier, which policy applied, which fields were marked required.
Approver identity – SSO subject, group memberships at the time of approval, and the authentication method. Group membership matters because it can change; the snapshot is what was true at sign-off.
Explanation artefact reference – a content-addressed pointer to the SHAP summary, the feature attribution report, the fairness slice deck, or whatever your policy requires. Pointer, plus hash. More on this trade-off below.
Dataset reference – the evaluation dataset's content hash, not its name. Names get reused. Hashes don't.

Written once. Append-only store. Idempotency key derived from the candidate version and the approval request id, so a retry storm during a flaky deploy does not produce two promotion events for the same decision. If you can't write all six fields, the promotion fails closed and nothing transitions. No half-promoted state. No "we'll fill in the explanation reference later". Later never comes, and later is exactly when the auditor shows up.

This is the registry-metadata-versus-decision-state line, made operational. The registry continues to record what exists. The promotion event records what was decided. They reference each other. Neither replaces the other. The registry can keep mutating, because that is its job. The promotion event cannot, because that is its job.

One engineering trade-off is worth naming directly, because teams get it wrong in both directions. The explanation artefact: do you store the rendered snapshot inside the event, or do you store a reference plus a content hash and keep the bytes in object storage?

Storing the rendered snapshot inline is tempting. It survives object-store lifecycle policies. It survives someone running a cleanup job that deletes "old" SHAP reports. It is the most defensible thing in front of a reviewer because you hand them the artefact, not a pointer to it. The cost is event-store bloat, awkward schema evolution, and the unpleasant moment when somebody discovers a PII column in a feature attribution plot that you have now made immutable. Storing a reference plus a content hash is leaner and lets you handle retention separately, but you have just made the promotion event's evidentiary value depend on a second system staying honest. If that system has a delete API and an IAM policy that grants object delete to more principals than you can name, you do not have evidence. You have a hash that points at nothing.

The defensible middle: reference plus content hash, with the referenced object written to a bucket that has object lock or equivalent write-once semantics, and a retention policy that outlives any plausible audit window. Cheaper than inlining. Still defensible. Slightly more operational burden, because somebody has to own the lock policy and prove it is enforced. Pick your poison deliberately.

One more thing, because it gets missed. This is a write-path problem, not a dashboard problem. No amount of nicer UI on top of the registry recovers the decision state that was never captured. The instinct to build a "governance dashboard" first is the instinct to paint over the gap. The gap is upstream, at the moment of the click. Fix it there.

Materiality Is The Routing Key, Not Policy Theatre

If every promotion has to carry every field at maximum fidelity, two things happen. Engineers route around the process. Reviewers stop reading the evidence because there is too much of it. Both are worse than the original problem.

Materiality is the routing key. Not the org chart. Not the seniority of the approver. The properties of the change itself.

A weekly retrain on the same features, same hyperparameters, same evaluation slices, with performance inside the pre-agreed band, is a low-materiality event. It still produces a promotion event – that is non-negotiable, because the audit question "what was approved on the 14th" has to have an answer for every promotion, not just the interesting ones – but the event is lightweight. Frozen model hash, dataset hash, automated approval identity, a pointer to the standing policy that authorised auto-approval, a reference to the band check result. Six fields, all present, most of them populated by the pipeline rather than a human. No challenger deck. The challenger is the previous version and the comparison is mechanical.

A threshold change, a new feature, a swap of the evaluation set, a change in the segment definitions, or anything that moves the model into a population it has not been evaluated on before – that is high-materiality. Same six fields, but the challenger comparison has to be the rendered human-facing diff, the explanation artefact reference has to point at the full attribution report, and the approver identity has to be a human with the right group membership at the time of sign-off. Auto-approval is off. The write-path enforces it. Not a checklist. The schema.

Routing by materiality means the heavyweight evidence shows up where it actually gets read, and the lightweight evidence still exists where it doesn't. A reviewer asking about the weekly retrain on the 14th gets a terse, correct, machine-generated promotion event. A reviewer asking about the threshold change on the 21st gets the diff the human actually saw, pinned. Same primitive. Different fidelity. Decided by the change, not by who happened to be on the on-call rota that week.

There is a counterargument worth conceding. This adds a real operational burden. Somebody has to define the materiality classifier, somebody has to maintain it, and somebody has to argue with model owners who think their change is lower-materiality than the classifier says. That argument is the point. The classifier is the artefact of the policy. Reviewable, testable, version-controlled, which is more than can be said for most governance frameworks. If your materiality logic lives in a slide deck, you do not have materiality routing. You have an opinion.

Where this approach is overkill: small teams with one or two models, low regulatory exposure, and a single approver who is also the engineer. Don't build this. You will spend more on the plumbing than the plumbing protects. Where it is load-bearing: any environment where the approver and the engineer are different people, where models retrain on a schedule you do not control, and where someone outside the team will eventually ask what was on the screen.

Before your next promotion ships, write the event once, end to end, for one pipeline. All six fields or it fails closed. Route their fidelity by the materiality of the change, not by the org chart. Then make it the only way promotions are allowed to happen.