Freeze the Decision, Not Just the Weights | Promotion Events

Six months after a credit-decisioning model went live, an internal audit team sits in a room with the platform lead, the model owner, and someone from second-line risk. The registry is open on the screen. Version 4.2.1. AUC, KS, PSI on the holdout. Training dataset hash. A signed-off approver name with a timestamp. Everything the runbook said to capture is there, neatly rendered.

Then the auditor asks the only question that matters. What did the approver actually see on the day they clicked approve?

Nobody in the room can answer. The SHAP report attached to the model card was regenerated two weeks after promotion, against a slightly newer feature snapshot, because someone wanted cleaner plots for a steering deck. The challenger comparison lives in a notebook on a laptop that has since been re-imaged. The approval email thread exists, but it references an attachment that was a draft. The approver, asked directly, says they remember signing off on a model that handled thin-file applicants better than v4.1 – but they cannot recall the exact numbers, and the numbers in the registry today are not the numbers from that week.

That is the kind of mistake you only make once.

The registry remembers the model, not the decision

A model registry is an inventory system. It tells you which weights exist, where the artefacts live, what dataset hash they were trained against, which environment they're deployed to, and who is nominally accountable. That is artefact lineage. Genuinely useful. Most regulated teams have invested heavily to get it right.

But artefact lineage is not decision state. Decision state is a different object: the specific bundle of evidence in front of an approver at the moment they signed off. The challenger they compared against. The explanation report they actually read, not the one regenerated later. The approval request payload – the form, the questions answered, the residual concerns flagged. The dataset reference at the version the approver believed they were approving against. The approver's verified identity, not just a name in a free-text field.

Registries don't capture this because they weren't designed to. They were designed to answer which model is in production, not what was agreed to, by whom, on what evidence. Related, but answering a different question.

And here is where the failure mode shows up under audit. The registry has fields. The fields look complete. But every field is mutable – someone can re-upload a SHAP report against the same model version, swap a model card description, regenerate plots, edit a metrics table. The row in the registry is a current view, not a frozen record. So when the auditor asks what the approver saw, the honest answer is: we have what we have now. We do not have what was on the screen then.

Reconstruction by guesswork is the polite name for what happens next.

What a promotion event actually has to contain

The fix is not more fields in the registry. The fix is a separate object – call it a promotion event – that is written exactly once, at signoff, and is content-addressed and append-only after that. The registry can keep doing its job. The promotion event sits next to it and answers the audit question the registry cannot.

Six fields, at minimum, have to be frozen into that event:

Frozen model version. The actual immutable artefact reference: content hash of the serialised model, plus the registry's version ID at the moment of signoff. Not a registry pointer that can be re-tagged later.
Challenger comparison. The specific challenger or champion the new model was evaluated against, with its own frozen artefact hash, and the comparison report exactly as it existed at signoff.
Approval request payload. The form the approver filled in, the questions, their answers, any free-text justification, and any flagged residual risk.
Approver identity. Verified through the SSO/IdP at the moment of click, with the assertion captured. A typed name in a free-text field doesn't count, and neither does an email CC.
Explanation artefact reference. A content hash of the exact SHAP report, fairness slice, or whatever explanation document the approver reviewed.
Dataset reference. The specific snapshot – feature store version, offline store partition, or dataset hash – the approver believed underpinned both the metrics and the explanation.

Each of these is sourced from a different system. The model hash from the registry. The challenger from a comparison job. The approval payload from whatever workflow tool runs your signoff – this is the field teams routinely lose, because the workflow tool has its own retention policy and nobody owns coordinating it with model lifecycle. The identity from the IdP. The explanation hash from the artefact store. The dataset reference from the feature store or warehouse. The promotion event's job is to staple them together, in one signed bundle, at the instant of signoff. If a regenerated explanation arrives tomorrow with cleaner axes, it gets a new hash, and the promotion event still points at the original. Dataset references drift in subtle ways too, and approvers are usually not the ones who notice.

Append-only matters. Content-addressed matters more. If your promotion event is a row in a table that an admin can update, it is not evidence. It is a note. A first-line examiner who has seen a write-once ledger before will ask about object lock or hash chains inside the first ten minutes of the walkthrough, and the answer they get tells them how seriously to read everything else. The bundle should be written to a store where overwrite is structurally impossible: object storage with object lock, a write-once ledger, or at minimum a hash chain that any later edit would visibly break. Not because anyone on the team is dishonest. Because the value of the artefact to a regulator is exactly proportional to how implausible tampering is.

Consider the failure mode in slow motion. A team promotes a model in March. The approver reviews a SHAP summary showing a particular feature's contribution at roughly the level they're comfortable with. In May, the explanation pipeline is upgraded – different background dataset, slightly different sampling – and the model card is helpfully refreshed with the new plots. By July, when the model comes up for periodic review, the SHAP report on the model card shows the same feature contributing more aggressively than the approver remembers. No malice. No model change. Just no frozen record of what was reviewed in March, so the conversation in July becomes an argument about memory rather than a conversation about evidence. Strictly speaking, the model behaviour hasn't shifted at all. The explanation of it has.

That whole episode costs nothing to prevent if the explanation artefact reference is frozen at signoff. It is unrecoverable afterwards.

Routing by materiality instead of one ceremony for everything

If every promotion has to produce a fully signed, six-field, committee-reviewed event, the retrain cadence dies. Teams that retrain weekly on stable feature sets will not tolerate a heavyweight ceremony for each push, and they shouldn't have to. Uniform process is how you get either shadow workflows or a frozen model estate. Both are worse than the original problem.

Materiality routing is the way out. The promotion event itself – the six fields, frozen and content-addressed – is non-negotiable for every promotion, including the boring ones. That part is cheap once the pipeline produces it automatically. What varies is who has to approve, how much human review the approval requires, and whether a committee is in the loop.

A weekly retrain on the same feature schema, same training window shape, same population, with drift metrics inside pre-agreed bands, does not need a model risk committee. It needs an automated approver – a service identity, with the policy that authorises it captured in the promotion event – plus a named human accountable on the rota. The promotion event still gets written. The bundle still gets frozen. The human ceremony is proportional to the change.

A model that touches adverse-action decisions, or that introduces a new feature, or that retrains on a population shift, is a different animal. That one routes to committee. The approval request payload is longer. The challenger comparison is mandatory, not optional. The explanation artefact has to include fairness slices the committee specifically asked for last cycle. Same six fields, much heavier process around producing them.

The routing logic itself has to live somewhere auditable. A policy file in the repo, versioned, reviewed, with explicit criteria. Not tribal knowledge about which retrain "feels material". If your routing decision can be re-litigated after the fact, your routing decision is not a control.

One thing not to do: do not try to solve this by extending the model registry's schema. The registry vendor will usually accommodate custom fields, lineage links, even approval workflows. Wrong layer. Registries are read-write systems optimised for discovery and deployment. Promotion events are write-once systems optimised for evidence. Conflating them gives you neither: a registry that's harder to operate and an evidence trail that's still mutable underneath. Keep them separate. Let the registry hold the pointer to the promotion event's hash, and let the promotion event live in storage that cannot be quietly edited.

The counterargument worth conceding: this is more plumbing. It adds a system. It adds a contract between the registry, the workflow tool, the IdP, the feature store, and the artefact store. For a team that retrains a handful of models a quarter, that's overhead the audit pressure may not yet justify. Fair. But the moment the regulator's question lands, the cost of not having it isn't the plumbing – it's reconstructing six months of decisions from Slack threads and memory. The blast radius of that conversation, once it reaches second-line risk or an external examiner, is much larger than the engineering bill for doing it properly the first time.

Most outages are paid for in advance. Audit findings are no different.

Before the next promotion goes out, write down – on one page, not a wiki tree – the six fields your promotion event must capture and exactly which system each one is sourced from at the moment of signoff. Walk a recent promotion through that list. If any field would have to be reconstructed after the fact instead of frozen at the click, treat that promotion as undocumented, and fix the pipeline before the next retrain ships.