An auditor sits down with your model risk lead and asks one question about a credit model that went to production six weeks ago: show me what the approver saw before they clicked approve. Not the metrics in the registry today. The screen, that morning. The challenger comparison they were looking at, the explanation document open beside it, the exact request payload they consented to.
The team has the model artefact. They have the AUC, KS, and PSI on the promotion run. They have the dataset hash, the training code commit, the feature view version. They can rebuild the model bit-for-bit. They cannot rebuild the desk.
That gap is what the registry cannot close, and it is why this kind of review goes badly.
What the registry records, and what it quietly omits
A model registry is a catalogue. It records what the model is: weights, hyperparameters, training data reference, evaluation metrics, lineage edges to upstream features and downstream services. Good registries also pin the container image and the inference signature. None of this is wrong. It is just not the thing the auditor asked for.
The auditor asked about a decision. A decision has an actor, a moment, an alternative considered, and a thing being consented to. Registry rows describe artefacts. Decisions happen against artefacts but are not the same record. Conflating the two is the failure mode that shows up in nearly every first deep review I have sat through.
Here is how it looks in practice. The promotion ran on a Tuesday. The challenger being compared against the incumbent was version 0.7.3 of the same family, trained on a slightly different sample window. Between Tuesday and the audit, two more challengers shipped, the comparison dashboard was refactored to use a new metric panel, and the explanation notebook was re-run against a fresher reference population because someone wanted a cleaner SHAP plot for a different review. Query the lineage tables now and you will get a challenger comparison. Not the one the approver saw. The screen the approver saw may no longer be reproducible from current tables, and pretending otherwise in front of a regulator is worse than admitting the gap.
Teams resist this part. The instinct is to say: we have lineage, we have versioning, we can reconstruct. Reconstruction is not evidence. Reconstruction is a story you tell after the fact using whatever survived. Evidence is a record written at the moment of the decision and never edited.
Most approvers do not read the metrics blob. They scroll past the dashboard, glance at one or two numbers they half-trust, and rely on the explanation artefact – the document that says this model behaves this way on these segments, here is the drift since last quarter, here is what changed since the last approved version. That document is the load-bearing thing in the room. If your promotion event references the metrics JSON and not the explanation artefact, you have recorded the part the approver didn't really use and omitted the part they actually relied on. That is the kind of mistake you only make once.
The promotion event as a first-class artefact
Treat the promotion as an event in its own right. Not a row updated in the registry. Not a status flag flipped from staging to production. A separate, append-only record written once, at the instant of approval, containing the state of the decision.
Six fields. Minimum.
- Frozen model version – the immutable reference to the candidate being promoted, including artefact digest, not just a semver tag that can be re-pointed.
- Challenger comparison – a frozen snapshot of the comparison the approver saw, with the challenger versions named and their metrics captured inline, not as a link to a dashboard that will rerender.
- Approval request payload – the exact request the approver clicked on, byte-for-byte, including the routing rules that brought it to them and the materiality classification at the moment of submission.
- Approver identity – not just a user ID. The principal, the authentication method, the role asserted at the time, and where applicable the delegation chain.
- Explanation artefact reference – a content-addressed pointer to the explanation document that was on the desk. Not a link to a notebook that will be re-executed. A hash of the rendered output as it stood.
- Dataset reference – the evaluation dataset and reference population used in the comparison, pinned by content hash, including the feature view snapshot if you use one.
Write it once. Sign it. Put it somewhere the team that builds the models cannot rewrite. An object store with object lock, a WORM bucket, a ledger table – the mechanism matters less than the property. Append-only. No updates. No soft deletes. If you correct something later, you write a new event that references the old one. You do not edit the old one.
Notice what is not on that list. The model artefact itself is not on the list, because the registry already holds it and a content-addressed reference is enough. The training code commit is not on the list, because the question the auditor asked is about the decision, not about how the candidate came to exist. The boundary is narrow on purpose. The promotion event is the record of decision state at promotion time, nothing more. Keep it that way. Every field you add that isn't decision state is a field someone will eventually argue should be updatable, and the moment one field becomes updatable the whole record loses its evidential weight.
One uncomfortable caveat. Writing this event at the moment of approval means your approval UI has to be the thing that emits it, transactionally, with the click. Not a nightly job. Not a downstream ETL that snapshots from the registry. If the event is written by a process that runs after the click, you are back to reconstruction – better reconstruction, sooner, but still reconstruction. The idempotency key for the event should be the approval request ID, and the write should fail closed: no event, no promotion.
That last part is where teams push back. Fail closed feels heavy. It is heavy. It is also the only configuration that survives a serious audit, because the alternative is a world where some promotions have events and some don't, and you cannot tell from the registry which is which.
Routing by materiality without losing the trail
The obvious counterargument: full committee approval for every retrain is operationally absurd. A weekly refresh of a low-stakes propensity model cannot wait three days for a quorum. Agreed. That is not what this is about.
Materiality routing decides who approves and what they look at. It does not decide whether a promotion event gets written. A low-risk retrain on the same feature schema, same target, same population, with drift within thresholds, can be auto-approved by a named service principal under a policy. Fine. The promotion event still gets written. The approver identity is the service principal. The challenger comparison is the previous production version against the candidate, captured at the moment the policy fired. The explanation artefact reference points to whatever the automated check produced – the drift report, the segment stability summary, the calibration plot. The approval request payload contains the policy ID and the rule that matched.
Same six fields. Different approver. Same immutability.
This is where the recording discipline pays for itself. When the materiality classification of a model changes – and it will, because regulators reclassify, because the model gets reused for a new product, because a downstream consumer becomes load-bearing – you do not have to retrofit governance to historical promotions. You already have the events. You can answer who approved this, under which policy, against which alternative for every promotion in the history of the model, including the auto-approved ones. Without that, a reclassification turns into an archaeology project.
A tier sketch, kept deliberately rough because the exact thresholds belong to your risk function and not to a blog post:
- Tier 1, high materiality: human committee, frozen comparison pack, explanation artefact reviewed, event written on click.
- Tier 2, medium: named individual approver with delegated authority, policy-bounded, event written on click.
- Tier 3, low: automated approval under a published policy, event written by the policy engine, exceptions escalate to Tier 2.
The trap to avoid: a fourth tier where retrains skip the event entirely because they are obviously safe. There is no such tier. If a retrain is so safe it does not warrant an event, it does not warrant a promotion either – wire it as a configuration update inside the existing approved model, not as a new model version. The distinction is real and worth defending, because the moment obviously safe becomes a category, the boundary of what counts as a model promotion starts drifting, and the audit trail develops holes that nobody notices until somebody is looking for them.
One thing not to do. Do not try to make the registry itself immutable to solve this. Registries need to evolve. Metadata gets enriched, tags get fixed, ownership changes, deprecation flags get set. A registry that cannot be edited is a registry that rots. The promotion event is a separate record precisely so that the registry can keep being a working catalogue while the decision trail stays frozen.
Before your next promotion, write the promotion event schema first and make it append-only. Decide the six fields, decide where they live, decide who signs them. If a low-risk retrain cannot produce that event automatically under your policy engine, the path is not ready. Raise the materiality bar, not the recording bar.
