The auditor's question was narrow. Which challenger comparison did the approver review before v17 of the credit-decisioning model went to production? The registry was helpful, up to a point. It showed the model artefact, the training dataset hash, the metric snapshot, the git SHA of the training job. It showed who had write access to the Production stage. It did not show what was on the approver's screen at 14:42 on a Thursday when they clicked promote.
What the team handed over instead: three Slack messages, a Jira ticket with two attachments, a registry timestamp. The Slack messages were from the right people. The Jira ticket referenced the right model. Nobody could prove, from the evidence alone, that the approver had opened the challenger comparison before the click, or that the explanation artefact they discussed was the one tied to v17 rather than the one tied to v16. The audit didn't fail. It became expensive. That is the kind of mistake you only make once.
This is the gap. Registries record what exists. They do not record what was decided.
What The Registry Actually Records
A model registry is a catalog. That is its job, and most of them do it well. It indexes versions, pins lineage to training runs, holds metric blobs, exposes a stage transition API, and lets you query backwards from a deployed artefact to a dataset hash and a git commit. Lineage, provenance, the bill of materials – all present, all queryable.
None of that is decision state.
Lineage tells you what flowed into the artefact. Decision state tells you what a named human agreed to, on the basis of what evidence, at what moment. The registry can answer the first question with a graph traversal. The second question requires an event that the registry was never designed to emit – or, more accurately, an event that the registry emits in a degraded form (a stage transition, an actor field, a timestamp) which auditors correctly read as metadata rather than as a binding record.
The honest framing: we have been treating catalog systems as governance systems because they sit closest to the artefact. They are not the same shape. A court record has to bind a named human to specific evidence at a specific moment, and a catalog has no slot for that binding. Pretending otherwise is how teams end up reconstructing decisions from Slack threads under audit pressure.
Reconstruction is not evidence. It is a story told after the fact, with the artefacts that happened to survive. Sometimes the story is true. Sometimes it isn't. The auditor cannot tell the difference, and neither, honestly, can you.
The failure mode is predictable. A model is retrained. Metrics look good. Someone posts the challenger comparison in a channel. A few people react with a thumbs up. The approver, who may or may not have read the thread carefully, opens the registry UI and clicks the stage transition. Six months later, an incident or an audit asks the obvious question, and the team begins assembling a narrative from Slack search, Jira history, registry timestamps, and human memory. The narrative might even be correct. It is still not evidence.
The Promotion Event And What Goes Inside It
The fix is a single, narrow, immutable record. Call it the promotion event. One event per promotion, emitted at the moment of the approval click, written to an append-only store outside the registry. Not in the registry. The registry is mutable by design – stages move, tags get rewritten, descriptions get cleaned up. You do not want your evidence living in a system whose schema is allowed to change under you.
The event has six fields. Each field is a reference. The payloads live elsewhere.
- Frozen model version – the registry ID plus the content hash of the artefact. Not the tag. The hash.
- Challenger comparison reference – a content-addressed pointer to the exact comparison report rendered to the approver, including the champion version it was compared against.
- Approval request ID – the workflow record that opened the request, with its own immutable history of state transitions.
- Approver identity – the authenticated subject from the IdP, not a display name. SSO subject ID, not
@jane. - Explanation artefact reference – a hash of the explanation bundle (SHAP summaries, partial dependence plots, segment slices) the approver had access to.
- Dataset reference – the training and evaluation dataset hashes, not the dataset name.
Six fields. One event. Signed at write time, stored append-only, queryable by model version.
The reason the explanation artefact is referenced by hash rather than stored inline is load-bearing. Inline storage tempts teams to "improve" the explanation later – a re-rendered chart, a corrected segment label, a cleaned-up notebook. Each improvement is reasonable in isolation. Each one quietly invalidates the evidence. If the hash in the promotion event resolves to a different byte sequence than what was rendered at decision time, the event is honest about that: it points to nothing, and you know immediately the chain is broken. A registry-stored explanation will just silently mutate and look fine.
The same logic applies to the challenger comparison. The point of the reference is not to make retrieval easy. The point is to make tampering visible. If the bytes change, the hash stops resolving. The audit trail tells on itself.
One opinionated aside, because this comes up every time: do not put the promotion event inside the model registry's metadata fields. I know it is convenient. I know the registry has a tags dictionary and a description field that will happily accept JSON. Don't. The registry's storage layer is not append-only, its access controls are scoped for ML engineers rather than for evidence custodians, and the moment someone runs a cleanup script you have lost the record. S3 with Object Lock in compliance mode, a QLDB-style ledger table, an immudb instance, or an append-only Kafka topic with a retention lock. Anywhere except the system whose job is to keep the catalog tidy.
A note on what the event does not contain. It does not contain the explanation itself. It does not contain the metrics. It does not contain the dataset. Those are large, they already live somewhere, and duplicating them into the event creates two problems: storage growth you don't need, and divergence risk between the copy in the event and the copy in the source system. References, hashed. Nothing more.
Routing By Materiality, Not Ceremony
This is where most governance programmes break their own backs. They define a promotion process, declare it mandatory, and apply it uniformly. Every retrain goes through committee. Every retrain generates the same paperwork. Within a quarter, the team has either routed around the process or stopped retraining as often as the model actually needs.
Neither outcome is what the auditor wanted.
The promotion event is the artefact. The route the event takes – who approves, how much review, how much ceremony – should depend on the materiality of the change. A scheduled retrain on the same feature set, the same population, the same hyperparameter grid, where the challenger comparison shows drift within a pre-agreed envelope, does not need a committee. It needs a named approver, a generated comparison, a generated explanation diff against the prior version, and the six-field event. The approver might be a senior engineer on the on-call rota. The review might take ten minutes. The event still gets written, signed, and stored the same way as any other.
A change to the feature set is a different animal. A new feature, a removed feature, a population shift detected in the evaluation slice, a change in the loss function, a new vendor data source – any of these moves the model into territory the previous risk review did not cover. That is what the committee exists for. The event still has the same six fields. The approval request ID just points to a heavier workflow with more named reviewers and a longer audit trail attached to it.
The discipline is to decide the materiality classes in advance and to attach the routing to the change, not to the calendar. Three classes is usually enough:
- Same feature set, same population, drift within envelope. Single approver. Generated artefacts.
- Same feature set, population shift or drift outside envelope. Two approvers, one from risk.
- Feature set change, objective change, or new data source. Full committee review.
I'll concede the counterargument before someone raises it: yes, the boundaries between these classes are fuzzy, and yes, teams will argue about which bucket a particular change falls into. That argument is the point. It is cheaper to argue about classification once, in advance, than to argue about adequacy of review after an incident. Most outages are paid for in advance.
The sharp edge of this pattern, where it stops being free: someone has to own the materiality classifier. Not a document. A function in the promotion pipeline that inspects the proposed change and assigns a class. If it is a document, it will be ignored. If it is a check that blocks promotion until a class is assigned and the corresponding approvers have signed, it will be obeyed. The classifier itself becomes a piece of code that needs review, and that review is real work. There is no version of this that is free.
One thing not to do: do not let the registry's stage transition be the trigger for the promotion event. Make the promotion event the trigger for the stage transition. The approval workflow writes the event first, and only on successful write does the registry move the stage. If you wire it the other way, the registry click becomes the source of truth and the event becomes a side effect that can fail silently. You want the event on the critical path. You want a failed event write to block the promotion. Backwards from what feels natural, and correct.
Before your next model promotion, write down the six fields your promotion event must contain and where each one is stored immutably. If any field resolves to a Slack thread, a Jira comment, or a registry timestamp, that field is not yet evidence. Fix it before the next retrain ships.

