Six months after a credit model went into production, an internal auditor sits down with the person who approved it and asks a very simple question: show me what you reviewed before you signed off. The registry opens cleanly. There is the frozen version, the training dataset hash, the offline metrics, the lineage graph back to the feature store. All of it correct. None of it answers the question. The approver remembers a challenger comparison on screen that morning, remembers scrolling through a slice report, remembers a SHAP summary that nudged the decision. The registry has no idea any of that happened.
That is the gap this piece is about. Not governance in the abstract. The promotion moment.
The registry tells you what exists, not what was decided
A model registry is an inventory system. It is very good at that. It knows which artefact was produced by which training run, which code commit, which dataset snapshot, which container image. Lineage, in the strict sense, is the chain from inputs to artefact. It answers what produced this model. It does not answer what a human saw before agreeing to put it in front of customers.
Those are different objects. One is a property of the artefact. The other is a property of the decision. Conflating them is the failure mode that quietly breaks audit interviews.
Here is how it plays out. The auditor asks for the approval evidence. The team pulls up the registry record. Version 4.2.1, promoted on the 14th, approved by a named risk officer, with a link to the metrics page. The auditor nods, then asks the question that actually matters: which challenger was on screen when the approval was clicked? The team can name the production model at the time. They can probably reconstruct, from a separate experiment tracker, which candidates were trained that week. What they cannot do is prove which comparison surface the approver was looking at. Maybe it was 4.2.1 against the incumbent 4.1.7. Maybe it was 4.2.1 against a quietly discarded 4.2.0. Six months on, no one is sure, and the screenshot folder on someone's laptop is not evidence.
The registry was never going to answer this. It records artefacts. Decisions need their own record.
Let me narrow the claim before it gets too grand. This is not about every retrain. A nightly refresh of a low-stakes ranking model does not need a committee. But it still made a decision – promote or do not promote – and that decision still has a state. The state needs to be written down somewhere that is not the registry.
What belongs inside the promotion event
Treat the promotion event as an immutable, append-only record emitted at the exact moment of approval. Not before. Not reconstructed after. Emitted by the approval UI itself, as a side effect of the click. If the click does not produce the event, the event does not exist, and you have nothing to show.
Six fields, at minimum:
- Frozen model version. The artefact identifier the approver is approving. Pinned, content-addressed, not a moving tag like
latestorstaging. - Challenger comparison. The identifiers of every model on the comparison surface at decision time, including the incumbent. Not just the winner.
- Approval request payload. The exact JSON (or equivalent) that the UI rendered from. If the screen was driven by a payload, store the payload.
- Approver identity. Cryptographically bound. An SSO subject claim, a signed JWT, or a hardware-key signature. Not a free-text name field.
- Explanation artefact reference. The SHAP plot, the slice report, the fairness summary – referenced by content hash of the rendered artefact, not by URL.
- Dataset reference. The evaluation dataset, by hash, plus a pointer to the slice report the approver actually read.
Three of these need defending. The easy version of each looks right and isn't.
Take the explanation artefact. The reflex is to store a URL to the dashboard – the same Grafana panel, the same MLflow page, the same internal tool that the approver was looking at. Do not do this. Dashboards re-render. The data behind them rolls forward. The panel that showed a clean reliability curve in March will show something different in September because the underlying query now spans a different window. By the time the auditor clicks the link, you are showing them a different artefact than the approver saw, and you cannot prove otherwise. Reference the rendered artefact by content hash. A PNG, a PDF, a serialised JSON of the chart payload – whatever form it took on the approver's screen, freeze it and address it by hash. Keep the URL as a convenience link beside the record if you want, but the hash is what survives the next dashboard rewrite.
Take the dataset reference. A dataset hash is necessary, and it is not sufficient. The approver did not stare at a sha256 string and form a judgement. They looked at a slice report built on top of that dataset – performance by segment, calibration by cohort, error rates by region. That report is a derived artefact, and it has its own content hash. Store both. The dataset hash gives you reproducibility. The slice report hash gives you decision evidence. Conflating them is how you end up at the audit interview saying "well, the dataset was definitely correct" while the auditor is asking what the human read.
The approval request payload is the one most teams skip, and it matters more than it looks. UIs change. The version of the approval screen that existed in March is not the version that exists today, because someone shipped a redesign in May. If all you have stored is "approver clicked yes on screen X", you have lost the rendering context. Store the payload the UI consumed. Then, given the payload and the frozen UI code at that commit, the screen is reconstructible. Without the payload, it is not.
One direct opinion, since this is the place for it: the payload feels redundant until the first UI refactor lands between a promotion and its audit. After that, nobody on the team argues about it again.
Storage is its own decision. The promotion event does not belong inside the registry. The registry has its own lifecycle, its own retention, its own access patterns, and it is mutable in ways the event must not be. Put the event in an append-only store. Object storage with object-lock works. An append-only ledger works. The retention boundary should be set by the regulatory or contractual life of the decision, not by the operational life of the model. Models get deprecated. The decision to promote them does not.
One more thing on identity. SSO subject claims are the floor, not the ceiling. If your approval flow accepts a click from a browser session and writes "approved by alice@company.com" into a database row, you have an audit log entry, not a non-repudiation artefact. For high-impact promotions, sign the event with a key bound to the approver – a hardware token, a platform authenticator, something the approver had to actively present. One is an audit log entry. The other is a signature you can hand to a regulator without flinching.
Route by materiality, not by ceremony
The counterargument lands early, so let me concede it. Not every promotion warrants a committee. Forcing a fraud-scoring retrain that runs four times a day through a human review board is how you build a shadow process where the real promotion happens in a Slack thread and the official one happens in a meeting nobody attends. The instinct to push back on heavy governance is correct.
The mistake is treating "lightweight" as "no event". Materiality should change the route. It should not change the schema.
A low-risk retrain of an established model – same features, same training pipeline, same evaluation harness, drift within agreed bounds – can promote on an automated approval. The event still gets written. The approver identity field carries the service principal of the promotion pipeline, signed with its workload identity. The challenger comparison field carries the incumbent and the candidate. The explanation artefact reference carries whatever automated checks ran. The payload is the policy decision JSON that the gate evaluated. Same six fields. Same store. Same retention.
A high-impact promotion – a new model class, a credit decisioning change, a clinical triage update – routes to a committee. The same event is written at the moment of human signoff. Approver identity is now a person, hardware-signed. The explanation artefact reference points to the deck or the memo the committee reviewed, hashed. Same schema.
This matters for two reasons. First, the audit story is uniform. The auditor does not have to learn a different evidence shape for every tier of model. They learn one event, and they query it. Second, materiality classifications drift. A model that started as a quiet internal tool becomes a customer-facing one. If the lightweight tier and the heavy tier emit different evidence shapes, every reclassification turns into a migration project. If the schema is constant, reclassification only changes the routing rule.
What not to do: do not invent a "governance lite" event that captures three fields for the cheap path and six for the expensive one. You will spend the next two years explaining the asymmetry to auditors, and you will lose every one of those conversations. Write the full event every time. If a field is structurally absent for a given route – say, no human approver on an automated promotion – fill it with the service principal and move on. Absence is data. Improvising the schema is not.
One uncomfortable caveat, because the recommendation is not free. Writing the event from the UI means the UI is now on the critical path of promotion. If the event store is unavailable, promotion has to fail closed, or you accept a queued-write pattern with its own idempotency story. Neither is trivial. The team that owns the approval surface now owns an event producer with real durability requirements – retries, dead-letter handling, an on-call rota that understands the blast radius of a dropped write. Plan for that before you ship it. Retrofitting durability onto an approval flow that started as a form submit is harder than building it in from the start.
A final narrowing, because scope creep is how good schemas die: this whole apparatus is about the promotion boundary. It is not a substitute for monitoring, not a substitute for incident review, not a substitute for the change log on the serving stack. It answers exactly one question – what did the approver see, and when, before they clicked approve – and it answers it well. Asking it to do more is how the schema bloats until nobody trusts it.
Before the next promotion goes out, write the event schema first and make the approval UI emit it. Then sit with the event alone – no registry, no dashboards, no Slack history – and try to reconstruct what the approver was looking at. If you can't, the event is incomplete. Fix the event. Not the audit response.

