Registry Metadata Is Not Promotion Evidence

An internal auditor sits down with the ML platform team and asks one question about a credit-risk model that was promoted last quarter. Not a hard one. Show me what the approver was looking at when they clicked approve. The team pulls up the registry: model version, dataset hash, training run ID, the lineage graph back to the feature store, performance metrics on the holdout slice. All present. All correct. None of it answers the question.

That is the gap this post is about.

The auditor's question the registry cannot answer

Registries are good at one job. They describe artifacts. Weights, hyperparameters, training data references, evaluation metrics, the lineage edges that connect them. A well-run registry will tell you, with hashes, exactly which bytes shipped to production on a given date. That is necessary. It isn't what the auditor is asking for.

The auditor is asking about a decision. Who approved this promotion, on what evidence, against which challenger, with which explanation in front of them, under what risk tier. The registry does not store decisions. It stores objects. The distinction sounds pedantic until you are in the room.

Consider what reviewers actually probe at signoff. Was the challenger's PSI drift report the one generated against the current production population, or the stale one from two weeks earlier? Did the approver see the SHAP summary built from the frozen candidate, or one regenerated afterwards when somebody re-ran the notebook? Was the approval request payload the same one the approver saw in their inbox, or did somebody edit the threshold values after the click? These aren't exotic questions. Reviewers ask them every time.

A registry can tell you the model artifact has SHA a13f…. It cannot tell you which rendering of which explanation artefact the approver had on screen at 14:07 when they clicked. Hashing the model is necessary. The approver's field of view is the missing object.

I'll concede the obvious counterargument. Yes, many teams capture some of this in ticketing systems – a Jira approval, a ServiceNow change record, a Slack thread with screenshots. That is evidence of activity, not evidence of state. Activity logs scatter. They get edited. They reference URLs that point to artefacts which have since been regenerated. Reconstructing the approver's view from them is archaeology, and archaeology is not a control.

What belongs in a promotion event

The fix is narrow. One immutable record, written once at signoff, addressable by a single event ID, with six fields:

Frozen model version. The content hash of the candidate artifact, plus the registry version ID. Not a pointer that resolves to whatever the latest is. The hash, captured at signoff.
Challenger comparison. The reference to the incumbent model version and the specific comparison report ID – metric deltas, drift, slice performance – as it existed at the moment of approval. Not a query that re-computes on read.
Approval request payload. The exact serialized request the approver acted on: proposed deployment scope, risk tier, threshold values, rollback plan, the human-readable summary. Whatever was in the box when they said yes.
Approver identity. Authenticated principal, the role under which they approved, and the authentication method. An email string in a free-text field will not survive cross-examination.
Explanation artefact reference. A content-addressed pointer to the SHAP or counterfactual or partial dependence output the approver reviewed. Immutable. Re-runnable from the frozen model and dataset, but not regenerated lazily on read.
Dataset reference. The content-addressed reference to the validation dataset used to produce the metrics and the explanation. Snapshot, not query.

Six fields. One event, append-only.

The shape matters less than the discipline. What matters is that the registry stores the artifact description and the promotion event store holds the decision state, and the registry references the event by ID rather than the other way around. Embedding decision records inside the registry is the version of this idea that fails in eighteen months, when somebody updates a row to fix a typo and the audit trail quietly mutates.

Storage. Boring, load-bearing. Append-only object storage with object lock, or an event log with retention set well past your regulatory horizon. Write once. Reference by ID from the registry entry. If a model version in the registry points to promotion event pe_2024Q3_0481, that ID resolves to the same bytes today, next quarter, three years from now under supervisory review. The registry can evolve. The event cannot.

One concrete failure mode justifies the whole pattern. A team had a perfectly good registry, perfectly good lineage, and a SHAP plot served from a notebook that re-ran on demand. The approver looked at the plot on Tuesday and signed off. On Thursday somebody refactored a feature transform upstream – cosmetic, they thought – and the notebook regenerated the plot with subtly different bar ordering and one feature renamed. When audit asked six months later what the approver had seen, the artefact at the URL had drifted. Not the model. The explanation. Same prediction surface, different picture. That is the kind of mistake you only make once.

Strictly speaking, the SHAP values hadn't changed in any way that affected the model's behaviour. The audit finding still stood, because the control – "approver reviewed an explanation of the candidate model" – could no longer be evidenced. The artefact was no longer the one approved. The chain of custody was broken by a refactor nobody flagged as governance-relevant.

In practice: the explanation artefact and the comparison report must be materialized and frozen at signoff, addressed by content hash, written into the event, never regenerated in place. Re-derivable from inputs, yes. Re-served from cache, no.

Materiality decides the route, not the ceremony

There is a predictable objection. If every promotion event has to carry a full challenger comparison, a frozen explanation, an authenticated approval payload, then weekly retrains of a low-risk fraud-scoring auxiliary model become a committee exercise. They shouldn't. And they don't need to.

The schema is the same. The approver set is not.

Tier the models by materiality – the standard axes are decisioning impact, regulatory exposure, customer reach, blast radius on a bad day. A Tier 1 model (credit decisioning, underwriting, anything that touches a regulated outcome on a named customer) gets routed to the model risk committee or its delegated quorum. A Tier 2 model goes to a two-person technical review with a named accountable owner. A Tier 3 retrain – same features, same architecture, scheduled refresh, drift inside the agreed envelope – goes to an automated approver acting under a pre-authorized policy, with a human on the rota notified.

The promotion event still gets written. Six fields, every time. The approver identity field records the automated principal and the policy ID it acted under, not a human name. The approval request payload records the policy evaluation result. If audit asks how a Tier 3 retrain was approved, the event resolves to the policy, the policy resolves to the human governance decision that authorized that class of automated approval, and the chain holds.

What you do not do, under any circumstances, is vary the schema by tier. The temptation is real – "low risk models don't need an explanation artefact reference" – and it is wrong. The moment the schema bends to materiality, the audit reviewer cannot apply one query across the promotion log to answer "show me every Tier 1 promotion in the last twelve months and the explanation each approver reviewed". They have to know in advance which tier had which fields. That is how controls quietly degrade into folklore.

One caveat where this recommendation gets uncomfortable. A frozen promotion event store adds operational surface. Somebody owns the event schema. Somebody owns the materialization of the explanation artefact at signoff. Somebody pages when the event writer fails and a promotion is blocked. If you have three models in production and a quarterly retrain cadence, this is overkill and a spreadsheet with screenshots will get you through the next audit. If you have forty models and weekly retrains across six business lines, the spreadsheet has already failed and you just haven't been audited yet.

Most outages are paid for in advance. So are most audit findings.

The test before your next promotion is one sentence long. If you cannot replay the approver's exact field of view from a single event ID – the frozen model, the challenger comparison they saw, the explanation that was on their screen, the dataset behind it, the payload they acted on, the identity they acted as – the control is not in place. Whatever the registry shows.