A credit-risk model went to production six months ago. The registry has the version tag, the metrics, the dataset hash, the training run ID, the git SHA. Everything you would expect. Then second-line review walks in with a question that sounds almost trivial: show me what the approver actually saw on the screen before they clicked approve. Not the metrics that were logged. The view. The challenger comparison they were looking at. The SHAP plot in the tab next to it. The exact approval request, with the exact justification text, signed by the exact human.
Nobody can produce it. The dashboard has been recomputed since. The Slack thread where the approval was discussed scrolled off retention. The approver remembers approving it and is reasonably sure of why, but reasonably sure is not the artefact being asked for.
That is the kind of mistake you only make once.
The artefact the auditor asks for is not in the registry
Model registries are good at one job. They record what the model is: weights, version, lineage pointers, evaluation metrics, the dataset reference, sometimes the training environment. ML lineage tools extend that backwards through features and forwards through deployments. Together they cover the artefact side of the story cleanly.
What they don't record is the decision state at signoff. The registry tells you the model exists and how it was built. It does not tell you what was on the approver's screen at 14:42 on a Tuesday when they decided this version was acceptable to put in front of customers. Different objects. Conflating them is the gap most teams discover during a review, not before.
The distinction matters because the question an auditor or a model risk reviewer actually asks is rarely what was the model. They already have that. The question is what was the basis for the decision. Which challenger was it compared against. Which slice of data. Which explanation. Which human accepted accountability, against which written request, on which day. And, critically, can you reproduce that view today, byte for byte, or only approximately.
Approximately is not an answer in regulated review. It is a finding.
Here is the failure mode in concrete form. The approver clicked approve while looking at a SHAP summary plot rendered against the validation slice as it stood that morning. The plot lived in a notebook on a shared dashboard. Six months later, that dashboard has been refreshed against newer data, the underlying feature pipeline has had two schema patches, and the original rendering is unrecoverable. The model is still the same model. The decision artefact is gone. You have the what, you have lost the why-as-seen.
This isn't an edge case. It's the default behaviour of any setup where the explanation is a live link rather than a frozen object.
What belongs inside a single promotion event
The unit of evidence is not a row in the registry and not a ticket in the workflow tool. It is one immutable event, written once at promotion time, content-addressed, append-only. The registry entry points at it. Not the other way around. If the registry is the source of truth for the decision, you've already lost – the registry is mutable in ways the decision record cannot be.
Six fields have to be sealed together inside that event. All six. Sealed at the same moment, against the same hash.
- Frozen model version. The exact artefact being promoted, by content hash, not by tag. Tags move. Hashes don't.
- Challenger comparison. What was this version measured against. The incumbent in production, or a named baseline, with the comparison metrics as computed at promotion time on a named evaluation set.
- Approval request. The written request the approver responded to. Justification, scope, intended deployment, known limitations. The text the human actually read.
- Approver identity. Who clicked approve. Authenticated identity, not a display name. Tied to a role that had the authority to approve at that moment.
- Explanation artefact reference. A pointer to a frozen rendering of the explanation the approver reviewed. SHAP plot, partial dependence, fairness slice – whichever was on the screen, captured as an immutable object, not a dashboard URL.
- Dataset reference. The evaluation dataset by content hash, not by name.
validation_q3is a label. Labels move under you. Content hashes don't.
The explanation artefact deserves its own line. Most teams get this one wrong because the dashboard is right there and it feels redundant to snapshot it. It isn't. A dashboard is a query against state that drifts. The minute you re-run the notebook, the minute the feature store gets a backfill, the minute someone fixes a bug in the explainer library, the rendering the approver saw is gone. What you need stored is the rendered object – the PNG, the JSON of SHAP values, the HTML of the comparison report – at a content-addressed location that cannot be overwritten. A link to a dashboard is not evidence. A link to s3://promotion-events/{event_hash}/explanation.png where the bucket is write-once, that is evidence.
Strictly speaking, the event itself does not need to contain the explanation bytes. It needs to contain a reference that can be resolved deterministically, forever, to the bytes. Same for the dataset, same for the model weights. The event is a small, signed manifest. The objects it points to are immutable. Together they form one record an auditor can replay a year from now and see the screen the approver saw.
One detail that gets skipped: the event has to be written before the deployment side effects, and the deployment must refuse to proceed without a valid event hash. Otherwise you've built ceremony, not control. If the model can ship without the event, the event is documentation. If the model cannot ship without the event, the event is a gate.
The decision record is your problem to design. Registries and lineage tools won't draw this boundary for you, and that's the right separation of concerns – only your second line knows what they will ask for.
Route by materiality, or the process collapses under its own weight
You want us to convene a committee every time we retrain a churn model on Monday's data? No. That would be absurd, and any process that demands it will be quietly bypassed within a quarter. Governance failures, like outages, get paid for in advance: usually by a process so heavy that engineers route around it.
Materiality routing is the load-bearing piece. The same six-field event structure applies to every promotion. What differs is the path the event takes to get sealed.
A scheduled retrain of a stable model on fresh data, where the architecture, features, and evaluation regime are unchanged and the metrics stay inside a pre-agreed band, is a low-materiality change. The event is generated automatically. The approver is the on-call ML engineer. The challenger comparison runs against the incumbent. The explanation artefact is rendered by the pipeline and frozen by the pipeline. Total human time: minutes.
A change in feature set, a change in model class, a shift in the population the model is scored against, a metric movement outside the band, or a deployment into a new decisioning context – that is a material change. It hits committee. The same six fields are sealed, but now the approval request is written by a human, the challenger is named explicitly, the explanation is reviewed in session, and the approver is whichever role your governance framework names for that risk tier.
The threshold between the two paths is engineered, not negotiated case by case. If you negotiate per case, you've recreated the problem you were trying to solve, and you've added meetings. Write the threshold down. Encode it. Examples of things that should be encoded, not debated:
- Metric drift bands per model, set at promotion of the previous version.
- Feature set diff: any change to the input contract auto-routes to material.
- Population shift: a PSI threshold on scoring traffic, monitored continuously, that flips the next promotion to material.
- Model class or hyperparameter family changes: always material.
The uncomfortable caveat: this works cleanly for models that retrain on a stable input contract. It gets harder for models whose feature pipelines are themselves under active development, because the "is this material" question bleeds into upstream changes that didn't go through promotion at all. That's a real sharp edge and pretending otherwise would be dishonest. The pragmatic answer is that the feature pipeline needs its own promotion discipline, with its own event, and the model promotion event references the feature pipeline version by hash. Otherwise the lineage breaks at exactly the point auditors will push on.
One more thing not to do. Don't try to retrofit historical promotions. The temptation is real – you have an audit next month, you have eighteen months of models in production, and you want to backfill events to make the picture look complete. Don't. A reconstructed event is not an immutable record of a past decision. It is a present-day reconstruction wearing a past timestamp, and any reviewer worth their salary will see through it in one question. State plainly that the discipline starts from a date. Apply it forward. Let the older models carry whatever evidence they actually have.
Before your next promotion, write down the six fields and check where each one currently lives. If any of them lives only in a dashboard, a Slack thread, or an approver's memory, that's the work. Start there. Route by materiality so the lightweight path stays lightweight, and so the committee path is reserved for the changes that actually warrant a room full of people.
