Model Promotion as Audit Evidence: Six Fields, One Record

The auditor's question is almost rude in how simple it is. A credit risk model went live in March. Show me what the approver saw the moment they clicked approve. Not the model card as it looks today. Not the metrics in the current dashboard. What was on their screen, in their inbox, in their head, when they signed.

The team has plenty. They have the model artefact, pinned by hash. They have the training dataset reference. They have the AUC and KS numbers from the validation run. They have a Jira ticket with an approval comment, a Confluence page with the model risk write-up, and a registry row that says status: production. Three systems. Each one true. Together, still not an answer.

Because the registry row got updated twice after promotion. The Jira ticket links to a SHAP plot URL that now 404s. The Confluence page was edited in July to reflect the current champion. The dataset hash is correct, but nobody can prove the approver looked at metrics computed against that hash versus the previous slice. The artefact survived. The decision did not.

The Artefact Survived. The Decision Did Not.

Most ML platforms were built to answer a different question: what produced what. Lineage tools are good at this. Given a model, walk backwards to the training run, the dataset version, the feature definitions, the code commit. Given a dataset, walk forwards to every model that consumed it. That graph is useful. It is also not evidence of a decision.

Lineage tells you the inputs existed. It doesn't tell you which inputs a human weighed, in what form, at what moment, against what alternative. Different questions. They need a different record.

Here is the failure mode I keep seeing. A team promotes a retrained fraud model. The approver – a second-line risk reviewer – opens the approval request, glances at the challenger comparison, opens the linked SHAP summary, and signs. Six months later, somebody notices the SHAP plot was generated against the previous quarter's holdout slice, not the new one. The registry never knew the difference. The lineage graph correctly shows the model was trained on the new data. But the approver's decision was anchored to an explanation artefact that referenced the wrong slice. That is the kind of mistake you only make once, and only if you find it before the regulator does.

A registry row plus a Jira ticket plus a Confluence page is three artefacts and zero evidence. Each of them is mutable. Each is owned by a different system with a different retention policy and a different idea of what "the truth" means. Stitching them together at audit time is archaeology, not governance. This is the gap where teams pass an internal review – where everyone trusts each other's reconstructions – and fail an external audit, where nobody does.

The fix isn't another dashboard. It is a single record, written once, never edited, that captures the decision state at the instant of promotion. Call it the promotion event. Treat it as the join key between the model registry, the feature store, and the approval workflow – the one row everyone else points at, rather than the row everyone else tries to reconstruct.

What Belongs Inside a Promotion Event

Six fields. Each one load-bearing, each one captured at the moment of approval rather than reassembled afterwards.

Frozen model version. A content-addressed reference to the model artefact – weights, config, preprocessor, the lot. Not a tag like v2.3 that someone can move. A hash.
Challenger comparison. The specific champion-versus-challenger numbers shown to the approver, captured as a frozen payload, not a link to a dashboard that will recompute next Tuesday.
Approval request payload. The exact body of what the approver was sent or shown, including the rendered metrics, the diffs, the warnings. Bytes, not a URL.
Approver identity with auth context. Who signed, via which identity provider, with which session, under which role. An OIDC token reference or signed assertion, not "approved by jdoe" in a free-text field.
Explanation artefact reference. A content-addressed pointer to the SHAP plot, the calibration curve, the fairness slice – whatever the approver actually opened. Tied to the dataset slice it was computed on. If the artefact changes, the hash changes, and the old promotion event still points at the old hash.
Dataset reference. The exact training and validation slices, by hash, including any segment filters applied. Not the logical dataset name.

These six are the minimum. You'll probably want more – a code commit, a feature store snapshot ID, a policy version – but if any of the six above is missing or mutable, the rest is decoration.

Immutability is where most teams quietly fail. They build the schema, capture the fields, and then store the event in a Postgres table that anyone with write access can update. That is not an immutable record. That is a database row with good intentions. The promotion event needs to live in an append-only store, content-addressed, with each event signed at write time. Object storage with object lock, a transparency-log style append-only structure, or a notarised hash chain – pick one. The property you need is simple: an auditor (or you, on a bad day) can prove the event has not been edited since the moment it was written.

A signed event also gives you something the ticket-plus-wiki approach never will: a stable join key. The registry row points at the promotion event ID. The feature store snapshot points at it. The approval workflow closes against it. The deployment manifest carries it as a label. When something goes wrong in production and somebody asks which promotion decision authorised this behaviour, there is one ID to follow, and it resolves to one record nobody has touched.

One more distinction worth pinning down, because teams blur it constantly. ML lineage is a graph of artefacts and the processes that produced them. Decision state is the snapshot of what a human weighed and when. Lineage is generated by your pipelines. Decision state is generated by your approvers. They live in different systems for good reason. The promotion event is the bridge – the place where lineage references and human judgement are bound together and frozen.

Route by Materiality, Not by Reflex

Now the uncomfortable caveat. If you apply this pattern to every retrain, you will drown. Most production ML systems retrain frequently on stable features against stable schemas, and treating each of those as a committee-grade event is both expensive and dishonest – it manufactures ceremony where there was never a real decision to record.

Route by materiality. The promotion event itself is cheap to write; the human review around it is not. So the question isn't whether to capture the event (always yes, for anything that touches production) but what gates fire around it.

A reasonable split for BFSI and insurance platforms:

Automated path. Retrains on the existing feature set, same loss, same segments, drift within agreed bands. The promotion event is written automatically, the approver is the pipeline's service identity (with the auth context to prove it), and the challenger comparison runs without a human in the loop. Evidence is captured; ceremony is not.
Reviewer path. Threshold changes, modest feature additions inside an existing family, calibration adjustments. A single named reviewer signs, the promotion event captures their identity, and the explanation artefact is regenerated against the new slice before signoff.
Committee path. New feature additions that change the input surface, segment expansion into a population the model wasn't validated against, loss function changes, or anything that alters the model's intended use. Multiple signers, each captured separately in the event, with the full approval payload frozen.

The materiality rules belong in a policy the platform enforces, not in tribal knowledge. Strictly speaking, the platform doesn't decide what is material – the model risk function does – but the platform is the only place that can refuse to promote without the right path completing. That refusal is the point.

What this approach gives up: it won't catch a genuinely novel risk hiding inside a "low-materiality" retrain. Someone will eventually argue, after the fact, that a routine retrain should have been routed to committee. Fine. The promotion event still tells you exactly what was considered and by whom, which is the input you need to revise the materiality rules. Without the event, that conversation is speculation.

One thing not to do: don't let approvers attach evidence by pasting links into a free-text field. Links rot. Pages get edited. Plots get regenerated. If the explanation artefact isn't content-addressed and referenced by hash in the event, you have rebuilt the problem you started with, in a nicer-looking UI.

Before your next promotion, point at where each of the six fields is signed and stored. If any one of them lives only in a mutable system – a ticket body, a wiki page, a registry row that can be overwritten – fix that field first, before you touch the others. Then route the rest by materiality. The event is cheap. The argument you avoid having with an auditor, eighteen months later, is not.