Lineage Without Decision State Fails Model Audits

An auditor sits across the table and asks a very small question about a credit model promoted six months ago. Show me what the approver saw on the day they clicked approve. The registry pulls up the artifact: a model version, training metrics, a dataset hash, a timestamp, the approver's name. Everything is internally consistent. Nothing is missing from the registry's point of view. And yet nobody in the room can reconstruct the actual screen the approver looked at before signoff – which challenger it was compared to, which explanation artefact was attached, which scoring population the metrics referred to on that specific afternoon.

That is the moment a lot of teams discover they have been confusing two different records.

The registry holds artifact state: what the model is. Governance evidence needs decision state: what the approver believed when they approved it. They look adjacent. They aren't. Conflating them is what fails the audit.

The Registry Remembers the Model, Not the Meeting

A model registry is built to answer questions about the artifact. Which version is in production. What it was trained on. What metrics it reported on the holdout. Which feature store snapshot it consumed. Load-bearing, all of it. But registry metadata is mutable in ways that matter for evidence. Tags get re-pointed. Explanation files get regenerated when somebody reruns the notebook. A dataset reference can resolve to a logical name whose underlying partition was rewritten last quarter. None of this is malicious. It is just how working systems behave.

Here is the failure mode I keep seeing. The approver clicks approve on a Tuesday. They reviewed a SHAP plot, a population stability comparison against the incumbent, and a short note from the modeller about a re-weighted training cohort. Three weeks later, somebody re-runs the explanation pipeline against the same model version to refresh a dashboard. The file at the same path is now a different file. The registry still points to it. The link in the approval email still resolves. The bytes have changed. Six months on, the auditor asks to see the explanation artefact, and what surfaces is not what the approver saw. It is a regenerated cousin of it.

A Git commit doesn't save you here. A commit proves code state at a point in time. It says nothing about which human, on which screen, with which comparison in front of them, accepted the risk of pushing that code's output into a decisioning path. A registry tag is worse, because tags are designed to move. production today is not production last March. That is the point of a tag.

Approval is a human act, reviewed under a specific screen on a specific afternoon. The record of that act has to be at least as durable as the artifact it approved – or, more accurately, more durable, because the artifact can be retrained and the decision cannot be re-decided after the fact.

What Belongs Inside a Promotion Event

Treat promotion as an event, not a state transition on a registry row. Emit one record at the moment of signoff. Write it to an append-only store with its own retention policy, separate from the registry's mutable metadata. That store is your evidence. The registry is your operational catalogue. They are allowed to disagree, and when they do, the promotion event wins for audit purposes.

Six fields. Bound atomically. If any one of them is written separately or later, the binding is fiction.

Frozen model version – an immutable reference to the exact artifact bytes, not a tag. Content-addressed if you can manage it.
Challenger comparison – the specific incumbent or alternative the new model was scored against, with the comparison metrics computed at decision time and embedded in the event payload itself, not linked.
Approval request payload – the exact JSON or form the approver acted on. Whatever the UI rendered, snapshot the source of truth behind it.
Approver identity – a verified principal, tied to your IdP, with the auth method recorded (SSO assertion, hardware key, step-up). Not a display name.
Explanation artefact reference – a content hash of the SHAP output, feature-importance file, or whatever explanation surface was reviewed. Hash the bytes the approver saw. If the file is regenerated later, the hash diverges and you can prove which one was original.
Dataset reference – pinned to an immutable snapshot ID, not a logical dataset name. If your platform only exposes logical names, resolve them at event-write time and store the resolution.

Bind these together. One event, written once, signed if your threat model warrants it. Give the event an idempotency key derived from the approval request ID plus the approver's principal. Signoff flows retry. Browsers double-submit. Networks blip during the exact second a senior risk officer clicks approve. Without an idempotency key, you end up with two approval rows for the same act, which is arguably worse than having none, because now you have to explain which one is real.

The append-only store matters as much as the schema. If the place that holds your promotion events allows in-place updates, you have rebuilt the registry's mutability inside your evidence layer. Use a store whose semantics make rewriting visible: an object store with versioning and object lock, a ledger-style table, a WORM-configured bucket. Retention is a separate policy from your registry's. Models get deprecated. Decisions about models do not.

One opinionated aside, because the polite version of this gets ignored: if your current approval flow is a button on a registry UI that flips a status column from staging to production, you do not have governance evidence. You have a status column. The two have been mistaken for each other in more audits than I care to count.

Routing by Materiality Without Building a Committee Bottleneck

The objection writes itself. If every retrain has to produce a fully bound promotion event with a human approver and a challenger comparison, you have turned a daily drift-triggered refresh into a governance ceremony. Fair. That is a real cost, and ignoring it is how good controls die – teams route around them, and the workaround calcifies before anyone notices.

So route the event itself by materiality, not the existence of the event.

Every promotion emits a promotion event. Always. The event is cheap to write. What varies is the approval path that feeds into it.

A retrain on the same features, the same population definition, the same decision threshold, triggered by a drift signal within pre-agreed bounds – that takes a lightweight path. The approver can be an automated policy with a named owner of record, the challenger comparison is the previous production version, the explanation artefact is the standard diff against the incumbent. The event still binds all six fields. A human didn't sit in a meeting. But the record exists, and if it is ever questioned, the policy that authorised it is itself a versioned, signed artefact you can produce.

The path escalates when any of the following change: a new feature enters the model, the scoring population shifts (new geography, new product, new segment), the decision threshold or score-to-action mapping moves, or the training data window crosses a known regime break. Now you need a human approver, a written rationale in the approval request payload, and a second reviewer for anything touching protected attributes or adverse-action surfaces. Same event schema. Heavier approval path feeding it.

The trap to avoid: do not build a single committee that approves everything. You will create a queue, the queue will become the bottleneck, and somebody will invent a side channel to bypass it. Push routing into the platform. The platform decides which path a given promotion qualifies for, based on a diff between the candidate and the incumbent, and that routing decision is itself part of the event payload. If somebody later disputes that a retrain should have escalated, you can replay the diff.

One uncomfortable caveat. This pattern is overkill for a team running three models, all internal, none of them touching customer decisions. For that team, a well-disciplined registry plus the habit of attaching PDFs to a ticket is probably enough, and adding an append-only event store is operational drag with no audit on the horizon. The pattern earns its weight when promotions are frequent, when models touch regulated decisions, or when the population of approvers exceeds the number of people who can keep the history in their heads. That is most BFSI and insurance platforms past their second year. It is not every team.

The other thing I will concede: building this correctly creates a new on-call surface. The promotion event writer is now on the critical path of every model going live. If it fails closed, you cannot promote. If it fails open, you have promotions without evidence, which is the worst of both worlds. Pick fail-closed, accept that you now have a freeze window when the evidence store is unhealthy, and budget for the incident the first time it happens. Most outages of this kind are paid for in advance, in the form of the design conversation you have before go-live rather than the one you have at 11pm during an audit week.

Before Your Next Promotion

Write down the six fields your promotion event will bind. Write down where that event will be stored, who owns the retention policy on it, and what the idempotency key is derived from. If that record does not exist independently of your registry, then your governance evidence is your registry – and the registry was not built to be evidence. It was built to be a catalogue. Those are different jobs. Audits are where you find out which one you actually have.