AI
AI governance engineers won't route around
AI governance that ships as code: policy-as-code, model cards, audit trails, and the NIST AI RMF mapped to engineering artifacts your teams already produce.
Executive summary
AI governance is the system of policies, controls, and accountability that keeps an organization's AI use inside the bounds it has decided are acceptable. The only version of it that survives contact with delivery teams is the version expressed as engineering artifacts: policy enforced in pipelines, model cards versioned in repos, audit trails emitted as structured logs, review gates that behave like CI checks. Governance that lives in documents gets routed around, and shadow AI is what routing around looks like. This article maps each NIST AI RMF function, Govern, Map, Measure, Manage, to artifacts an engineering organization can build, so compliance evidence becomes a byproduct of operating instead of a quarterly archaeology project.
Engineers do not hate governance. They hate governance that arrives as a 40-page PDF, a quarterly attestation spreadsheet, and a review board that meets every three weeks to approve things it does not understand. The same engineers comply with branch protection, mandatory code review, and CI gates every working day without complaint, because those controls are legible, automated, and applied at the point of work.
That contrast is the entire thesis of this article.
AI governance succeeds exactly to the degree that it is expressed as engineering artifacts. Everything else gets routed around, and shadow AI, the team quietly wiring a public model API into a workflow nobody reviewed, is what routing around looks like. You will not find it in a policy attestation. You will find it in an incident, or if you are lucky, in a finance report.
Policy-as-code: the enforcement layer
A governance policy that says “only approved models may process confidential data” is a sentence. Sentences do not fire at request time. The same policy as code is a gateway rule:
# ai-gateway policy fragment
policies:
- name: restrict-confidential-data
match:
data_classification: [confidential, regulated]
allow_models:
- anthropic/claude-* # DPA in place, no training on inputs
- internal/llama-3-onprem # self-hosted, restricted enclave
deny_action: block_and_log
- name: pii-redaction
match:
data_classification: [pii]
require: [redaction_filter]
The gateway pattern is what makes this enforceable at all: one choke point where every model call carries an application identity and a data classification, and policy gets evaluated in milliseconds instead of committee-weeks. Approved-model registries, per-team quotas, regional routing rules, all the same mechanism.
None of this is a new idea. It is the
infrastructure-as-code operating model
applied to AI. The policy lives in a repo. Changes go through pull requests.
The diff is the change record. When an auditor asks what the policy was on
March 3rd, the answer comes from git log, not from whoever has been there
longest.
Two practical notes from doing this. Keep the rule set boring and small; a dozen rules that always fire beat a hundred that are half-implemented, and every exception you encode is a promise to maintain it. And version the policy separately from the gateway software, so a policy change is never entangled with a deployment. You want to be able to answer “what changed?” with one artifact, not two.
Model cards: documentation that versions
Model cards, introduced by Mitchell et al. in 2019, are short structured documents describing what a model is for, how it was built, and how it performs. For enterprise use I adapt the format to describe the system rather than just the model, and I enforce one rule without exception: the card lives in the repository, next to the code, updated in the same pull request as the change it describes.
A card that earns its keep contains:
- Intended use and explicit non-uses. “Summarizes claims correspondence for adjusters. Not approved for coverage decisions.” The second sentence is the one that matters in a deposition.
- Model and version, provider, and the contract terms that actually matter: training on inputs, retention, region.
- Data: what the system can read, its retrieval corpora, its classification ceiling.
- Eval results as a link to the current eval run, never a pasted number. A pasted number is stale the week after someone pastes it.
- Limitations observed in testing. This is the honest section, and it is the one an incident commander reads first.
- An owner. A name, not a team alias.
Cards stored in a governance portal die within two quarters; I have never seen an exception. Cards in the repo, required by a CI check whenever model configuration changes, stay alive for a simple mechanical reason.
Staleness blocks the build.
Audit trails: logs, not testimony
When the auditor, the regulator, or the incident commander asks what the system did and why, the answer must come from structured logs, not from interviewing whoever was on call that week. Memory fades and people leave; logs do neither. The minimum viable AI audit record, emitted by the gateway on every request:
{
"request_id": "9f2c...",
"timestamp": "2026-01-09T14:22:07Z",
"application": "claims-summarizer",
"caller_identity": "svc-claims-prod",
"end_user": "adjuster-4417",
"model": "claude-sonnet-4-5@2025-09-29",
"prompt_template": "summarize-v14@sha256:ab3f...",
"context_sources": ["kb://claims/2026/..."],
"data_classification": "confidential",
"policy_decisions": ["restrict-confidential-data:allow"],
"human_action": {"approver": "adjuster-4417", "action": "accepted_edited"}
}
Look at what makes this audit-grade. The prompt template hash and model version make the behavior reproducible. The context sources make the answer traceable. The human action field closes the accountability loop, which is the field most implementations forget and the field every hard question eventually lands on. Completions themselves go to a separate store with access controls matching the data classification; the audit index does not need to contain the sensitive text to prove what happened.
Retention and access deserve their own decision record. Prompt logs are a honeypot of precisely the data your policies exist to protect, which is the point the identity-first security mindset gets right: the log store is a crown jewel and needs crown-jewel access control. Governance tooling that leaks the data it governs is not a hypothetical. It is an architecture review finding I expect to keep making.
Mapping NIST AI RMF to artifacts
The NIST AI Risk Management Framework organizes AI risk work into four functions. It is voluntary, it sets vocabulary, and, usefully, it maps almost one-to-one onto artifacts an engineering organization can produce and automate. ISO/IEC 42001 covers similar ground as a certifiable management system; build the artifacts below and you are most of the way to either.
| RMF function | What NIST asks | Engineering artifact |
|---|---|---|
| Govern | Policies, roles, accountability, risk culture | Policy-as-code repo; approved-model registry; named owners in the system inventory; this quarter’s exceptions list |
| Map | Understand context, systems, and impacts | AI system inventory with data-flow diagram per system; model cards; data classification per pipeline |
| Measure | Assess and track risks and performance | Golden-set eval results in CI; production sampling scores; injection-test reports; drift dashboards |
| Manage | Prioritize, respond, recover | Incident thresholds and runbooks; HITL override telemetry; deprecation calendar; post-incident eval additions |
Two things I emphasize when walking teams through this table.
Map before Measure. Teams love building eval dashboards, because dashboards demo well, for systems that appear in no inventory. That is measurement of the systems you know about and silence about the ones that will hurt you. The inventory is duller work and it comes first.
And Manage means rehearsed. An incident threshold nobody has tripped in a game day is a hypothesis, not a control. The discipline from infrastructure incident response transfers wholesale: declare on thresholds rather than vibes, assign a commander, write the timeline. The 2 AM question is never “what does the framework say.” It is “who rolls this back, and on whose authority.”
Making the review gate an engineer’s tool
Policy-as-code handles the known patterns. Humans still need to review the novel ones, and most organizations keep a review step for new AI use cases. The design goal is a gate engineers use voluntarily because it is faster than guessing, and there are four properties that get you there.
Intake is a pull request, not a form portal. The team adds a system entry, an inventory record plus a draft model card, to the governance repo. The review happens where the work happens.
The SLA is measured in days, with lazy consensus: silence past the SLA is approval for low-risk classifications. A review board that can block indefinitely by being busy will be bypassed, and deserves to be.
Depth is risk-tiered. An internal summarizer over public docs gets a rubber stamp by design. A system touching regulated data or external customers gets the full review. Publish the tiering rubric and most of the queue evaporates, because teams can predict their own tier.
Every rejection produces a rule. If the board turns something down, the reason becomes written policy or a gateway rule, so the next team finds out in seconds instead of at review. A board that rejects the same thing twice has failed at its actual job, which is converting judgment into infrastructure.
What to write down first
Four artifacts, in this order: the system inventory, because you cannot govern what you have not mapped; the approved-model registry with data-classification ceilings; a model card template enforced by CI for new systems; and the audit log schema at the gateway. That set converts Level 1 of the integrity maturity model into something an auditor can hold in their hands.
Governance earns trust the same way infrastructure does: by being boring, automated, and right. The best governance program, like the best network, is one nobody thinks about because it simply works at the point of use. If your AI governance produces mostly meetings, it is not governing anything. It is narrating, while engineering routes around it.
Frequently asked questions
- What is AI governance in engineering terms?
- It is the set of enforceable controls that determine which models can be used, with which data, for which purposes, with what testing, and with what accountability trail. Concretely: an approved-model registry, data-classification rules enforced at a gateway, eval gates in CI, model cards in version control, structured audit logs. Documents describe the policy. Pipelines enforce it. The difference between those two verbs is the whole subject.
- Is the NIST AI RMF mandatory?
- No. It is a voluntary framework, not a regulation. It matters anyway, because it has become the common vocabulary that US enterprises, auditors, and increasingly regulators use to discuss AI risk, and because its four functions map cleanly onto engineering practice. Treat it as a checklist and you will miss the point. Treat it as a structure for controls you already need and it earns its keep.
- What should a model card actually contain?
- Intended use and explicit non-uses, the model version and provider, training or fine-tuning data described at the level you can honestly describe it, current eval results against your golden sets, the data classifications the system is approved for, and a named owner. Keep it in the repo next to the system it describes, update it in the same pull request as the change, and keep it short enough that people actually read it.
- Who should own AI governance?
- Split the pen from the enforcement. Policy ownership belongs to a cross-functional group spanning legal, security, and engineering leadership. Control ownership must sit with engineering, because the controls live in pipelines and gateways. The recurring failure mode is a governance committee that owns everything and can enforce nothing. Give the committee the policy pen and give engineering the enforcement budget, and be explicit about which is which.