What is AI integrity and how is it different from AI safety?

AI integrity is the property that a system's outputs are grounded in legitimate data, consistent with its intended behavior, resistant to manipulation, and attributable to accountable parties. Safety asks whether the system can cause harm; integrity asks whether you have engineering grounds to trust what it produces. A system can be safe and still confidently wrong.

Do I need this model if I already follow the NIST AI RMF?

The NIST AI RMF tells you which functions a risk program needs — govern, map, measure, manage — but stays deliberately abstract about technical controls. This reference model supplies the layer of concrete, auditable control objectives underneath the RMF. Section six maps every layer to RMF functions so the two compose rather than compete.

Which layer should an enterprise implement first?

Accountability. Logging, provenance capture, and human ownership assignments are cheap relative to the other layers and are prerequisites for measuring anything else. If an incident occurred today, you need to be able to reconstruct what the model saw and produced. Start there, then work down into data integrity.

Does the model apply to vendor AI products or only in-house systems?

Both, but the control implementation shifts. For vendor systems you cannot instrument model internals, so data integrity and accountability controls move to the contract and the integration boundary: input/output logging on your side, provenance requirements in procurement language, and behavioral acceptance tests you run yourself before and after vendor model updates.

Whitepaper

An AI Integrity Reference Model for the Enterprise

A four-layer reference model for AI integrity — data, model behavior, interaction security, accountability — with control objectives and NIST AI RMF mapping.

By Pavel Glukhikh May 19, 2026 7 min read

Abstract

Enterprises deploying AI systems lack a shared structure for answering a basic question: on what grounds do we trust this system's outputs? Security frameworks address confidentiality and availability; ML evaluation addresses accuracy. Neither addresses integrity — whether the system's outputs are grounded in legitimate data, consistent with intended behavior, resistant to manipulation, and attributable to accountable parties. This paper proposes a four-layer AI Integrity Reference Model: data integrity, model behavior integrity, interaction security, and accountability. Each layer carries explicit control objectives suitable for audit. A four-stage maturity model describes the progression from ad-hoc practice to measured, continuously verified integrity. Finally, the model is mapped to the NIST AI Risk Management Framework's four functions, allowing organizations to implement integrity controls inside governance structures they already operate.

The problem this model addresses

Every enterprise I work with is deploying AI systems faster than it can answer one question about them: on what grounds do we trust the output? Security teams reach for the CIA triad and find that confidentiality and availability controls transfer reasonably well, but “integrity” — defined for databases and file systems — says nothing useful about a model that can be subtly wrong, confidently ungrounded, or quietly manipulated through its inputs. ML teams reach for accuracy metrics and find they measure the happy path only.

The gap is architectural.

We do not lack point solutions — guardrail products, eval harnesses, red-team services exist in abundance. We lack a shared structure that says what must be true, at which layer, verified by whom. This paper proposes that structure. I have been developing it across enterprise accounts and my own AI integrity engineering work, and it is deliberately modest: four layers, explicit control objectives, a maturity path, and a mapping to the NIST AI RMF so nobody has to invent a parallel governance program. Modesty is a design choice here. Frameworks fail by trying to answer every question; this one tries to make four questions answerable.

Definition and scope

AI integrity is the property that an AI system’s outputs are (1) grounded in legitimate, verifiable data, (2) consistent with the system’s intended and declared behavior, (3) resistant to manipulation through inputs, context, or supply chain, and (4) attributable — traceable to accountable human and organizational owners.

The model covers predictive and generative systems, built or bought. It does not cover model development ethics, workforce impact, or societal questions; those matter, but they belong to governance functions this model plugs into rather than replaces. Scope discipline is what keeps a reference model usable — the moment it claims everything, it verifies nothing.

The four layers

The layers stack from the data a system consumes to the humans who answer for it. Each layer assumes the layers below it; interaction security is meaningless if training data provenance is unknown.

+--------------------------------------------+
| Layer 4: Accountability                     |
|  provenance, audit, human ownership         |
+--------------------------------------------+
| Layer 3: Interaction Security               |
|  input/output boundaries, injection, misuse |
+--------------------------------------------+
| Layer 2: Model Behavior Integrity           |
|  intended behavior, drift, evaluation       |
+--------------------------------------------+
| Layer 1: Data Integrity                     |
|  sources, pipelines, retrieval corpora      |
+--------------------------------------------+

Layer 1 — Data integrity

The claim this layer supports: the data the system learned from and reasons over is what we believe it to be.

Control objectives:

DI-1: Source authenticity. Every training and retrieval source has a recorded origin, owner, and acquisition method. Unknown-origin data is quarantined, not silently ingested.
DI-2: Pipeline tamper evidence. Data pipelines produce verifiable artifacts (checksums, signed manifests) at each transformation stage, so a poisoned or corrupted batch can be detected and traced.
DI-3: Retrieval corpus hygiene. For RAG and grounded systems, the document corpus has access control, versioning, and a review gate. A retrieval corpus that anyone can write to is a prompt-injection delivery system with extra steps.
DI-4: Data currency. Staleness bounds are defined per corpus, monitored, and surfaced to consumers — a model answering from an expired policy document is an integrity failure even though nothing was attacked.

Layer 2 — Model behavior integrity

The claim: the model behaves as intended and declared, and we would notice if that changed.

Control objectives:

MB-1: Declared behavior. Intended behavior, prohibited behavior, and known limitations are written down per system — the behavioral equivalent of a network baseline. You cannot detect deviation from an undeclared norm.
MB-2: Pre-deployment evaluation. Systems pass a defined evaluation suite — task performance plus integrity-specific tests (grounding faithfulness, refusal consistency, instruction adherence) — before production exposure. My research on measuring AI integrity covers candidate metrics.
MB-3: Change-triggered re-evaluation. Model version changes, fine-tunes, system-prompt changes, and vendor updates all re-trigger the evaluation suite. Vendor model updates are the most commonly skipped trigger.
MB-4: Drift monitoring. Production output distributions are monitored against the evaluation baseline, with defined thresholds that page a human.

Layer 3 — Interaction security

The claim: the system’s inputs and outputs cross a controlled, monitored boundary that resists manipulation.

Control objectives:

IS-1: Input boundary control. Untrusted input — user prompts, retrieved documents, tool results — is treated as untrusted at every point where it meets model context. Prompt injection is an input-validation failure class, and the OWASP LLM Top 10 is the working checklist here.
IS-2: Output mediation. Model output is not executed, rendered, or forwarded to downstream systems without policy checks appropriate to the blast radius. An LLM with unmediated tool access has the effective privileges of its tools.
IS-3: Privilege minimization. Agents and tool-using systems hold the minimum credentials for their function, scoped and expiring — identity discipline applied to non-human actors, the same posture argued in identity-first security.
IS-4: Abuse monitoring. Interaction logs are analyzed for manipulation patterns (injection attempts, systematic probing, jailbreak families) with the same seriousness as authentication logs.

Layer 4 — Accountability

The claim: for any output, we can reconstruct what produced it and name who answers for it.

Control objectives:

AC-1: Decision provenance. For consequential outputs, the system records model version, prompt/context composition, retrieved sources, and tool calls — enough to replay the decision.
AC-2: Named ownership. Every production AI system has a named human owner accountable for its behavior, and that owner has actual authority to suspend it. Ownership without a kill switch is ceremony.
AC-3: Audit readiness. Logs from AC-1 are retained, protected from tampering, and queryable within a defined SLA.
AC-4: Disclosure honesty. Where outputs reach customers or regulators, the system’s AI nature and material limitations are disclosed accurately.

Maturity stages

Maturity is assessed per layer, not per organization — most enterprises I assess are Stage 2 on accountability and Stage 0–1 on data integrity.

Stage	Name	Characteristics
0	Ad hoc	Controls exist only where an individual engineer cared. No inventory of AI systems.
1	Defined	Control objectives adopted, AI system inventory exists, owners named. Verification is manual and sporadic.
2	Enforced	Controls are gates: systems cannot reach production without evaluation, logging, and ownership. Exceptions are recorded.
3	Measured	Integrity is quantified — drift metrics, injection detection rates, provenance coverage — with trends reviewed by governance.
4	Continuous	Verification is automated and continuous; integrity regressions block deploys the way failing tests do.

The honest advice: get every production system to Stage 1 before pushing any system to Stage 3. An inventory with named owners beats a beautifully instrumented pilot surrounded by shadow AI.

Mapping to the NIST AI RMF

The NIST AI RMF organizes risk work into four functions — Govern, Map, Measure, Manage — and deliberately avoids prescribing technical controls. This model supplies that missing layer, and the mapping is clean:

RMF function	Reference model coverage
Govern	AC-2, AC-4 (ownership, disclosure); maturity model as governance instrument; MB-1 as documented intent
Map	System inventory (Stage 1 prerequisite); DI-1 source mapping; IS-3 privilege mapping for agents
Measure	MB-2/MB-3/MB-4 evaluation and drift; IS-4 abuse metrics; Stage 3 measurement criteria
Manage	IS-1/IS-2 boundary controls; DI-2/DI-3 pipeline and corpus controls; AC-1/AC-3 incident reconstruction

For generative systems specifically, NIST AI 600-1 (the Generative AI Profile) enumerates risks — confabulation, information integrity, prompt injection — that land almost one-to-one on Layers 2 and 3. An organization already running an RMF-shaped program can adopt this model as its control catalog without new committee structures, which is precisely the point: the governance machinery engineers will actually tolerate is the machinery that reuses what exists.

Implementation notes

Three lessons from applying this in real environments:

Start with the inventory and Layer 4. Provenance logging and named ownership cost little and make every other layer measurable. If you cannot reconstruct what a model saw, you cannot investigate anything.
Treat vendor systems as boundary problems. You will not instrument a SaaS vendor’s model internals. Move the controls to your side of the boundary — log inputs and outputs yourself, run your own acceptance evals, and put provenance and update-notification requirements into contracts.
Resist the platform-first instinct. Buying an “AI security platform” before declaring intended behavior (MB-1) repeats the oldest mistake in security: tooling in search of a policy.

The model will evolve — the measurement layer in particular is an open research area — but the structure has held up: data, behavior, interaction, accountability. None of it is novel as engineering. It is the same discipline we already apply to networks and databases — declare intended behavior, control the boundaries, log enough to reconstruct events, name an owner — applied to a technology the industry keeps insisting is too new for discipline.

If you can make four true claims, one per layer, about a production AI system, you have engineering grounds for trust. If you cannot, you have a demo in production.

Frequently asked questions

What is AI integrity and how is it different from AI safety?: AI integrity is the property that a system's outputs are grounded in legitimate data, consistent with its intended behavior, resistant to manipulation, and attributable to accountable parties. Safety asks whether the system can cause harm; integrity asks whether you have engineering grounds to trust what it produces. A system can be safe and still confidently wrong.
Do I need this model if I already follow the NIST AI RMF?: The NIST AI RMF tells you which functions a risk program needs — govern, map, measure, manage — but stays deliberately abstract about technical controls. This reference model supplies the layer of concrete, auditable control objectives underneath the RMF. Section six maps every layer to RMF functions so the two compose rather than compete.
Which layer should an enterprise implement first?: Accountability. Logging, provenance capture, and human ownership assignments are cheap relative to the other layers and are prerequisites for measuring anything else. If an incident occurred today, you need to be able to reconstruct what the model saw and produced. Start there, then work down into data integrity.
Does the model apply to vendor AI products or only in-house systems?: Both, but the control implementation shifts. For vendor systems you cannot instrument model internals, so data integrity and accountability controls move to the contract and the integration boundary: input/output logging on your side, provenance requirements in procurement language, and behavioral acceptance tests you run yourself before and after vendor model updates.

An AI Integrity Reference Model for the Enterprise

Abstract

The problem this model addresses

Definition and scope

The four layers

Layer 1 — Data integrity

Layer 2 — Model behavior integrity

Layer 3 — Interaction security

Layer 4 — Accountability

Maturity stages

Mapping to the NIST AI RMF

Implementation notes

Frequently asked questions

References

Related reading