Skip to content
PAVEL GLUKHIKH
Menu

Research

Measuring AI Integrity: Toward Useful Metrics

An active research program on measuring AI integrity: why accuracy metrics miss it, dimensions like grounding faithfulness, and harness design.

5 min read

Executive summary

AI integrity measurement is the attempt to quantify whether an AI system's outputs deserve trust — not merely whether they are accurate on a benchmark. This research write-up documents an active program: the argument for why accuracy metrics structurally miss integrity failures, three candidate measurement dimensions (grounding faithfulness, instruction adherence under adversarial pressure, and provenance completeness), the design of an evaluation harness for running them against production-shaped systems, early observations, and the open questions I have not resolved. It is a progress report, not a results paper.

The question this program is asking

Enterprises ask me a version of the same question about every AI system they deploy: how do we know we can trust it? The honest current answer is that we mostly do not know — we know how it scores on benchmarks, which is a different thing. This research program is my attempt to close some of that gap: to find a small set of integrity metrics that are measurable in an automated harness, meaningful to a risk owner, and stable enough to compare across model versions.

This is a progress report on active work, written the way I would want to read one: method and early observations, clearly separated from results I do not yet have. I am deliberately not publishing numbers, because the numbers I have so far are from harness shakedown runs and would imply more precision than they possess.

Why accuracy metrics structurally miss integrity

The case is not that accuracy is useless — it is that accuracy is measured in exactly the conditions where integrity failures do not occur.

A benchmark scores a model against an answer key under benign inputs.

Integrity failures live elsewhere:

  • The confident blank. Retrieval returns nothing useful and the model answers fluently anyway, from parametric memory or thin air. Benchmark accuracy never sees this, because benchmarks supply the context.
  • The obedient victim. A retrieved document contains embedded instructions and the model follows them. This is a security failure (OWASP’s LLM01) that no accuracy metric touches.
  • The decorative citation. The answer is correct-ish, the cited source exists, and the source does not support the claim. Users check citations exist; almost nobody checks entailment.
  • The version drift. A vendor model update shifts behavior on your workload while headline benchmarks improve. Aggregate leaderboards, HELM included, are built to compare models in general — not to tell you whether your system’s trust properties held through Tuesday’s update.

The common structure: accuracy measures the output against truth; integrity measures the output against the system’s evidence, instructions, and declared behavior. Those are different denominators, and the second one is what an enterprise actually relies on. This is the measurement layer that my AI Integrity Reference Model assumes exists — this program is the attempt to make that assumption true.

Candidate dimensions

I am currently working three dimensions, chosen because each maps to a real production failure I have seen and each looks automatable.

1. Grounding faithfulness

Is the output supported by the evidence the system retrieved?

Method under test: decompose each response into atomic claims, then check each claim for support against the retrieved context — currently using an LLM judge with a strict entailment rubric, spot-audited by hand. The metric is the supported-claim fraction, with a separate score for abstention correctness: when the context contains no answer, did the system say so?

The abstention half matters more than I expected. Systems tuned for helpfulness resist abstaining, and the failure is asymmetric — a wrong answer delivered confidently costs more than a declined one. Any production RAG system should be measured on both halves.

2. Instruction adherence under adversarial pressure

Does the system keep following its operator’s instructions when the inputs push back?

Method under test: take a system’s declared behavior rules (tone, scope, prohibited actions, tool policies) and measure adherence twice — once on benign traffic, once on inputs carrying graded adversarial pressure, from polite social engineering up through structured injection payloads embedded in retrieved documents. The interesting number is not either score; it is the degradation curve between them. Two systems with equal benign adherence can behave very differently at pressure, and the delta is the integrity property.

3. Provenance completeness

For a given output, what fraction of the decision can be reconstructed?

This one is a property of the system, not the model: given an output, can you recover the model version, full context composition, retrieval sources and their versions, and tool calls? The metric is a completeness score against that checklist, measured by actually attempting reconstructions of sampled production outputs. It is the least glamorous dimension and the one enterprises fail hardest, and it gates the other two — you cannot debug a faithfulness failure you cannot replay.

Harness design

The harness constraints, in priority order: it must test systems (prompt + retrieval + model + tools), not bare models; it must rerun cheaply on every change, because the point is regression detection, in the same spirit as production evaluation practice; and its judgments must be auditable, because an unexplainable evaluator scores no points in a governance conversation.

Current shape:

scenario sets (per dimension, versioned in git)
        |
        v
  system under test  (real RAG stack / agent, staging corpus)
        |
        v
  judges: rubric-scored LLM judge + deterministic checks
        |            (claim decomposition, injection canaries,
        |             provenance checklist)
        v
  scored run -> trend store -> regression gates

Two design decisions worth recording. First, every LLM-judge verdict carries its rubric and a fixed judge model version, and a rotating sample gets human re-scoring; judge drift is otherwise indistinguishable from system drift. Second, injection scenarios use canary tokens — instructions to emit a detectable marker — so adversarial adherence has a deterministic scoring path that does not depend on a judge’s opinion.

Early observations (held loosely)

  • Claim decomposition is the load-bearing step in faithfulness scoring, and it is fragile: judges disagree most about what constitutes one claim, not about whether a claim is supported.
  • Abstention behavior is highly sensitive to small system-prompt changes — more sensitive than adherence or faithfulness. It may be the cheapest early-warning signal in the whole program.
  • Provenance completeness in the systems I have assessed informally is low enough that dimensions one and two are often unmeasurable in retrospect. Logging is the prerequisite research finding, boring as it is.

Open questions

Honest list of what I have not solved:

  1. Judge circularity. Using an LLM to judge LLM integrity has an obvious regress. Deterministic checks and human audit bound it; they do not eliminate it.
  2. Scenario representativeness. A harness measures what its scenario sets contain. How much adversarial coverage is enough to claim anything about the open world? I do not know yet.
  3. Aggregation. Risk owners want one number; one number would be a lie. The current compromise — three dimension scores with trend arrows — has not yet survived contact with an actual governance committee.
  4. Cross-system transfer. Whether a threshold that means “healthy” on one RAG stack means anything on another is the question that decides whether this becomes a useful yardstick or stays a local tool.

The pattern underneath all of this is familiar from every other discipline I have worked in: you cannot manage what you cannot measure, but measuring the wrong thing precisely is worse than measuring the right thing roughly. Accuracy is precise and wrong for this job. The bet this program is making is that evidence, adherence, and provenance are the right things — and that rough measurements of them, honestly labeled, beat another decimal place on a benchmark.

If you are running similar evaluations in production and willing to compare notes — especially on abstention measurement — the contact page works. This write-up will be revised as the harness matures.

Frequently asked questions

Why isn't accuracy enough to measure AI trustworthiness?
Accuracy scores a model against a benchmark's answer key under benign conditions. Integrity failures happen off the answer key: a model that answers fluently from nothing when its retrieval fails, follows injected instructions embedded in a document, or cites sources that do not support its claim can all score well on accuracy while being untrustworthy in production.
What is grounding faithfulness?
Grounding faithfulness measures whether a system's output is actually supported by the sources it retrieved and cited — not whether the output is true in general. A faithful system says what its evidence supports and declines when evidence is absent. It is measured by decomposing outputs into claims and checking each claim against the retrieved context.
Is this research producing a product or a standard?
Neither, at this stage. It is a practitioner research program: building an evaluation harness, testing candidate metrics against real RAG and agent systems, and publishing what holds up and what does not. The intended output is a small set of metrics that transfer across systems, feeding the control objectives in my AI Integrity Reference Model whitepaper.

References

Related reading