Skip to content
PAVEL GLUKHIKH
Menu

AI

RAG architecture for the enterprise

Enterprise RAG architecture end to end: pipeline design, chunking and index tradeoffs, permission-aware retrieval, and measuring grounding faithfulness.

6 min read

Executive summary

RAG architecture is the design of the pipeline that retrieves relevant, permitted knowledge at request time and grounds a language model's answer in it. In an enterprise, the hard parts are not the vector database. They are ingestion that keeps the index faithful to changing sources, chunking that preserves meaning, retrieval that enforces the caller's permissions, and measurement that proves answers actually came from the evidence. This article walks the pipeline end to end, ingestion, chunking, indexing, retrieval, generation, with the tradeoffs at each stage, the permission model that keeps RAG from becoming a data-leak amplifier, and the faithfulness metrics that tell you whether grounding is real or claimed.

RAG is the pattern enterprises actually deploy, because the alternative, teaching a model your private and constantly changing knowledge through training, is slow, expensive, and ungovernable. Retrieval keeps knowledge where it belongs, in documents you can update, cite, and revoke, and brings it to the model per request.

The vendor pitch makes it sound like a weekend project. Embed documents, stand up a vector database, done. The production reality is a data pipeline with a search engine in the middle and a security boundary most teams forget to draw. Having built and reviewed several of these, I can report that the vector database is the least of your problems.

The pipeline is where the engineering lives.

The pipeline, end to end

 Sources          Ingestion              Index            Query time
┌─────────┐   ┌──────────────────┐   ┌───────────┐   ┌─────────────────┐
│ Wiki    │──▶│ extract → clean  │──▶│ vector    │◀──│ rewrite → embed │
│ SharePt │   │ → chunk → embed  │   │ + keyword │   │ → filter (ACL!) │
│ Tickets │   │ + ACL metadata   │   │ + metadata│   │ → rerank → k    │
│ DBs     │   │ (delta sync)     │   └───────────┘   └───────┬─────────┘
└─────────┘   └──────────────────┘                           ▼
                                                   ┌──────────────────┐
                                                   │ LLM: answer with │
                                                   │ citations only   │
                                                   │ from context     │
                                                   └──────────────────┘

Each stage has a job and a characteristic failure. Ingestion that misses deletions serves answers from documents legal already retracted. Chunking that splits mid-table produces retrievable nonsense. Retrieval without ACL filters leaks by design. Generation without citation discipline invents.

Notice that only the last failure is the model’s fault. The other three are plumbing, which is why the plumbing gets most of this article.

Ingestion: the unglamorous 60%

Most RAG quality problems are ingestion problems wearing a disguise. Teams tune prompts for weeks while the real defect sits upstream in a PDF parser.

Extraction fidelity comes first. Tables, headers, and layout in PDFs and Office documents carry meaning, and naive text extraction destroys it. Budget real effort here. Garbage extracted is garbage retrieved, and no downstream cleverness recovers it.

Delta sync must include deletions. The index tracks source changes on a defined freshness SLA, and removal has to propagate on that SLA too. An index that only ever adds becomes something nobody signed up to operate: an unauthorized archive of everything your organization has ever tried to retract. When legal pulls a document, “it’s gone from SharePoint but still in the vector index” is a sentence you want never to say.

Capture metadata at ingestion or lose it forever: source system, author, timestamp, document type, and, critically, the ACL. Every downstream capability, filtering, permissions, citations, purging, hangs off metadata captured now.

And treat corpus curation as a security control, not a librarian’s chore. Whoever can write into a source that feeds the index can inject instructions into future model contexts. Corpus write-access is a trust boundary; the LLM security threat model covers why indirect injection through retrieved content is the attack to plan for.

Chunking and indexing tradeoffs

Chunking decides what “a unit of knowledge” is. The index decides how it is found. Both are tradeoffs to make against your own corpus, not settings to copy from a tutorial.

DecisionOptionWinsCosts
ChunkingFixed-size + overlapSimple, predictableSplits meaning mid-thought
Structure-aware (headings, paragraphs)Preserves semantic unitsNeeds per-format parsing
Parent-child (retrieve small, return large)Precision + contextPipeline complexity
IndexVector-onlySemantic matchingMisses exact terms: SKUs, error codes, names
Keyword (BM25)-onlyExact-term precisionMisses paraphrase
Hybrid + rerankerBest retrieval qualityMore moving parts, rerank latency

Defensible defaults for a document-heavy enterprise corpus: structure-aware chunks of a few hundred tokens, hybrid retrieval, and a cross-encoder reranker over the top 50 candidates before selecting the final handful.

The case for hybrid deserves a sentence of its own. Enterprise queries are full of part numbers, error strings, and people’s names, which is exactly what pure vector search fumbles. Adding keyword retrieval is usually the single biggest quality upgrade available, and it is decades-old technology.

Two more decisions worth making deliberately rather than discovering. Embedding model versioning: changing the embedding model means re-embedding the corpus, so pin it and plan migrations like the infrastructure events they are. And index snapshots tied to releases, so that when you are debugging last week’s bad answer, you can reproduce last week’s index instead of interrogating today’s.

Permission-aware retrieval: the boundary that makes it enterprise

Here is the design rule that separates enterprise RAG from a demo: the retriever enforces permissions, and the model never sees what the caller cannot see.

By default, an index is a flattening machine. It happily co-locates the M&A memo with the cafeteria menu, and semantic search will surface whichever one matches the query. Prompting the model to “not reveal restricted content” is not a control. The model summarizes what it is given; the leak already happened at retrieval time.

The only sound design:

  1. Ingest ACLs as chunk metadata, synced from source systems on the same cadence as content.
  2. Resolve the caller’s entitlements at query time, identity and group memberships from your IdP, and apply them as a pre-filter inside the retrieval query. Not post-hoc trimming of results; a pre-filter. This is identity-first security applied to a search index.
  3. Propagate permission changes on a defined SLA. When access is revoked at the source, stale index entries are a live leak until the next sync. Write the SLA down and test it, because untested SLAs are wishes.
  4. Log which chunks entered which answer. Leak forensics and poisoned-source tracing in one artifact.

There is a corollary worth stating plainly. If entitlement models across sources are genuinely incompatible, run separate indexes per sensitivity tier rather than one index with heroic filtering.

Blast radius beats elegance.

Generation: grounding discipline

The generation stage has one job: compose an answer from the retrieved evidence, and nothing else. The guardrails are unglamorous. Instruct answers only from provided context. Require inline citations to chunk identifiers. Make “I don’t have that in the sources” an approved, tested output rather than a failure the team quietly prompts away.

Then verify citations programmatically after generation: every cited chunk must exist in the retrieved set. It is a cheap check, a set-membership test, and it catches an entire class of confident fabrication before a human ever sees it.

Measuring grounding faithfulness

You cannot claim a RAG system is grounded. You have to measure it, and keep measuring it, because every index rebuild changes the system’s behavior whether or not any code changed.

The metric stack, bottom to top:

  • Retrieval recall and precision against a labeled set: for known questions, do the relevant documents appear in the top k? This diagnoses the retriever in isolation. Always debug retrieval before touching prompts; a perfect prompt over the wrong chunks produces fluent nonsense.
  • Faithfulness: decompose each answer into atomic claims and verify each one is entailed by the retrieved context, the approach frameworks like Ragas operationalize with an LLM judge. This is the anti-hallucination number.
  • Answer relevance and citation validity round out the quartet.

Wire all of it into the machinery described in evaluating AI systems in production: a golden set of question and expected-evidence pairs gating releases, plus continuous faithfulness scoring over sampled production traffic. Index rebuilds, embedding-model changes, and chunking adjustments all go through the gate. Judges get calibrated against human review, per the usual rules.

Set an explicit floor. On systems where answers inform real decisions, I treat sustained faithfulness below the high nineties as an incident condition, the same as an error-rate SLO breach: page someone, investigate, add the failures to the golden set. A grounding metric without a threshold is trivia.

What to write down

Five decision records make a RAG system operable by the next engineer: the corpus contract, sources, owners, freshness and deletion SLAs; the chunking and index configuration, with the eval results that justified them; the permission model, ACL sync mechanism, SLA, and tier boundaries; the grounding thresholds and what happens when they breach; and the citation logging schema. That set also answers, in advance, every question a governance review will ask about the system.

RAG earns its place in the enterprise AI pattern catalog because it makes knowledge governable: updatable, citable, revocable, permission-scoped. Skip the governance-bearing parts, deletion sync, ACL filtering, faithfulness measurement, and you have built the opposite. An ungoverned copy of everything your organization knows, fronted by a very persuasive summarizer. The pipeline is the product. The model just talks.

Frequently asked questions

What is RAG in one paragraph?
Retrieval-augmented generation answers a question in two steps: retrieve the most relevant passages from a curated corpus, then have a language model compose an answer using those passages as evidence. It exists because models cannot know your private, current data, and because retraining for every document change is absurd. Done well, it produces current, citable, permission-respecting answers. Done badly, it produces confident summaries of the wrong documents.
Why is permission-aware retrieval so important?
Because a RAG index flattens your access model by default. Documents from HR, finance, and engineering land in one index, and any query can surface any chunk unless the retriever enforces the caller's entitlements. The model cannot be the enforcement point; it summarizes whatever context it is handed. Permissions must be applied as a filter inside the retrieval query, using ACL metadata synced from the source systems.
What chunk size should I use?
There is no universal number, but there is a method. Start structure-aware, splitting on headings, paragraphs, or semantic units rather than fixed character counts, in the low hundreds of tokens with modest overlap, then tune against your own retrieval eval set. Small chunks retrieve precisely but lose surrounding context. Large chunks preserve context but dilute embeddings and waste the context window. Measure on your corpus rather than copying a tutorial's constant.
How do I measure whether answers are actually grounded?
Score faithfulness: decompose each answer into claims and check that every claim is supported by the retrieved passages, using an LLM judge calibrated against human review. Track it alongside retrieval quality metrics like recall of known-relevant documents, so you can tell retriever failures from generation failures. A faithfulness score materially below one means the model is filling gaps from its own weights, which in an enterprise setting is a defect, not a feature.

References

Related reading