Retrieval-Augmented Generation in Practice

RAG is now the default architecture for grounded LLM systems, and rightly so. It is the only pattern that lets a model speak to facts that changed after it was trained — with citation back to source and access control enforced at the document layer. But the published patterns assume a level of data hygiene that most enterprises do not have. The interesting engineering begins when the corpus is the actual corpus.

The corpus is multi-format. In published examples, the source is a clean directory of Markdown files. In practice, it is PDFs with mixed text and image layers, scanned documents in twelve different scanner profiles, Word files with embedded objects, Excel sheets used as forms, email threads with quoted history, presentations exported to PDF with reordered slides. Every format imposes a different extraction strategy, and extraction quality varies document by document.

The corpus is multi-language. In the GCC, in continental Europe, in cross-border legal and financial practice, a single document set spans multiple languages — sometimes within the same document. The embedding model must be multilingual. The retrieval ranking has to handle mixed-language queries. The generation model has to respond in the language the user asked in, regardless of the language of the source.

The corpus is partially conflicting. Versioned policies coexist with their predecessors. Acquired companies bring document sets written under different assumptions. Regulatory positions shift, and superseded documents stay on file. A naive retrieval layer surfaces contradictory passages from different time periods as if they were equally current. The pipeline has to express recency, version status, and source authority — and the generation layer has to respect that signal.

Chunking is not solved. The published advice — “500-token chunks with 50-token overlap” — works for clean technical documentation. It breaks for legal contracts (where the unit of meaning is the clause, often spanning chunks), for tables (which must not be split), for headers and footers (which add noise to every chunk), for code (where the unit is the function), and for transcripts (where the unit is the turn). Chunking is a per-corpus design decision, not a default.

Embedding model selection has consequences. The embedding model determines retrieval quality more than any other component. Smaller models retrieve faster and cost less but miss semantic nuance. Larger models retrieve more accurately but blow up the cost of re-embedding when the corpus changes. Self-hosted models are required for sovereign deployments and add their own operational load. The right answer is corpus-specific.

Re-ranking is usually necessary. First-stage retrieval (vector similarity, or hybrid BM25 plus vector) is fast and recall-oriented: it surfaces roughly the right passages, but in the wrong order. A re-ranker — a cross-encoder that scores query-passage pairs — reorders the top-k from the first stage. It adds latency but materially improves the quality of what reaches the generation layer.

Citations have to be structural, not narrative. The generation layer must emit citations that point to specific passages — by document ID, page, and span — not loose textual references the user has to interpret. Auditors will ask whether a given output is traceable to source. The answer has to be “yes, here is the passage.”

Evaluation runs against the corpus, not the model. The right question is not “is the model accurate” but “is the system accurate against this corpus.” Answering it requires a held-out evaluation set drawn from the actual corpus, scored against ground-truth answers, with regression monitoring as both corpus and model evolve.

RAG is a deceptively simple pattern in slides and a substantial engineering program in practice. The teams that get value from it treat the retrieval layer as a first-class system in its own right — versioned, evaluated, monitored — not as plumbing bolted in front of an LLM.

The above is a Veritonix Insights publication. Direct inquiries on this topic or related engagements to [email protected].

Retrieval-Augmented Generation in Practice

Related writing.

Deploying Sovereign LLMs in Regulated Industries

From Prototype to Production: The Hidden Cost of AI Demos

The Case for Small Language Models