RAG is now the default architecture for grounded LLM systems, and rightly so. It is the only pattern that lets a model speak to facts that change after the model was trained, with citation back to source, and with access control enforced at the document layer. The published patterns, however, assume a level of data hygiene that most enterprises do not have. The interesting engineering is what happens when the corpus is the actual corpus.
The corpus is multi-format. In published examples, the source is a clean directory of Markdown files. In practice, the source is PDFs with mixed text and image layers, scanned documents in twelve different scanner profiles, Word documents with embedded objects, Excel files used as forms, email threads with quoted history, presentations exported to PDF with reordered slides. Every format imposes a different extraction strategy, and the extraction quality varies by document.
The corpus is multi-language. In the GCC, in continental Europe, in cross-border legal and financial practice, a single document set spans multiple languages — sometimes within the same document. The embedding model must be multilingual. The retrieval ranking has to handle mixed-language queries. The generation model has to respond in the language the user asked in, regardless of the language of the source.
The corpus is partially conflicting. Versioned policies coexist with their predecessors. Acquired companies bring their own document sets, written under different assumptions. Regulatory positions shift, and old documents remain on file. A naive retrieval layer will surface contradictory passages from different time periods as if they were equally current. The pipeline has to express recency, version status and source authority — and the generation layer has to respect that signal.
Chunking is not solved. The published advice — “500-token chunks with 50-token overlap” — works for clean technical documentation. It breaks for legal contracts (where the unit of meaning is the clause, often spanning chunks), for tables (which must not be split), for headers and footers (which add noise to every chunk), for code (where the unit is the function), and for transcripts (where the unit is the turn). Chunking strategy is a per-corpus design decision, not a default.
Embedding model selection has consequences. The embedding model determines retrieval quality more than any other component. Smaller embedding models retrieve faster and cost less but miss semantic nuance. Larger embedding models retrieve more accurately but blow up the cost of re-embedding when the corpus changes. Self-hosted embedding models are required for sovereign deployments and add their own operational load. The right answer is corpus-specific.
Re-ranking is usually necessary. First-stage retrieval (vector similarity or hybrid BM25 plus vector) is fast and recall-oriented. It will surface roughly the right passages but in the wrong order. A re-ranker — a cross-encoder model that scores query-passage pairs — re-orders the top-k from the first stage. The re-ranker adds latency but materially improves the quality of what is actually shown to the generation layer.
Citations have to be structural, not narrative. The generation layer must emit citations that point to specific passages — by document ID, page, and span — not loose textual references the user has to interpret. Auditors will ask whether a given output is traceable to source. The answer has to be “yes, here is the passage”.
Evaluation runs against the corpus, not against the model. The right evaluation question is not “is the model accurate” but “is the system accurate against this corpus”. That requires a held-out evaluation set drawn from the actual corpus, scored against ground-truth answers, with regression monitoring as the corpus and model evolve.
RAG is a deceptively simple pattern in slides and a substantial engineering programme in practice. The teams that get value from it treat the retrieval layer as a first-class system in its own right — versioned, evaluated, monitored — rather than as plumbing in front of an LLM.
The above is a Veritonix Insights publication. Direct enquiries on this topic or related engagements to [email protected].