Fix the Retrieval Layer That Is Quietly Wrecking Your RAG Answers

Measure recall, chunking, and reranking so you stop blaming the model for the bad context your retrieval layer handed it.

Derrick S. K. SiaworDecember 20, 20258 min read

Abstract blue mesh of glowing nodes and connecting lines on a dark background, suggesting a neural network — Photo · Conny Schneider / Unsplash

When a retrieval-augmented generation system gives a wrong or vague answer, the instinct is to blame the model. It hallucinated, it is not smart enough, we should try a bigger one. Most of the time, that instinct is pointing at the wrong component. The model answered correctly given what it was handed. The problem is that it was handed the wrong context, or no relevant context at all, and a model cannot answer from documents it never received.

This is the central, under-appreciated truth of building RAG: the retrieval layer is doing most of the work and getting none of the blame. A RAG system is a pipeline (chunk the documents, embed them, retrieve the relevant ones for a query, optionally rerank, assemble the context, generate the answer) and the answer is only as good as the context the retrieval step found. If you are not measuring where in that pipeline things break, you are debugging blind, and you will keep swapping models while the actual fault sits in chunking or retrieval.

Separate retrieval failure from generation failure

The first and most valuable thing to do is stop treating the system as a black box that is either right or wrong. There are two distinct failure modes, and they have completely different fixes.

Retrieval failure: the right information was not in the context handed to the model. The answer is wrong because the model never saw the facts it needed. No model upgrade fixes this, because the bigger model also cannot answer from documents it did not receive.

Generation failure: the right information was in the context, and the model still produced a wrong or unfaithful answer. This is where model quality, prompting, and grounding actually matter.

The diagnostic move is simple and it changes everything: when an answer is wrong, look at the context that was retrieved for that query. Did it contain the information needed to answer? If yes, you have a generation problem. If no, you have a retrieval problem, and the vast majority of the time, you have a retrieval problem. Teams that skip this step spend weeks tuning the generation half while the retrieval half quietly hands over garbage. Tracing which step actually failed is far easier with agent observability and tracing spans capturing what was retrieved on every call.

RAG diagnosis flow: split retrieval failure from generation failure, then fix chunking, reranking, or prompt

Measure retrieval with recall and precision

Once you know retrieval is where to look, you measure it with two complementary metrics borrowed from information retrieval.

Context recall: of all the information needed to answer the query, how much of it appears in the retrieved chunks. Recall below 100 percent means some required information was missing from what you retrieved, and the model literally could not have answered fully. Low recall is the most damaging failure, because the answer is incomplete or wrong through no fault of the model.

Context precision: of the chunks you retrieved, how many actually contain query-relevant information. Low precision means you are diluting the relevant facts with irrelevant chunks, which both wastes context budget and can distract the model toward the wrong information.

These point at different fixes. Low recall says your knowledge coverage is incomplete, or there is a mismatch between how queries are phrased and how the documents are written, so the right chunk exists but does not get retrieved. When the gap is genuinely about your own narrow domain rather than phrasing, it can be worth training a small local classifier that outperforms a frontier model on your task for the routing step. Low precision says you need better reranking, filtering, or hybrid search to push the noise out of the top results. Knowing which one is failing tells you exactly which lever to pull, instead of guessing.

If you have labelled data (queries paired with the chunks that should answer them) you can compute these precisely with metrics like Recall@K, Precision@K, and NDCG, which measure whether the right chunks showed up in your top K results and how well-ranked they were. If you do not have labels, you can still judge retrieved context by manual review or by using a strong model as a judge to assess whether the retrieved chunks were relevant. Either way, you are now measuring the part that was previously invisible.

Chunking is usually the silent culprit

When recall is low, the cause is very often chunking, the step where you split documents into pieces before embedding them. Chunking is deceptively important, because a chunk is the atomic unit retrieval works with. If the answer to a query is split across two chunks, retrieval may grab one and miss the other, and recall drops. If a chunk is too large, it dilutes its own embedding with unrelated content and becomes hard to match precisely. If it is too small, it loses the surrounding context that made it meaningful.

The failure mode to watch for: a chunk that contains a fact but not the context needed to recognise it as relevant to the query, or an answer that spans a chunk boundary so no single chunk holds it. You evaluate chunking both intrinsically (does a chunk fully cover the keywords needed to answer, how many tokens until the answer appears) and extrinsically (does this chunking strategy improve end-to-end retrieval recall and answer quality). The practical loop is to try a chunking approach, measure retrieval recall against your test queries, and keep the strategy that retrieves the answers most reliably. There is no universal best chunk size; there is the one that works for your documents and your queries, found by measuring.

Reranking is the precision fix

When precision is low, reranking is the tool. Your initial retrieval (usually a vector similarity search) casts a wide net and returns the top candidates, but similarity is a coarse signal and the most truly relevant chunk is not always ranked first. A reranker is a second, more careful model that takes the retrieved candidates and reorders them by genuine relevance to the query, so the best chunks rise to the top and the noise falls below the cutoff.

This matters because you only pass the top few chunks to the generation model, to stay within context limits and to avoid distraction. If reranking puts the genuinely relevant chunk at position one instead of position eight, it makes the cut; without reranking it might get truncated away and recall effectively drops even though the chunk was retrieved. Reranking improves the precision of the final context, which is often the difference between an answer grounded in the right passage and an answer grounded in a near-miss.

Hybrid search (combining vector similarity with keyword matching) is the other precision lever, because pure vector search can miss exact-term matches that keyword search catches, and pure keyword search misses semantic matches that vectors catch. Combining them covers both, and it is a common fix when recall is low because the right document used different wording than the query.

Build the evaluation loop, then iterate

The thing that turns RAG from frustrating to tractable is a small evaluation harness: a set of representative queries with known good answers, run through the pipeline, with recall and precision measured at the retrieval step and answer quality measured at the end. This is the retrieval-specific version of an eval harness that catches regressions before your users do. With that harness, every change becomes a measured experiment. Change the chunk size, run the suite, see if recall went up. Add a reranker, run the suite, see if precision went up. You are no longer guessing, you are optimising against numbers.

Without the harness, RAG development is a depressing cycle of "the answer is still bad, try another model," which rarely helps because the model was rarely the problem. Worse, swapping to a bigger model to paper over a retrieval bug is also how the LLM bill quietly doubles without improving answer quality. With it, you find the actual broken link (almost always in chunking or retrieval) and fix that specific thing. This is the same discipline that makes any AI feature shippable: measure where it breaks before you change anything, and validate the change against real cases rather than vibes.

Getting this right is the difference between a RAG system that confidently cites the wrong document and one that grounds every answer in the passage that actually contains the truth. It is core to how we build AI features that have to be trusted, because a system that retrieves the wrong context and answers from it is more dangerous than one that admits it does not know. The answer the model finally produces still has to be shaped, which is where forcing LLM output into schemas your code can trust takes over, and when retrieval is not enough on its own it is worth knowing when fine-tuning beats prompting and when it just burns money. If your RAG answers are unreliable and you have been blaming the model, the retrieval metrics will usually tell a different story, and finding the real culprit is the fast path to fixing it.

ai rag retrieval evaluation llm

All of the Journal