RAG isn't a search problem — it's a chunking problem
Most teams spend weeks tuning retrieval. The real problem is upstream.
There is a pattern in RAG debugging that most teams recognise only after going through it themselves. Retrieval looks fine in demos. You can pull the right context with a well-formed query. In production, with real users asking real questions in real phrasing, the pipeline returns fragments that are adjacent to the answer, or splits the relevant passage across two chunks that never get retrieved together.
The standard response: try a better embedding model. Swap to a newer one. Tune top_k. Add a reranker. Adjust the similarity threshold.
These are all retrieval changes. The RAG chunking problem sits one layer upstream, and retrieval tuning does not touch it.
Why retrieval gets the blame
Retrieval is visible. You can inspect the top-k chunks returned for any query. You can compute cosine similarity scores. You can see reranker outputs. When a query fails, you debug retrieval because you can observe it.
Chunking happens at ingestion time. The artifacts it creates are not obvious at query time. The chunk that fails you looks like any other chunk in your index. You do not know what you did not retrieve, or that a better cut of the same document would have retrieved it.
This visibility asymmetry is why teams spend weeks on retrieval and never open the ingestion pipeline. They optimise the layer they can see and leave broken the layer they cannot.
What a chunk is — and what retrieval does with it
A chunk is the unit of information retrieval. Your embedding model turns each chunk into a vector. At query time, your query becomes a vector, and the search returns the k chunks whose vectors are nearest.
Two things follow from this. A chunk that mixes multiple topics produces an averaged embedding that is less similar to any single-topic query: you lose precision. A chunk that cuts off a thought mid-sentence, or splits a table across two units, loses the structural context that makes the information interpretable.
The embedding model does not see 'this chunk contains partial information.' It embeds what is there. If what is there is structurally incomplete, no embedding model quality will recover the missing context. The vectors will be wrong not because the model is bad, but because the input is.
Four ways the RAG chunking problem shows up without telling you
Boundary splits. Fixed-size chunking on character or token count does not know where a sentence ends or a paragraph begins. A chunk that ends with 'The threshold for this approach is' and the next chunk beginning with '15 milliseconds, matching our SLA' will retrieve poorly for 'what is the latency threshold.' Neither chunk contains a complete answer. Both score reasonably on similarity because they are on-topic, but neither is useful on its own.
Context loss. Many documents carry context at the top that governs everything below: a section heading, a contract party's name, an effective date. Fixed-size chunking often separates this header from the content it applies to. The third paragraph of a legal clause, chunked in isolation, means something different from the same paragraph read under its heading.
Table destruction. Tables in PDFs or Word documents, processed with naive text extraction, often produce a flat sequence of cell values with no structural relationship. A chunk containing '2021 | Q3 | 4.2M | 18%' is nearly unretrievable. No natural-language query will match raw numbers without the column headers that give them meaning. The data is there; the structure that makes it findable is not.
Cross-reference breakage. Technical documents, contracts, and regulations use cross-references: 'see Section 4.3' or 'as defined in clause 2.' Fixed-size chunking strips the context that resolves those references. A chunk containing 'as defined in this agreement' is semantically incomplete when the definition appeared three chunks earlier, and the embedding will reflect that.
Fixed-size chunking: when it works and when it lies
Fixed-size chunking is not always wrong. It works when your source documents have uniform structure, consistent paragraph length, and no tables or cross-references. A corpus of short customer support tickets, product descriptions, or FAQ entries might chunk fine at 256 or 512 tokens.
Most real enterprise documents (policies, contracts, technical documentation, long-form reports) do not fit that profile. Chunking them at a fixed token count optimises for uniformity, not meaning.
One diagnostic: if your chunk sizes look clean and uniform in a size histogram, but chunk recall on your golden query set is below 0.6, the likely problem is that uniform does not mean coherent.
Document-aware chunking: matching the cut to the structure
The alternative is to cut where the document tells you to cut.
For structured documents (HTML, Markdown, DOCX with proper heading styles), chunk at heading boundaries. A section under an H2 is a semantically coherent unit. This often produces chunks ranging from 50 to 800 tokens, and that variability is fine: coherent chunks of variable size outperform uniform chunks of incoherent content.
For PDFs without built-in structure, use layout analysis tools (pdfplumber, Unstructured, Docling) to detect paragraph boundaries, tables, and lists before chunking. Treat tables as atomic units: embed the whole table, or extract the content as structured Markdown before embedding. A table with headers intact is retrievable; a table whose headers are in a separate chunk is not.
For legal and regulatory text, clause-level chunking outperforms sentence and paragraph chunking. The legal unit of meaning is the clause; clause boundaries are usually marked by numbering (1.1, 1.2.a) or explicit legal conjunctions. Cutting there preserves the operative unit.
For long documents where every section depends on context established earlier, prepend the section heading to every chunk. This adds a small amount of redundant text per chunk but substantially improves recall on context-dependent queries. The retrieval system can now match a heading-scoped query to a heading-scoped chunk.
| Strategy | Best for | Avoid for | Typical chunk size |
|---|---|---|---|
| Fixed-size (tokens) | Uniform short docs, support tickets, product listings | Contracts, long-form PDFs, tables, technical docs | 256-512 tokens |
| Sentence / paragraph | Long-form prose, articles, blog posts | Tables, legal clauses, heavily cross-referenced docs | 100-400 tokens |
| Heading-based | HTML, Markdown, structured DOCX | PDFs without heading styles, scanned documents | 50-800 tokens (variable) |
| Clause-level | Legal docs, regulations, SLAs, contracts | Narrative prose, Q&A content | 80-300 tokens |
| Layout-aware (PDF) | PDFs with tables, figures, and mixed content | Simple text-only documents (overkill) | Varies by element |
Measuring the failure before changing anything
Before touching your chunking strategy, measure what is actually broken.
Build a golden query set: 100 to 300 queries where you know which chunk or chunks contain the correct answer. Measure chunk recall at k=5, meaning what fraction of queries have the correct chunk in the top five results. If chunk recall is above 0.8 but answer quality is still poor, the problem is likely generation or prompt construction. If it is below 0.6, the problem is chunking or retrieval, and chunking is usually where to start.
Then do the manual inspection that most teams skip: take 20 failing queries and look at the chunks that were returned. Are they adjacent to the right answer? Do they contain half of the relevant information? Are they topically correct but structurally incomplete?
If the retrieved chunks are adjacent-but-not-quite, the answer was split at a boundary: fix boundary detection. If they are topically correct but incomplete, you have context loss: fix heading prepending or chunk size. If they are entirely off-topic, the problem is query-document mismatch, which is an embedding or indexing issue. Each failure mode has a different fix, and manual inspection of 20 examples tells you which one you are dealing with.
def evaluate_chunk_recall(query_fn, golden_set, k=5):
"""
golden_set: list of (query_str, [ground_truth_chunk_ids])
query_fn: callable(query_str, k) -> [chunk_ids]
Returns: recall@k as float
"""
hits = 0
for query, ground_truth_ids in golden_set:
retrieved = query_fn(query, k)
if any(gt in retrieved for gt in ground_truth_ids):
hits += 1
return hits / len(golden_set)
# Example output after switching chunking strategies on a 150-query golden set:
# Fixed-size 512: recall@5 = 0.52
# Sentence-boundary: recall@5 = 0.61
# Heading-based: recall@5 = 0.78
# Clause-level (legal corpus): recall@5 = 0.84
The counter-intuitive thing about better embedding models
Better embedding models do improve retrieval. Newer models, fine-tuned on domain-specific text, produce more discriminative vectors for complex queries. The improvement is real. But it is marginal on structurally broken chunks.
A better embedding model will not fix a chunk that cuts off a sentence mid-thought. It will not fix a table whose column headers sit in a different chunk from its values. It will not reconstruct a legal clause whose operative definition is three chunks upstream. The vectors it produces will still be wrong, just slightly less wrong.
Fix chunking before evaluating your embedding model. Once chunking is solid, you get a clean signal on whether the model is the remaining bottleneck. Most teams that upgrade their embedding model after fixing chunking see modest additional gains, because chunking was carrying most of the failure load. A few see substantial gains, which tells them the model was also a bottleneck. Either way, you have now separated the variables.
Where to start
If your RAG pipeline is in production and recall is mediocre, here is the sequence:
- Build a 100-query golden set and measure chunk recall at k=5. This takes half a day and gives you a baseline to reason from.
- Manually inspect 20 failing queries. Categorise each: boundary split, context loss, table destruction, or cross-reference breakage.
- Match the failure mode to a chunking fix: boundary detection, heading-based chunking, layout-aware PDF extraction, or clause-level splitting for legal text.
- Re-ingest a sample corpus with the new strategy and re-run your golden set. Do not re-index everything until the sample shows measurable recall improvement.
- Only after chunk recall is above 0.75 at k=5, evaluate whether a different embedding model or a reranker is the next lever to pull.
Most pipelines see 25-40% improvement in chunk recall from fixing chunking alone, without touching the embedding model. The work is not glamorous. There is no model swap to announce. But it is where the failure usually is.
Frequently asked questions
Related reading
Prompt caching in production: why the hit rate depends on prompt structure, not the API setting
Prompt caching keys on the leading token prefix. One dynamic field early in the prompt invalidates the cache for everything after it. Here is what that means for how you structure production prompts.
When per-seat pricing breaks: what GitHub Copilot's billing shift signals for AI-powered SaaS
AI agents consume compute in ways that don't map to user count — and Copilot's June 2026 billing shift is the clearest signal yet. Here's what the transition reveals about pricing for AI-powered products.
When the model fails: engineering graceful degradation into LLM-powered features
LLM features fail slowly, partially, and semantically — not with clean error codes. Designing for this requires different patterns from the distributed systems toolkit you already know.