What chunk size should I use for a general-purpose knowledge base?

It depends on document type, but 512-768 tokens with heading-based boundaries is a reasonable starting point for mixed corpora. Run your golden query set against 256, 512, and 768 tokens to find where chunk recall plateaus for your specific content. The right size is the one your documents tell you, not a default.

Do I need a reranker if my recall is still low?

Not until chunking is fixed. A reranker re-orders an already-retrieved candidate set. It cannot surface a chunk that was not retrieved in the first place. If the correct chunk is not in the top 20 results, no reranker will put it in the top 5. Fix chunking so the right chunk is retrieved; then evaluate whether a reranker improves precision within that set.

How do I handle documents with both structured and unstructured content?

Split at the structural boundary and apply different chunking logic per section. Most layout-aware extraction libraries (Unstructured, Docling, pdfplumber) classify elements by type: heading, paragraph, table, list. Route each element type to the appropriate chunking strategy rather than applying one strategy to the whole document.

Does adding chunk overlap fix the boundary split problem?

Overlap helps at boundaries but does not fix context loss or table destruction. Use 10-20% overlap as a backstop for boundary splits, but treat it as a supplement to good chunking design, not a substitute. A 20% overlap on a 512-token chunk adds 100 tokens of repeated context per chunk, and that accumulates across a large corpus.

AI & LLMsMay 24, 20267 min readReviewed May 24, 2026

RAG isn't a search problem — it's a chunking problem

Most teams spend weeks tuning retrieval. The real problem is upstream.

By FlowVerify Editorial Team

There is a pattern in RAG debugging that most teams recognise only after going through it themselves. Retrieval looks fine in demos. You can pull the right context with a well-formed query. In production, with real users asking real questions in real phrasing, the pipeline returns fragments that are adjacent to the answer, or splits the relevant passage across two chunks that never get retrieved together.

The standard response: try a better embedding model. Swap to a newer one. Tune top_k. Add a reranker. Adjust the similarity threshold.

These are all retrieval changes. The RAG chunking problem sits one layer upstream, and retrieval tuning does not touch it.

Why retrieval gets the blame

Retrieval is visible. You can inspect the top-k chunks returned for any query. You can compute cosine similarity scores. You can see reranker outputs. When a query fails, you debug retrieval because you can observe it.

Chunking happens at ingestion time. The artifacts it creates are not obvious at query time. The chunk that fails you looks like any other chunk in your index. You do not know what you did not retrieve, or that a better cut of the same document would have retrieved it.

This visibility asymmetry is why teams spend weeks on retrieval and never open the ingestion pipeline. They optimise the layer they can see and leave broken the layer they cannot.

What a chunk is — and what retrieval does with it

A chunk is the unit of information retrieval. Your embedding model turns each chunk into a vector. At query time, your query becomes a vector, and the search returns the k chunks whose vectors are nearest.

Two things follow from this. A chunk that mixes multiple topics produces an averaged embedding that is less similar to any single-topic query: you lose precision. A chunk that cuts off a thought mid-sentence, or splits a table across two units, loses the structural context that makes the information interpretable.

The embedding model does not see 'this chunk contains partial information.' It embeds what is there. If what is there is structurally incomplete, no embedding model quality will recover the missing context. The vectors will be wrong not because the model is bad, but because the input is.

Four ways the RAG chunking problem shows up without telling you

Boundary splits. Fixed-size chunking on character or token count does not know where a sentence ends or a paragraph begins. A chunk that ends with 'The threshold for this approach is' and the next chunk beginning with '15 milliseconds, matching our SLA' will retrieve poorly for 'what is the latency threshold.' Neither chunk contains a complete answer. Both score reasonably on similarity because they are on-topic, but neither is useful on its own.

Context loss. Many documents carry context at the top that governs everything below: a section heading, a contract party's name, an effective date. Fixed-size chunking often separates this header from the content it applies to. The third paragraph of a legal clause, chunked in isolation, means something different from the same paragraph read under its heading.

Table destruction. Tables in PDFs or Word documents, processed with naive text extraction, often produce a flat sequence of cell values with no structural relationship. A chunk containing '2021 | Q3 | 4.2M | 18%' is nearly unretrievable. No natural-language query will match raw numbers without the column headers that give them meaning. The data is there; the structure that makes it findable is not.

Cross-reference breakage. Technical documents, contracts, and regulations use cross-references: 'see Section 4.3' or 'as defined in clause 2.' Fixed-size chunking strips the context that resolves those references. A chunk containing 'as defined in this agreement' is semantically incomplete when the definition appeared three chunks earlier, and the embedding will reflect that.

Fixed-size chunking: when it works and when it lies

Fixed-size chunking is not always wrong. It works when your source documents have uniform structure, consistent paragraph length, and no tables or cross-references. A corpus of short customer support tickets, product descriptions, or FAQ entries might chunk fine at 256 or 512 tokens.

Most real enterprise documents (policies, contracts, technical documentation, long-form reports) do not fit that profile. Chunking them at a fixed token count optimises for uniformity, not meaning.

One diagnostic: if your chunk sizes look clean and uniform in a size histogram, but chunk recall on your golden query set is below 0.6, the likely problem is that uniform does not mean coherent.

Document-aware chunking: matching the cut to the structure

The alternative is to cut where the document tells you to cut.

For structured documents (HTML, Markdown, DOCX with proper heading styles), chunk at heading boundaries. A section under an H2 is a semantically coherent unit. This often produces chunks ranging from 50 to 800 tokens, and that variability is fine: coherent chunks of variable size outperform uniform chunks of incoherent content.

For PDFs without built-in structure, use layout analysis tools (pdfplumber, Unstructured, Docling) to detect paragraph boundaries, tables, and lists before chunking. Treat tables as atomic units: embed the whole table, or extract the content as structured Markdown before embedding. A table with headers intact is retrievable; a table whose headers are in a separate chunk is not.

For legal and regulatory text, clause-level chunking outperforms sentence and paragraph chunking. The legal unit of meaning is the clause; clause boundaries are usually marked by numbering (1.1, 1.2.a) or explicit legal conjunctions. Cutting there preserves the operative unit.

For long documents where every section depends on context established earlier, prepend the section heading to every chunk. This adds a small amount of redundant text per chunk but substantially improves recall on context-dependent queries. The retrieval system can now match a heading-scoped query to a heading-scoped chunk.

Strategy	Best for	Avoid for	Typical chunk size
Fixed-size (tokens)	Uniform short docs, support tickets, product listings	Contracts, long-form PDFs, tables, technical docs	256-512 tokens
Sentence / paragraph	Long-form prose, articles, blog posts	Tables, legal clauses, heavily cross-referenced docs	100-400 tokens
Heading-based	HTML, Markdown, structured DOCX	PDFs without heading styles, scanned documents	50-800 tokens (variable)
Clause-level	Legal docs, regulations, SLAs, contracts	Narrative prose, Q&A content	80-300 tokens
Layout-aware (PDF)	PDFs with tables, figures, and mixed content	Simple text-only documents (overkill)	Varies by element

Chunking strategy by document type

Measuring the failure before changing anything

Before touching your chunking strategy, measure what is actually broken.

Build a golden query set: 100 to 300 queries where you know which chunk or chunks contain the correct answer. Measure chunk recall at k=5, meaning what fraction of queries have the correct chunk in the top five results. If chunk recall is above 0.8 but answer quality is still poor, the problem is likely generation or prompt construction. If it is below 0.6, the problem is chunking or retrieval, and chunking is usually where to start.

Then do the manual inspection that most teams skip: take 20 failing queries and look at the chunks that were returned. Are they adjacent to the right answer? Do they contain half of the relevant information? Are they topically correct but structurally incomplete?

If the retrieved chunks are adjacent-but-not-quite, the answer was split at a boundary: fix boundary detection. If they are topically correct but incomplete, you have context loss: fix heading prepending or chunk size. If they are entirely off-topic, the problem is query-document mismatch, which is an embedding or indexing issue. Each failure mode has a different fix, and manual inspection of 20 examples tells you which one you are dealing with.

eval_chunk_recall.py

def evaluate_chunk_recall(query_fn, golden_set, k=5):
    """
    golden_set: list of (query_str, [ground_truth_chunk_ids])
    query_fn:   callable(query_str, k) -> [chunk_ids]
    Returns:    recall@k as float
    """
    hits = 0
    for query, ground_truth_ids in golden_set:
        retrieved = query_fn(query, k)
        if any(gt in retrieved for gt in ground_truth_ids):
            hits += 1
    return hits / len(golden_set)

# Example output after switching chunking strategies on a 150-query golden set:
# Fixed-size 512:              recall@5 = 0.52
# Sentence-boundary:           recall@5 = 0.61
# Heading-based:               recall@5 = 0.78
# Clause-level (legal corpus): recall@5 = 0.84

The counter-intuitive thing about better embedding models

Better embedding models do improve retrieval. Newer models, fine-tuned on domain-specific text, produce more discriminative vectors for complex queries. The improvement is real. But it is marginal on structurally broken chunks.

A better embedding model will not fix a chunk that cuts off a sentence mid-thought. It will not fix a table whose column headers sit in a different chunk from its values. It will not reconstruct a legal clause whose operative definition is three chunks upstream. The vectors it produces will still be wrong, just slightly less wrong.

Fix chunking before evaluating your embedding model. Once chunking is solid, you get a clean signal on whether the model is the remaining bottleneck. Most teams that upgrade their embedding model after fixing chunking see modest additional gains, because chunking was carrying most of the failure load. A few see substantial gains, which tells them the model was also a bottleneck. Either way, you have now separated the variables.

Where to start

If your RAG pipeline is in production and recall is mediocre, here is the sequence:

Build a 100-query golden set and measure chunk recall at k=5. This takes half a day and gives you a baseline to reason from.
Manually inspect 20 failing queries. Categorise each: boundary split, context loss, table destruction, or cross-reference breakage.
Match the failure mode to a chunking fix: boundary detection, heading-based chunking, layout-aware PDF extraction, or clause-level splitting for legal text.
Re-ingest a sample corpus with the new strategy and re-run your golden set. Do not re-index everything until the sample shows measurable recall improvement.
Only after chunk recall is above 0.75 at k=5, evaluate whether a different embedding model or a reranker is the next lever to pull.

Most pipelines see 25-40% improvement in chunk recall from fixing chunking alone, without touching the embedding model. The work is not glamorous. There is no model swap to announce. But it is where the failure usually is.

Frequently asked questions

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

pgvector handles most RAG workloads under ten million vectors just fine. The HNSW index underneath it has a memory requirement Postgres won't mention until the build already ran 40x slower.

Jul 3, 2026Read full article →

AI & LLMsMay 24, 20267 min readReviewed May 24, 2026

RAG isn't a search problem — it's a chunking problem

Most teams spend weeks tuning retrieval. The real problem is upstream.

By FlowVerify Editorial Team

The standard response: try a better embedding model. Swap to a newer one. Tune top_k. Add a reranker. Adjust the similarity threshold.

These are all retrieval changes. The RAG chunking problem sits one layer upstream, and retrieval tuning does not touch it.

Why retrieval gets the blame

This visibility asymmetry is why teams spend weeks on retrieval and never open the ingestion pipeline. They optimise the layer they can see and leave broken the layer they cannot.

What a chunk is — and what retrieval does with it

Four ways the RAG chunking problem shows up without telling you

Fixed-size chunking: when it works and when it lies

Most real enterprise documents (policies, contracts, technical documentation, long-form reports) do not fit that profile. Chunking them at a fixed token count optimises for uniformity, not meaning.

One diagnostic: if your chunk sizes look clean and uniform in a size histogram, but chunk recall on your golden query set is below 0.6, the likely problem is that uniform does not mean coherent.

Document-aware chunking: matching the cut to the structure

The alternative is to cut where the document tells you to cut.

Strategy	Best for	Avoid for	Typical chunk size
Fixed-size (tokens)	Uniform short docs, support tickets, product listings	Contracts, long-form PDFs, tables, technical docs	256-512 tokens
Sentence / paragraph	Long-form prose, articles, blog posts	Tables, legal clauses, heavily cross-referenced docs	100-400 tokens
Heading-based	HTML, Markdown, structured DOCX	PDFs without heading styles, scanned documents	50-800 tokens (variable)
Clause-level	Legal docs, regulations, SLAs, contracts	Narrative prose, Q&A content	80-300 tokens
Layout-aware (PDF)	PDFs with tables, figures, and mixed content	Simple text-only documents (overkill)	Varies by element

Chunking strategy by document type

Measuring the failure before changing anything

Before touching your chunking strategy, measure what is actually broken.

eval_chunk_recall.py

def evaluate_chunk_recall(query_fn, golden_set, k=5):
    """
    golden_set: list of (query_str, [ground_truth_chunk_ids])
    query_fn:   callable(query_str, k) -> [chunk_ids]
    Returns:    recall@k as float
    """
    hits = 0
    for query, ground_truth_ids in golden_set:
        retrieved = query_fn(query, k)
        if any(gt in retrieved for gt in ground_truth_ids):
            hits += 1
    return hits / len(golden_set)

# Example output after switching chunking strategies on a 150-query golden set:
# Fixed-size 512:              recall@5 = 0.52
# Sentence-boundary:           recall@5 = 0.61
# Heading-based:               recall@5 = 0.78
# Clause-level (legal corpus): recall@5 = 0.84

The counter-intuitive thing about better embedding models

Where to start

If your RAG pipeline is in production and recall is mediocre, here is the sequence:

Build a 100-query golden set and measure chunk recall at k=5. This takes half a day and gives you a baseline to reason from.
Manually inspect 20 failing queries. Categorise each: boundary split, context loss, table destruction, or cross-reference breakage.
Match the failure mode to a chunking fix: boundary detection, heading-based chunking, layout-aware PDF extraction, or clause-level splitting for legal text.
Re-ingest a sample corpus with the new strategy and re-run your golden set. Do not re-index everything until the sample shows measurable recall improvement.
Only after chunk recall is above 0.75 at k=5, evaluate whether a different embedding model or a reranker is the next lever to pull.

RAG isn't a search problem — it's a chunking problem

Why retrieval gets the blame

What a chunk is — and what retrieval does with it

Four ways the RAG chunking problem shows up without telling you

Fixed-size chunking: when it works and when it lies

Document-aware chunking: matching the cut to the structure

Measuring the failure before changing anything

The counter-intuitive thing about better embedding models

Where to start

Frequently asked questions

Related reading

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Stay ahead on eSignatures, compliance, and document workflows

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

RAG isn't a search problem — it's a chunking problem

Why retrieval gets the blame

What a chunk is — and what retrieval does with it

Four ways the RAG chunking problem shows up without telling you

Fixed-size chunking: when it works and when it lies

Document-aware chunking: matching the cut to the structure

Measuring the failure before changing anything

The counter-intuitive thing about better embedding models

Where to start

Frequently asked questions

Related reading

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Stay ahead on eSignatures, compliance, and document workflows

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it