Technical / AI/ML / September 2024 · 12 min read

RAG Retrieval of 1000s of Chunks

How to build RAG systems that can effectively retrieve and reason over thousands of document chunks using Recursive Context Synthesis.

Vector Search Hybrid Retrieval LLMs Embedding Models

The Problem with Standard RAG

Standard RAG (Retrieval-Augmented Generation) systems work well when the answer to a question is contained in a small number of document chunks-typically 5-10. But enterprise use cases often require synthesizing information from dozens or hundreds of documents.

Consider a legal team asking “What are all the indemnification clauses across our 500 vendor contracts?” or a researcher asking “Summarize the key findings from the last 3 years of clinical trials on drug X.” These queries can’t be answered by retrieving the top-k most similar chunks.

Why Top-K Retrieval Fails

Traditional top-k retrieval has fundamental limitations:

Scattered information: Relevant information is distributed across many documents
Semantic similarity gaps: Not all relevant chunks score high on similarity
Context window limits: Can’t fit all retrieved content into the LLM
No iterative refinement: No mechanism for exploration or follow-up

Recursive Context Synthesis

We developed an approach called Recursive Context Synthesis that handles large-scale retrieval through a hierarchical summarization and refinement process. Here’s how it works:

Step 1: Broad Retrieval

Instead of retrieving top-k chunks, we retrieve a much larger set-often hundreds or thousands of potentially relevant chunks. We use hybrid retrieval combining dense embeddings with sparse keyword matching to maximize recall.

# Hybrid retrieval with high recall
dense_results = vector_store.search(query_embedding, top_k=500)
sparse_results = bm25_index.search(query_keywords, top_k=500)
candidates = merge_and_deduplicate(dense_results, sparse_results)

Step 2: Relevance Filtering

A fast classifier (we use a fine-tuned DistilBERT) scores each candidate chunk for relevance to the specific query. This is more accurate than embedding similarity alone because it’s trained on query-document pairs.

Step 3: Hierarchical Summarization

The filtered chunks are grouped into batches that fit within the LLM’s context window. Each batch is summarized with respect to the query. The summaries are then recursively combined until we have a single, comprehensive synthesis.

Level 0: 1000 chunks → 100 batch summaries
Level 1: 100 summaries → 10 section summaries
Level 2: 10 sections → 1 final synthesis

Step 4: Citation Tracking

Throughout the process, we maintain provenance links from the final synthesis back to the original chunks. This enables the system to cite specific sources and allows users to drill down into the underlying documents.

Implementation Details

Chunking Strategy

Chunk size matters more than you think. We use semantic chunking that respects document structure-paragraphs, sections, list items-rather than fixed token counts. Each chunk includes metadata about its position in the document hierarchy.

Embedding Model Selection

For enterprise documents, we’ve found that domain-adapted embedding models significantly outperform general-purpose models. We typically fine-tune on a corpus of similar documents using contrastive learning.

Technology Stack

OpenAI text-embedding-3-large - 78% recall@100 (baseline)

Cohere embed-v3 - 81% recall@100

Domain fine-tuned (ours) - 92% recall@100

Handling Contradictions

When synthesizing large document sets, contradictions are inevitable. We explicitly prompt the LLM to identify and flag conflicting information rather than silently resolving it. The final output includes a “conflicts” section when applicable.

Performance Characteristics

The recursive approach trades latency for quality. A typical query over 10,000 documents takes 30-60 seconds-too slow for interactive chat, but perfect for complex research questions where accuracy matters more than speed.

10K+

Documents per Query

Comprehensive coverage

94%

Answer Completeness

Measured on test set

30-60s

Query Latency

For complex queries

When to Use This Approach

Recursive Context Synthesis is ideal for:

Research and analysis queries over large document corpora
Compliance and audit questions spanning many records
Due diligence and document review workflows
Any question where completeness matters more than speed

For interactive chat where users expect sub-second responses, standard top-k RAG is still the right choice. The key is matching the retrieval strategy to the use case.

Try It Yourself

This approach is built into Elastiq Discover. If you’re building RAG systems that need to reason over large document sets, we’d love to show you how it works in practice.

Share this article

Ready to get started?

Let's discuss how we can help with your project.

Contact Us

Work with us

Let’s build something together

Our team can help you turn these ideas into production systems.

Get in Touch More Articles