The Problem with Standard RAG
Standard RAG (Retrieval-Augmented Generation) systems work well when the answer to a question is contained in a small number of document chunks-typically 5-10. But enterprise use cases often require synthesizing information from dozens or hundreds of documents.
Consider a legal team asking “What are all the indemnification clauses across our 500 vendor contracts?” or a researcher asking “Summarize the key findings from the last 3 years of clinical trials on drug X.” These queries can’t be answered by retrieving the top-k most similar chunks.
Why Top-K Retrieval Fails
Traditional top-k retrieval has fundamental limitations:
- Scattered information: Relevant information is distributed across many documents
- Semantic similarity gaps: Not all relevant chunks score high on similarity
- Context window limits: Can’t fit all retrieved content into the LLM
- No iterative refinement: No mechanism for exploration or follow-up
Recursive Context Synthesis
We developed an approach called Recursive Context Synthesis that handles large-scale retrieval through a hierarchical summarization and refinement process. Here’s how it works:
Step 1: Broad Retrieval
Instead of retrieving top-k chunks, we retrieve a much larger set-often hundreds or thousands of potentially relevant chunks. We use hybrid retrieval combining dense embeddings with sparse keyword matching to maximize recall.
# Hybrid retrieval with high recall
dense_results = vector_store.search(query_embedding, top_k=500)
sparse_results = bm25_index.search(query_keywords, top_k=500)
candidates = merge_and_deduplicate(dense_results, sparse_results)
Step 2: Relevance Filtering
A fast classifier (we use a fine-tuned DistilBERT) scores each candidate chunk for relevance to the specific query. This is more accurate than embedding similarity alone because it’s trained on query-document pairs.
Step 3: Hierarchical Summarization
The filtered chunks are grouped into batches that fit within the LLM’s context window. Each batch is summarized with respect to the query. The summaries are then recursively combined until we have a single, comprehensive synthesis.
Level 0: 1000 chunks → 100 batch summaries
Level 1: 100 summaries → 10 section summaries
Level 2: 10 sections → 1 final synthesis
Step 4: Citation Tracking
Throughout the process, we maintain provenance links from the final synthesis back to the original chunks. This enables the system to cite specific sources and allows users to drill down into the underlying documents.
Implementation Details
Chunking Strategy
Chunk size matters more than you think. We use semantic chunking that respects document structure-paragraphs, sections, list items-rather than fixed token counts. Each chunk includes metadata about its position in the document hierarchy.
Embedding Model Selection
For enterprise documents, we’ve found that domain-adapted embedding models significantly outperform general-purpose models. We typically fine-tune on a corpus of similar documents using contrastive learning.
Handling Contradictions
When synthesizing large document sets, contradictions are inevitable. We explicitly prompt the LLM to identify and flag conflicting information rather than silently resolving it. The final output includes a “conflicts” section when applicable.
Performance Characteristics
The recursive approach trades latency for quality. A typical query over 10,000 documents takes 30-60 seconds-too slow for interactive chat, but perfect for complex research questions where accuracy matters more than speed.
When to Use This Approach
Recursive Context Synthesis is ideal for:
- Research and analysis queries over large document corpora
- Compliance and audit questions spanning many records
- Due diligence and document review workflows
- Any question where completeness matters more than speed
For interactive chat where users expect sub-second responses, standard top-k RAG is still the right choice. The key is matching the retrieval strategy to the use case.
Try It Yourself
This approach is built into Elastiq Discover. If you’re building RAG systems that need to reason over large document sets, we’d love to show you how it works in practice.