2026-01-08
RAG Architectures 2025: Deep Dive
research
Learned: 2026-01-08 Topic: Retrieval-Augmented Generation
Key Insights
- Chunking is king - More important than retrieval method choice (65% → 92% accuracy improvement)
- Hybrid retrieval dominates - Dense + sparse = 53.4% top-1 recall (vs 48.7% dense alone)
- Reranking is non-negotiable - 35% hallucination reduction
- Latency drives business - Not just operational, affects customer acquisition/retention
Core RAG Patterns
| Pattern | Use Case | Key Features |
|---|---|---|
| Naive RAG | Prototypes | Simple retrieve-read pipeline |
| Advanced RAG | Production | Pre/post-retrieval optimization |
| Modular RAG | Enterprise | Reconfigurable modules, multi-source |
Chunking Strategies
Optimal settings:
- Size: 256-512 tokens
- Overlap: 10-20% (50-100 tokens)
- Method: Structure-aware (Markdown/HTML headers) → Recursive
Impact: Semantic chunking provides 70% accuracy improvement but costs more compute.
Retrieval Methods
| Method | Top-1 Recall | Notes |
|---|---|---|
| Dense (BERT) | 48.7% | Good semantic understanding |
| Sparse (BM25) | 22.1% | Good for exact matches |
| Hybrid | 53.4% | Best of both worlds |
Advanced techniques:
- SPLADE: Learned sparse embeddings with term expansion
- ColBERT: Late interaction, separately encodes query/doc
- HyDE: Generate hypothetical answer, embed, retrieve similar
Reranking
Cross-encoders: +28% NDCG@10, production standard LLM-based: Highest accuracy but expensive (+0.9s latency) Optimal candidate set: 50-75 documents
Latest Innovations (2024-2025)
Self-RAG
Self-reflective system using reflection tokens. 52% reduction in hallucinations.
Corrective RAG (CRAG)
Confidence scoring + web search fallback when below threshold.
GraphRAG
Knowledge-graph based retrieval. 99% search precision (vs 70-80% traditional).
Agentic RAG
LLM breaks complex queries into parallel subqueries.
Production Considerations
Latency Breakdown (typical 2-7s)
- Query Processing: 50-200ms
- Vector Search: 100-500ms
- Document Retrieval: 200-1000ms
- Reranking: 300-800ms
- LLM Generation: 1000-5000ms
Optimization Strategies
- Batching: 65-70% latency reduction
- Caching: Embed cache, document cache, result cache
- Scope Reduction: Metadata filtering, query routing
- Smart model routing: 60-80% cost reduction
Target Metrics
- Answer Rate: ≥90%
- Cost per 1K calls: $2-8
- Mean Time to Answer: <3 seconds
Decision Tree
Starting RAG project?
├─ Chunking: Recursive (256-512 tokens, 10-20% overlap)
├─ Retrieval: Hybrid (BM25 + dense)
├─ Reranking: Cross-encoder (50-75 candidates)
└─ Evaluation: RAGAS framework
Need better accuracy? → Semantic chunking, GraphRAG
Latency problems? → Batching, caching, streaming
High costs? → Smart routing, optimize chunks
Hallucinations? → Reranking + Self-RAG/CRAG
RAGAS Evaluation Metrics
- Faithfulness (0-1): Factual consistency with context
- Answer Relevancy: How pertinent answer is to prompt
- Contextual Relevancy: How relevant retrieved context is
- Contextual Recall: Does context contain ALL needed info?
- Contextual Precision: Are docs ranked correctly?
Key Papers
- "Sufficient Context: A New Lens on RAG Systems" (Google, ICLR 2025)
- "Corrective Retrieval Augmented Generation" (arXiv:2401.15884)
- "ColBERTv2: Lightweight Late Interaction" (arXiv:2112.01488)
- "SPLADE v2: Sparse Lexical and Expansion" (arXiv:2109.10086)
Personal Application
This knowledge is directly applicable to:
- Building better search for the knowledge base system
- Understanding job recommendation systems (uses similar retrieval patterns)
- Future projects requiring document retrieval + LLM generation