RAG Research Notes

Hybrid Retrieval: Combining Dense and Sparse Embeddings

Tested a hybrid approach using BM25 for keyword matching alongside a dense bi-encoder (all-MiniLM-L6-v2). The reciprocal rank fusion of both result sets improved recall@10 by about 12% on our internal benchmark compared to dense-only. The tricky part is tuning the alpha weighting between the two — too much BM25 and you lose semantic matching, too little and acronyms/proper nouns get missed.

Chunking Strategies: Recursive vs. Semantic Splitting

Compared recursive character splitting (LangChain default, 512 tokens with 50-token overlap) against semantic splitting using embedding similarity breakpoints. Semantic splitting produced more coherent chunks for technical documentation but was 4x slower to process. For codebases, the recursive approach with language-aware separators (function boundaries, class definitions) still seems to perform better since code structure is already hierarchical.

Contextual Compression with Cross-Encoder Reranking

Implemented a two-stage retrieval pipeline: first retrieve top-50 with a bi-encoder, then rerank with a cross-encoder (ms-marco-MiniLM-L-6-v2) down to top-5. The reranking step adds about 200ms latency but significantly improves relevance — especially for multi-hop questions where the answer spans multiple chunks. Next step is to try LLM-based compression to extract only the relevant sentences from each chunk before passing to the generator.

Evaluating RAG with RAGAS: Faithfulness vs. Relevance Metrics

Set up the RAGAS evaluation framework to measure faithfulness (is the answer grounded in retrieved context?) and answer relevance (does the answer actually address the question?). Interesting finding: increasing the number of retrieved chunks from 3 to 8 improved context recall but slightly decreased faithfulness — the model starts hallucinating connections between unrelated chunks. Need to find the sweet spot, possibly with dynamic top-k based on query complexity.

Fine-Tuning Embedding Models on Domain-Specific Data

Started fine-tuning a sentence-transformers model on our internal Q&A pairs using contrastive learning. After 3 epochs on ~5k pairs, the domain-adapted model showed a 15% improvement in hit rate on our test set compared to the off-the-shelf model. The key was generating hard negatives — random negatives were too easy and the model didn't learn much. Using in-batch negatives from the same topic cluster made a huge difference.