Search and retrieval strategies including semantic, hybrid, and reranking for RAG systems. Use when implementing retrieval mechanisms, optimizing search performance, comparing retrieval approaches, or when user mentions semantic search, hybrid search, reranking, BM25, or retrieval optimization.
Limited to specific tools
Additional assets for this skill
This skill is limited to using the following tools:
examples/conversational-retrieval.pyexamples/metadata-filtering.pyscripts/benchmark-retrieval.pyscripts/evaluate-retrieval-quality.pytemplates/hybrid-search.pytemplates/multi-query-retrieval.pytemplates/reranking.pytemplates/semantic-search.pyPurpose: Provide comprehensive retrieval strategies, benchmarking tools, and implementation templates for building high-performance RAG retrieval systems using LlamaIndex and LangChain.
Activation Triggers:
Key Resources:
scripts/benchmark-retrieval.py - Performance testing for different retrieval methodsscripts/evaluate-retrieval-quality.py - Quality metrics (precision, recall, MRR, NDCG)templates/semantic-search.py - Pure vector similarity searchtemplates/hybrid-search.py - Combined vector + BM25 searchtemplates/reranking.py - Cross-encoder and LLM-based rerankingtemplates/multi-query-retrieval.py - Query expansion and fusionexamples/conversational-retrieval.py - Context-aware retrievalexamples/metadata-filtering.py - Filtered retrieval with metadataHow it works: Embed query and documents, compute cosine similarity, return top-k matches
Strengths:
Weaknesses:
When to use:
Performance: ~50-100ms per query (depends on index size)
Template: templates/semantic-search.py
How it works: Combine semantic vector search with keyword-based BM25, merge results using Reciprocal Rank Fusion (RRF)
Strengths:
Weaknesses:
When to use:
Performance: ~100-200ms per query
Template: templates/hybrid-search.py
Reciprocal Rank Fusion (RRF) formula:
RRF_score(d) = sum(1 / (k + rank_i(d)))
where k = 60 (constant), rank_i(d) = rank of document d in retriever i
How it works: Initial retrieval (semantic or hybrid) returns top-k candidates (e.g., 20), then reranker scores all pairs (query, doc) and returns top-n (e.g., 5)
Strengths:
Weaknesses:
When to use:
Performance: +100-500ms additional latency
Reranker options:
Template: templates/reranking.py
How it works: Generate multiple query variations, retrieve for each, deduplicate and fuse results
Strengths:
Weaknesses:
When to use:
Performance: 3-5x base retrieval time
Template: templates/multi-query-retrieval.py
Start
│
├─ Need highest quality? → Use Hybrid + Reranking
│
├─ Budget/latency constrained?
│ ├─ Yes → Semantic Search (simplest, fastest)
│ └─ No → Hybrid Search (best default)
│
├─ Queries are ambiguous/complex? → Multi-Query Retrieval
│
├─ Need exact keyword matches? → Hybrid Search
│
└─ Multilingual or conceptual similarity? → Semantic Search
Script: scripts/benchmark-retrieval.py
Measures:
Usage:
python scripts/benchmark-retrieval.py \
--strategies semantic,hybrid,reranking \
--queries queries.jsonl \
--num-runs 100 \
--output benchmark-results.json
Output:
{
"semantic": {
"latency_p50": 75.2,
"latency_p95": 120.5,
"latency_p99": 180.3,
"throughput": 13.3,
"cost_per_query": 0.0001
},
"hybrid": {
"latency_p50": 145.8,
"latency_p95": 220.1,
"latency_p99": 300.5,
"throughput": 6.9,
"cost_per_query": 0.0002
}
}
Script: scripts/evaluate-retrieval-quality.py
Measures:
Usage:
python scripts/evaluate-retrieval-quality.py \
--strategy hybrid \
--test-set labeled-queries.jsonl \
--k-values 1,3,5,10 \
--output quality-metrics.json
Output:
{
"precision@5": 0.78,
"recall@5": 0.65,
"mrr": 0.72,
"ndcg@5": 0.81,
"hit_rate@5": 0.92
}
Use case: Prototype, low latency requirements, conceptual queries
Template: templates/semantic-search.py
Implementation:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
# Initialize
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
# Retrieve
retriever = index.as_retriever(similarity_top_k=5)
results = retriever.retrieve("query")
Use case: Production systems, balanced performance/quality
Template: templates/hybrid-search.py
Implementation:
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import FAISS
# Vector retriever
vector_retriever = FAISS.from_documents(docs, embeddings).as_retriever(
search_kwargs={"k": 10}
)
# BM25 retriever
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 10
# Ensemble with RRF
ensemble = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.5, 0.5]
)
results = ensemble.get_relevant_documents("query")
Use case: Quality-critical applications
Template: templates/reranking.py
Implementation:
from llama_index.postprocessor.cohere_rerank import CohereRerank
# Initial retrieval (hybrid recommended)
initial_results = hybrid_retriever.retrieve("query", top_k=20)
# Rerank
reranker = CohereRerank(api_key=api_key, top_n=5)
reranked = reranker.postprocess_nodes(
initial_results,
query_str="query"
)
# Use top 5 reranked results
final_results = reranked[:5]
Use case: Complex queries, exploration scenarios
Template: templates/multi-query-retrieval.py
Implementation:
from langchain.retrievers import MultiQueryRetriever
from langchain.llms import OpenAI
# Generate query variations
retriever = MultiQueryRetriever.from_llm(
retriever=base_retriever,
llm=OpenAI(temperature=0.7)
)
# Automatically generates variations and fuses results
results = retriever.get_relevant_documents("query")
Use case: Chatbots, multi-turn interactions
Example: examples/conversational-retrieval.py
Key features:
Implementation highlights:
# Rewrite query with conversation context
standalone_query = rewrite_with_history(current_query, chat_history)
# Retrieve with rewritten query
results = retriever.retrieve(standalone_query)
Use case: Filtered search, access control, temporal queries
Example: examples/metadata-filtering.py
Key features:
Implementation highlights:
# Filter by metadata before retrieval
retriever = index.as_retriever(
similarity_top_k=5,
filters=MetadataFilters(
filters=[
ExactMatchFilter(key="source", value="documentation"),
DateFilter(key="timestamp", after="2024-01-01")
]
)
)
Fast & Cheap:
text-embedding-3-small (1536 dim, $0.02/1M tokens)High Quality:
text-embedding-3-large (3072 dim, $0.13/1M tokens)embed-english-v3.0 (1024 dim, $0.10/1M tokens)Domain-Specific:
Initial Retrieval:
Final Results:
Techniques:
Target latencies:
Strategies:
# 1. Collect test queries
# Create queries.jsonl with representative queries
# 2. Run performance benchmark
python scripts/benchmark-retrieval.py \
--strategies semantic,hybrid \
--queries queries.jsonl \
--num-runs 100
# 3. Run quality evaluation (requires labeled data)
python scripts/evaluate-retrieval-quality.py \
--strategy hybrid \
--test-set labeled-queries.jsonl \
--k-values 3,5,10
# 4. Compare results
# Analyze benchmark-results.json and quality-metrics.json
Pattern:
import random
def retrieve_with_strategy(query, user_id):
# A/B test: 50% hybrid, 50% semantic
strategy = "hybrid" if hash(user_id) % 2 == 0 else "semantic"
if strategy == "hybrid":
return hybrid_retriever.retrieve(query)
else:
return semantic_retriever.retrieve(query)
Problem: Misses exact keyword matches Solution: Default to hybrid search for production
Problem: Initial retrieval may have poor ordering Solution: Add reranking for quality-critical applications
Problem: Too few = miss relevant docs, too many = noise Solution: Benchmark with your data (typically 5-10 for final results)
Problem: Multi-query without caching can be slow Solution: Profile each component, optimize critical paths
Problem: Can't measure improvement Solution: Implement evaluation with labeled test set
Scripts:
benchmark-retrieval.py - Measure latency, throughput, costevaluate-retrieval-quality.py - Precision, recall, MRR, NDCGTemplates:
semantic-search.py - Vector-only retrievalhybrid-search.py - Vector + BM25 with RRFreranking.py - Cross-encoder and LLM rerankingmulti-query-retrieval.py - Query expansion and fusionExamples:
conversational-retrieval.py - Chat context handlingmetadata-filtering.py - Filtered retrievalDocumentation:
Supported Frameworks: LlamaIndex, LangChain Vector Stores: FAISS, Chroma, Pinecone, Weaviate, Qdrant Rerankers: Cohere, cross-encoders, LLM-based
Best Practice: Start with hybrid search, add reranking if quality is critical, benchmark with your data