
Your RAG system is up and running. Users are asking questions, documents are being retrieved, and your language model is generating responses. Everything looks good in your monitoring dashboard—low latency, high throughput, minimal errors. But then you start getting feedback: "The system gave me information that wasn't in the documents," or "It missed the most relevant section entirely," or worse, "It's making things up."
This is the harsh reality of RAG evaluation. Unlike traditional machine learning where you can rely on accuracy metrics against a held-out test set, evaluating RAG systems requires a more nuanced approach. You need to assess not just whether the system produces the right answer, but whether it retrieves the right information, uses only the information it retrieved, and maintains factual consistency with your knowledge base.
What you'll learn:
You should have hands-on experience building RAG systems, including vector databases, embedding models, and retrieval pipelines. Familiarity with evaluation frameworks like RAGAS or deepeval is helpful but not required—we'll build evaluation systems from scratch to understand the underlying mechanics.
RAG evaluation fundamentally differs from traditional NLP evaluation because it involves a two-stage process: retrieval and generation. This creates three distinct but interconnected evaluation dimensions:
Retrieval Precision and Recall measure how well your system finds relevant documents. High precision means most retrieved documents are relevant to the query. High recall means you're not missing relevant documents that exist in your knowledge base.
End-to-End Precision and Recall evaluate the complete system—whether the final generated response contains the information the user needs, regardless of intermediate retrieval performance.
Faithfulness measures whether the generated response is grounded in the retrieved documents and doesn't introduce information not present in the source material.
Let's start with retrieval evaluation, then build toward comprehensive system-level metrics.
Before you can measure anything, you need ground truth. For retrieval evaluation, this means question-document relevance judgments. Here's how to build a robust evaluation dataset:
import pandas as pd
from dataclasses import dataclass
from typing import List, Dict, Optional
import json
@dataclass
class RetrievalExample:
query: str
relevant_doc_ids: List[str]
metadata: Dict = None
def to_dict(self):
return {
'query': self.query,
'relevant_doc_ids': self.relevant_doc_ids,
'metadata': self.metadata or {}
}
class GroundTruthBuilder:
def __init__(self):
self.examples = []
def add_example(self, query: str, relevant_docs: List[str],
category: str = None, difficulty: str = None):
"""Add a labeled example to the ground truth dataset."""
metadata = {}
if category:
metadata['category'] = category
if difficulty:
metadata['difficulty'] = difficulty
example = RetrievalExample(
query=query,
relevant_doc_ids=relevant_docs,
metadata=metadata
)
self.examples.append(example)
def save(self, filepath: str):
"""Save the ground truth dataset."""
data = [ex.to_dict() for ex in self.examples]
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
# Build a realistic evaluation set for a technical documentation RAG system
builder = GroundTruthBuilder()
# Multi-hop reasoning query
builder.add_example(
query="How do I configure SSL encryption for database connections in production?",
relevant_docs=["doc_ssl_config", "doc_database_security", "doc_production_setup"],
category="configuration",
difficulty="complex"
)
# Factual lookup query
builder.add_example(
query="What is the default timeout for API requests?",
relevant_docs=["doc_api_reference"],
category="factual",
difficulty="simple"
)
# Troubleshooting query requiring multiple sources
builder.add_example(
query="Why am I getting connection refused errors after deployment?",
relevant_docs=["doc_deployment_troubleshooting", "doc_network_config", "doc_firewall_rules"],
category="troubleshooting",
difficulty="complex"
)
builder.save("retrieval_ground_truth.json")
The key insight here is that not all queries are created equal. Simple factual lookups should have perfect precision and recall, while complex multi-step questions might reasonably retrieve broader sets of documents. Categorizing your evaluation examples lets you set appropriate expectations.
Now let's implement precision and recall for retrieval evaluation:
import numpy as np
from typing import Set, List, Tuple
from collections import defaultdict
class RetrievalEvaluator:
def __init__(self, ground_truth_file: str):
with open(ground_truth_file, 'r') as f:
self.ground_truth = json.load(f)
# Index ground truth by query for fast lookup
self.gt_index = {ex['query']: set(ex['relevant_doc_ids'])
for ex in self.ground_truth}
def precision_at_k(self, query: str, retrieved_docs: List[str], k: int = None) -> float:
"""Calculate precision@k for a single query."""
if query not in self.gt_index:
raise ValueError(f"No ground truth for query: {query}")
relevant_docs = self.gt_index[query]
retrieved_set = set(retrieved_docs[:k] if k else retrieved_docs)
if len(retrieved_set) == 0:
return 0.0
return len(relevant_docs & retrieved_set) / len(retrieved_set)
def recall_at_k(self, query: str, retrieved_docs: List[str], k: int = None) -> float:
"""Calculate recall@k for a single query."""
if query not in self.gt_index:
raise ValueError(f"No ground truth for query: {query}")
relevant_docs = self.gt_index[query]
retrieved_set = set(retrieved_docs[:k] if k else retrieved_docs)
if len(relevant_docs) == 0:
return 1.0 # Edge case: no relevant docs means perfect recall
return len(relevant_docs & retrieved_set) / len(relevant_docs)
def mean_reciprocal_rank(self, query: str, retrieved_docs: List[str]) -> float:
"""Calculate MRR for a single query."""
if query not in self.gt_index:
raise ValueError(f"No ground truth for query: {query}")
relevant_docs = self.gt_index[query]
for rank, doc_id in enumerate(retrieved_docs, 1):
if doc_id in relevant_docs:
return 1.0 / rank
return 0.0
def evaluate_batch(self, query_results: List[Tuple[str, List[str]]],
metrics: List[str] = ['precision@5', 'recall@5', 'mrr']) -> Dict:
"""Evaluate multiple queries and return aggregate metrics."""
results = defaultdict(list)
for query, retrieved_docs in query_results:
if 'precision@5' in metrics:
results['precision@5'].append(self.precision_at_k(query, retrieved_docs, 5))
if 'recall@5' in metrics:
results['recall@5'].append(self.recall_at_k(query, retrieved_docs, 5))
if 'mrr' in metrics:
results['mean_reciprocal_rank'].append(self.mean_reciprocal_rank(query, retrieved_docs))
# Calculate means and standard deviations
summary = {}
for metric, values in results.items():
summary[f'{metric}_mean'] = np.mean(values)
summary[f'{metric}_std'] = np.std(values)
summary[f'{metric}_values'] = values
return summary
# Example usage with a mock retrieval system
evaluator = RetrievalEvaluator("retrieval_ground_truth.json")
# Simulate retrieval results
mock_results = [
("How do I configure SSL encryption for database connections in production?",
["doc_ssl_config", "doc_api_reference", "doc_database_security", "doc_deployment_guide", "doc_troubleshooting"]),
("What is the default timeout for API requests?",
["doc_api_reference", "doc_configuration", "doc_performance_tuning"]),
]
metrics = evaluator.evaluate_batch(mock_results)
print(f"Precision@5: {metrics['precision@5_mean']:.3f} ± {metrics['precision@5_std']:.3f}")
print(f"Recall@5: {metrics['recall@5_mean']:.3f} ± {metrics['recall@5_std']:.3f}")
Important: These metrics only tell you about retrieval quality, not whether your system ultimately provides useful responses. A document might be technically "relevant" but not contain the specific information needed to answer the user's question.
Real-world retrieval evaluation often requires more nuanced approaches. Here's how to handle common challenges:
class AdvancedRetrievalEvaluator(RetrievalEvaluator):
def __init__(self, ground_truth_file: str):
super().__init__(ground_truth_file)
# Load relevance weights if available (graded relevance)
self.relevance_weights = {}
for ex in self.ground_truth:
query = ex['query']
if 'relevance_scores' in ex:
self.relevance_weights[query] = ex['relevance_scores']
def ndcg_at_k(self, query: str, retrieved_docs: List[str], k: int = 10) -> float:
"""Calculate NDCG@k accounting for graded relevance."""
if query not in self.relevance_weights:
# Fall back to binary relevance
relevant_docs = self.gt_index[query]
relevance_scores = {doc: 1 for doc in relevant_docs}
else:
relevance_scores = self.relevance_weights[query]
# Calculate DCG
dcg = 0.0
for i, doc_id in enumerate(retrieved_docs[:k]):
relevance = relevance_scores.get(doc_id, 0)
dcg += relevance / np.log2(i + 2) # i+2 because log2(1) is 0
# Calculate IDCG (perfect ranking)
sorted_relevance = sorted(relevance_scores.values(), reverse=True)
idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(sorted_relevance[:k]))
return dcg / idcg if idcg > 0 else 0.0
def evaluate_by_category(self, query_results: List[Tuple[str, List[str]]]) -> Dict:
"""Break down metrics by query category."""
category_results = defaultdict(lambda: defaultdict(list))
for query, retrieved_docs in query_results:
# Find the category for this query
category = None
for ex in self.ground_truth:
if ex['query'] == query:
category = ex.get('metadata', {}).get('category', 'unknown')
break
if category is None:
continue
# Calculate metrics for this category
category_results[category]['precision@5'].append(
self.precision_at_k(query, retrieved_docs, 5)
)
category_results[category]['recall@5'].append(
self.recall_at_k(query, retrieved_docs, 5)
)
category_results[category]['ndcg@10'].append(
self.ndcg_at_k(query, retrieved_docs, 10)
)
# Summarize by category
summary = {}
for category, metrics in category_results.items():
summary[category] = {}
for metric, values in metrics.items():
summary[category][f'{metric}_mean'] = np.mean(values)
summary[category][f'{metric}_count'] = len(values)
return summary
Retrieval metrics only tell part of the story. You need to evaluate whether your complete RAG system—retrieval plus generation—produces useful responses. This requires different techniques because you're now evaluating natural language generation quality.
import openai
from typing import Optional
import asyncio
import aiohttp
class EndToEndEvaluator:
def __init__(self, openai_api_key: str, evaluation_model: str = "gpt-4"):
self.client = openai.OpenAI(api_key=openai_api_key)
self.evaluation_model = evaluation_model
def evaluate_answer_relevance(self, query: str, generated_answer: str,
context_docs: List[str]) -> Dict:
"""
Use an LLM to evaluate whether the generated answer addresses the query.
"""
evaluation_prompt = f"""
You are evaluating a question-answering system. Given a user query and the system's response,
rate how well the response addresses the query.
Query: {query}
Generated Response: {generated_answer}
Context Documents: {' '.join(context_docs)}
Please evaluate the response on these dimensions:
1. Relevance (1-5): Does the response address the user's query?
2. Completeness (1-5): Does the response fully answer the query?
3. Accuracy (1-5): Based on the context documents, is the response factually correct?
Return your evaluation as JSON:
{{
"relevance": <score>,
"completeness": <score>,
"accuracy": <score>,
"explanation": "<brief explanation of your scoring>"
}}
"""
try:
response = self.client.chat.completions.create(
model=self.evaluation_model,
messages=[{"role": "user", "content": evaluation_prompt}],
temperature=0.1,
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return result
except Exception as e:
return {
"relevance": 0, "completeness": 0, "accuracy": 0,
"explanation": f"Evaluation failed: {str(e)}"
}
def evaluate_batch_async(self, evaluation_cases: List[Dict]) -> List[Dict]:
"""
Evaluate multiple cases concurrently for better throughput.
"""
async def evaluate_single(case):
return await asyncio.to_thread(
self.evaluate_answer_relevance,
case['query'],
case['generated_answer'],
case['context_docs']
)
async def evaluate_all():
tasks = [evaluate_single(case) for case in evaluation_cases]
return await asyncio.gather(*tasks)
return asyncio.run(evaluate_all())
Warning: LLM-based evaluation can be expensive and inconsistent. Always validate your evaluation prompts on a subset of data and consider using smaller, specialized models for large-scale evaluation.
When you have expected answers (ground truth), you can use reference-based metrics:
from sentence_transformers import SentenceTransformer
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import numpy as np
class ReferenceBased Evaluator:
def __init__(self, embedding_model: str = "all-MiniLM-L6-v2"):
self.sentence_model = SentenceTransformer(embedding_model)
self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
# Download required NLTK data
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
def semantic_similarity(self, generated_answer: str, reference_answer: str) -> float:
"""Calculate semantic similarity using sentence embeddings."""
generated_embedding = self.sentence_model.encode([generated_answer])
reference_embedding = self.sentence_model.encode([reference_answer])
# Cosine similarity
similarity = np.dot(generated_embedding[0], reference_embedding[0]) / (
np.linalg.norm(generated_embedding[0]) * np.linalg.norm(reference_embedding[0])
)
return float(similarity)
def bleu_score(self, generated_answer: str, reference_answer: str) -> float:
"""Calculate BLEU score for n-gram overlap."""
reference_tokens = nltk.word_tokenize(reference_answer.lower())
generated_tokens = nltk.word_tokenize(generated_answer.lower())
# Use smoothing function to handle short sentences
smoothing = SmoothingFunction().method1
return sentence_bleu([reference_tokens], generated_tokens, smoothing_function=smoothing)
def rouge_scores(self, generated_answer: str, reference_answer: str) -> Dict:
"""Calculate ROUGE scores for summarization-like evaluation."""
scores = self.rouge_scorer.score(reference_answer, generated_answer)
return {
'rouge1_f': scores['rouge1'].fmeasure,
'rouge2_f': scores['rouge2'].fmeasure,
'rougeL_f': scores['rougeL'].fmeasure
}
def comprehensive_evaluation(self, generated_answer: str, reference_answer: str) -> Dict:
"""Run all reference-based metrics."""
results = {
'semantic_similarity': self.semantic_similarity(generated_answer, reference_answer),
'bleu_score': self.bleu_score(generated_answer, reference_answer),
}
rouge_results = self.rouge_scores(generated_answer, reference_answer)
results.update(rouge_results)
return results
# Example usage
ref_evaluator = ReferenceBased Evaluator()
generated = "The default API timeout is 30 seconds. You can configure this in the settings file."
reference = "API requests timeout after 30 seconds by default. This can be modified in config.yaml."
evaluation = ref_evaluator.comprehensive_evaluation(generated, reference)
print(f"Semantic similarity: {evaluation['semantic_similarity']:.3f}")
print(f"BLEU score: {evaluation['bleu_score']:.3f}")
print(f"ROUGE-L F1: {evaluation['rougeL_f']:.3f}")
Faithfulness is perhaps the most critical metric for RAG systems. A response might be relevant and well-written, but if it contains information not present in the retrieved documents, it's potentially hallucinated content that could mislead users.
The most robust approach to measuring faithfulness uses natural language inference (NLI) models to check whether the generated response is entailed by the source documents:
from transformers import pipeline
import torch
from typing import List, Dict, Tuple
import re
class FaithfulnessEvaluator:
def __init__(self, nli_model: str = "microsoft/deberta-v2-xlarge-mnli"):
self.nli_pipeline = pipeline(
"text-classification",
model=nli_model,
tokenizer=nli_model,
device=0 if torch.cuda.is_available() else -1
)
def split_into_claims(self, text: str) -> List[str]:
"""
Split generated response into individual factual claims.
This is a simplified approach - production systems often use more sophisticated methods.
"""
# Split by sentences, then filter out very short ones
sentences = re.split(r'[.!?]+', text)
claims = []
for sentence in sentences:
sentence = sentence.strip()
if len(sentence) > 10 and not sentence.startswith(('However', 'Additionally', 'Furthermore')):
claims.append(sentence)
return claims
def check_entailment(self, claim: str, context: str) -> Dict:
"""
Check if a claim is entailed by the context using NLI.
"""
# Truncate context if too long (transformer limit)
max_context_length = 1000
if len(context) > max_context_length:
context = context[:max_context_length] + "..."
try:
result = self.nli_pipeline({
"text": claim,
"text_pair": context
})
# Convert to consistent format
label_map = {
'ENTAILMENT': 'entailed',
'CONTRADICTION': 'contradicted',
'NEUTRAL': 'neutral'
}
return {
'label': label_map.get(result['label'], result['label'].lower()),
'confidence': result['score']
}
except Exception as e:
return {
'label': 'error',
'confidence': 0.0,
'error': str(e)
}
def evaluate_faithfulness(self, generated_answer: str, context_documents: List[str]) -> Dict:
"""
Evaluate faithfulness by checking entailment of individual claims.
"""
# Combine context documents
full_context = "\n\n".join(context_documents)
# Extract claims from generated answer
claims = self.split_into_claims(generated_answer)
if not claims:
return {
'faithfulness_score': 1.0,
'num_claims': 0,
'entailed_claims': 0,
'contradicted_claims': 0,
'neutral_claims': 0,
'claim_results': []
}
# Check entailment for each claim
results = []
entailed = 0
contradicted = 0
neutral = 0
for claim in claims:
entailment_result = self.check_entailment(claim, full_context)
results.append({
'claim': claim,
'entailment': entailment_result
})
if entailment_result['label'] == 'entailed':
entailed += 1
elif entailment_result['label'] == 'contradicted':
contradicted += 1
else:
neutral += 1
# Calculate faithfulness score
# Simple approach: (entailed claims) / (total claims)
# More conservative: (entailed claims) / (entailed + contradicted + neutral)
faithfulness_score = entailed / len(claims) if claims else 1.0
return {
'faithfulness_score': faithfulness_score,
'num_claims': len(claims),
'entailed_claims': entailed,
'contradicted_claims': contradicted,
'neutral_claims': neutral,
'claim_results': results
}
# Example usage
faithfulness_evaluator = FaithfulnessEvaluator()
generated_answer = """
The default API timeout is 30 seconds. This setting can be modified in the configuration file.
The system also supports custom retry logic with exponential backoff.
"""
context_docs = [
"API Configuration: The default timeout for all API requests is set to 30 seconds. This can be customized by editing the timeout parameter in config.yaml.",
"Error Handling: The system implements automatic retry with exponential backoff for failed requests."
]
faithfulness_result = faithfulness_evaluator.evaluate_faithfulness(generated_answer, context_docs)
print(f"Faithfulness Score: {faithfulness_result['faithfulness_score']:.3f}")
print(f"Claims Analysis: {faithfulness_result['entailed_claims']}/{faithfulness_result['num_claims']} entailed")
Tip: Different NLI models have varying capabilities and biases. Consider ensemble approaches or fine-tuning on your domain for production systems.
For a faster but less precise approach, you can use semantic similarity between generated content and source documents:
class SemanticFaithfulnessEvaluator:
def __init__(self, embedding_model: str = "all-MiniLM-L6-v2"):
self.sentence_model = SentenceTransformer(embedding_model)
def semantic_faithfulness(self, generated_answer: str, context_documents: List[str],
threshold: float = 0.5) -> Dict:
"""
Measure faithfulness using semantic similarity between answer and context.
"""
# Split answer into sentences
answer_sentences = re.split(r'[.!?]+', generated_answer)
answer_sentences = [s.strip() for s in answer_sentences if len(s.strip()) > 10]
if not answer_sentences:
return {'semantic_faithfulness': 1.0, 'supported_sentences': 0, 'total_sentences': 0}
# Get embeddings for context documents
context_embeddings = self.sentence_model.encode(context_documents)
supported_sentences = 0
sentence_scores = []
for sentence in answer_sentences:
sentence_embedding = self.sentence_model.encode([sentence])
# Calculate max similarity with any context document
similarities = np.dot(sentence_embedding, context_embeddings.T)[0]
max_similarity = np.max(similarities)
sentence_scores.append({
'sentence': sentence,
'max_similarity': float(max_similarity),
'supported': max_similarity > threshold
})
if max_similarity > threshold:
supported_sentences += 1
semantic_faithfulness_score = supported_sentences / len(answer_sentences)
return {
'semantic_faithfulness': semantic_faithfulness_score,
'supported_sentences': supported_sentences,
'total_sentences': len(answer_sentences),
'sentence_scores': sentence_scores,
'threshold': threshold
}
# Usage example
semantic_evaluator = SemanticFaithfulnessEvaluator()
semantic_result = semantic_evaluator.semantic_faithfulness(generated_answer, context_docs)
print(f"Semantic Faithfulness: {semantic_result['semantic_faithfulness']:.3f}")
Now let's put it all together into a production-ready evaluation pipeline:
import logging
from datetime import datetime
from pathlib import Path
import pickle
class RAGEvaluationPipeline:
def __init__(self, config: Dict):
self.config = config
self.retrieval_evaluator = RetrievalEvaluator(config['ground_truth_file'])
self.faithfulness_evaluator = FaithfulnessEvaluator(config.get('nli_model'))
self.end_to_end_evaluator = EndToEndEvaluator(config['openai_api_key'])
self.reference_evaluator = ReferenceBased Evaluator()
# Setup logging
self.logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# Results storage
self.results_dir = Path(config.get('results_dir', 'evaluation_results'))
self.results_dir.mkdir(exist_ok=True)
def evaluate_single_case(self, case: Dict) -> Dict:
"""
Evaluate a single query-response case across all metrics.
"""
query = case['query']
retrieved_docs = case['retrieved_docs']
generated_answer = case['generated_answer']
context_documents = case.get('context_documents', [])
reference_answer = case.get('reference_answer')
results = {
'query': query,
'timestamp': datetime.now().isoformat(),
'case_id': case.get('case_id', f"case_{hash(query)}")
}
try:
# Retrieval evaluation
if retrieved_docs:
results['retrieval_precision_at_5'] = self.retrieval_evaluator.precision_at_k(
query, retrieved_docs, 5
)
results['retrieval_recall_at_5'] = self.retrieval_evaluator.recall_at_k(
query, retrieved_docs, 5
)
results['mrr'] = self.retrieval_evaluator.mean_reciprocal_rank(
query, retrieved_docs
)
# Faithfulness evaluation
if context_documents:
faithfulness_result = self.faithfulness_evaluator.evaluate_faithfulness(
generated_answer, context_documents
)
results.update({f'faithfulness_{k}': v for k, v in faithfulness_result.items()})
# End-to-end evaluation (LLM-based)
if self.config.get('use_llm_evaluation', True):
e2e_result = self.end_to_end_evaluator.evaluate_answer_relevance(
query, generated_answer, context_documents
)
results.update({f'e2e_{k}': v for k, v in e2e_result.items()})
# Reference-based evaluation
if reference_answer:
ref_result = self.reference_evaluator.comprehensive_evaluation(
generated_answer, reference_answer
)
results.update({f'ref_{k}': v for k, v in ref_result.items()})
results['evaluation_status'] = 'success'
except Exception as e:
self.logger.error(f"Evaluation failed for query: {query[:100]}... Error: {str(e)}")
results['evaluation_status'] = 'failed'
results['error'] = str(e)
return results
def evaluate_dataset(self, test_cases: List[Dict],
save_results: bool = True) -> Dict:
"""
Evaluate a complete dataset and return aggregate metrics.
"""
individual_results = []
self.logger.info(f"Starting evaluation of {len(test_cases)} cases")
for i, case in enumerate(test_cases):
if i % 10 == 0:
self.logger.info(f"Processed {i}/{len(test_cases)} cases")
result = self.evaluate_single_case(case)
individual_results.append(result)
# Calculate aggregate metrics
aggregate_metrics = self._calculate_aggregate_metrics(individual_results)
# Save results if requested
if save_results:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_file = self.results_dir / f"evaluation_{timestamp}.json"
full_results = {
'config': self.config,
'aggregate_metrics': aggregate_metrics,
'individual_results': individual_results,
'evaluation_timestamp': datetime.now().isoformat()
}
with open(results_file, 'w') as f:
json.dump(full_results, f, indent=2, default=str)
self.logger.info(f"Results saved to {results_file}")
return aggregate_metrics
def _calculate_aggregate_metrics(self, results: List[Dict]) -> Dict:
"""Calculate aggregate statistics from individual results."""
successful_results = [r for r in results if r.get('evaluation_status') == 'success']
if not successful_results:
return {'error': 'No successful evaluations'}
aggregate = {
'num_cases': len(results),
'successful_cases': len(successful_results),
'success_rate': len(successful_results) / len(results)
}
# Calculate means for numeric metrics
numeric_metrics = [
'retrieval_precision_at_5', 'retrieval_recall_at_5', 'mrr',
'faithfulness_faithfulness_score', 'e2e_relevance', 'e2e_completeness',
'ref_semantic_similarity', 'ref_bleu_score'
]
for metric in numeric_metrics:
values = [r[metric] for r in successful_results if metric in r and isinstance(r[metric], (int, float))]
if values:
aggregate[f'{metric}_mean'] = np.mean(values)
aggregate[f'{metric}_std'] = np.std(values)
aggregate[f'{metric}_count'] = len(values)
# Category breakdown if available
category_metrics = defaultdict(list)
for result in successful_results:
category = result.get('metadata', {}).get('category', 'unknown')
for metric in numeric_metrics:
if metric in result:
category_metrics[f'{category}_{metric}'].append(result[metric])
if category_metrics:
aggregate['by_category'] = {}
for category_metric, values in category_metrics.items():
if values:
aggregate['by_category'][f'{category_metric}_mean'] = np.mean(values)
return aggregate
# Example usage
config = {
'ground_truth_file': 'retrieval_ground_truth.json',
'openai_api_key': 'your-api-key',
'results_dir': './evaluation_results',
'use_llm_evaluation': True,
'nli_model': 'microsoft/deberta-v2-xlarge-mnli'
}
pipeline = RAGEvaluationPipeline(config)
# Mock test cases
test_cases = [
{
'case_id': 'test_001',
'query': 'How do I configure SSL encryption?',
'retrieved_docs': ['doc_ssl_config', 'doc_security', 'doc_deployment'],
'generated_answer': 'To configure SSL encryption, edit the ssl_config section in your configuration file.',
'context_documents': ['SSL configuration requires setting ssl_enabled=true in the config file...'],
'reference_answer': 'SSL can be configured by editing the ssl settings in your configuration file.'
}
]
results = pipeline.evaluate_dataset(test_cases)
print(f"Overall faithfulness: {results.get('faithfulness_faithfulness_score_mean', 'N/A'):.3f}")
Let's build a complete evaluation system for a customer support RAG system. You'll implement custom metrics and run a comprehensive evaluation.
# Exercise: Customer Support RAG Evaluation
# Dataset: Customer support tickets and knowledge base articles
class CustomerSupportEvaluator:
"""Specialized evaluator for customer support RAG systems."""
def __init__(self):
self.categories = ['billing', 'technical', 'account', 'general']
self.severity_levels = ['low', 'medium', 'high', 'critical']
def evaluate_response_tone(self, generated_answer: str) -> Dict:
"""
Evaluate if the response has appropriate tone for customer support.
"""
# Simple keyword-based approach (replace with ML model in production)
positive_indicators = ['please', 'help', 'understand', 'assist', 'resolve', 'apologize']
negative_indicators = ['obviously', 'should have', 'impossible', 'wrong', 'stupid']
text_lower = generated_answer.lower()
positive_count = sum(1 for word in positive_indicators if word in text_lower)
negative_count = sum(1 for word in negative_indicators if word in text_lower)
# Simple scoring
if negative_count > 0:
tone_score = max(0.0, 0.5 - (negative_count * 0.2))
else:
tone_score = min(1.0, 0.7 + (positive_count * 0.1))
return {
'tone_score': tone_score,
'positive_indicators': positive_count,
'negative_indicators': negative_count,
'appropriate_tone': tone_score > 0.6
}
def evaluate_actionability(self, query: str, generated_answer: str) -> Dict:
"""
Check if the response provides actionable steps for the customer.
"""
action_indicators = [
'step', 'click', 'navigate', 'select', 'enter', 'visit',
'contact', 'call', 'email', 'submit', 'follow'
]
answer_lower = generated_answer.lower()
action_count = sum(1 for indicator in action_indicators if indicator in answer_lower)
# Check for numbered lists or bullet points
has_numbered_steps = bool(re.search(r'\d+\.\s', generated_answer))
has_bullets = bool(re.search(r'[•\-*]\s', generated_answer))
actionability_score = min(1.0, (action_count * 0.15) +
(0.3 if has_numbered_steps else 0) +
(0.2 if has_bullets else 0))
return {
'actionability_score': actionability_score,
'action_indicators_count': action_count,
'has_numbered_steps': has_numbered_steps,
'has_bullets': has_bullets,
'is_actionable': actionability_score > 0.4
}
# Your task: Implement the following methods
# 1. evaluate_resolution_time_estimate() - Check if response includes time estimates
# 2. evaluate_escalation_appropriateness() - Determine if issue should be escalated
# 3. comprehensive_customer_support_evaluation() - Combine all metrics
# Test your implementation with these cases:
test_cases = [
{
'query': 'I cannot access my account after the recent update',
'generated_answer': 'I understand your frustration. Please try these steps: 1. Clear your browser cache 2. Try logging in with an incognito window 3. If that doesn\'t work, please contact our technical support team.',
'category': 'technical',
'severity': 'medium'
},
{
'query': 'Why was I charged twice this month?',
'generated_answer': 'You obviously need to check your billing statement more carefully. The charges are clearly listed there.',
'category': 'billing',
'severity': 'high'
}
]
# Implement and test your solution here
Problem: Automated metrics like BLEU or semantic similarity don't capture nuanced aspects of response quality.
Solution: Always complement automated evaluation with human evaluation, especially during system development:
class HybridEvaluator:
def __init__(self):
self.automated_evaluator = RAGEvaluationPipeline(config)
self.human_evaluation_queue = []
def flag_for_human_review(self, case: Dict, automated_results: Dict) -> bool:
"""Determine if a case needs human evaluation."""
# Flag cases with conflicting automated metrics
faithfulness = automated_results.get('faithfulness_faithfulness_score', 0)
relevance = automated_results.get('e2e_relevance', 0)
if abs(faithfulness - (relevance / 5.0)) > 0.3: # Normalize relevance to 0-1 scale
return True
# Flag edge cases
if faithfulness < 0.5 or (relevance and relevance < 2):
return True
return False
def evaluate_with_human_fallback(self, cases: List[Dict]) -> Dict:
"""Run automated evaluation with selective human review."""
automated_results = []
human_review_cases = []
for case in cases:
auto_result = self.automated_evaluator.evaluate_single_case(case)
automated_results.append(auto_result)
if self.flag_for_human_review(case, auto_result):
human_review_cases.append({
'case': case,
'auto_result': auto_result,
'review_reason': 'conflicting_metrics'
})
return {
'automated_results': automated_results,
'human_review_queue': human_review_cases,
'review_rate': len(human_review_cases) / len(cases)
}
Problem: Comparing systems based on small differences in mean metrics without considering statistical significance.
Solution: Use proper statistical testing:
from scipy import stats
def compare_rag_systems(system_a_results: List[float], system_b_results: List[float],
metric_name: str = "faithfulness") -> Dict:
"""Compare two RAG systems with statistical testing."""
# Basic statistics
mean_a, mean_b = np.mean(system_a_results), np.mean(system_b_results)
std_a, std_b = np.std(system_a_results), np.std(system_b_results)
# Statistical significance test
t_stat, p_value = stats.ttest_ind(system_a_results, system_b_results)
# Effect size (Cohen's d)
pooled_std = np.sqrt(((len(system_a_results) - 1) * std_a**2 +
(len(system_b_results) - 1) * std_b**2) /
(len(system_a_results) + len(system_b_results) - 2))
cohens_d = (mean_a - mean_b) / pooled_std if pooled_std > 0 else 0
# Confidence interval for difference
se_diff = pooled_std * np.sqrt(1/len(system_a_results) + 1/len(system_b_results))
ci_lower = (mean_a - mean_b) - 1.96 * se_diff
ci_upper = (mean_a - mean_b) + 1.96 * se_diff
return {
'metric': metric_name,
'system_a_mean': mean_a,
'system_b_mean': mean_b,
'difference': mean_a - mean_b,
'p_value': p_value,
'significant': p_value < 0.05,
'cohens_d': cohens_d,
'effect_size': 'large' if abs(cohens_d) > 0.8 else ('medium' if abs(cohens_d) > 0.5 else 'small'),
'confidence_interval_95': (ci_lower, ci_upper),
'recommendation': 'System A significantly better' if p_value < 0.05 and mean_a > mean_b
else ('System B significantly better' if p_value < 0.05 and mean_b > mean_a
else 'No significant difference')
}
Problem: Using the same model family for generation and evaluation can create inflated scores.
Solution: Use diverse evaluation approaches and models:
class MultiModelEvaluator:
def __init__(self):
# Use different model families for evaluation
self.evaluators = {
'gpt4': EndToEndEvaluator(api_key, "gpt-4"),
'claude': EndToEndEvaluator(anthropic_key, "claude-3-sonnet"), # Hypothetical
'nli_deberta': FaithfulnessEvaluator("microsoft/deberta-v2-xlarge-mnli"),
'nli_roberta': FaithfulnessEvaluator("roberta-large-mnli")
}
def ensemble_evaluation(self, query: str, generated_answer: str,
context_docs: List[str]) -> Dict:
"""Get evaluation from multiple models and combine results."""
results = {}
# LLM-based evaluations
for name, evaluator in [('gpt4', self.evaluators['gpt4']),
('claude', self.evaluators['claude'])]:
try:
result = evaluator.evaluate_answer_relevance(query, generated_answer, context_docs)
results[f'{name}_relevance'] = result['relevance']
results[f'{name}_accuracy'] = result['accuracy']
except Exception as e:
results[f'{name}_error'] = str(e)
# NLI-based faithfulness
for name, evaluator in [('deberta', self.evaluators['nli_deberta']),
('roberta', self.evaluators['nli_roberta'])]:
try:
result = evaluator.evaluate_faithfulness(generated_answer, context_docs)
results[f'{name}_faithfulness'] = result['faithfulness_score']
except Exception as e:
results[f'{name}_error'] = str(e)
# Calculate consensus scores
relevance_scores = [v for k, v in results.items() if k.endswith('_relevance')]
faithfulness_scores = [v for k, v in results.items() if k.endswith('_faithfulness')]
if relevance_scores:
results['consensus_relevance'] = np.mean(relevance_scores)
results['relevance_std'] = np.std(relevance_scores)
if faithfulness_scores:
results['consensus_faithfulness'] = np.mean(faithfulness_scores)
results['faithfulness_std'] = np.std(faithfulness_scores)
return results
Problem: Treating all queries the same when they have different complexity levels.
Solution: Stratified evaluation by complexity:
class ComplexityAwareEvaluator:
def __init__(self):
self.complexity_classifier = self._build_complexity_classifier()
def _build_complexity_classifier(self):
"""Simple rule-based complexity classifier."""
def classify_complexity(query: str) -> str:
query_lower = query.lower()
# Count complexity indicators
multi_hop_indicators = ['and', 'also', 'additionally', 'furthermore', 'after', 'before']
comparison_indicators = ['compare', 'difference', 'versus', 'vs', 'better', 'worse']
reasoning_indicators = ['why', 'how', 'explain', 'reason', 'cause']
complexity_score = 0
# Multi-hop reasoning
complexity_score += sum(1 for indicator in multi_hop_indicators if indicator in query_lower)
# Comparison questions
if any(indicator in query_lower for indicator in comparison_indicators):
complexity_score += 2
# Reasoning questions
if any(indicator in query_lower for indicator in reasoning_indicators):
complexity_score += 1
# Question length
if len(query.split()) > 15:
complexity_score += 1
# Multiple question marks or sentences
if query.count('?') > 1 or len(query.split('.')) > 2:
complexity_score += 1
if complexity_score >= 4:
return 'high'
elif complexity_score >= 2:
return 'medium'
else:
return 'low'
return classify_complexity
def evaluate_with_complexity_stratification(self, test_cases: List[Dict]) -> Dict:
"""Evaluate cases grouped by complexity level."""
complexity_groups = {'low': [], 'medium': [], 'high': []}
# Classify and group test cases
for case in test_cases:
complexity = self.complexity_classifier(case['query'])
case['complexity'] = complexity
complexity_groups[complexity].append(case)
# Evaluate each group separately
results = {}
for complexity, cases in complexity_groups.items():
if not cases:
continue
group_results = self.evaluate_group(cases)
results[complexity] = {
'count': len(cases),
'metrics': group_results,
'expected_performance': self._get_expected_performance(complexity)
}
return results
def _get_expected_performance(self, complexity: str) -> Dict:
"""Define expected performance thresholds by complexity."""
thresholds = {
'low': {
'faithfulness': 0.9,
'relevance': 4.0,
'precision_at_5': 0.8
},
'medium': {
'faithfulness': 0.8,
'relevance': 3.5,
'precision_at_5': 0.7
},
'high': {
'faithfulness': 0.7,
'relevance': 3.0,
'precision_at_5': 0.6
}
}
return thresholds.get(complexity, thresholds['medium'])
You now have a comprehensive framework for evaluating RAG systems across the three critical dimensions: retrieval performance, end-to-end response quality, and faithfulness. The key insights from this lesson:
Retrieval evaluation using precision, recall, and MRR gives you visibility into whether your system finds relevant documents, but doesn't guarantee useful responses.
End-to-end evaluation requires both automated metrics (BLEU, ROUGE, semantic similarity) and LLM-based evaluation for aspects like relevance and completeness.
Faithfulness evaluation is crucial for preventing hallucinations and maintaining user trust. NLI-based approaches are more reliable than similarity-based methods.
Statistical rigor is essential when comparing systems—always test for significance and consider effect sizes, not just mean differences.
Multi-faceted evaluation using diverse models and approaches helps identify biases and provides more robust assessments.
Next steps for mastering RAG evaluation:
The evaluation techniques you've learned here form the foundation for building reliable, trustworthy RAG systems that users can depend on for accurate information.
Learning Path: RAG & AI Agents