
When your LLM application works perfectly in development but produces bizarre outputs in production, you're experiencing a classic problem: inadequate testing. Unlike traditional software where you test for deterministic outputs, LLM applications require evaluating probabilistic, context-dependent responses that can vary dramatically based on subtle input changes.
Consider this scenario: You've built a customer service chatbot that handles refund requests. During development, it correctly processes simple cases like "I want to return my broken headphones." But in production, users discover it approves expensive electronics returns based on vague complaints, while rejecting legitimate warranty claims due to minor phrasing differences. Without proper evaluation frameworks, these failures only surface when they impact real customers.
By the end of this lesson, you'll have a comprehensive testing strategy for LLM applications that catches problems before production and provides ongoing monitoring for deployed systems.
What you'll learn:
You should be familiar with building basic LLM applications using APIs like OpenAI's or open-source models. Basic Python programming skills and familiarity with testing concepts are assumed. Experience with data analysis libraries like pandas will be helpful but not required.
Traditional software testing relies on deterministic inputs and outputs. You test that calculate_tax(100, 0.08) always returns 8.0. But LLM applications break this paradigm. The same prompt might generate different responses due to temperature settings, model updates, or even random sampling during generation.
This non-determinism creates several evaluation challenges:
Output Variability: Even with temperature set to 0, most LLMs can produce slightly different outputs across runs. More problematically, semantically equivalent responses might use different words, sentence structures, or reasoning approaches.
Context Sensitivity: LLM responses depend heavily on conversation history, system prompts, and even the order of examples in few-shot prompts. A customer service bot might handle "cancel my subscription" differently if the previous message was about billing issues versus product complaints.
Emergent Behaviors: Complex prompts can trigger unexpected model behaviors that weren't present in simpler test cases. Multi-step reasoning tasks especially prone to failure modes that only appear with specific input combinations.
Let's start by building evaluation infrastructure that addresses these challenges systematically.
Your evaluation dataset is the foundation of reliable testing. Unlike unit tests with single inputs and outputs, LLM evaluation requires diverse examples that capture the full range of real-world usage patterns.
Start by analyzing your production data (or expected usage patterns) to identify key dimensions of variation:
import pandas as pd
import json
from typing import List, Dict, Any
from dataclasses import dataclass
@dataclass
class EvaluationExample:
"""Structure for individual test cases"""
id: str
input_text: str
expected_output: str
category: str
difficulty: str # easy, medium, hard
tags: List[str]
context: Dict[str, Any] = None
# Example: Customer service refund evaluation dataset
refund_examples = [
EvaluationExample(
id="refund_001",
input_text="My headphones stopped working after 2 months. Can I get my money back?",
expected_output="I understand your frustration with the headphones failing so quickly. Since you're within our 6-month warranty period, I can process a full refund. You'll receive an email with return instructions shortly.",
category="warranty_claim",
difficulty="easy",
tags=["electronics", "warranty", "straightforward"]
),
EvaluationExample(
id="refund_002",
input_text="I bought this laptop 8 months ago and now the screen is flickering. This is unacceptable quality. I demand a full refund immediately or I'm calling my lawyer.",
expected_output="I apologize for the screen issue with your laptop. While you're outside our standard 6-month return window, screen flickering could indicate a manufacturing defect covered under warranty. Let me escalate this to our technical support team to determine if this qualifies for warranty repair or replacement.",
category="complex_warranty",
difficulty="hard",
tags=["electronics", "warranty_edge_case", "angry_customer", "escalation"]
),
# More examples covering edge cases...
]
Notice how each example includes metadata beyond just input/output pairs. The category and tags help you analyze performance across different scenarios, while difficulty ratings let you track whether problems concentrate in complex cases.
Many LLM applications don't have single "correct" answers. A creative writing assistant might generate multiple valid story continuations, or a data analysis tool might choose different but equally appropriate visualization approaches.
For these scenarios, create evaluation examples with multiple acceptable outputs or define success criteria rather than exact matches:
@dataclass
class MultiTargetExample:
id: str
input_text: str
acceptable_outputs: List[str]
success_criteria: List[str]
rejection_criteria: List[str]
creative_writing_example = MultiTargetExample(
id="story_001",
input_text="Continue this story: Sarah walked into the abandoned library and noticed something strange about the books...",
acceptable_outputs=[
"The books were rearranging themselves on the shelves, sliding silently from section to section as if guided by invisible hands.",
"Every book lay open to the same page number - 237 - despite being completely different titles from different eras.",
"The books whispered her name in unison, their pages fluttering without any breeze in the still air."
],
success_criteria=[
"Continues the narrative coherently",
"Maintains the mysterious/supernatural tone",
"Introduces a specific strange element about the books",
"Uses vivid descriptive language"
],
rejection_criteria=[
"Breaks narrative continuity",
"Introduces characters not mentioned in the prompt",
"Shifts to completely different genre (comedy, romance)",
"Provides meta-commentary instead of story continuation"
]
)
This structure allows for flexible evaluation while maintaining clear quality standards.
Rule-based evaluation uses programmatic checks to validate LLM outputs against specific criteria. While less sophisticated than model-based approaches, rule-based methods provide fast, consistent, and interpretable results for many applications.
Many LLM applications require outputs in specific formats. A financial report generator must produce valid JSON, or a code generation tool must output syntactically correct Python. These requirements are perfect for rule-based validation:
import json
import re
from typing import Tuple, List
class FormatValidator:
"""Validates LLM outputs against format requirements"""
def validate_json_response(self, response: str) -> Tuple[bool, str]:
"""Check if response is valid JSON with required fields"""
try:
data = json.loads(response.strip())
required_fields = ["analysis", "confidence", "recommendations"]
missing_fields = [field for field in required_fields if field not in data]
if missing_fields:
return False, f"Missing required fields: {missing_fields}"
return True, "Valid JSON format"
except json.JSONDecodeError as e:
return False, f"Invalid JSON: {str(e)}"
def validate_sql_query(self, response: str) -> Tuple[bool, str]:
"""Basic SQL syntax validation"""
# Extract SQL from response (assuming it's in code blocks)
sql_pattern = r"```sql\s*(.*?)\s*```"
matches = re.findall(sql_pattern, response, re.DOTALL | re.IGNORECASE)
if not matches:
return False, "No SQL code block found"
sql = matches[0].strip()
# Check for dangerous operations in read-only context
dangerous_keywords = ["DROP", "DELETE", "UPDATE", "INSERT", "CREATE", "ALTER"]
for keyword in dangerous_keywords:
if keyword.upper() in sql.upper():
return False, f"Contains prohibited keyword: {keyword}"
# Basic structure validation
if not sql.upper().startswith("SELECT"):
return False, "Query must start with SELECT"
return True, "Valid SQL query"
# Usage example
validator = FormatValidator()
# Test a financial analysis response
llm_response = '''
{
"analysis": "Revenue increased 15% YoY driven by strong Q4 performance",
"confidence": 0.85,
"recommendations": ["Focus on Q4 strategies", "Monitor competitive landscape"]
}
'''
is_valid, message = validator.validate_json_response(llm_response)
print(f"JSON validation: {is_valid} - {message}")
Beyond format validation, you can check content quality using domain-specific rules:
class ContentValidator:
"""Validates content quality and appropriateness"""
def __init__(self):
# Load domain-specific requirements
self.prohibited_phrases = [
"I cannot", "I'm not able to", "I don't know",
"As an AI", "I'm just a language model"
]
self.required_elements = {
"customer_service": ["apology", "solution", "next_steps"],
"technical_explanation": ["definition", "example", "implications"]
}
def check_response_completeness(self, response: str, category: str) -> Tuple[bool, List[str]]:
"""Verify response contains required elements for the category"""
issues = []
# Check for refusal patterns
for phrase in self.prohibited_phrases:
if phrase.lower() in response.lower():
issues.append(f"Contains refusal pattern: '{phrase}'")
# Check category-specific requirements
if category in self.required_elements:
required = self.required_elements[category]
response_lower = response.lower()
for element in required:
# Simple keyword-based checking (could be more sophisticated)
element_keywords = {
"apology": ["sorry", "apologize", "regret", "understand your frustration"],
"solution": ["will", "can", "here's how", "solution", "resolve"],
"next_steps": ["next", "follow up", "contact", "email", "within"]
}
if element in element_keywords:
keywords = element_keywords[element]
if not any(keyword in response_lower for keyword in keywords):
issues.append(f"Missing {element} element")
return len(issues) == 0, issues
def check_response_length(self, response: str, min_words: int = 10, max_words: int = 500) -> Tuple[bool, str]:
"""Validate response length is appropriate"""
word_count = len(response.split())
if word_count < min_words:
return False, f"Response too short: {word_count} words (minimum {min_words})"
elif word_count > max_words:
return False, f"Response too long: {word_count} words (maximum {max_words})"
return True, f"Appropriate length: {word_count} words"
# Example usage
content_validator = ContentValidator()
customer_response = """
I apologize for the inconvenience with your headphones. I understand how frustrating it must be when a product fails so quickly.
I can process a full refund for you since you're within our warranty period. Here's what happens next: you'll receive an email with return instructions within the next hour, and your refund will be processed within 3-5 business days once we receive the item.
Is there anything else I can help you with today?
"""
is_complete, issues = content_validator.check_response_completeness(
customer_response, "customer_service"
)
is_appropriate_length, length_msg = content_validator.check_response_length(customer_response)
print(f"Content completeness: {is_complete}")
if issues:
print(f"Issues found: {issues}")
print(f"Length check: {is_appropriate_length} - {length_msg}")
Rule-based validation excels at catching obvious failures quickly and consistently. However, it struggles with nuanced quality assessment, which requires more sophisticated approaches.
Model-based evaluation uses other AI systems to assess LLM outputs. This approach can capture semantic similarity, quality judgments, and complex criteria that rule-based methods miss.
One of the most effective approaches is using a strong LLM to evaluate outputs from your application's LLM. This "LLM-as-judge" pattern can assess criteria like coherence, relevance, and helpfulness:
import openai
from typing import Dict, List
from dataclasses import dataclass
@dataclass
class EvaluationResult:
score: float # 0-10 scale
reasoning: str
criteria_scores: Dict[str, float]
class LLMJudge:
"""Uses GPT-4 to evaluate LLM outputs"""
def __init__(self, model: str = "gpt-4"):
self.model = model
self.client = openai.OpenAI()
def evaluate_response(self,
input_prompt: str,
llm_output: str,
evaluation_criteria: List[str]) -> EvaluationResult:
"""Evaluate an LLM response against specific criteria"""
criteria_text = "\n".join([f"- {criterion}" for criterion in evaluation_criteria])
judge_prompt = f"""
You are an expert evaluator of AI system outputs. Evaluate the following LLM response based on the given criteria.
USER INPUT:
{input_prompt}
LLM RESPONSE:
{llm_output}
EVALUATION CRITERIA:
{criteria_text}
For each criterion, provide a score from 1-10 where:
1-3: Poor (major issues)
4-6: Adequate (meets basic requirements)
7-8: Good (exceeds expectations)
9-10: Excellent (exceptional quality)
Respond in JSON format:
{{
"overall_score": <average of all criteria>,
"reasoning": "<detailed explanation of the evaluation>",
"criteria_scores": {{
"criterion_1": <score>,
"criterion_2": <score>,
...
}}
}}
"""
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": judge_prompt}],
temperature=0.1 # Lower temperature for more consistent evaluations
)
try:
result = json.loads(response.choices[0].message.content)
return EvaluationResult(
score=result["overall_score"],
reasoning=result["reasoning"],
criteria_scores=result["criteria_scores"]
)
except (json.JSONDecodeError, KeyError) as e:
# Fallback if JSON parsing fails
return EvaluationResult(
score=0.0,
reasoning=f"Evaluation failed: {str(e)}",
criteria_scores={}
)
# Example usage for customer service evaluation
judge = LLMJudge()
customer_input = "I bought a laptop 8 months ago and the screen is flickering. I want a refund."
llm_response = "I understand your frustration. While 8 months is outside our return window, flickering screens may indicate a manufacturing defect covered under warranty. Let me escalate this to determine if you qualify for warranty repair or replacement."
evaluation_criteria = [
"Empathy and understanding of customer concern",
"Accurate application of return/warranty policies",
"Clear communication of next steps",
"Professional and helpful tone",
"Appropriate level of detail"
]
result = judge.evaluate_response(customer_input, llm_response, evaluation_criteria)
print(f"Overall Score: {result.score}/10")
print(f"Reasoning: {result.reasoning}")
print("Criteria Scores:")
for criterion, score in result.criteria_scores.items():
print(f" {criterion}: {score}/10")
For tasks where you have reference outputs, semantic similarity can measure how well the LLM's response matches expected content without requiring exact word matches:
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine
import numpy as np
class SemanticEvaluator:
"""Evaluates responses using semantic similarity"""
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
def calculate_similarity(self, response: str, reference: str) -> float:
"""Calculate semantic similarity between response and reference"""
embeddings = self.model.encode([response, reference])
similarity = 1 - cosine(embeddings[0], embeddings[1])
return float(similarity)
def evaluate_against_references(self,
response: str,
references: List[str]) -> Dict[str, float]:
"""Evaluate response against multiple reference outputs"""
similarities = []
for ref in references:
similarity = self.calculate_similarity(response, ref)
similarities.append(similarity)
return {
"max_similarity": max(similarities),
"mean_similarity": np.mean(similarities),
"all_similarities": similarities
}
# Example usage
evaluator = SemanticEvaluator()
# Test response against multiple valid references
test_response = "I can help you process a refund since your headphones are still under warranty."
reference_responses = [
"I'll be happy to process your refund as the headphones are within the warranty period.",
"Since your headphones are covered under warranty, I can arrange a full refund for you.",
"Your headphones qualify for a warranty refund. I'll start the process right away."
]
similarity_scores = evaluator.evaluate_against_references(test_response, reference_responses)
print(f"Max similarity: {similarity_scores['max_similarity']:.3f}")
print(f"Mean similarity: {similarity_scores['mean_similarity']:.3f}")
# Set threshold for acceptable similarity
acceptable_threshold = 0.7
if similarity_scores['max_similarity'] >= acceptable_threshold:
print("✓ Response semantically matches expected output")
else:
print("✗ Response differs significantly from expected output")
Manual evaluation doesn't scale for production LLM applications. You need automated pipelines that run comprehensive evaluations on every code change and provide continuous monitoring of model performance.
Here's a complete testing pipeline that integrates with your development process:
import pytest
import pandas as pd
from typing import List, Dict, Any
import asyncio
import aiohttp
import time
from datetime import datetime
class LLMTestSuite:
"""Comprehensive test suite for LLM applications"""
def __init__(self,
model_endpoint: str,
api_key: str,
test_dataset_path: str):
self.endpoint = model_endpoint
self.api_key = api_key
self.test_data = pd.read_json(test_dataset_path)
self.results = []
# Initialize evaluators
self.format_validator = FormatValidator()
self.content_validator = ContentValidator()
self.semantic_evaluator = SemanticEvaluator()
self.llm_judge = LLMJudge()
async def generate_response(self, session: aiohttp.ClientSession, prompt: str) -> str:
"""Generate response from LLM endpoint"""
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
payload = {
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.1,
"max_tokens": 500
}
async with session.post(self.endpoint, json=payload, headers=headers) as response:
result = await response.json()
return result["choices"][0]["message"]["content"]
async def run_comprehensive_evaluation(self) -> Dict[str, Any]:
"""Run full evaluation suite on test dataset"""
start_time = time.time()
async with aiohttp.ClientSession() as session:
# Generate responses for all test cases
tasks = []
for _, row in self.test_data.iterrows():
task = self.evaluate_single_case(session, row)
tasks.append(task)
results = await asyncio.gather(*tasks)
# Aggregate results
evaluation_results = {
"timestamp": datetime.now().isoformat(),
"total_cases": len(results),
"execution_time": time.time() - start_time,
"results": results
}
# Calculate summary statistics
evaluation_results["summary"] = self.calculate_summary_stats(results)
return evaluation_results
async def evaluate_single_case(self, session: aiohttp.ClientSession, test_case: pd.Series) -> Dict[str, Any]:
"""Evaluate a single test case comprehensively"""
try:
# Generate response
response = await self.generate_response(session, test_case["input_text"])
# Run all evaluations
case_result = {
"case_id": test_case["id"],
"category": test_case["category"],
"difficulty": test_case["difficulty"],
"input": test_case["input_text"],
"response": response,
"evaluations": {}
}
# Format validation
if test_case.get("expected_format"):
if test_case["expected_format"] == "json":
is_valid, msg = self.format_validator.validate_json_response(response)
case_result["evaluations"]["format_valid"] = is_valid
case_result["evaluations"]["format_message"] = msg
# Content validation
is_complete, issues = self.content_validator.check_response_completeness(
response, test_case["category"]
)
case_result["evaluations"]["content_complete"] = is_complete
case_result["evaluations"]["content_issues"] = issues
# Semantic similarity (if reference output exists)
if pd.notna(test_case.get("expected_output")):
similarity = self.semantic_evaluator.calculate_similarity(
response, test_case["expected_output"]
)
case_result["evaluations"]["semantic_similarity"] = similarity
# LLM judge evaluation
if test_case.get("evaluation_criteria"):
criteria = test_case["evaluation_criteria"]
judge_result = self.llm_judge.evaluate_response(
test_case["input_text"], response, criteria
)
case_result["evaluations"]["judge_score"] = judge_result.score
case_result["evaluations"]["judge_reasoning"] = judge_result.reasoning
return case_result
except Exception as e:
return {
"case_id": test_case["id"],
"error": str(e),
"evaluations": {}
}
def calculate_summary_stats(self, results: List[Dict[str, Any]]) -> Dict[str, Any]:
"""Calculate aggregate performance metrics"""
total_cases = len(results)
successful_cases = [r for r in results if "error" not in r]
summary = {
"success_rate": len(successful_cases) / total_cases,
"total_cases": total_cases,
"failed_cases": total_cases - len(successful_cases)
}
if successful_cases:
# Format validation stats
format_valid = [r for r in successful_cases
if r["evaluations"].get("format_valid", True)]
summary["format_validation_rate"] = len(format_valid) / len(successful_cases)
# Content completeness stats
content_complete = [r for r in successful_cases
if r["evaluations"].get("content_complete", True)]
summary["content_completeness_rate"] = len(content_complete) / len(successful_cases)
# Semantic similarity stats
similarities = [r["evaluations"]["semantic_similarity"]
for r in successful_cases
if "semantic_similarity" in r["evaluations"]]
if similarities:
summary["mean_semantic_similarity"] = np.mean(similarities)
summary["min_semantic_similarity"] = min(similarities)
# Judge score stats
judge_scores = [r["evaluations"]["judge_score"]
for r in successful_cases
if "judge_score" in r["evaluations"]]
if judge_scores:
summary["mean_judge_score"] = np.mean(judge_scores)
summary["min_judge_score"] = min(judge_scores)
summary["high_quality_rate"] = len([s for s in judge_scores if s >= 7]) / len(judge_scores)
return summary
# Integration with pytest
class TestLLMApplication:
"""Pytest test class for LLM applications"""
@pytest.fixture(scope="class")
def test_suite(self):
return LLMTestSuite(
model_endpoint="https://api.openai.com/v1/chat/completions",
api_key="your-api-key",
test_dataset_path="test_cases.json"
)
@pytest.mark.asyncio
async def test_comprehensive_evaluation(self, test_suite):
"""Run comprehensive evaluation and assert quality thresholds"""
results = await test_suite.run_comprehensive_evaluation()
summary = results["summary"]
# Assert quality thresholds
assert summary["success_rate"] >= 0.95, f"Success rate too low: {summary['success_rate']}"
assert summary["format_validation_rate"] >= 0.98, f"Format validation rate too low: {summary['format_validation_rate']}"
assert summary["content_completeness_rate"] >= 0.90, f"Content completeness too low: {summary['content_completeness_rate']}"
if "mean_semantic_similarity" in summary:
assert summary["mean_semantic_similarity"] >= 0.7, f"Semantic similarity too low: {summary['mean_semantic_similarity']}"
if "mean_judge_score" in summary:
assert summary["mean_judge_score"] >= 6.0, f"Judge score too low: {summary['mean_judge_score']}"
# Save detailed results for analysis
with open(f"evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f:
json.dump(results, f, indent=2)
# Run tests
# pytest test_llm_application.py -v --tb=short
This testing pipeline provides comprehensive evaluation that can be integrated into CI/CD processes, ensuring quality gates before deployment.
Once your LLM application is deployed, ongoing monitoring becomes crucial. Model performance can degrade due to distribution shift, model updates, or changing user behavior patterns.
Implement monitoring that tracks key metrics in production:
import logging
from datetime import datetime, timedelta
from collections import defaultdict, deque
from typing import Dict, List, Optional
import threading
import time
class ProductionMonitor:
"""Monitors LLM application performance in production"""
def __init__(self,
alert_thresholds: Dict[str, float],
window_minutes: int = 60):
self.alert_thresholds = alert_thresholds
self.window_minutes = window_minutes
# Sliding window storage for metrics
self.metrics_window = defaultdict(lambda: deque())
self.lock = threading.Lock()
# Initialize evaluators for real-time assessment
self.content_validator = ContentValidator()
self.semantic_evaluator = SemanticEvaluator()
# Start background monitoring thread
self.monitoring_thread = threading.Thread(target=self._monitor_loop, daemon=True)
self.monitoring_thread.start()
logging.basicConfig(level=logging.INFO)
self.logger = logging.getLogger(__name__)
def log_interaction(self,
user_input: str,
llm_response: str,
response_time_ms: float,
user_feedback: Optional[str] = None,
category: Optional[str] = None):
"""Log a production interaction for monitoring"""
timestamp = datetime.now()
# Perform real-time quality checks
quality_scores = self._assess_response_quality(
user_input, llm_response, category
)
interaction_data = {
"timestamp": timestamp,
"user_input": user_input,
"llm_response": llm_response,
"response_time_ms": response_time_ms,
"user_feedback": user_feedback,
"category": category,
"quality_scores": quality_scores
}
# Update sliding window metrics
with self.lock:
self.metrics_window["interactions"].append(interaction_data)
self.metrics_window["response_times"].append((timestamp, response_time_ms))
self.metrics_window["quality_scores"].append((timestamp, quality_scores))
# Add user feedback if provided
if user_feedback:
feedback_score = self._parse_feedback_score(user_feedback)
self.metrics_window["user_satisfaction"].append((timestamp, feedback_score))
# Check for immediate alerts
self._check_quality_alerts(quality_scores, interaction_data)
def _assess_response_quality(self,
user_input: str,
llm_response: str,
category: Optional[str]) -> Dict[str, float]:
"""Perform real-time quality assessment"""
scores = {}
# Content completeness check
if category:
is_complete, issues = self.content_validator.check_response_completeness(
llm_response, category
)
scores["content_completeness"] = 1.0 if is_complete else 0.0
scores["content_issues_count"] = len(issues)
# Response length appropriateness
is_appropriate, _ = self.content_validator.check_response_length(llm_response)
scores["length_appropriate"] = 1.0 if is_appropriate else 0.0
# Refusal detection
refusal_indicators = ["I cannot", "I'm not able", "I don't know", "I can't help"]
contains_refusal = any(indicator.lower() in llm_response.lower()
for indicator in refusal_indicators)
scores["contains_refusal"] = 1.0 if contains_refusal else 0.0
return scores
def _parse_feedback_score(self, feedback: str) -> float:
"""Convert user feedback to numeric score"""
feedback_lower = feedback.lower()
if any(word in feedback_lower for word in ["excellent", "great", "perfect"]):
return 1.0
elif any(word in feedback_lower for word in ["good", "helpful", "thanks"]):
return 0.8
elif any(word in feedback_lower for word in ["okay", "fine", "adequate"]):
return 0.6
elif any(word in feedback_lower for word in ["poor", "bad", "unhelpful"]):
return 0.2
else:
return 0.5 # Neutral/unclear feedback
def _check_quality_alerts(self, quality_scores: Dict[str, float], interaction_data: Dict):
"""Check if quality scores trigger alerts"""
# High refusal rate alert
if quality_scores.get("contains_refusal", 0) > 0:
self.logger.warning(f"Refusal detected in interaction: {interaction_data['user_input'][:100]}...")
# Content completeness alert
if quality_scores.get("content_completeness", 1) < self.alert_thresholds.get("min_content_completeness", 0.8):
self.logger.warning(f"Content completeness below threshold: {quality_scores['content_completeness']}")
# Response time alert
if interaction_data["response_time_ms"] > self.alert_thresholds.get("max_response_time_ms", 5000):
self.logger.warning(f"Response time exceeded threshold: {interaction_data['response_time_ms']}ms")
def get_current_metrics(self) -> Dict[str, Any]:
"""Get current performance metrics from sliding window"""
cutoff_time = datetime.now() - timedelta(minutes=self.window_minutes)
with self.lock:
# Filter to current window
current_interactions = [
interaction for interaction in self.metrics_window["interactions"]
if interaction["timestamp"] > cutoff_time
]
current_response_times = [
rt for ts, rt in self.metrics_window["response_times"]
if ts > cutoff_time
]
current_satisfaction = [
score for ts, score in self.metrics_window["user_satisfaction"]
if ts > cutoff_time
]
if not current_interactions:
return {"message": "No interactions in current window"}
# Calculate aggregated metrics
total_interactions = len(current_interactions)
# Quality metrics
avg_content_completeness = np.mean([
interaction["quality_scores"].get("content_completeness", 1)
for interaction in current_interactions
])
refusal_rate = np.mean([
interaction["quality_scores"].get("contains_refusal", 0)
for interaction in current_interactions
])
# Performance metrics
avg_response_time = np.mean(current_response_times) if current_response_times else 0
p95_response_time = np.percentile(current_response_times, 95) if current_response_times else 0
# User satisfaction
avg_satisfaction = np.mean(current_satisfaction) if current_satisfaction else None
return {
"window_minutes": self.window_minutes,
"total_interactions": total_interactions,
"avg_content_completeness": avg_content_completeness,
"refusal_rate": refusal_rate,
"avg_response_time_ms": avg_response_time,
"p95_response_time_ms": p95_response_time,
"avg_user_satisfaction": avg_satisfaction,
"satisfaction_samples": len(current_satisfaction)
}
def _monitor_loop(self):
"""Background monitoring loop"""
while True:
try:
metrics = self.get_current_metrics()
# Check alert thresholds
if isinstance(metrics, dict) and "total_interactions" in metrics:
self._check_aggregate_alerts(metrics)
# Clean old data from sliding window
self._cleanup_old_data()
time.sleep(300) # Check every 5 minutes
except Exception as e:
self.logger.error(f"Monitoring loop error: {str(e)}")
time.sleep(60) # Wait before retrying
def _check_aggregate_alerts(self, metrics: Dict[str, Any]):
"""Check aggregate metrics against alert thresholds"""
if metrics["refusal_rate"] > self.alert_thresholds.get("max_refusal_rate", 0.1):
self.logger.error(f"High refusal rate detected: {metrics['refusal_rate']:.2%}")
if metrics["avg_content_completeness"] < self.alert_thresholds.get("min_content_completeness", 0.8):
self.logger.error(f"Low content completeness: {metrics['avg_content_completeness']:.2f}")
if metrics["p95_response_time_ms"] > self.alert_thresholds.get("max_p95_response_time_ms", 10000):
self.logger.error(f"High p95 response time: {metrics['p95_response_time_ms']:.0f}ms")
if metrics["avg_user_satisfaction"] and metrics["avg_user_satisfaction"] < self.alert_thresholds.get("min_satisfaction", 0.6):
self.logger.error(f"Low user satisfaction: {metrics['avg_user_satisfaction']:.2f}")
def _cleanup_old_data(self):
"""Remove data outside the sliding window"""
cutoff_time = datetime.now() - timedelta(minutes=self.window_minutes)
with self.lock:
# Clean interactions
self.metrics_window["interactions"] = deque([
interaction for interaction in self.metrics_window["interactions"]
if interaction["timestamp"] > cutoff_time
])
# Clean response times
self.metrics_window["response_times"] = deque([
(ts, rt) for ts, rt in self.metrics_window["response_times"]
if ts > cutoff_time
])
# Clean quality scores
self.metrics_window["quality_scores"] = deque([
(ts, scores) for ts, scores in self.metrics_window["quality_scores"]
if ts > cutoff_time
])
# Clean user satisfaction
self.metrics_window["user_satisfaction"] = deque([
(ts, score) for ts, score in self.metrics_window["user_satisfaction"]
if ts > cutoff_time
])
# Usage example
monitor = ProductionMonitor(
alert_thresholds={
"max_refusal_rate": 0.05,
"min_content_completeness": 0.85,
"max_response_time_ms": 3000,
"max_p95_response_time_ms": 8000,
"min_satisfaction": 0.7
}
)
# In your application code
def handle_user_request(user_input: str, category: str) -> str:
start_time = time.time()
# Generate LLM response
llm_response = your_llm_function(user_input)
response_time_ms = (time.time() - start_time) * 1000
# Log for monitoring
monitor.log_interaction(
user_input=user_input,
llm_response=llm_response,
response_time_ms=response_time_ms,
category=category
)
return llm_response
# Check current metrics
current_metrics = monitor.get_current_metrics()
print(f"Current performance: {current_metrics}")
Let's build a complete evaluation system for a document summarization application. This exercise combines all the techniques we've covered into a practical implementation.
You'll create an evaluation system for an LLM that summarizes technical documents. The system should assess summary quality across multiple dimensions and provide actionable feedback for improvement.
# First, let's create our test dataset
test_documents = [
{
"id": "tech_001",
"title": "Machine Learning Model Deployment Best Practices",
"content": """
Deploying machine learning models to production requires careful consideration of multiple factors including model versioning, monitoring, and scalability.
Model versioning ensures that different iterations of your model can be tracked and rolled back if necessary. Use semantic versioning (e.g., 1.2.3) where major versions indicate breaking changes, minor versions add functionality, and patch versions fix bugs.
Monitoring is crucial for detecting model drift, where the input data distribution changes over time, causing performance degradation. Implement both data drift detection (monitoring input feature distributions) and concept drift detection (monitoring model performance metrics).
Scalability considerations include choosing appropriate infrastructure (cloud vs. on-premise), implementing load balancing for high-traffic scenarios, and optimizing inference latency through techniques like model quantization or caching frequent predictions.
Security measures should include input validation to prevent adversarial attacks, secure API endpoints with proper authentication, and audit logging for compliance requirements.
""",
"expected_summary": "Key considerations for ML model deployment include version control using semantic versioning, monitoring for data and concept drift, scalability through proper infrastructure and optimization techniques, and security via input validation and secure APIs.",
"key_points": [
"Model versioning with semantic versioning",
"Monitoring for data drift and concept drift",
"Scalability through infrastructure and optimization",
"Security via input validation and secure APIs"
],
"category": "technical_summary",
"difficulty": "medium"
},
# Add more test cases...
]
class DocumentSummarizationEvaluator:
"""Comprehensive evaluator for document summarization"""
def __init__(self):
self.semantic_evaluator = SemanticEvaluator()
self.llm_judge = LLMJudge()
# Initialize ROUGE evaluator for text summarization
try:
from rouge_score import rouge_scorer
self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
except ImportError:
print("Warning: rouge_score not installed. ROUGE metrics will be skipped.")
self.rouge_scorer = None
def evaluate_summary(self,
original_document: str,
generated_summary: str,
reference_summary: str,
key_points: List[str]) -> Dict[str, Any]:
"""Comprehensive evaluation of a generated summary"""
evaluation = {
"timestamp": datetime.now().isoformat(),
"summary_length": len(generated_summary.split()),
"compression_ratio": len(original_document.split()) / len(generated_summary.split())
}
# 1. ROUGE scores (if available)
if self.rouge_scorer:
rouge_scores = self.rouge_scorer.score(reference_summary, generated_summary)
evaluation["rouge_scores"] = {
"rouge1": rouge_scores['rouge1'].fmeasure,
"rouge2": rouge_scores['rouge2'].fmeasure,
"rougeL": rouge_scores['rougeL'].fmeasure
}
# 2. Semantic similarity to reference
semantic_sim = self.semantic_evaluator.calculate_similarity(
generated_summary, reference_summary
)
evaluation["semantic_similarity"] = semantic_sim
# 3. Key point coverage analysis
key_point_coverage = self._evaluate_key_point_coverage(
generated_summary, key_points
)
evaluation["key_point_coverage"] = key_point_coverage
# 4. LLM judge evaluation
judge_criteria = [
"Accuracy: Does the summary accurately represent the main points?",
"Completeness: Are all important topics covered?",
"Conciseness: Is the summary appropriately concise without losing key information?",
"Clarity: Is the summary clear and well-written?",
"Coherence: Does the summary flow logically?"
]
judge_result = self.llm_judge.evaluate_response(
f"Original document:\n{original_document}\n\nGenerated summary:\n{generated_summary}",
generated_summary,
judge_criteria
)
evaluation["judge_evaluation"] = {
"overall_score": judge_result.score,
"reasoning": judge_result.reasoning,
"criteria_scores": judge_result.criteria_scores
}
# 5. Summary quality rules
quality_checks = self._check_summary_quality_rules(generated_summary)
evaluation["quality_checks"] = quality_checks
# 6. Calculate overall score
evaluation["overall_score"] = self._calculate_overall_score(evaluation)
return evaluation
def _evaluate_key_point_coverage(self,
summary: str,
key_points: List[str]) -> Dict[str, Any]:
"""Evaluate how well the summary covers key points"""
summary_lower = summary.lower()
coverage_scores = []
for point in key_points:
# Simple keyword-based coverage (could be enhanced with semantic similarity)
point_keywords = point.lower().split()
keyword_matches = sum(1 for keyword in point_keywords if keyword in summary_lower)
coverage_score = keyword_matches / len(point_keywords)
coverage_scores.append(coverage_score)
return {
"individual_coverage": coverage_scores,
"average_coverage": np.mean(coverage_scores),
"points_well_covered": len([score for score in coverage_scores if score > 0.5]),
"total_key_points": len(key_points)
}
def _check_summary_quality_rules(self, summary: str) -> Dict[str, Any]:
"""Apply rule-based quality checks"""
checks = {}
# Length appropriateness
word_count = len(summary.split())
checks["appropriate_length"] = 50 <= word_count <= 200
checks["word_count"] = word_count
# Sentence structure
sentences = summary.split('.')
sentence_count = len([s for s in sentences if s.strip()])
checks["sentence_count"] = sentence_count
checks["appropriate_sentence_count"] = 2 <= sentence_count <= 8
# Avoid common summarization issues
checks["no_repetition"] = not self._has_significant_repetition(summary)
checks["no_first_person"] = not any(phrase in summary.lower()
for phrase in ["i think", "in my opinion", "i believe"])
checks["proper_capitalization"] = summary[0].isupper() if summary else False
return checks
def _has_significant_repetition(self, text: str) -> bool:
"""Check for significant repetition in the text"""
words = text.lower().split()
if len(words) < 10:
return False
# Check for repeated phrases of 3+ words
for i in range(len(words) - 5):
phrase = ' '.join(words[i:i+3])
remaining_text = ' '.join(words[i+3:])
if phrase in remaining_text:
return True
return False
def _calculate_overall_score(self, evaluation: Dict[str, Any]) -> float:
"""Calculate weighted overall score from all evaluation components"""
score_components = []
# Semantic similarity (weight: 0.25)
if "semantic_similarity" in evaluation:
score_components.append((evaluation["semantic_similarity"], 0.25))
# ROUGE scores (weight: 0.25)
if "rouge_scores" in evaluation:
rouge_avg = np.mean(list(evaluation["rouge_scores"].values()))
score_components.append((rouge_avg, 0.25))
# Key point coverage (weight: 0.20)
coverage_score = evaluation["key_point_coverage"]["average_coverage"]
score_components.append((coverage_score, 0.20))
# LLM judge score (weight: 0.20, normalized to 0-1 scale)
judge_score = evaluation["judge_evaluation"]["overall_score"] / 10.0
score_components.append((judge_score, 0.20))
# Quality checks (weight: 0.10)
quality_checks = evaluation["quality_checks"]
quality_score = np.mean([
1.0 if quality_checks["appropriate_length"] else 0.0,
1.0 if quality_checks["appropriate_sentence_count"] else 0.0,
1.0 if quality_checks["no_repetition"] else 0.0,
1.0 if quality_checks["no_first_person"] else 0.0,
1.0 if quality_checks["proper_capitalization"] else 0.0
])
score_components.append((quality_score, 0.10))
# Calculate weighted average
total_weight = sum(weight for _, weight in score_components)
weighted_sum = sum(score * weight for score, weight in score_components)
return weighted_sum / total_weight if total_weight > 0 else 0.0
# Your task: Implement the summarization function and run evaluation
def your_summarization_function(document_text: str) -> str:
"""
TODO: Implement your summarization logic here
This could use:
- An LLM API call with appropriate prompts
- A fine-tuned summarization model
- Traditional extractive summarization techniques
For this exercise, you can use OpenAI's API or any other approach
"""
# Example implementation using OpenAI
import openai
client = openai.OpenAI()
prompt = f"""
Please provide a concise summary of the following technical document.
Focus on the main points and key takeaways. Keep the summary between 50-150 words.
Document:
{document_text}
Summary:
"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=200
)
return response.choices[0].message.content.strip()
# Run the evaluation
def run_summarization_evaluation():
"""Run evaluation on the test dataset"""
evaluator = DocumentSummarizationEvaluator()
results = []
for doc in test_documents:
print(f"Evaluating document: {doc['id']}")
# Generate summary
generated_summary = your_summarization_function(doc["content"])
# Evaluate the summary
evaluation = evaluator.evaluate_summary(
original_document=doc["content"],
generated_summary=generated_summary,
reference_summary=doc["expected_summary"],
key_points=doc["key_points"]
)
evaluation["document_id"] = doc["id"]
evaluation["generated_summary"] = generated_summary
results.append(evaluation)
print(f"Overall score: {evaluation['overall_score']:.3f}")
print(f"Generated summary: {generated_summary[:100]}...")
print("-" * 50)
# Calculate aggregate statistics
overall_scores = [r["overall_score"] for r in results]
print(f"\nAggregate Results:")
print(f"Mean overall score: {np.mean(overall_scores):.3f}")
print(f"Min overall score: {min(overall_scores):.3f}")
print(f"Max overall score: {max(overall_scores):.3f}")
# Identify areas for improvement
print(f"\nDetailed Analysis:")
for result in results:
print(f"\nDocument {result['document_id']}:")
print(f" Overall Score: {result['overall_score']:.3f}")
print(f" Semantic Similarity: {result.get('semantic_similarity', 'N/A'):.3f}")
print(f" Key Point Coverage: {result['key_point_coverage']['average_coverage']:.3f}")
print(f" Judge Score: {result['judge_evaluation']['overall_score']:.1f}/10")
# Highlight specific issues
quality_issues = []
quality_checks = result['quality_checks']
if not quality_checks['appropriate_length']:
quality_issues.append(f"Length issue (current: {quality_checks['word_count']} words)")
if not quality_checks['no_repetition']:
quality_issues.append("Contains repetitive content")
if not quality_checks['appropriate_sentence_count']:
quality_issues.append(f"Sentence count issue (current: {quality_checks['sentence_count']})")
if quality_issues:
print(f" Quality Issues: {', '.join(quality_issues)}")
return results
# Run the evaluation
if __name__ == "__main__":
evaluation_results = run_summarization_evaluation()
# Save results for further analysis
with open(f"summarization_evaluation_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f:
json.dump(evaluation_results, f, indent=2, default=str)
Exercise Instructions:
Implement the summarization function: Choose your approach (OpenAI API, Hugging Face model, or another method) and implement your_summarization_function().
Run the evaluation: Execute the evaluation script and analyze the results. Pay attention to which aspects score well and which need improvement.
Iterate and improve: Based on the evaluation results, modify your summarization approach:
Add more test cases: Expand the test_documents list with additional examples covering different domains and difficulty levels.
Customize evaluation criteria: Modify the evaluation criteria in the LLM judge to match your specific requirements.
Problem: Test cases don't reflect real-world usage patterns.
Solution: Continuously update your evaluation dataset based on production data. Set up processes to regularly sample and annotate real user interactions.
def update_evaluation_dataset_from_production(production_logs: List[Dict],
sample_rate: float = 0.01) -> List[EvaluationExample]:
"""Sample and prepare production data for evaluation dataset"""
# Sample random subset
import random
sampled_logs = random.sample(production_logs,
int(len(production_logs) * sample_rate))
evaluation_examples = []
for log in sampled_logs:
# Only include interactions with user feedback
if log.get("user_rating") is not None:
example = EvaluationExample(
id=f"prod_{log['timestamp']}_{log['user_id']}",
input_text=log["user_input"],
expected_output=log["llm_response"] if log["user_rating"] >= 4 else "",
category=log.get("category", "unknown"),
difficulty="real_world",
tags=["production", f"rating_{log['user_rating']}"]
)
evaluation_examples.append(example)
return evaluation_examples
Problem: LLM judges provide inconsistent scores for similar outputs.
Solution: Use multiple judge evaluations and aggregate results. Also, provide more specific evaluation criteria and examples.
def get_consensus_evaluation(input_text: str,
llm_output: str,
criteria: List[str],
num_judges: int = 3) -> EvaluationResult:
"""Get consensus evaluation from multiple LLM judges"""
judge = LLMJudge()
evaluations = []
for i in range(num_judges):
# Add slight variation to reduce identical responses
modified_criteria = criteria + [f"(Evaluation run {i+1})"]
result = judge.evaluate_response(input_text, llm_output, modified_criteria)
evaluations.append(result)
# Calculate consensus
scores = [eval.score for eval in evaluations]
consensus_score = np.median(scores) # Use median for robustness
# Identify high-variance evaluations
score_variance = np.var(scores)
if score_variance > 4.0: # High disagreement
print(f"Warning: High variance in judge evaluations: {scores}")
return EvaluationResult(
score=consensus_score,
reasoning=f"Consensus of {num_judges} evaluations. Scores: {scores}",
criteria_scores={} # Could aggregate individual criteria scores
)
Problem: Comprehensive evaluation is too slow for development workflows.
Solution: Implement tiered evaluation - fast checks for development, comprehensive evaluation for pre-production.
class TieredEvaluator:
"""Provides fast and comprehensive evaluation modes"""
def __init__(self):
self.format_validator = FormatValidator()
self.content_validator = ContentValidator()
self.semantic_evaluator = SemanticEvaluator()
self.llm_judge = LLMJudge()
def quick_evaluation(self, response: str, test_case: Dict) -> Dict[str, Any]:
"""Fast evaluation for development workflow"""
start_time = time.time()
# Only run fast checks
results = {}
# Format validation
if test_case.get("expected_format"):
is_valid, msg = self.format_validator.validate_json_response(response)
results["format_valid"] = is_valid
# Basic content checks
is_complete, issues = self.content_validator.check_response_completeness(
response, test_case.get("category", "")
)
results["content_complete"] = is_complete
results["issue_count"] = len(issues)
results["evaluation_time_ms"] = (time.time() - start_time) * 1000
results["evaluation_mode"] = "quick"
return results
def comprehensive_evaluation(self, response: str, test_case: Dict) -> Dict[str, Any]:
"""Thorough evaluation for pre-production validation"""
start_time = time.time()
# Run all evaluation methods
results = self.quick_evaluation(response, test_case)
# Add semantic similarity
if test_case.get("expected_output"):
similarity = self.semantic_evaluator.calculate_similarity(
response, test_case["expected_output"]
)
results["semantic_similarity"] = similarity
# Add LLM judge evaluation
if test_case.get("evaluation_criteria"):
judge_result = self.llm_judge.evaluate_response(
test_case["input_text"], response, test_case["evaluation_criteria"]
)
results["judge_score"] = judge_result.score
results["judge_reasoning"] = judge_result.reasoning
results["evaluation_time_ms"] = (time.time() - start_time) * 1000
results["evaluation_mode"] = "comprehensive"
return results
Problem: Same inputs produce different outputs, making evaluation inconsistent.
Solution: Run multiple evaluations and use statistical methods to account for variance.
def evaluate_with_variance_analysis(llm_function,
test_input: str,
num_runs: int = 5) -> Dict[str, Any]:
"""Evaluate accounting for LLM non-determinism"""
results = []
responses = []
for run in range(num_runs):
response = llm_function(test_input)
responses.append(response)
# Run evaluation on this response
evaluation = your_evaluation_function(response, test_input)
results.append(evaluation)
# Calculate statistics
scores = [r.get("overall_score", 0) for r in results]
return {
"mean_score": np.mean(scores),
"std_score": np.std(scores),
"min_score": min(scores),
"max_score": max(scores),
"confidence_interval_95": (
np.mean(scores) - 1.96 * np.std(scores) / np.sqrt(len(scores)),
np.mean(scores) + 1.96 * np.std(scores) / np.sqrt(len(scores))
),
"all_responses": responses,
"all_results": results,
"num_runs": num_runs
}
You've built a comprehensive testing and evaluation framework for LLM applications that combines multiple evaluation approaches, integrates with development workflows, and provides production monitoring. This framework addresses the unique challenges of evaluating non-deterministic AI systems while maintaining practical usability for development teams.
Key takeaways from this lesson:
Immediate next steps:
Advanced topics to explore:
The evaluation strategies you've learned scale from simple applications to complex multi-agent systems. As LLM capabilities continue advancing, robust evaluation remains the foundation for building reliable, production-ready AI applications that users can trust.
Learning Path: Building with LLMs