Testing and Evaluating LLM Applications

When your LLM application works perfectly in development but produces bizarre outputs in production, you're experiencing a classic problem: inadequate testing. Unlike traditional software where you test for deterministic outputs, LLM applications require evaluating probabilistic, context-dependent responses that can vary dramatically based on subtle input changes.

Consider this scenario: You've built a customer service chatbot that handles refund requests. During development, it correctly processes simple cases like "I want to return my broken headphones." But in production, users discover it approves expensive electronics returns based on vague complaints, while rejecting legitimate warranty claims due to minor phrasing differences. Without proper evaluation frameworks, these failures only surface when they impact real customers.

By the end of this lesson, you'll have a comprehensive testing strategy for LLM applications that catches problems before production and provides ongoing monitoring for deployed systems.

What you'll learn:

How to design evaluation datasets that capture real-world complexity and edge cases
Multiple evaluation approaches: rule-based validation, model-based scoring, and human assessment workflows
Automated testing pipelines that integrate with your development process
Production monitoring strategies to detect model drift and performance degradation
Statistical methods for measuring evaluation reliability and making data-driven improvements

Prerequisites

You should be familiar with building basic LLM applications using APIs like OpenAI's or open-source models. Basic Python programming skills and familiarity with testing concepts are assumed. Experience with data analysis libraries like pandas will be helpful but not required.

The Challenge of Evaluating Non-Deterministic Systems

Traditional software testing relies on deterministic inputs and outputs. You test that calculate_tax(100, 0.08) always returns 8.0. But LLM applications break this paradigm. The same prompt might generate different responses due to temperature settings, model updates, or even random sampling during generation.

This non-determinism creates several evaluation challenges:

Output Variability: Even with temperature set to 0, most LLMs can produce slightly different outputs across runs. More problematically, semantically equivalent responses might use different words, sentence structures, or reasoning approaches.

Context Sensitivity: LLM responses depend heavily on conversation history, system prompts, and even the order of examples in few-shot prompts. A customer service bot might handle "cancel my subscription" differently if the previous message was about billing issues versus product complaints.

Emergent Behaviors: Complex prompts can trigger unexpected model behaviors that weren't present in simpler test cases. Multi-step reasoning tasks especially prone to failure modes that only appear with specific input combinations.

Let's start by building evaluation infrastructure that addresses these challenges systematically.

Building Robust Evaluation Datasets

Your evaluation dataset is the foundation of reliable testing. Unlike unit tests with single inputs and outputs, LLM evaluation requires diverse examples that capture the full range of real-world usage patterns.

Creating Representative Test Cases

Start by analyzing your production data (or expected usage patterns) to identify key dimensions of variation:

import pandas as pd
import json
from typing import List, Dict, Any
from dataclasses import dataclass

@dataclass
class EvaluationExample:
    """Structure for individual test cases"""
    id: str
    input_text: str
    expected_output: str
    category: str
    difficulty: str  # easy, medium, hard
    tags: List[str]
    context: Dict[str, Any] = None

# Example: Customer service refund evaluation dataset
refund_examples = [
    EvaluationExample(
        id="refund_001",
        input_text="My headphones stopped working after 2 months. Can I get my money back?",
        expected_output="I understand your frustration with the headphones failing so quickly. Since you're within our 6-month warranty period, I can process a full refund. You'll receive an email with return instructions shortly.",
        category="warranty_claim",
        difficulty="easy",
        tags=["electronics", "warranty", "straightforward"]
    ),
    EvaluationExample(
        id="refund_002", 
        input_text="I bought this laptop 8 months ago and now the screen is flickering. This is unacceptable quality. I demand a full refund immediately or I'm calling my lawyer.",
        expected_output="I apologize for the screen issue with your laptop. While you're outside our standard 6-month return window, screen flickering could indicate a manufacturing defect covered under warranty. Let me escalate this to our technical support team to determine if this qualifies for warranty repair or replacement.",
        category="complex_warranty",
        difficulty="hard",
        tags=["electronics", "warranty_edge_case", "angry_customer", "escalation"]
    ),
    # More examples covering edge cases...
]

Notice how each example includes metadata beyond just input/output pairs. The category and tags help you analyze performance across different scenarios, while difficulty ratings let you track whether problems concentrate in complex cases.

Handling Subjective and Multi-Valid Outputs

Many LLM applications don't have single "correct" answers. A creative writing assistant might generate multiple valid story continuations, or a data analysis tool might choose different but equally appropriate visualization approaches.

For these scenarios, create evaluation examples with multiple acceptable outputs or define success criteria rather than exact matches:

@dataclass
class MultiTargetExample:
    id: str
    input_text: str
    acceptable_outputs: List[str]
    success_criteria: List[str]
    rejection_criteria: List[str]

creative_writing_example = MultiTargetExample(
    id="story_001",
    input_text="Continue this story: Sarah walked into the abandoned library and noticed something strange about the books...",
    acceptable_outputs=[
        "The books were rearranging themselves on the shelves, sliding silently from section to section as if guided by invisible hands.",
        "Every book lay open to the same page number - 237 - despite being completely different titles from different eras.",
        "The books whispered her name in unison, their pages fluttering without any breeze in the still air."
    ],
    success_criteria=[
        "Continues the narrative coherently",
        "Maintains the mysterious/supernatural tone", 
        "Introduces a specific strange element about the books",
        "Uses vivid descriptive language"
    ],
    rejection_criteria=[
        "Breaks narrative continuity",
        "Introduces characters not mentioned in the prompt",
        "Shifts to completely different genre (comedy, romance)",
        "Provides meta-commentary instead of story continuation"
    ]
)

This structure allows for flexible evaluation while maintaining clear quality standards.

Rule-Based Evaluation Methods

Rule-based evaluation uses programmatic checks to validate LLM outputs against specific criteria. While less sophisticated than model-based approaches, rule-based methods provide fast, consistent, and interpretable results for many applications.

Format and Structure Validation

Many LLM applications require outputs in specific formats. A financial report generator must produce valid JSON, or a code generation tool must output syntactically correct Python. These requirements are perfect for rule-based validation:

import json
import re
from typing import Tuple, List

class FormatValidator:
    """Validates LLM outputs against format requirements"""
    
    def validate_json_response(self, response: str) -> Tuple[bool, str]:
        """Check if response is valid JSON with required fields"""
        try:
            data = json.loads(response.strip())
            required_fields = ["analysis", "confidence", "recommendations"]
            missing_fields = [field for field in required_fields if field not in data]
            
            if missing_fields:
                return False, f"Missing required fields: {missing_fields}"
            
            return True, "Valid JSON format"
        except json.JSONDecodeError as e:
            return False, f"Invalid JSON: {str(e)}"
    
    def validate_sql_query(self, response: str) -> Tuple[bool, str]:
        """Basic SQL syntax validation"""
        # Extract SQL from response (assuming it's in code blocks)
        sql_pattern = r"```sql\s*(.*?)\s*```"
        matches = re.findall(sql_pattern, response, re.DOTALL | re.IGNORECASE)
        
        if not matches:
            return False, "No SQL code block found"
        
        sql = matches[0].strip()
        
        # Check for dangerous operations in read-only context
        dangerous_keywords = ["DROP", "DELETE", "UPDATE", "INSERT", "CREATE", "ALTER"]
        for keyword in dangerous_keywords:
            if keyword.upper() in sql.upper():
                return False, f"Contains prohibited keyword: {keyword}"
        
        # Basic structure validation
        if not sql.upper().startswith("SELECT"):
            return False, "Query must start with SELECT"
        
        return True, "Valid SQL query"

# Usage example
validator = FormatValidator()

# Test a financial analysis response
llm_response = '''
{
    "analysis": "Revenue increased 15% YoY driven by strong Q4 performance",
    "confidence": 0.85,
    "recommendations": ["Focus on Q4 strategies", "Monitor competitive landscape"]
}
'''

is_valid, message = validator.validate_json_response(llm_response)
print(f"JSON validation: {is_valid} - {message}")

Content Quality Rules

Beyond format validation, you can check content quality using domain-specific rules:

class ContentValidator:
    """Validates content quality and appropriateness"""
    
    def __init__(self):
        # Load domain-specific requirements
        self.prohibited_phrases = [
            "I cannot", "I'm not able to", "I don't know",
            "As an AI", "I'm just a language model"
        ]
        
        self.required_elements = {
            "customer_service": ["apology", "solution", "next_steps"],
            "technical_explanation": ["definition", "example", "implications"]
        }
    
    def check_response_completeness(self, response: str, category: str) -> Tuple[bool, List[str]]:
        """Verify response contains required elements for the category"""
        issues = []
        
        # Check for refusal patterns
        for phrase in self.prohibited_phrases:
            if phrase.lower() in response.lower():
                issues.append(f"Contains refusal pattern: '{phrase}'")
        
        # Check category-specific requirements
        if category in self.required_elements:
            required = self.required_elements[category]
            response_lower = response.lower()
            
            for element in required:
                # Simple keyword-based checking (could be more sophisticated)
                element_keywords = {
                    "apology": ["sorry", "apologize", "regret", "understand your frustration"],
                    "solution": ["will", "can", "here's how", "solution", "resolve"],
                    "next_steps": ["next", "follow up", "contact", "email", "within"]
                }
                
                if element in element_keywords:
                    keywords = element_keywords[element]
                    if not any(keyword in response_lower for keyword in keywords):
                        issues.append(f"Missing {element} element")
        
        return len(issues) == 0, issues
    
    def check_response_length(self, response: str, min_words: int = 10, max_words: int = 500) -> Tuple[bool, str]:
        """Validate response length is appropriate"""
        word_count = len(response.split())
        
        if word_count < min_words:
            return False, f"Response too short: {word_count} words (minimum {min_words})"
        elif word_count > max_words:
            return False, f"Response too long: {word_count} words (maximum {max_words})"
        
        return True, f"Appropriate length: {word_count} words"

# Example usage
content_validator = ContentValidator()

customer_response = """
I apologize for the inconvenience with your headphones. I understand how frustrating it must be when a product fails so quickly. 

I can process a full refund for you since you're within our warranty period. Here's what happens next: you'll receive an email with return instructions within the next hour, and your refund will be processed within 3-5 business days once we receive the item.

Is there anything else I can help you with today?
"""

is_complete, issues = content_validator.check_response_completeness(
    customer_response, "customer_service"
)
is_appropriate_length, length_msg = content_validator.check_response_length(customer_response)

print(f"Content completeness: {is_complete}")
if issues:
    print(f"Issues found: {issues}")
print(f"Length check: {is_appropriate_length} - {length_msg}")

Rule-based validation excels at catching obvious failures quickly and consistently. However, it struggles with nuanced quality assessment, which requires more sophisticated approaches.

Model-Based Evaluation Approaches

Model-based evaluation uses other AI systems to assess LLM outputs. This approach can capture semantic similarity, quality judgments, and complex criteria that rule-based methods miss.

Using LLMs as Judges

One of the most effective approaches is using a strong LLM to evaluate outputs from your application's LLM. This "LLM-as-judge" pattern can assess criteria like coherence, relevance, and helpfulness:

import openai
from typing import Dict, List
from dataclasses import dataclass

@dataclass
class EvaluationResult:
    score: float  # 0-10 scale
    reasoning: str
    criteria_scores: Dict[str, float]

class LLMJudge:
    """Uses GPT-4 to evaluate LLM outputs"""
    
    def __init__(self, model: str = "gpt-4"):
        self.model = model
        self.client = openai.OpenAI()
    
    def evaluate_response(self, 
                         input_prompt: str,
                         llm_output: str, 
                         evaluation_criteria: List[str]) -> EvaluationResult:
        """Evaluate an LLM response against specific criteria"""
        
        criteria_text = "\n".join([f"- {criterion}" for criterion in evaluation_criteria])
        
        judge_prompt = f"""
        You are an expert evaluator of AI system outputs. Evaluate the following LLM response based on the given criteria.

        USER INPUT:
        {input_prompt}

        LLM RESPONSE:
        {llm_output}

        EVALUATION CRITERIA:
        {criteria_text}

        For each criterion, provide a score from 1-10 where:
        1-3: Poor (major issues)
        4-6: Adequate (meets basic requirements)
        7-8: Good (exceeds expectations)
        9-10: Excellent (exceptional quality)

        Respond in JSON format:
        {{
            "overall_score": <average of all criteria>,
            "reasoning": "<detailed explanation of the evaluation>",
            "criteria_scores": {{
                "criterion_1": <score>,
                "criterion_2": <score>,
                ...
            }}
        }}
        """
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": judge_prompt}],
            temperature=0.1  # Lower temperature for more consistent evaluations
        )
        
        try:
            result = json.loads(response.choices[0].message.content)
            return EvaluationResult(
                score=result["overall_score"],
                reasoning=result["reasoning"],
                criteria_scores=result["criteria_scores"]
            )
        except (json.JSONDecodeError, KeyError) as e:
            # Fallback if JSON parsing fails
            return EvaluationResult(
                score=0.0,
                reasoning=f"Evaluation failed: {str(e)}",
                criteria_scores={}
            )

# Example usage for customer service evaluation
judge = LLMJudge()

customer_input = "I bought a laptop 8 months ago and the screen is flickering. I want a refund."
llm_response = "I understand your frustration. While 8 months is outside our return window, flickering screens may indicate a manufacturing defect covered under warranty. Let me escalate this to determine if you qualify for warranty repair or replacement."

evaluation_criteria = [
    "Empathy and understanding of customer concern",
    "Accurate application of return/warranty policies", 
    "Clear communication of next steps",
    "Professional and helpful tone",
    "Appropriate level of detail"
]

result = judge.evaluate_response(customer_input, llm_response, evaluation_criteria)
print(f"Overall Score: {result.score}/10")
print(f"Reasoning: {result.reasoning}")
print("Criteria Scores:")
for criterion, score in result.criteria_scores.items():
    print(f"  {criterion}: {score}/10")

Semantic Similarity Evaluation

For tasks where you have reference outputs, semantic similarity can measure how well the LLM's response matches expected content without requiring exact word matches:

from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine
import numpy as np

class SemanticEvaluator:
    """Evaluates responses using semantic similarity"""
    
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
    
    def calculate_similarity(self, response: str, reference: str) -> float:
        """Calculate semantic similarity between response and reference"""
        embeddings = self.model.encode([response, reference])
        similarity = 1 - cosine(embeddings[0], embeddings[1])
        return float(similarity)
    
    def evaluate_against_references(self, 
                                  response: str, 
                                  references: List[str]) -> Dict[str, float]:
        """Evaluate response against multiple reference outputs"""
        similarities = []
        
        for ref in references:
            similarity = self.calculate_similarity(response, ref)
            similarities.append(similarity)
        
        return {
            "max_similarity": max(similarities),
            "mean_similarity": np.mean(similarities),
            "all_similarities": similarities
        }

# Example usage
evaluator = SemanticEvaluator()

# Test response against multiple valid references
test_response = "I can help you process a refund since your headphones are still under warranty."

reference_responses = [
    "I'll be happy to process your refund as the headphones are within the warranty period.",
    "Since your headphones are covered under warranty, I can arrange a full refund for you.",
    "Your headphones qualify for a warranty refund. I'll start the process right away."
]

similarity_scores = evaluator.evaluate_against_references(test_response, reference_responses)
print(f"Max similarity: {similarity_scores['max_similarity']:.3f}")
print(f"Mean similarity: {similarity_scores['mean_similarity']:.3f}")

# Set threshold for acceptable similarity
acceptable_threshold = 0.7
if similarity_scores['max_similarity'] >= acceptable_threshold:
    print("✓ Response semantically matches expected output")
else:
    print("✗ Response differs significantly from expected output")

Automated Testing Pipelines

Manual evaluation doesn't scale for production LLM applications. You need automated pipelines that run comprehensive evaluations on every code change and provide continuous monitoring of model performance.

Integration with Development Workflow

Here's a complete testing pipeline that integrates with your development process:

import pytest
import pandas as pd
from typing import List, Dict, Any
import asyncio
import aiohttp
import time
from datetime import datetime

class LLMTestSuite:
    """Comprehensive test suite for LLM applications"""
    
    def __init__(self, 
                 model_endpoint: str,
                 api_key: str,
                 test_dataset_path: str):
        self.endpoint = model_endpoint
        self.api_key = api_key
        self.test_data = pd.read_json(test_dataset_path)
        self.results = []
        
        # Initialize evaluators
        self.format_validator = FormatValidator()
        self.content_validator = ContentValidator()
        self.semantic_evaluator = SemanticEvaluator()
        self.llm_judge = LLMJudge()
    
    async def generate_response(self, session: aiohttp.ClientSession, prompt: str) -> str:
        """Generate response from LLM endpoint"""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
        
        payload = {
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.1,
            "max_tokens": 500
        }
        
        async with session.post(self.endpoint, json=payload, headers=headers) as response:
            result = await response.json()
            return result["choices"][0]["message"]["content"]
    
    async def run_comprehensive_evaluation(self) -> Dict[str, Any]:
        """Run full evaluation suite on test dataset"""
        start_time = time.time()
        
        async with aiohttp.ClientSession() as session:
            # Generate responses for all test cases
            tasks = []
            for _, row in self.test_data.iterrows():
                task = self.evaluate_single_case(session, row)
                tasks.append(task)
            
            results = await asyncio.gather(*tasks)
        
        # Aggregate results
        evaluation_results = {
            "timestamp": datetime.now().isoformat(),
            "total_cases": len(results),
            "execution_time": time.time() - start_time,
            "results": results
        }
        
        # Calculate summary statistics
        evaluation_results["summary"] = self.calculate_summary_stats(results)
        
        return evaluation_results
    
    async def evaluate_single_case(self, session: aiohttp.ClientSession, test_case: pd.Series) -> Dict[str, Any]:
        """Evaluate a single test case comprehensively"""
        try:
            # Generate response
            response = await self.generate_response(session, test_case["input_text"])
            
            # Run all evaluations
            case_result = {
                "case_id": test_case["id"],
                "category": test_case["category"],
                "difficulty": test_case["difficulty"],
                "input": test_case["input_text"],
                "response": response,
                "evaluations": {}
            }
            
            # Format validation
            if test_case.get("expected_format"):
                if test_case["expected_format"] == "json":
                    is_valid, msg = self.format_validator.validate_json_response(response)
                    case_result["evaluations"]["format_valid"] = is_valid
                    case_result["evaluations"]["format_message"] = msg
            
            # Content validation  
            is_complete, issues = self.content_validator.check_response_completeness(
                response, test_case["category"]
            )
            case_result["evaluations"]["content_complete"] = is_complete
            case_result["evaluations"]["content_issues"] = issues
            
            # Semantic similarity (if reference output exists)
            if pd.notna(test_case.get("expected_output")):
                similarity = self.semantic_evaluator.calculate_similarity(
                    response, test_case["expected_output"]
                )
                case_result["evaluations"]["semantic_similarity"] = similarity
            
            # LLM judge evaluation
            if test_case.get("evaluation_criteria"):
                criteria = test_case["evaluation_criteria"]
                judge_result = self.llm_judge.evaluate_response(
                    test_case["input_text"], response, criteria
                )
                case_result["evaluations"]["judge_score"] = judge_result.score
                case_result["evaluations"]["judge_reasoning"] = judge_result.reasoning
            
            return case_result
            
        except Exception as e:
            return {
                "case_id": test_case["id"],
                "error": str(e),
                "evaluations": {}
            }
    
    def calculate_summary_stats(self, results: List[Dict[str, Any]]) -> Dict[str, Any]:
        """Calculate aggregate performance metrics"""
        total_cases = len(results)
        successful_cases = [r for r in results if "error" not in r]
        
        summary = {
            "success_rate": len(successful_cases) / total_cases,
            "total_cases": total_cases,
            "failed_cases": total_cases - len(successful_cases)
        }
        
        if successful_cases:
            # Format validation stats
            format_valid = [r for r in successful_cases 
                          if r["evaluations"].get("format_valid", True)]
            summary["format_validation_rate"] = len(format_valid) / len(successful_cases)
            
            # Content completeness stats  
            content_complete = [r for r in successful_cases
                              if r["evaluations"].get("content_complete", True)]
            summary["content_completeness_rate"] = len(content_complete) / len(successful_cases)
            
            # Semantic similarity stats
            similarities = [r["evaluations"]["semantic_similarity"] 
                          for r in successful_cases 
                          if "semantic_similarity" in r["evaluations"]]
            if similarities:
                summary["mean_semantic_similarity"] = np.mean(similarities)
                summary["min_semantic_similarity"] = min(similarities)
            
            # Judge score stats
            judge_scores = [r["evaluations"]["judge_score"] 
                          for r in successful_cases
                          if "judge_score" in r["evaluations"]]
            if judge_scores:
                summary["mean_judge_score"] = np.mean(judge_scores)
                summary["min_judge_score"] = min(judge_scores)
                summary["high_quality_rate"] = len([s for s in judge_scores if s >= 7]) / len(judge_scores)
        
        return summary

# Integration with pytest
class TestLLMApplication:
    """Pytest test class for LLM applications"""
    
    @pytest.fixture(scope="class")
    def test_suite(self):
        return LLMTestSuite(
            model_endpoint="https://api.openai.com/v1/chat/completions",
            api_key="your-api-key",
            test_dataset_path="test_cases.json"
        )
    
    @pytest.mark.asyncio
    async def test_comprehensive_evaluation(self, test_suite):
        """Run comprehensive evaluation and assert quality thresholds"""
        results = await test_suite.run_comprehensive_evaluation()
        summary = results["summary"]
        
        # Assert quality thresholds
        assert summary["success_rate"] >= 0.95, f"Success rate too low: {summary['success_rate']}"
        assert summary["format_validation_rate"] >= 0.98, f"Format validation rate too low: {summary['format_validation_rate']}"
        assert summary["content_completeness_rate"] >= 0.90, f"Content completeness too low: {summary['content_completeness_rate']}"
        
        if "mean_semantic_similarity" in summary:
            assert summary["mean_semantic_similarity"] >= 0.7, f"Semantic similarity too low: {summary['mean_semantic_similarity']}"
        
        if "mean_judge_score" in summary:
            assert summary["mean_judge_score"] >= 6.0, f"Judge score too low: {summary['mean_judge_score']}"
        
        # Save detailed results for analysis
        with open(f"evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f:
            json.dump(results, f, indent=2)

# Run tests
# pytest test_llm_application.py -v --tb=short

This testing pipeline provides comprehensive evaluation that can be integrated into CI/CD processes, ensuring quality gates before deployment.

Production Monitoring and Performance Tracking

Once your LLM application is deployed, ongoing monitoring becomes crucial. Model performance can degrade due to distribution shift, model updates, or changing user behavior patterns.

Real-Time Quality Monitoring

Implement monitoring that tracks key metrics in production:

import logging
from datetime import datetime, timedelta
from collections import defaultdict, deque
from typing import Dict, List, Optional
import threading
import time

class ProductionMonitor:
    """Monitors LLM application performance in production"""
    
    def __init__(self, 
                 alert_thresholds: Dict[str, float],
                 window_minutes: int = 60):
        self.alert_thresholds = alert_thresholds
        self.window_minutes = window_minutes
        
        # Sliding window storage for metrics
        self.metrics_window = defaultdict(lambda: deque())
        self.lock = threading.Lock()
        
        # Initialize evaluators for real-time assessment
        self.content_validator = ContentValidator()
        self.semantic_evaluator = SemanticEvaluator()
        
        # Start background monitoring thread
        self.monitoring_thread = threading.Thread(target=self._monitor_loop, daemon=True)
        self.monitoring_thread.start()
        
        logging.basicConfig(level=logging.INFO)
        self.logger = logging.getLogger(__name__)
    
    def log_interaction(self, 
                       user_input: str,
                       llm_response: str, 
                       response_time_ms: float,
                       user_feedback: Optional[str] = None,
                       category: Optional[str] = None):
        """Log a production interaction for monitoring"""
        
        timestamp = datetime.now()
        
        # Perform real-time quality checks
        quality_scores = self._assess_response_quality(
            user_input, llm_response, category
        )
        
        interaction_data = {
            "timestamp": timestamp,
            "user_input": user_input,
            "llm_response": llm_response,
            "response_time_ms": response_time_ms,
            "user_feedback": user_feedback,
            "category": category,
            "quality_scores": quality_scores
        }
        
        # Update sliding window metrics
        with self.lock:
            self.metrics_window["interactions"].append(interaction_data)
            self.metrics_window["response_times"].append((timestamp, response_time_ms))
            self.metrics_window["quality_scores"].append((timestamp, quality_scores))
            
            # Add user feedback if provided
            if user_feedback:
                feedback_score = self._parse_feedback_score(user_feedback)
                self.metrics_window["user_satisfaction"].append((timestamp, feedback_score))
        
        # Check for immediate alerts
        self._check_quality_alerts(quality_scores, interaction_data)
    
    def _assess_response_quality(self, 
                               user_input: str, 
                               llm_response: str,
                               category: Optional[str]) -> Dict[str, float]:
        """Perform real-time quality assessment"""
        scores = {}
        
        # Content completeness check
        if category:
            is_complete, issues = self.content_validator.check_response_completeness(
                llm_response, category
            )
            scores["content_completeness"] = 1.0 if is_complete else 0.0
            scores["content_issues_count"] = len(issues)
        
        # Response length appropriateness
        is_appropriate, _ = self.content_validator.check_response_length(llm_response)
        scores["length_appropriate"] = 1.0 if is_appropriate else 0.0
        
        # Refusal detection
        refusal_indicators = ["I cannot", "I'm not able", "I don't know", "I can't help"]
        contains_refusal = any(indicator.lower() in llm_response.lower() 
                              for indicator in refusal_indicators)
        scores["contains_refusal"] = 1.0 if contains_refusal else 0.0
        
        return scores
    
    def _parse_feedback_score(self, feedback: str) -> float:
        """Convert user feedback to numeric score"""
        feedback_lower = feedback.lower()
        if any(word in feedback_lower for word in ["excellent", "great", "perfect"]):
            return 1.0
        elif any(word in feedback_lower for word in ["good", "helpful", "thanks"]):
            return 0.8
        elif any(word in feedback_lower for word in ["okay", "fine", "adequate"]):
            return 0.6
        elif any(word in feedback_lower for word in ["poor", "bad", "unhelpful"]):
            return 0.2
        else:
            return 0.5  # Neutral/unclear feedback
    
    def _check_quality_alerts(self, quality_scores: Dict[str, float], interaction_data: Dict):
        """Check if quality scores trigger alerts"""
        
        # High refusal rate alert
        if quality_scores.get("contains_refusal", 0) > 0:
            self.logger.warning(f"Refusal detected in interaction: {interaction_data['user_input'][:100]}...")
        
        # Content completeness alert
        if quality_scores.get("content_completeness", 1) < self.alert_thresholds.get("min_content_completeness", 0.8):
            self.logger.warning(f"Content completeness below threshold: {quality_scores['content_completeness']}")
        
        # Response time alert
        if interaction_data["response_time_ms"] > self.alert_thresholds.get("max_response_time_ms", 5000):
            self.logger.warning(f"Response time exceeded threshold: {interaction_data['response_time_ms']}ms")
    
    def get_current_metrics(self) -> Dict[str, Any]:
        """Get current performance metrics from sliding window"""
        cutoff_time = datetime.now() - timedelta(minutes=self.window_minutes)
        
        with self.lock:
            # Filter to current window
            current_interactions = [
                interaction for interaction in self.metrics_window["interactions"]
                if interaction["timestamp"] > cutoff_time
            ]
            
            current_response_times = [
                rt for ts, rt in self.metrics_window["response_times"]
                if ts > cutoff_time
            ]
            
            current_satisfaction = [
                score for ts, score in self.metrics_window["user_satisfaction"]
                if ts > cutoff_time
            ]
        
        if not current_interactions:
            return {"message": "No interactions in current window"}
        
        # Calculate aggregated metrics
        total_interactions = len(current_interactions)
        
        # Quality metrics
        avg_content_completeness = np.mean([
            interaction["quality_scores"].get("content_completeness", 1)
            for interaction in current_interactions
        ])
        
        refusal_rate = np.mean([
            interaction["quality_scores"].get("contains_refusal", 0)
            for interaction in current_interactions
        ])
        
        # Performance metrics
        avg_response_time = np.mean(current_response_times) if current_response_times else 0
        p95_response_time = np.percentile(current_response_times, 95) if current_response_times else 0
        
        # User satisfaction
        avg_satisfaction = np.mean(current_satisfaction) if current_satisfaction else None
        
        return {
            "window_minutes": self.window_minutes,
            "total_interactions": total_interactions,
            "avg_content_completeness": avg_content_completeness,
            "refusal_rate": refusal_rate,
            "avg_response_time_ms": avg_response_time,
            "p95_response_time_ms": p95_response_time,
            "avg_user_satisfaction": avg_satisfaction,
            "satisfaction_samples": len(current_satisfaction)
        }
    
    def _monitor_loop(self):
        """Background monitoring loop"""
        while True:
            try:
                metrics = self.get_current_metrics()
                
                # Check alert thresholds
                if isinstance(metrics, dict) and "total_interactions" in metrics:
                    self._check_aggregate_alerts(metrics)
                
                # Clean old data from sliding window
                self._cleanup_old_data()
                
                time.sleep(300)  # Check every 5 minutes
                
            except Exception as e:
                self.logger.error(f"Monitoring loop error: {str(e)}")
                time.sleep(60)  # Wait before retrying
    
    def _check_aggregate_alerts(self, metrics: Dict[str, Any]):
        """Check aggregate metrics against alert thresholds"""
        
        if metrics["refusal_rate"] > self.alert_thresholds.get("max_refusal_rate", 0.1):
            self.logger.error(f"High refusal rate detected: {metrics['refusal_rate']:.2%}")
        
        if metrics["avg_content_completeness"] < self.alert_thresholds.get("min_content_completeness", 0.8):
            self.logger.error(f"Low content completeness: {metrics['avg_content_completeness']:.2f}")
        
        if metrics["p95_response_time_ms"] > self.alert_thresholds.get("max_p95_response_time_ms", 10000):
            self.logger.error(f"High p95 response time: {metrics['p95_response_time_ms']:.0f}ms")
        
        if metrics["avg_user_satisfaction"] and metrics["avg_user_satisfaction"] < self.alert_thresholds.get("min_satisfaction", 0.6):
            self.logger.error(f"Low user satisfaction: {metrics['avg_user_satisfaction']:.2f}")
    
    def _cleanup_old_data(self):
        """Remove data outside the sliding window"""
        cutoff_time = datetime.now() - timedelta(minutes=self.window_minutes)
        
        with self.lock:
            # Clean interactions
            self.metrics_window["interactions"] = deque([
                interaction for interaction in self.metrics_window["interactions"]
                if interaction["timestamp"] > cutoff_time
            ])
            
            # Clean response times
            self.metrics_window["response_times"] = deque([
                (ts, rt) for ts, rt in self.metrics_window["response_times"]
                if ts > cutoff_time
            ])
            
            # Clean quality scores
            self.metrics_window["quality_scores"] = deque([
                (ts, scores) for ts, scores in self.metrics_window["quality_scores"]
                if ts > cutoff_time
            ])
            
            # Clean user satisfaction
            self.metrics_window["user_satisfaction"] = deque([
                (ts, score) for ts, score in self.metrics_window["user_satisfaction"]
                if ts > cutoff_time
            ])

# Usage example
monitor = ProductionMonitor(
    alert_thresholds={
        "max_refusal_rate": 0.05,
        "min_content_completeness": 0.85,
        "max_response_time_ms": 3000,
        "max_p95_response_time_ms": 8000,
        "min_satisfaction": 0.7
    }
)

# In your application code
def handle_user_request(user_input: str, category: str) -> str:
    start_time = time.time()
    
    # Generate LLM response
    llm_response = your_llm_function(user_input)
    
    response_time_ms = (time.time() - start_time) * 1000
    
    # Log for monitoring
    monitor.log_interaction(
        user_input=user_input,
        llm_response=llm_response,
        response_time_ms=response_time_ms,
        category=category
    )
    
    return llm_response

# Check current metrics
current_metrics = monitor.get_current_metrics()
print(f"Current performance: {current_metrics}")

Hands-On Exercise

Let's build a complete evaluation system for a document summarization application. This exercise combines all the techniques we've covered into a practical implementation.

Exercise: Document Summarization Evaluator

You'll create an evaluation system for an LLM that summarizes technical documents. The system should assess summary quality across multiple dimensions and provide actionable feedback for improvement.

# First, let's create our test dataset
test_documents = [
    {
        "id": "tech_001",
        "title": "Machine Learning Model Deployment Best Practices",
        "content": """
        Deploying machine learning models to production requires careful consideration of multiple factors including model versioning, monitoring, and scalability. 
        
        Model versioning ensures that different iterations of your model can be tracked and rolled back if necessary. Use semantic versioning (e.g., 1.2.3) where major versions indicate breaking changes, minor versions add functionality, and patch versions fix bugs.
        
        Monitoring is crucial for detecting model drift, where the input data distribution changes over time, causing performance degradation. Implement both data drift detection (monitoring input feature distributions) and concept drift detection (monitoring model performance metrics).
        
        Scalability considerations include choosing appropriate infrastructure (cloud vs. on-premise), implementing load balancing for high-traffic scenarios, and optimizing inference latency through techniques like model quantization or caching frequent predictions.
        
        Security measures should include input validation to prevent adversarial attacks, secure API endpoints with proper authentication, and audit logging for compliance requirements.
        """,
        "expected_summary": "Key considerations for ML model deployment include version control using semantic versioning, monitoring for data and concept drift, scalability through proper infrastructure and optimization techniques, and security via input validation and secure APIs.",
        "key_points": [
            "Model versioning with semantic versioning",
            "Monitoring for data drift and concept drift", 
            "Scalability through infrastructure and optimization",
            "Security via input validation and secure APIs"
        ],
        "category": "technical_summary",
        "difficulty": "medium"
    },
    # Add more test cases...
]

class DocumentSummarizationEvaluator:
    """Comprehensive evaluator for document summarization"""
    
    def __init__(self):
        self.semantic_evaluator = SemanticEvaluator()
        self.llm_judge = LLMJudge()
        
        # Initialize ROUGE evaluator for text summarization
        try:
            from rouge_score import rouge_scorer
            self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        except ImportError:
            print("Warning: rouge_score not installed. ROUGE metrics will be skipped.")
            self.rouge_scorer = None
    
    def evaluate_summary(self, 
                        original_document: str,
                        generated_summary: str,
                        reference_summary: str,
                        key_points: List[str]) -> Dict[str, Any]:
        """Comprehensive evaluation of a generated summary"""
        
        evaluation = {
            "timestamp": datetime.now().isoformat(),
            "summary_length": len(generated_summary.split()),
            "compression_ratio": len(original_document.split()) / len(generated_summary.split())
        }
        
        # 1. ROUGE scores (if available)
        if self.rouge_scorer:
            rouge_scores = self.rouge_scorer.score(reference_summary, generated_summary)
            evaluation["rouge_scores"] = {
                "rouge1": rouge_scores['rouge1'].fmeasure,
                "rouge2": rouge_scores['rouge2'].fmeasure,
                "rougeL": rouge_scores['rougeL'].fmeasure
            }
        
        # 2. Semantic similarity to reference
        semantic_sim = self.semantic_evaluator.calculate_similarity(
            generated_summary, reference_summary
        )
        evaluation["semantic_similarity"] = semantic_sim
        
        # 3. Key point coverage analysis
        key_point_coverage = self._evaluate_key_point_coverage(
            generated_summary, key_points
        )
        evaluation["key_point_coverage"] = key_point_coverage
        
        # 4. LLM judge evaluation
        judge_criteria = [
            "Accuracy: Does the summary accurately represent the main points?",
            "Completeness: Are all important topics covered?", 
            "Conciseness: Is the summary appropriately concise without losing key information?",
            "Clarity: Is the summary clear and well-written?",
            "Coherence: Does the summary flow logically?"
        ]
        
        judge_result = self.llm_judge.evaluate_response(
            f"Original document:\n{original_document}\n\nGenerated summary:\n{generated_summary}",
            generated_summary,
            judge_criteria
        )
        
        evaluation["judge_evaluation"] = {
            "overall_score": judge_result.score,
            "reasoning": judge_result.reasoning,
            "criteria_scores": judge_result.criteria_scores
        }
        
        # 5. Summary quality rules
        quality_checks = self._check_summary_quality_rules(generated_summary)
        evaluation["quality_checks"] = quality_checks
        
        # 6. Calculate overall score
        evaluation["overall_score"] = self._calculate_overall_score(evaluation)
        
        return evaluation
    
    def _evaluate_key_point_coverage(self, 
                                   summary: str, 
                                   key_points: List[str]) -> Dict[str, Any]:
        """Evaluate how well the summary covers key points"""
        
        summary_lower = summary.lower()
        coverage_scores = []
        
        for point in key_points:
            # Simple keyword-based coverage (could be enhanced with semantic similarity)
            point_keywords = point.lower().split()
            keyword_matches = sum(1 for keyword in point_keywords if keyword in summary_lower)
            coverage_score = keyword_matches / len(point_keywords)
            coverage_scores.append(coverage_score)
        
        return {
            "individual_coverage": coverage_scores,
            "average_coverage": np.mean(coverage_scores),
            "points_well_covered": len([score for score in coverage_scores if score > 0.5]),
            "total_key_points": len(key_points)
        }
    
    def _check_summary_quality_rules(self, summary: str) -> Dict[str, Any]:
        """Apply rule-based quality checks"""
        
        checks = {}
        
        # Length appropriateness
        word_count = len(summary.split())
        checks["appropriate_length"] = 50 <= word_count <= 200
        checks["word_count"] = word_count
        
        # Sentence structure
        sentences = summary.split('.')
        sentence_count = len([s for s in sentences if s.strip()])
        checks["sentence_count"] = sentence_count
        checks["appropriate_sentence_count"] = 2 <= sentence_count <= 8
        
        # Avoid common summarization issues
        checks["no_repetition"] = not self._has_significant_repetition(summary)
        checks["no_first_person"] = not any(phrase in summary.lower() 
                                          for phrase in ["i think", "in my opinion", "i believe"])
        checks["proper_capitalization"] = summary[0].isupper() if summary else False
        
        return checks
    
    def _has_significant_repetition(self, text: str) -> bool:
        """Check for significant repetition in the text"""
        words = text.lower().split()
        if len(words) < 10:
            return False
        
        # Check for repeated phrases of 3+ words
        for i in range(len(words) - 5):
            phrase = ' '.join(words[i:i+3])
            remaining_text = ' '.join(words[i+3:])
            if phrase in remaining_text:
                return True
        
        return False
    
    def _calculate_overall_score(self, evaluation: Dict[str, Any]) -> float:
        """Calculate weighted overall score from all evaluation components"""
        
        score_components = []
        
        # Semantic similarity (weight: 0.25)
        if "semantic_similarity" in evaluation:
            score_components.append((evaluation["semantic_similarity"], 0.25))
        
        # ROUGE scores (weight: 0.25)
        if "rouge_scores" in evaluation:
            rouge_avg = np.mean(list(evaluation["rouge_scores"].values()))
            score_components.append((rouge_avg, 0.25))
        
        # Key point coverage (weight: 0.20)
        coverage_score = evaluation["key_point_coverage"]["average_coverage"]
        score_components.append((coverage_score, 0.20))
        
        # LLM judge score (weight: 0.20, normalized to 0-1 scale)
        judge_score = evaluation["judge_evaluation"]["overall_score"] / 10.0
        score_components.append((judge_score, 0.20))
        
        # Quality checks (weight: 0.10)
        quality_checks = evaluation["quality_checks"]
        quality_score = np.mean([
            1.0 if quality_checks["appropriate_length"] else 0.0,
            1.0 if quality_checks["appropriate_sentence_count"] else 0.0,
            1.0 if quality_checks["no_repetition"] else 0.0,
            1.0 if quality_checks["no_first_person"] else 0.0,
            1.0 if quality_checks["proper_capitalization"] else 0.0
        ])
        score_components.append((quality_score, 0.10))
        
        # Calculate weighted average
        total_weight = sum(weight for _, weight in score_components)
        weighted_sum = sum(score * weight for score, weight in score_components)
        
        return weighted_sum / total_weight if total_weight > 0 else 0.0

# Your task: Implement the summarization function and run evaluation
def your_summarization_function(document_text: str) -> str:
    """
    TODO: Implement your summarization logic here
    This could use:
    - An LLM API call with appropriate prompts
    - A fine-tuned summarization model
    - Traditional extractive summarization techniques
    
    For this exercise, you can use OpenAI's API or any other approach
    """
    # Example implementation using OpenAI
    import openai
    client = openai.OpenAI()
    
    prompt = f"""
    Please provide a concise summary of the following technical document. 
    Focus on the main points and key takeaways. Keep the summary between 50-150 words.
    
    Document:
    {document_text}
    
    Summary:
    """
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=200
    )
    
    return response.choices[0].message.content.strip()

# Run the evaluation
def run_summarization_evaluation():
    """Run evaluation on the test dataset"""
    evaluator = DocumentSummarizationEvaluator()
    results = []
    
    for doc in test_documents:
        print(f"Evaluating document: {doc['id']}")
        
        # Generate summary
        generated_summary = your_summarization_function(doc["content"])
        
        # Evaluate the summary
        evaluation = evaluator.evaluate_summary(
            original_document=doc["content"],
            generated_summary=generated_summary,
            reference_summary=doc["expected_summary"],
            key_points=doc["key_points"]
        )
        
        evaluation["document_id"] = doc["id"]
        evaluation["generated_summary"] = generated_summary
        results.append(evaluation)
        
        print(f"Overall score: {evaluation['overall_score']:.3f}")
        print(f"Generated summary: {generated_summary[:100]}...")
        print("-" * 50)
    
    # Calculate aggregate statistics
    overall_scores = [r["overall_score"] for r in results]
    print(f"\nAggregate Results:")
    print(f"Mean overall score: {np.mean(overall_scores):.3f}")
    print(f"Min overall score: {min(overall_scores):.3f}")
    print(f"Max overall score: {max(overall_scores):.3f}")
    
    # Identify areas for improvement
    print(f"\nDetailed Analysis:")
    for result in results:
        print(f"\nDocument {result['document_id']}:")
        print(f"  Overall Score: {result['overall_score']:.3f}")
        print(f"  Semantic Similarity: {result.get('semantic_similarity', 'N/A'):.3f}")
        print(f"  Key Point Coverage: {result['key_point_coverage']['average_coverage']:.3f}")
        print(f"  Judge Score: {result['judge_evaluation']['overall_score']:.1f}/10")
        
        # Highlight specific issues
        quality_issues = []
        quality_checks = result['quality_checks']
        
        if not quality_checks['appropriate_length']:
            quality_issues.append(f"Length issue (current: {quality_checks['word_count']} words)")
        if not quality_checks['no_repetition']:
            quality_issues.append("Contains repetitive content")
        if not quality_checks['appropriate_sentence_count']:
            quality_issues.append(f"Sentence count issue (current: {quality_checks['sentence_count']})")
        
        if quality_issues:
            print(f"  Quality Issues: {', '.join(quality_issues)}")
    
    return results

# Run the evaluation
if __name__ == "__main__":
    evaluation_results = run_summarization_evaluation()
    
    # Save results for further analysis
    with open(f"summarization_evaluation_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f:
        json.dump(evaluation_results, f, indent=2, default=str)

Exercise Instructions:

Implement the summarization function: Choose your approach (OpenAI API, Hugging Face model, or another method) and implement your_summarization_function().
Run the evaluation: Execute the evaluation script and analyze the results. Pay attention to which aspects score well and which need improvement.
Iterate and improve: Based on the evaluation results, modify your summarization approach:
- Adjust prompts if using LLM APIs
- Try different models
- Experiment with post-processing
Add more test cases: Expand the test_documents list with additional examples covering different domains and difficulty levels.
Customize evaluation criteria: Modify the evaluation criteria in the LLM judge to match your specific requirements.

Common Mistakes & Troubleshooting

Evaluation Dataset Issues

Problem: Test cases don't reflect real-world usage patterns.

Solution: Continuously update your evaluation dataset based on production data. Set up processes to regularly sample and annotate real user interactions.

def update_evaluation_dataset_from_production(production_logs: List[Dict], 
                                            sample_rate: float = 0.01) -> List[EvaluationExample]:
    """Sample and prepare production data for evaluation dataset"""
    
    # Sample random subset
    import random
    sampled_logs = random.sample(production_logs, 
                                int(len(production_logs) * sample_rate))
    
    evaluation_examples = []
    for log in sampled_logs:
        # Only include interactions with user feedback
        if log.get("user_rating") is not None:
            example = EvaluationExample(
                id=f"prod_{log['timestamp']}_{log['user_id']}",
                input_text=log["user_input"],
                expected_output=log["llm_response"] if log["user_rating"] >= 4 else "",
                category=log.get("category", "unknown"),
                difficulty="real_world",
                tags=["production", f"rating_{log['user_rating']}"]
            )
            evaluation_examples.append(example)
    
    return evaluation_examples

Inconsistent LLM Judge Evaluations

Problem: LLM judges provide inconsistent scores for similar outputs.

Solution: Use multiple judge evaluations and aggregate results. Also, provide more specific evaluation criteria and examples.

def get_consensus_evaluation(input_text: str, 
                           llm_output: str, 
                           criteria: List[str],
                           num_judges: int = 3) -> EvaluationResult:
    """Get consensus evaluation from multiple LLM judges"""
    
    judge = LLMJudge()
    evaluations = []
    
    for i in range(num_judges):
        # Add slight variation to reduce identical responses
        modified_criteria = criteria + [f"(Evaluation run {i+1})"]
        result = judge.evaluate_response(input_text, llm_output, modified_criteria)
        evaluations.append(result)
    
    # Calculate consensus
    scores = [eval.score for eval in evaluations]
    consensus_score = np.median(scores)  # Use median for robustness
    
    # Identify high-variance evaluations
    score_variance = np.var(scores)
    if score_variance > 4.0:  # High disagreement
        print(f"Warning: High variance in judge evaluations: {scores}")
    
    return EvaluationResult(
        score=consensus_score,
        reasoning=f"Consensus of {num_judges} evaluations. Scores: {scores}",
        criteria_scores={}  # Could aggregate individual criteria scores
    )

Performance vs. Quality Trade-offs

Problem: Comprehensive evaluation is too slow for development workflows.

Solution: Implement tiered evaluation - fast checks for development, comprehensive evaluation for pre-production.

class TieredEvaluator:
    """Provides fast and comprehensive evaluation modes"""
    
    def __init__(self):
        self.format_validator = FormatValidator()
        self.content_validator = ContentValidator()
        self.semantic_evaluator = SemanticEvaluator()
        self.llm_judge = LLMJudge()
    
    def quick_evaluation(self, response: str, test_case: Dict) -> Dict[str, Any]:
        """Fast evaluation for development workflow"""
        start_time = time.time()
        
        # Only run fast checks
        results = {}
        
        # Format validation
        if test_case.get("expected_format"):
            is_valid, msg = self.format_validator.validate_json_response(response)
            results["format_valid"] = is_valid
        
        # Basic content checks
        is_complete, issues = self.content_validator.check_response_completeness(
            response, test_case.get("category", "")
        )
        results["content_complete"] = is_complete
        results["issue_count"] = len(issues)
        
        results["evaluation_time_ms"] = (time.time() - start_time) * 1000
        results["evaluation_mode"] = "quick"
        
        return results
    
    def comprehensive_evaluation(self, response: str, test_case: Dict) -> Dict[str, Any]:
        """Thorough evaluation for pre-production validation"""
        start_time = time.time()
        
        # Run all evaluation methods
        results = self.quick_evaluation(response, test_case)
        
        # Add semantic similarity
        if test_case.get("expected_output"):
            similarity = self.semantic_evaluator.calculate_similarity(
                response, test_case["expected_output"]
            )
            results["semantic_similarity"] = similarity
        
        # Add LLM judge evaluation
        if test_case.get("evaluation_criteria"):
            judge_result = self.llm_judge.evaluate_response(
                test_case["input_text"], response, test_case["evaluation_criteria"]
            )
            results["judge_score"] = judge_result.score
            results["judge_reasoning"] = judge_result.reasoning
        
        results["evaluation_time_ms"] = (time.time() - start_time) * 1000
        results["evaluation_mode"] = "comprehensive"
        
        return results

Handling Model Non-Determinism

Problem: Same inputs produce different outputs, making evaluation inconsistent.

Solution: Run multiple evaluations and use statistical methods to account for variance.

def evaluate_with_variance_analysis(llm_function, 
                                  test_input: str, 
                                  num_runs: int = 5) -> Dict[str, Any]:
    """Evaluate accounting for LLM non-determinism"""
    
    results = []
    responses = []
    
    for run in range(num_runs):
        response = llm_function(test_input)
        responses.append(response)
        
        # Run evaluation on this response
        evaluation = your_evaluation_function(response, test_input)
        results.append(evaluation)
    
    # Calculate statistics
    scores = [r.get("overall_score", 0) for r in results]
    
    return {
        "mean_score": np.mean(scores),
        "std_score": np.std(scores),
        "min_score": min(scores),
        "max_score": max(scores),
        "confidence_interval_95": (
            np.mean(scores) - 1.96 * np.std(scores) / np.sqrt(len(scores)),
            np.mean(scores) + 1.96 * np.std(scores) / np.sqrt(len(scores))
        ),
        "all_responses": responses,
        "all_results": results,
        "num_runs": num_runs
    }

Summary & Next Steps

You've built a comprehensive testing and evaluation framework for LLM applications that combines multiple evaluation approaches, integrates with development workflows, and provides production monitoring. This framework addresses the unique challenges of evaluating non-deterministic AI systems while maintaining practical usability for development teams.

Key takeaways from this lesson:

Multi-faceted evaluation is essential - no single metric captures LLM application quality
Automated testing pipelines make comprehensive evaluation scalable and consistent
Production monitoring catches issues that development testing misses
Statistical approaches help handle the inherent variance in LLM outputs
Continuous improvement through dataset updates and threshold refinement drives long-term success

Immediate next steps:

Implement evaluation for your current LLM application using the frameworks provided
Set up automated testing in your CI/CD pipeline with appropriate quality gates
Deploy production monitoring to catch performance degradation early
Build your evaluation dataset starting with critical user journeys and edge cases

Advanced topics to explore:

Human-in-the-loop evaluation for subjective quality assessment
A/B testing frameworks for comparing model versions in production
Adversarial testing to identify security and robustness issues
Multi-modal evaluation for applications combining text, images, or other data types
Cost-aware evaluation that balances quality improvements against computational expenses

The evaluation strategies you've learned scale from simple applications to complex multi-agent systems. As LLM capabilities continue advancing, robust evaluation remains the foundation for building reliable, production-ready AI applications that users can trust.

Testing and Evaluating LLM Applications: A Comprehensive Guide to Quality Assurance