Cost Optimization: Token Counting, Caching, and Model Selection

You're building an AI-powered customer support chatbot for your e-commerce company. During the first week of testing, your AWS bill shows $847 in language model API costs. Your manager calls an emergency meeting, and suddenly you're tasked with cutting AI expenses by 75% while maintaining the same quality of responses. Sound familiar?

This scenario plays out daily across organizations implementing Large Language Models (LLMs). The challenge isn't just building AI applications—it's building them cost-effectively. Many developers focus solely on functionality, only to discover that their "smart" chatbot costs more per conversation than hiring a human agent.

By the end of this lesson, you'll have practical skills to dramatically reduce your LLM costs without sacrificing quality. We'll explore three fundamental cost optimization strategies that can transform your AI budget from a runaway expense into a controlled, predictable investment.

What you'll learn:

How to accurately count and monitor tokens to predict and control LLM costs
Implement intelligent caching systems to eliminate redundant API calls
Select the right model for each task to maximize cost-efficiency
Build a comprehensive cost optimization strategy that scales with your application
Troubleshoot common cost-related issues before they impact your budget

Prerequisites

Before diving into cost optimization, you should have:

Basic familiarity with making API calls to language models (OpenAI, Claude, etc.)
Understanding of what tokens are in the context of LLMs
Experience with at least one programming language (Python examples provided)
Access to an LLM API for testing (free tiers work fine for learning)

Understanding Token Economics

Think of tokens as the "currency" of language models. Just as you pay per gallon of gas or per kilowatt-hour of electricity, you pay per token processed by an LLM. But unlike gas or electricity, tokens are abstract units that don't directly correspond to words or characters.

What Exactly Are Tokens?

A token is the smallest unit of text that a language model processes. The tokenization process breaks your input text into these units using specific rules. For most modern LLMs:

Common words like "the" or "and" = 1 token each
Uncommon words might be split: "extraordinary" could become "extra" + "ordinary" = 2 tokens
Punctuation and spaces count as tokens too
Non-English text often requires more tokens per word

Here's a practical example. Let's say you send this prompt to GPT-4:

Please analyze this customer feedback: "The delivery was extremely slow and the package arrived damaged. Very disappointed with the service."

This seemingly simple request actually consumes about 25 tokens for the prompt plus whatever tokens the model generates in response. If GPT-4 responds with a 150-word analysis, you're looking at roughly 200 total tokens, costing approximately $0.004 per interaction.

That might seem trivial, but multiply by 10,000 customer interactions daily, and you're spending $40/day or $14,600/year just on this one use case.

Implementing Token Counting

Before optimizing, you need visibility. Here's how to implement accurate token counting:

import tiktoken
import openai
from collections import defaultdict
import json

class TokenTracker:
    def __init__(self, model="gpt-4"):
        self.model = model
        self.encoding = tiktoken.encoding_for_model(model)
        self.usage_log = defaultdict(int)
        
        # Current pricing (update these regularly)
        self.pricing = {
            "gpt-4": {"input": 0.03, "output": 0.06},  # per 1K tokens
            "gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
            "claude-3-sonnet": {"input": 0.003, "output": 0.015}
        }
    
    def count_tokens(self, text):
        """Count tokens in a text string"""
        return len(self.encoding.encode(text))
    
    def estimate_cost(self, prompt, expected_response_length=100):
        """Estimate cost before making API call"""
        input_tokens = self.count_tokens(prompt)
        output_tokens = expected_response_length  # rough estimate
        
        input_cost = (input_tokens / 1000) * self.pricing[self.model]["input"]
        output_cost = (output_tokens / 1000) * self.pricing[self.model]["output"]
        
        return {
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "estimated_cost": input_cost + output_cost,
            "breakdown": {"input_cost": input_cost, "output_cost": output_cost}
        }
    
    def log_usage(self, prompt, response, actual_cost=None):
        """Log actual usage for tracking"""
        input_tokens = self.count_tokens(prompt)
        output_tokens = self.count_tokens(response)
        
        self.usage_log["total_input_tokens"] += input_tokens
        self.usage_log["total_output_tokens"] += output_tokens
        self.usage_log["total_requests"] += 1
        
        if actual_cost:
            self.usage_log["total_cost"] += actual_cost
        else:
            # Calculate estimated cost
            cost = self.estimate_cost(prompt, output_tokens)["estimated_cost"]
            self.usage_log["estimated_total_cost"] += cost
    
    def get_usage_report(self):
        """Generate usage summary"""
        return {
            "total_requests": self.usage_log["total_requests"],
            "total_tokens": self.usage_log["total_input_tokens"] + self.usage_log["total_output_tokens"],
            "input_tokens": self.usage_log["total_input_tokens"],
            "output_tokens": self.usage_log["total_output_tokens"],
            "estimated_cost": self.usage_log["estimated_total_cost"],
            "avg_cost_per_request": self.usage_log["estimated_total_cost"] / max(1, self.usage_log["total_requests"])
        }

# Example usage
tracker = TokenTracker("gpt-4")

prompt = """Analyze this customer feedback and provide actionable recommendations:
'The checkout process was confusing and took forever. I almost gave up.'"""

# Check cost before API call
cost_estimate = tracker.estimate_cost(prompt, expected_response_length=150)
print(f"Estimated cost: ${cost_estimate['estimated_cost']:.4f}")
print(f"Input tokens: {cost_estimate['input_tokens']}")

# Make your API call here...
response = "Based on the feedback, I recommend simplifying the checkout flow..."

# Log the actual usage
tracker.log_usage(prompt, response)

Pro tip: Set up automated alerts when your daily token usage exceeds specific thresholds. This prevents bill shock and helps you identify cost spikes immediately.

Token Optimization Strategies

Now that you're tracking tokens, let's reduce them:

1. Prompt Engineering for Efficiency

Instead of verbose prompts, use concise, direct language:

# Inefficient (47 tokens)
verbose_prompt = """
I would like you to please carefully analyze the following customer feedback 
and then provide me with detailed recommendations for improvement. Here is 
the feedback: "Product quality was poor."
"""

# Efficient (12 tokens)  
concise_prompt = """Analyze feedback and suggest improvements: "Product quality was poor." """

print(f"Verbose: {tracker.count_tokens(verbose_prompt)} tokens")
print(f"Concise: {tracker.count_tokens(concise_prompt)} tokens") 
# Savings: 75% reduction in input tokens

2. Template-Based Responses

For repetitive tasks, create reusable templates:

def create_analysis_prompt(feedback_text, analysis_type="general"):
    templates = {
        "general": f"Analyze: {feedback_text}",
        "sentiment": f"Sentiment of: {feedback_text}",  
        "categories": f"Categorize: {feedback_text}"
    }
    return templates[analysis_type]

# Instead of crafting unique prompts each time
feedback = "Delivery was late"
prompt = create_analysis_prompt(feedback, "sentiment")
print(f"Template prompt: {prompt}")
print(f"Tokens: {tracker.count_tokens(prompt)}")

Implementing Intelligent Caching

Caching is your most powerful weapon against unnecessary API costs. Think of it as creating a "memory" for your AI application—if you've already paid for an answer once, why pay again for the same question?

Understanding Cache Opportunities

Consider these scenarios where caching provides massive savings:

FAQ responses: Customer service bots often get identical questions. "What's your return policy?" doesn't need a fresh API call every time.

Document analysis: If multiple users upload the same contract template, cache the analysis.

Translation requests: The same product descriptions translated repeatedly across different sessions.

Building a Smart Cache System

Here's a production-ready caching implementation:

import hashlib
import json
import redis
from datetime import datetime, timedelta
import pickle

class LLMCache:
    def __init__(self, redis_host='localhost', redis_port=6379, default_ttl=3600):
        self.redis_client = redis.Redis(host=redis_host, port=redis_port, decode_responses=False)
        self.default_ttl = default_ttl  # Time to live in seconds
        self.hit_count = 0
        self.miss_count = 0
        
    def _generate_cache_key(self, prompt, model, temperature=0.0):
        """Generate unique cache key from prompt parameters"""
        # Include parameters that affect output
        cache_data = {
            "prompt": prompt,
            "model": model, 
            "temperature": temperature
        }
        cache_string = json.dumps(cache_data, sort_keys=True)
        return hashlib.sha256(cache_string.encode()).hexdigest()
    
    def get(self, prompt, model, temperature=0.0):
        """Retrieve cached response if available"""
        cache_key = self._generate_cache_key(prompt, model, temperature)
        
        try:
            cached_data = self.redis_client.get(cache_key)
            if cached_data:
                self.hit_count += 1
                result = pickle.loads(cached_data)
                result['cached'] = True
                result['cache_hit_time'] = datetime.now().isoformat()
                return result
            else:
                self.miss_count += 1
                return None
        except Exception as e:
            print(f"Cache retrieval error: {e}")
            self.miss_count += 1
            return None
    
    def set(self, prompt, model, response, cost, temperature=0.0, ttl=None):
        """Cache a response with metadata"""
        cache_key = self._generate_cache_key(prompt, model, temperature)
        ttl = ttl or self.default_ttl
        
        cache_data = {
            "response": response,
            "cost": cost,
            "timestamp": datetime.now().isoformat(),
            "model": model,
            "temperature": temperature
        }
        
        try:
            self.redis_client.setex(
                cache_key, 
                ttl, 
                pickle.dumps(cache_data)
            )
        except Exception as e:
            print(f"Cache storage error: {e}")
    
    def get_cache_stats(self):
        """Get cache performance statistics"""
        total_requests = self.hit_count + self.miss_count
        hit_rate = self.hit_count / max(1, total_requests)
        
        return {
            "cache_hits": self.hit_count,
            "cache_misses": self.miss_count,
            "hit_rate": hit_rate,
            "total_requests": total_requests
        }
    
    def invalidate_pattern(self, pattern):
        """Invalidate cache entries matching a pattern"""
        # Useful for clearing caches when underlying data changes
        keys = self.redis_client.keys(pattern)
        if keys:
            self.redis_client.delete(*keys)
            return len(keys)
        return 0

# Integration with your LLM calls
class CachedLLMClient:
    def __init__(self, cache, token_tracker):
        self.cache = cache
        self.tracker = token_tracker
        
    def query(self, prompt, model="gpt-4", temperature=0.0):
        """Query with caching"""
        # Check cache first
        cached_result = self.cache.get(prompt, model, temperature)
        if cached_result:
            print("Cache hit! Saved API call.")
            return cached_result
        
        # Cache miss - make API call
        print("Cache miss - calling API...")
        
        # Estimate cost before calling
        cost_estimate = self.tracker.estimate_cost(prompt)
        
        # Make actual API call (placeholder)
        response = self._call_llm_api(prompt, model, temperature)
        actual_cost = cost_estimate['estimated_cost']  # Use actual cost from API response
        
        # Cache the result
        self.cache.set(prompt, model, response, actual_cost, temperature)
        
        # Track usage
        self.tracker.log_usage(prompt, response, actual_cost)
        
        return {
            "response": response,
            "cost": actual_cost,
            "cached": False,
            "timestamp": datetime.now().isoformat()
        }
    
    def _call_llm_api(self, prompt, model, temperature):
        """Placeholder for actual API call"""
        # Replace with your actual OpenAI/Claude/etc API call
        return f"AI response to: {prompt[:50]}..."

# Example usage
cache = LLMCache()
tracker = TokenTracker("gpt-4")
client = CachedLLMClient(cache, tracker)

# First call - cache miss
result1 = client.query("What is your return policy?")
print(f"First call cost: ${result1['cost']:.4f}")

# Second identical call - cache hit  
result2 = client.query("What is your return policy?")
print(f"Second call cost: $0.0000 (cached)")

# Check cache performance
stats = cache.get_cache_stats()
print(f"Cache hit rate: {stats['hit_rate']:.2%}")

Advanced Caching Strategies

Semantic Caching: Cache similar questions, not just identical ones:

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache(LLMCache):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.similarity_threshold = 0.9
    
    def find_similar_cached(self, prompt, model, temperature=0.0):
        """Find semantically similar cached responses"""
        prompt_embedding = self.encoder.encode([prompt])
        
        # In production, use a vector database like Pinecone or Weaviate
        # This is a simplified example
        pattern = f"semantic:{model}:*"
        keys = self.redis_client.keys(pattern)
        
        for key in keys:
            cached_data = pickle.loads(self.redis_client.get(key))
            cached_embedding = cached_data.get('embedding')
            
            if cached_embedding is not None:
                similarity = np.dot(prompt_embedding[0], cached_embedding) / (
                    np.linalg.norm(prompt_embedding[0]) * np.linalg.norm(cached_embedding)
                )
                
                if similarity > self.similarity_threshold:
                    self.hit_count += 1
                    return cached_data
        
        self.miss_count += 1
        return None

Warning: Be careful with caching when temperature > 0, as LLMs are designed to give varied responses. Only cache deterministic queries (temperature = 0) unless you specifically want identical responses to similar questions.

Strategic Model Selection

Choosing the right model for each task is like selecting the right tool for a job. You wouldn't use a sledgehammer to hang a picture frame, and you shouldn't use GPT-4 for simple text classification that GPT-3.5 can handle just as well.

Understanding the Model Landscape

Different models excel at different tasks and come with vastly different price points:

GPT-4: Premium model, excellent for complex reasoning, creative tasks, and nuanced analysis. 20x more expensive than GPT-3.5.

GPT-3.5-turbo: Great balance of capability and cost. Handles most business tasks effectively.

Claude-3-haiku: Fast and economical, excellent for simple analysis and classification.

Specialized models: Task-specific models often outperform general models while costing less.

Building a Model Selection Framework

class ModelSelector:
    def __init__(self):
        self.models = {
            "gpt-4": {
                "cost_per_1k_input": 0.03,
                "cost_per_1k_output": 0.06,
                "capabilities": ["complex_reasoning", "creative_writing", "code_generation", "analysis"],
                "max_tokens": 8192,
                "latency": "high"
            },
            "gpt-3.5-turbo": {
                "cost_per_1k_input": 0.0015,
                "cost_per_1k_output": 0.002,
                "capabilities": ["general_tasks", "simple_analysis", "basic_coding"],
                "max_tokens": 4096,
                "latency": "medium"
            },
            "claude-3-haiku": {
                "cost_per_1k_input": 0.00025,
                "cost_per_1k_output": 0.00125,
                "capabilities": ["classification", "simple_qa", "formatting"],
                "max_tokens": 4096,
                "latency": "low"
            }
        }
        
        self.task_model_map = {
            "sentiment_analysis": "claude-3-haiku",
            "content_classification": "gpt-3.5-turbo", 
            "creative_writing": "gpt-4",
            "complex_analysis": "gpt-4",
            "simple_qa": "gpt-3.5-turbo",
            "data_extraction": "gpt-3.5-turbo",
            "code_review": "gpt-4"
        }
    
    def select_model(self, task_type, prompt_length, budget_priority=True):
        """Select optimal model based on task and constraints"""
        
        # Get recommended model for task
        recommended_model = self.task_model_map.get(task_type, "gpt-3.5-turbo")
        
        # Check if prompt fits in model's context window
        model_info = self.models[recommended_model]
        if prompt_length > model_info["max_tokens"]:
            # Upgrade to model with larger context
            recommended_model = "gpt-4" 
        
        # If budget is priority, consider cheaper alternatives
        if budget_priority:
            if task_type in ["sentiment_analysis", "simple_qa"]:
                recommended_model = "claude-3-haiku"
            elif task_type in ["content_classification", "data_extraction"]:
                recommended_model = "gpt-3.5-turbo"
        
        return recommended_model, model_info
    
    def compare_costs(self, prompt, task_type, expected_output_length=100):
        """Compare costs across different models for the same task"""
        results = {}
        
        for model_name, model_info in self.models.items():
            input_cost = (len(prompt) / 1000) * model_info["cost_per_1k_input"]
            output_cost = (expected_output_length / 1000) * model_info["cost_per_1k_output"]
            total_cost = input_cost + output_cost
            
            results[model_name] = {
                "total_cost": total_cost,
                "input_cost": input_cost,
                "output_cost": output_cost,
                "suitable_for_task": task_type in model_info.get("capabilities", [])
            }
        
        return results

# Example usage
selector = ModelSelector()

# Analyze a customer feedback classification task
task = "sentiment_analysis"
prompt = "Customer wrote: 'The product broke after one day. Terrible quality.'"
prompt_tokens = len(prompt.split()) * 1.3  # Rough token estimate

# Get optimal model
selected_model, model_info = selector.select_model(task, prompt_tokens, budget_priority=True)
print(f"Recommended model for {task}: {selected_model}")

# Compare costs across models
cost_comparison = selector.compare_costs(prompt, task)
for model, costs in cost_comparison.items():
    print(f"{model}: ${costs['total_cost']:.6f} (suitable: {costs['suitable_for_task']})")

Task-Specific Optimization Strategies

Cascade Pattern: Start with the cheapest model, escalate only if needed:

class CascadeProcessor:
    def __init__(self, model_selector, llm_client):
        self.selector = model_selector
        self.client = llm_client
        self.escalation_keywords = ["complex", "unclear", "need more context"]
    
    def process_with_cascade(self, prompt, task_type):
        """Try cheaper models first, escalate if needed"""
        
        # Start with most economical model
        models_to_try = ["claude-3-haiku", "gpt-3.5-turbo", "gpt-4"]
        
        for model in models_to_try:
            try:
                result = self.client.query(prompt, model=model, temperature=0)
                response = result["response"]
                
                # Check if response indicates the model struggled
                if not self._needs_escalation(response):
                    return {
                        "response": response,
                        "model_used": model,
                        "cost": result["cost"],
                        "escalated": model != models_to_try[0]
                    }
                
                print(f"{model} response unclear, trying next model...")
                
            except Exception as e:
                print(f"Error with {model}: {e}")
                continue
        
        raise Exception("All models failed")
    
    def _needs_escalation(self, response):
        """Determine if response quality warrants trying a better model"""
        response_lower = response.lower()
        
        # Simple heuristics - in production, use more sophisticated checks
        if len(response) < 10:  # Too short
            return True
        if any(keyword in response_lower for keyword in self.escalation_keywords):
            return True
        if "i'm not sure" in response_lower or "unclear" in response_lower:
            return True
            
        return False

# Example usage  
processor = CascadeProcessor(selector, client)
result = processor.process_with_cascade(
    "Is this review positive or negative: 'Okay product, nothing special'",
    "sentiment_analysis"
)

print(f"Final result from {result['model_used']}: {result['response']}")
print(f"Cost: ${result['cost']:.4f}")

Batch Processing: Group similar requests to optimize model selection:

def batch_process_by_complexity(requests, batch_size=10):
    """Group requests by complexity and process with appropriate models"""
    
    # Classify requests by complexity
    simple_requests = []
    complex_requests = []
    
    for req in requests:
        if len(req["prompt"]) < 200 and req["task_type"] in ["sentiment_analysis", "classification"]:
            simple_requests.append(req)
        else:
            complex_requests.append(req)
    
    results = []
    
    # Process simple requests with cheap model
    for i in range(0, len(simple_requests), batch_size):
        batch = simple_requests[i:i+batch_size]
        batch_prompt = "Process these requests:\n" + "\n".join([
            f"{j+1}. {req['prompt']}" for j, req in enumerate(batch)
        ])
        
        result = client.query(batch_prompt, model="claude-3-haiku")
        # Parse batch result and append to results...
    
    # Process complex requests individually with better model
    for req in complex_requests:
        result = client.query(req["prompt"], model="gpt-4")
        results.append(result)
    
    return results

Hands-On Exercise

Now let's put everything together in a practical exercise. You'll build a cost-optimized customer feedback analysis system that demonstrates all three optimization strategies.

The Scenario

Your e-commerce company receives 1,000 customer reviews daily. Currently, you're using GPT-4 for all analysis, costing about $50/day. Your goal: maintain analysis quality while reducing costs by 70%.

Step 1: Set Up the Foundation

import time
import random
from datetime import datetime

# Sample customer feedback data
sample_feedbacks = [
    "Great product, fast delivery!",
    "Package arrived damaged and customer service was unhelpful",
    "Average quality, nothing special but does the job",
    "Expensive but worth it for the quality",
    "Worst purchase ever, complete waste of money",
    "Good value for money, would recommend",
    "Delivery was delayed but product quality is excellent",
    "Interface is confusing, needs better instructions"
]

class FeedbackAnalyzer:
    def __init__(self):
        self.cache = LLMCache()
        self.tracker = TokenTracker("gpt-4")
        self.selector = ModelSelector()
        self.client = CachedLLMClient(self.cache, self.tracker)
        
        # Cost tracking
        self.total_cost = 0
        self.total_requests = 0
        self.cache_savings = 0
        self.model_savings = 0
    
    def analyze_feedback(self, feedback_text, optimize=True):
        """Analyze customer feedback with cost optimization"""
        
        if optimize:
            return self._analyze_optimized(feedback_text)
        else:
            return self._analyze_baseline(feedback_text)
    
    def _analyze_baseline(self, feedback_text):
        """Baseline: Always use GPT-4, no caching"""
        prompt = f"""Analyze this customer feedback:
        Feedback: "{feedback_text}"
        
        Provide:
        1. Sentiment (positive/negative/neutral)
        2. Key themes
        3. Priority level (high/medium/low)
        4. Suggested action"""
        
        # Simulate API call without caching
        cost_estimate = self.tracker.estimate_cost(prompt, expected_response_length=100)
        
        response = f"""1. Sentiment: {'positive' if 'great' in feedback_text.lower() else 'negative' if any(word in feedback_text.lower() for word in ['bad', 'terrible', 'worst']) else 'neutral'}
2. Key themes: Product quality, delivery experience
3. Priority level: {'high' if 'worst' in feedback_text.lower() else 'medium'}
4. Suggested action: Follow up with customer service team"""
        
        self.total_cost += cost_estimate['estimated_cost']
        self.total_requests += 1
        
        return {
            "response": response,
            "cost": cost_estimate['estimated_cost'],
            "model": "gpt-4",
            "cached": False
        }
    
    def _analyze_optimized(self, feedback_text):
        """Optimized version with all strategies"""
        
        # Step 1: Determine task complexity and select appropriate model
        task_type = "sentiment_analysis" if len(feedback_text) < 100 else "complex_analysis"
        
        # Create optimized prompt
        if task_type == "sentiment_analysis":
            prompt = f"Sentiment and priority of: '{feedback_text}'"
        else:
            prompt = f"Analyze feedback: '{feedback_text}' (sentiment, themes, priority, action)"
        
        # Step 2: Select optimal model
        selected_model, _ = self.selector.select_model(task_type, len(prompt), budget_priority=True)
        
        # Step 3: Use caching
        result = self.client.query(prompt, model=selected_model, temperature=0)
        
        # Track savings
        baseline_cost = self.tracker.estimate_cost(
            f"Analyze customer feedback: '{feedback_text}' with detailed analysis...",
            expected_response_length=150
        )['estimated_cost']
        
        actual_cost = result.get('cost', 0) if not result.get('cached') else 0
        
        self.total_cost += actual_cost
        self.total_requests += 1
        
        if result.get('cached'):
            self.cache_savings += baseline_cost
        
        if selected_model != "gpt-4":
            self.model_savings += (baseline_cost - actual_cost)
        
        return result
    
    def get_optimization_report(self):
        """Generate cost optimization report"""
        baseline_estimated = self.total_requests * 0.006  # Estimated GPT-4 cost per request
        
        return {
            "total_requests": self.total_requests,
            "actual_cost": self.total_cost,
            "baseline_estimated_cost": baseline_estimated,
            "total_savings": baseline_estimated - self.total_cost,
            "savings_percentage": ((baseline_estimated - self.total_cost) / baseline_estimated) * 100 if baseline_estimated > 0 else 0,
            "cache_savings": self.cache_savings,
            "model_selection_savings": self.model_savings,
            "cache_stats": self.cache.get_cache_stats()
        }

# Run the exercise
analyzer = FeedbackAnalyzer()

print("=== Cost Optimization Exercise ===\n")

# Process sample feedback with optimization
print("Processing feedback with optimization...")
for i, feedback in enumerate(sample_feedbacks):
    result = analyzer.analyze_feedback(feedback, optimize=True)
    print(f"\nFeedback {i+1}: {feedback}")
    print(f"Model used: {result.get('model', 'cached')}")
    print(f"Cached: {result.get('cached', False)}")
    print(f"Cost: ${result.get('cost', 0):.4f}")

# Process some duplicates to show caching benefits
print("\n--- Testing cache with duplicate feedback ---")
for feedback in sample_feedbacks[:3]:  # Repeat first 3
    result = analyzer.analyze_feedback(feedback, optimize=True)
    print(f"Duplicate feedback cost: ${result.get('cost', 0):.4f} (cached: {result.get('cached')})")

# Generate optimization report
report = analyzer.get_optimization_report()
print(f"\n=== OPTIMIZATION RESULTS ===")
print(f"Total requests processed: {report['total_requests']}")
print(f"Actual cost: ${report['actual_cost']:.4f}")
print(f"Baseline estimated cost: ${report['baseline_estimated_cost']:.4f}")
print(f"Total savings: ${report['total_savings']:.4f}")
print(f"Savings percentage: {report['savings_percentage']:.1f}%")
print(f"Cache hit rate: {report['cache_stats']['hit_rate']:.1%}")

Step 2: Scale Testing

def simulate_production_load(analyzer, num_requests=100):
    """Simulate realistic production load with various feedback types"""
    
    # Generate realistic distribution of feedback
    feedback_types = {
        "simple_positive": ["Great!", "Love it!", "Perfect product"],
        "simple_negative": ["Terrible", "Waste of money", "Poor quality"],
        "complex_mixed": [
            "The product quality is good but delivery was slow and packaging could be better",
            "Great customer service but the product didn't meet my expectations for the price point",
            "Interface is intuitive for basic tasks but lacks advanced features I needed for my work"
        ]
    }
    
    results = {
        "baseline_cost": 0,
        "optimized_cost": 0,
        "processing_time": 0
    }
    
    start_time = time.time()
    
    for i in range(num_requests):
        # Simulate realistic distribution: 40% simple positive, 30% simple negative, 30% complex
        rand_val = random.random()
        if rand_val < 0.4:
            feedback = random.choice(feedback_types["simple_positive"])
        elif rand_val < 0.7:
            feedback = random.choice(feedback_types["simple_negative"])  
        else:
            feedback = random.choice(feedback_types["complex_mixed"])
        
        # Add some randomness to simulate real variations
        if random.random() < 0.2:  # 20% chance of additional context
            feedback += f" Order #{random.randint(1000, 9999)}"
        
        result = analyzer.analyze_feedback(feedback, optimize=True)
        
        if i % 50 == 0:  # Progress update
            print(f"Processed {i} requests...")
    
    results["processing_time"] = time.time() - start_time
    
    # Get final report
    optimization_report = analyzer.get_optimization_report()
    
    return {
        "requests_processed": num_requests,
        "total_time": results["processing_time"],
        "avg_time_per_request": results["processing_time"] / num_requests,
        **optimization_report
    }

# Run production simulation
print("\n=== PRODUCTION SCALE SIMULATION ===")
print("Simulating 100 customer feedback analyses...")

final_results = simulate_production_load(analyzer, 100)

print(f"\n=== FINAL RESULTS ===")
print(f"Requests processed: {final_results['requests_processed']}")
print(f"Total processing time: {final_results['total_time']:.2f} seconds")
print(f"Average time per request: {final_results['avg_time_per_request']:.3f} seconds")
print(f"Total cost: ${final_results['actual_cost']:.4f}")
print(f"Estimated baseline cost: ${final_results['baseline_estimated_cost']:.4f}")
print(f"Cost savings: ${final_results['total_savings']:.4f} ({final_results['savings_percentage']:.1f}%)")
print(f"Cache hit rate: {final_results['cache_stats']['hit_rate']:.1%}")

# Calculate ROI
daily_volume = 1000  # feedbacks per day
monthly_savings = final_results['total_savings'] * 10 * 30  # Scale to monthly
print(f"\nProjected monthly savings at 1,000 requests/day: ${monthly_savings:.2f}")

Common Mistakes & Troubleshooting

Mistake 1: Over-Caching with Non-Deterministic Models

Problem: Caching responses when using temperature > 0 or when you want varied responses.

# WRONG: Caching creative content generation
def generate_marketing_copy(product_name):
    prompt = f"Write creative marketing copy for {product_name}"
    # This will always return the same cached response
    return client.query(prompt, model="gpt-4", temperature=0.8)  # High temperature but cached!

# CORRECT: Conditional caching
def generate_marketing_copy(product_name, use_cache=False):
    prompt = f"Write creative marketing copy for {product_name}"
    
    if use_cache:
        # Only cache if you want identical responses
        return client.query(prompt, model="gpt-4", temperature=0)
    else:
        # Skip cache for creative content
        return make_direct_api_call(prompt, model="gpt-4", temperature=0.8)

Solution: Only cache deterministic outputs (temperature=0) or when identical responses are desired.

Mistake 2: Inefficient Token Counting

Problem: Counting tokens after API calls instead of before, leading to budget overruns.

# WRONG: Count tokens after spending money
def risky_api_call(prompt):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Too late! Money already spent
    tokens_used = response.usage.total_tokens
    if tokens_used > budget_limit:
        print("Oops, went over budget!")

# CORRECT: Budget control before API calls
def safe_api_call(prompt, max_budget=0.01):
    estimated_cost = tracker.estimate_cost(prompt)
    
    if estimated_cost['estimated_cost'] > max_budget:
        # Try shorter prompt or different model
        return optimize_prompt_for_budget(prompt, max_budget)
    
    return make_api_call(prompt)

Mistake 3: Wrong Model for the Task

Problem: Using expensive models for simple tasks or cheap models for complex ones.

# WRONG: Expensive model for simple classification
def classify_sentiment_expensive(text):
    prompt = f"Is this positive or negative: {text}"
    return client.query(prompt, model="gpt-4")  # Overkill and expensive

# WRONG: Cheap model for complex reasoning  
def complex_analysis_cheap(data):
    prompt = f"Perform detailed statistical analysis with recommendations: {data}"
    return client.query(prompt, model="claude-3-haiku")  # May produce poor results

# CORRECT: Task-appropriate model selection
def classify_sentiment(text):
    prompt = f"Sentiment: {text}"
    return client.query(prompt, model="gpt-3.5-turbo")  # Right balance

def complex_analysis(data):
    prompt = f"Analyze and recommend: {data}"
    return client.query(prompt, model="gpt-4")  # Complex reasoning needs premium model

Troubleshooting Common Issues

Cache Inconsistencies:

def debug_cache_issues():
    # Check if your cache keys are consistent
    prompt1 = "What is the weather?"
    prompt2 = "What is the weather?"  # Same but different object
    
    key1 = cache._generate_cache_key(prompt1, "gpt-4", 0.0)
    key2 = cache._generate_cache_key(prompt2, "gpt-4", 0.0)
    
    assert key1 == key2, "Cache key generation is inconsistent!"
    
    # Check cache expiration
    cache.set("test", "gpt-4", "response", 0.001, ttl=1)  # 1 second TTL
    time.sleep(2)
    result = cache.get("test", "gpt-4")
    assert result is None, "Cache not expiring properly!"

Token Count Discrepancies:

def verify_token_counting():
    test_text = "Hello, world!"
    
    # Count using tiktoken
    local_count = tracker.count_tokens(test_text)
    
    # Make API call and compare
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": test_text}]
    )
    api_count = response.usage.prompt_tokens
    
    difference = abs(local_count - api_count)
    if difference > 2:  # Small differences are normal
        print(f"Token count discrepancy: local={local_count}, api={api_count}")

Memory Issues with Large Caches:

def monitor_cache_size():
    # Monitor Redis memory usage
    info = cache.redis_client.info('memory')
    used_memory_mb = info['used_memory'] / 1024 / 1024
    
    if used_memory_mb > 100:  # 100MB threshold
        print(f"Cache using {used_memory_mb:.1f}MB, consider cleanup")
        
        # Implement cache cleanup strategy
        cache.redis_client.execute_command('MEMORY PURGE')

Summary & Next Steps

You've now built a comprehensive cost optimization system that can reduce LLM expenses by 60-80% while maintaining quality. Let's recap the key strategies:

Token Counting & Monitoring: You learned to track usage proactively, optimize prompts for efficiency, and set up automated cost alerts before bills get out of control.

Intelligent Caching: You implemented both exact-match and semantic caching systems that eliminate redundant API calls, potentially saving thousands of dollars monthly on repetitive tasks.

Strategic Model Selection: You built a framework to automatically choose the most cost-effective model for each task, using premium models only when necessary.

Immediate Action Items

Audit your current LLM usage: Implement token tracking on your existing applications this week. You'll likely find surprising cost drivers.
Start with caching wins: Identify your most repetitive API calls and implement caching. FAQ systems and content analysis are prime candidates.
Evaluate model choices: Review tasks currently using GPT-4. Can 70% of them work just as well with GPT-3.5-turbo?

Advanced Topics to Explore Next

Fine-tuning for cost optimization: Train smaller, task-specific models that outperform general models while costing less.

Prompt engineering at scale: Advanced techniques like few-shot prompting and chain-of-thought optimization for specific business use cases.

Multi-model architectures: Design systems that automatically route requests to different models based on real-time pricing and availability.

Production monitoring: Build comprehensive dashboards that track cost per business outcome, not just raw API expenses.

The strategies you've learned today will serve as the foundation for building profitable AI applications. Remember: the goal isn't to minimize costs at all costs, but to maximize the value you get from every dollar spent on AI. Start implementing these techniques gradually, measure the impact, and iterate based on your specific use cases.

Your AI applications can be both intelligent and economical—now you have the tools to make it happen.

Cost Optimization: Token Counting, Caching, and Model Selection for LLMs