
You're building an AI-powered customer support chatbot for your e-commerce company. During the first week of testing, your AWS bill shows $847 in language model API costs. Your manager calls an emergency meeting, and suddenly you're tasked with cutting AI expenses by 75% while maintaining the same quality of responses. Sound familiar?
This scenario plays out daily across organizations implementing Large Language Models (LLMs). The challenge isn't just building AI applications—it's building them cost-effectively. Many developers focus solely on functionality, only to discover that their "smart" chatbot costs more per conversation than hiring a human agent.
By the end of this lesson, you'll have practical skills to dramatically reduce your LLM costs without sacrificing quality. We'll explore three fundamental cost optimization strategies that can transform your AI budget from a runaway expense into a controlled, predictable investment.
What you'll learn:
Before diving into cost optimization, you should have:
Think of tokens as the "currency" of language models. Just as you pay per gallon of gas or per kilowatt-hour of electricity, you pay per token processed by an LLM. But unlike gas or electricity, tokens are abstract units that don't directly correspond to words or characters.
A token is the smallest unit of text that a language model processes. The tokenization process breaks your input text into these units using specific rules. For most modern LLMs:
Here's a practical example. Let's say you send this prompt to GPT-4:
Please analyze this customer feedback: "The delivery was extremely slow and the package arrived damaged. Very disappointed with the service."
This seemingly simple request actually consumes about 25 tokens for the prompt plus whatever tokens the model generates in response. If GPT-4 responds with a 150-word analysis, you're looking at roughly 200 total tokens, costing approximately $0.004 per interaction.
That might seem trivial, but multiply by 10,000 customer interactions daily, and you're spending $40/day or $14,600/year just on this one use case.
Before optimizing, you need visibility. Here's how to implement accurate token counting:
import tiktoken
import openai
from collections import defaultdict
import json
class TokenTracker:
def __init__(self, model="gpt-4"):
self.model = model
self.encoding = tiktoken.encoding_for_model(model)
self.usage_log = defaultdict(int)
# Current pricing (update these regularly)
self.pricing = {
"gpt-4": {"input": 0.03, "output": 0.06}, # per 1K tokens
"gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
"claude-3-sonnet": {"input": 0.003, "output": 0.015}
}
def count_tokens(self, text):
"""Count tokens in a text string"""
return len(self.encoding.encode(text))
def estimate_cost(self, prompt, expected_response_length=100):
"""Estimate cost before making API call"""
input_tokens = self.count_tokens(prompt)
output_tokens = expected_response_length # rough estimate
input_cost = (input_tokens / 1000) * self.pricing[self.model]["input"]
output_cost = (output_tokens / 1000) * self.pricing[self.model]["output"]
return {
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"estimated_cost": input_cost + output_cost,
"breakdown": {"input_cost": input_cost, "output_cost": output_cost}
}
def log_usage(self, prompt, response, actual_cost=None):
"""Log actual usage for tracking"""
input_tokens = self.count_tokens(prompt)
output_tokens = self.count_tokens(response)
self.usage_log["total_input_tokens"] += input_tokens
self.usage_log["total_output_tokens"] += output_tokens
self.usage_log["total_requests"] += 1
if actual_cost:
self.usage_log["total_cost"] += actual_cost
else:
# Calculate estimated cost
cost = self.estimate_cost(prompt, output_tokens)["estimated_cost"]
self.usage_log["estimated_total_cost"] += cost
def get_usage_report(self):
"""Generate usage summary"""
return {
"total_requests": self.usage_log["total_requests"],
"total_tokens": self.usage_log["total_input_tokens"] + self.usage_log["total_output_tokens"],
"input_tokens": self.usage_log["total_input_tokens"],
"output_tokens": self.usage_log["total_output_tokens"],
"estimated_cost": self.usage_log["estimated_total_cost"],
"avg_cost_per_request": self.usage_log["estimated_total_cost"] / max(1, self.usage_log["total_requests"])
}
# Example usage
tracker = TokenTracker("gpt-4")
prompt = """Analyze this customer feedback and provide actionable recommendations:
'The checkout process was confusing and took forever. I almost gave up.'"""
# Check cost before API call
cost_estimate = tracker.estimate_cost(prompt, expected_response_length=150)
print(f"Estimated cost: ${cost_estimate['estimated_cost']:.4f}")
print(f"Input tokens: {cost_estimate['input_tokens']}")
# Make your API call here...
response = "Based on the feedback, I recommend simplifying the checkout flow..."
# Log the actual usage
tracker.log_usage(prompt, response)
Pro tip: Set up automated alerts when your daily token usage exceeds specific thresholds. This prevents bill shock and helps you identify cost spikes immediately.
Now that you're tracking tokens, let's reduce them:
1. Prompt Engineering for Efficiency
Instead of verbose prompts, use concise, direct language:
# Inefficient (47 tokens)
verbose_prompt = """
I would like you to please carefully analyze the following customer feedback
and then provide me with detailed recommendations for improvement. Here is
the feedback: "Product quality was poor."
"""
# Efficient (12 tokens)
concise_prompt = """Analyze feedback and suggest improvements: "Product quality was poor." """
print(f"Verbose: {tracker.count_tokens(verbose_prompt)} tokens")
print(f"Concise: {tracker.count_tokens(concise_prompt)} tokens")
# Savings: 75% reduction in input tokens
2. Template-Based Responses
For repetitive tasks, create reusable templates:
def create_analysis_prompt(feedback_text, analysis_type="general"):
templates = {
"general": f"Analyze: {feedback_text}",
"sentiment": f"Sentiment of: {feedback_text}",
"categories": f"Categorize: {feedback_text}"
}
return templates[analysis_type]
# Instead of crafting unique prompts each time
feedback = "Delivery was late"
prompt = create_analysis_prompt(feedback, "sentiment")
print(f"Template prompt: {prompt}")
print(f"Tokens: {tracker.count_tokens(prompt)}")
Caching is your most powerful weapon against unnecessary API costs. Think of it as creating a "memory" for your AI application—if you've already paid for an answer once, why pay again for the same question?
Consider these scenarios where caching provides massive savings:
FAQ responses: Customer service bots often get identical questions. "What's your return policy?" doesn't need a fresh API call every time.
Document analysis: If multiple users upload the same contract template, cache the analysis.
Translation requests: The same product descriptions translated repeatedly across different sessions.
Here's a production-ready caching implementation:
import hashlib
import json
import redis
from datetime import datetime, timedelta
import pickle
class LLMCache:
def __init__(self, redis_host='localhost', redis_port=6379, default_ttl=3600):
self.redis_client = redis.Redis(host=redis_host, port=redis_port, decode_responses=False)
self.default_ttl = default_ttl # Time to live in seconds
self.hit_count = 0
self.miss_count = 0
def _generate_cache_key(self, prompt, model, temperature=0.0):
"""Generate unique cache key from prompt parameters"""
# Include parameters that affect output
cache_data = {
"prompt": prompt,
"model": model,
"temperature": temperature
}
cache_string = json.dumps(cache_data, sort_keys=True)
return hashlib.sha256(cache_string.encode()).hexdigest()
def get(self, prompt, model, temperature=0.0):
"""Retrieve cached response if available"""
cache_key = self._generate_cache_key(prompt, model, temperature)
try:
cached_data = self.redis_client.get(cache_key)
if cached_data:
self.hit_count += 1
result = pickle.loads(cached_data)
result['cached'] = True
result['cache_hit_time'] = datetime.now().isoformat()
return result
else:
self.miss_count += 1
return None
except Exception as e:
print(f"Cache retrieval error: {e}")
self.miss_count += 1
return None
def set(self, prompt, model, response, cost, temperature=0.0, ttl=None):
"""Cache a response with metadata"""
cache_key = self._generate_cache_key(prompt, model, temperature)
ttl = ttl or self.default_ttl
cache_data = {
"response": response,
"cost": cost,
"timestamp": datetime.now().isoformat(),
"model": model,
"temperature": temperature
}
try:
self.redis_client.setex(
cache_key,
ttl,
pickle.dumps(cache_data)
)
except Exception as e:
print(f"Cache storage error: {e}")
def get_cache_stats(self):
"""Get cache performance statistics"""
total_requests = self.hit_count + self.miss_count
hit_rate = self.hit_count / max(1, total_requests)
return {
"cache_hits": self.hit_count,
"cache_misses": self.miss_count,
"hit_rate": hit_rate,
"total_requests": total_requests
}
def invalidate_pattern(self, pattern):
"""Invalidate cache entries matching a pattern"""
# Useful for clearing caches when underlying data changes
keys = self.redis_client.keys(pattern)
if keys:
self.redis_client.delete(*keys)
return len(keys)
return 0
# Integration with your LLM calls
class CachedLLMClient:
def __init__(self, cache, token_tracker):
self.cache = cache
self.tracker = token_tracker
def query(self, prompt, model="gpt-4", temperature=0.0):
"""Query with caching"""
# Check cache first
cached_result = self.cache.get(prompt, model, temperature)
if cached_result:
print("Cache hit! Saved API call.")
return cached_result
# Cache miss - make API call
print("Cache miss - calling API...")
# Estimate cost before calling
cost_estimate = self.tracker.estimate_cost(prompt)
# Make actual API call (placeholder)
response = self._call_llm_api(prompt, model, temperature)
actual_cost = cost_estimate['estimated_cost'] # Use actual cost from API response
# Cache the result
self.cache.set(prompt, model, response, actual_cost, temperature)
# Track usage
self.tracker.log_usage(prompt, response, actual_cost)
return {
"response": response,
"cost": actual_cost,
"cached": False,
"timestamp": datetime.now().isoformat()
}
def _call_llm_api(self, prompt, model, temperature):
"""Placeholder for actual API call"""
# Replace with your actual OpenAI/Claude/etc API call
return f"AI response to: {prompt[:50]}..."
# Example usage
cache = LLMCache()
tracker = TokenTracker("gpt-4")
client = CachedLLMClient(cache, tracker)
# First call - cache miss
result1 = client.query("What is your return policy?")
print(f"First call cost: ${result1['cost']:.4f}")
# Second identical call - cache hit
result2 = client.query("What is your return policy?")
print(f"Second call cost: $0.0000 (cached)")
# Check cache performance
stats = cache.get_cache_stats()
print(f"Cache hit rate: {stats['hit_rate']:.2%}")
Semantic Caching: Cache similar questions, not just identical ones:
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticCache(LLMCache):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.similarity_threshold = 0.9
def find_similar_cached(self, prompt, model, temperature=0.0):
"""Find semantically similar cached responses"""
prompt_embedding = self.encoder.encode([prompt])
# In production, use a vector database like Pinecone or Weaviate
# This is a simplified example
pattern = f"semantic:{model}:*"
keys = self.redis_client.keys(pattern)
for key in keys:
cached_data = pickle.loads(self.redis_client.get(key))
cached_embedding = cached_data.get('embedding')
if cached_embedding is not None:
similarity = np.dot(prompt_embedding[0], cached_embedding) / (
np.linalg.norm(prompt_embedding[0]) * np.linalg.norm(cached_embedding)
)
if similarity > self.similarity_threshold:
self.hit_count += 1
return cached_data
self.miss_count += 1
return None
Warning: Be careful with caching when temperature > 0, as LLMs are designed to give varied responses. Only cache deterministic queries (temperature = 0) unless you specifically want identical responses to similar questions.
Choosing the right model for each task is like selecting the right tool for a job. You wouldn't use a sledgehammer to hang a picture frame, and you shouldn't use GPT-4 for simple text classification that GPT-3.5 can handle just as well.
Different models excel at different tasks and come with vastly different price points:
GPT-4: Premium model, excellent for complex reasoning, creative tasks, and nuanced analysis. 20x more expensive than GPT-3.5.
GPT-3.5-turbo: Great balance of capability and cost. Handles most business tasks effectively.
Claude-3-haiku: Fast and economical, excellent for simple analysis and classification.
Specialized models: Task-specific models often outperform general models while costing less.
class ModelSelector:
def __init__(self):
self.models = {
"gpt-4": {
"cost_per_1k_input": 0.03,
"cost_per_1k_output": 0.06,
"capabilities": ["complex_reasoning", "creative_writing", "code_generation", "analysis"],
"max_tokens": 8192,
"latency": "high"
},
"gpt-3.5-turbo": {
"cost_per_1k_input": 0.0015,
"cost_per_1k_output": 0.002,
"capabilities": ["general_tasks", "simple_analysis", "basic_coding"],
"max_tokens": 4096,
"latency": "medium"
},
"claude-3-haiku": {
"cost_per_1k_input": 0.00025,
"cost_per_1k_output": 0.00125,
"capabilities": ["classification", "simple_qa", "formatting"],
"max_tokens": 4096,
"latency": "low"
}
}
self.task_model_map = {
"sentiment_analysis": "claude-3-haiku",
"content_classification": "gpt-3.5-turbo",
"creative_writing": "gpt-4",
"complex_analysis": "gpt-4",
"simple_qa": "gpt-3.5-turbo",
"data_extraction": "gpt-3.5-turbo",
"code_review": "gpt-4"
}
def select_model(self, task_type, prompt_length, budget_priority=True):
"""Select optimal model based on task and constraints"""
# Get recommended model for task
recommended_model = self.task_model_map.get(task_type, "gpt-3.5-turbo")
# Check if prompt fits in model's context window
model_info = self.models[recommended_model]
if prompt_length > model_info["max_tokens"]:
# Upgrade to model with larger context
recommended_model = "gpt-4"
# If budget is priority, consider cheaper alternatives
if budget_priority:
if task_type in ["sentiment_analysis", "simple_qa"]:
recommended_model = "claude-3-haiku"
elif task_type in ["content_classification", "data_extraction"]:
recommended_model = "gpt-3.5-turbo"
return recommended_model, model_info
def compare_costs(self, prompt, task_type, expected_output_length=100):
"""Compare costs across different models for the same task"""
results = {}
for model_name, model_info in self.models.items():
input_cost = (len(prompt) / 1000) * model_info["cost_per_1k_input"]
output_cost = (expected_output_length / 1000) * model_info["cost_per_1k_output"]
total_cost = input_cost + output_cost
results[model_name] = {
"total_cost": total_cost,
"input_cost": input_cost,
"output_cost": output_cost,
"suitable_for_task": task_type in model_info.get("capabilities", [])
}
return results
# Example usage
selector = ModelSelector()
# Analyze a customer feedback classification task
task = "sentiment_analysis"
prompt = "Customer wrote: 'The product broke after one day. Terrible quality.'"
prompt_tokens = len(prompt.split()) * 1.3 # Rough token estimate
# Get optimal model
selected_model, model_info = selector.select_model(task, prompt_tokens, budget_priority=True)
print(f"Recommended model for {task}: {selected_model}")
# Compare costs across models
cost_comparison = selector.compare_costs(prompt, task)
for model, costs in cost_comparison.items():
print(f"{model}: ${costs['total_cost']:.6f} (suitable: {costs['suitable_for_task']})")
Cascade Pattern: Start with the cheapest model, escalate only if needed:
class CascadeProcessor:
def __init__(self, model_selector, llm_client):
self.selector = model_selector
self.client = llm_client
self.escalation_keywords = ["complex", "unclear", "need more context"]
def process_with_cascade(self, prompt, task_type):
"""Try cheaper models first, escalate if needed"""
# Start with most economical model
models_to_try = ["claude-3-haiku", "gpt-3.5-turbo", "gpt-4"]
for model in models_to_try:
try:
result = self.client.query(prompt, model=model, temperature=0)
response = result["response"]
# Check if response indicates the model struggled
if not self._needs_escalation(response):
return {
"response": response,
"model_used": model,
"cost": result["cost"],
"escalated": model != models_to_try[0]
}
print(f"{model} response unclear, trying next model...")
except Exception as e:
print(f"Error with {model}: {e}")
continue
raise Exception("All models failed")
def _needs_escalation(self, response):
"""Determine if response quality warrants trying a better model"""
response_lower = response.lower()
# Simple heuristics - in production, use more sophisticated checks
if len(response) < 10: # Too short
return True
if any(keyword in response_lower for keyword in self.escalation_keywords):
return True
if "i'm not sure" in response_lower or "unclear" in response_lower:
return True
return False
# Example usage
processor = CascadeProcessor(selector, client)
result = processor.process_with_cascade(
"Is this review positive or negative: 'Okay product, nothing special'",
"sentiment_analysis"
)
print(f"Final result from {result['model_used']}: {result['response']}")
print(f"Cost: ${result['cost']:.4f}")
Batch Processing: Group similar requests to optimize model selection:
def batch_process_by_complexity(requests, batch_size=10):
"""Group requests by complexity and process with appropriate models"""
# Classify requests by complexity
simple_requests = []
complex_requests = []
for req in requests:
if len(req["prompt"]) < 200 and req["task_type"] in ["sentiment_analysis", "classification"]:
simple_requests.append(req)
else:
complex_requests.append(req)
results = []
# Process simple requests with cheap model
for i in range(0, len(simple_requests), batch_size):
batch = simple_requests[i:i+batch_size]
batch_prompt = "Process these requests:\n" + "\n".join([
f"{j+1}. {req['prompt']}" for j, req in enumerate(batch)
])
result = client.query(batch_prompt, model="claude-3-haiku")
# Parse batch result and append to results...
# Process complex requests individually with better model
for req in complex_requests:
result = client.query(req["prompt"], model="gpt-4")
results.append(result)
return results
Now let's put everything together in a practical exercise. You'll build a cost-optimized customer feedback analysis system that demonstrates all three optimization strategies.
Your e-commerce company receives 1,000 customer reviews daily. Currently, you're using GPT-4 for all analysis, costing about $50/day. Your goal: maintain analysis quality while reducing costs by 70%.
import time
import random
from datetime import datetime
# Sample customer feedback data
sample_feedbacks = [
"Great product, fast delivery!",
"Package arrived damaged and customer service was unhelpful",
"Average quality, nothing special but does the job",
"Expensive but worth it for the quality",
"Worst purchase ever, complete waste of money",
"Good value for money, would recommend",
"Delivery was delayed but product quality is excellent",
"Interface is confusing, needs better instructions"
]
class FeedbackAnalyzer:
def __init__(self):
self.cache = LLMCache()
self.tracker = TokenTracker("gpt-4")
self.selector = ModelSelector()
self.client = CachedLLMClient(self.cache, self.tracker)
# Cost tracking
self.total_cost = 0
self.total_requests = 0
self.cache_savings = 0
self.model_savings = 0
def analyze_feedback(self, feedback_text, optimize=True):
"""Analyze customer feedback with cost optimization"""
if optimize:
return self._analyze_optimized(feedback_text)
else:
return self._analyze_baseline(feedback_text)
def _analyze_baseline(self, feedback_text):
"""Baseline: Always use GPT-4, no caching"""
prompt = f"""Analyze this customer feedback:
Feedback: "{feedback_text}"
Provide:
1. Sentiment (positive/negative/neutral)
2. Key themes
3. Priority level (high/medium/low)
4. Suggested action"""
# Simulate API call without caching
cost_estimate = self.tracker.estimate_cost(prompt, expected_response_length=100)
response = f"""1. Sentiment: {'positive' if 'great' in feedback_text.lower() else 'negative' if any(word in feedback_text.lower() for word in ['bad', 'terrible', 'worst']) else 'neutral'}
2. Key themes: Product quality, delivery experience
3. Priority level: {'high' if 'worst' in feedback_text.lower() else 'medium'}
4. Suggested action: Follow up with customer service team"""
self.total_cost += cost_estimate['estimated_cost']
self.total_requests += 1
return {
"response": response,
"cost": cost_estimate['estimated_cost'],
"model": "gpt-4",
"cached": False
}
def _analyze_optimized(self, feedback_text):
"""Optimized version with all strategies"""
# Step 1: Determine task complexity and select appropriate model
task_type = "sentiment_analysis" if len(feedback_text) < 100 else "complex_analysis"
# Create optimized prompt
if task_type == "sentiment_analysis":
prompt = f"Sentiment and priority of: '{feedback_text}'"
else:
prompt = f"Analyze feedback: '{feedback_text}' (sentiment, themes, priority, action)"
# Step 2: Select optimal model
selected_model, _ = self.selector.select_model(task_type, len(prompt), budget_priority=True)
# Step 3: Use caching
result = self.client.query(prompt, model=selected_model, temperature=0)
# Track savings
baseline_cost = self.tracker.estimate_cost(
f"Analyze customer feedback: '{feedback_text}' with detailed analysis...",
expected_response_length=150
)['estimated_cost']
actual_cost = result.get('cost', 0) if not result.get('cached') else 0
self.total_cost += actual_cost
self.total_requests += 1
if result.get('cached'):
self.cache_savings += baseline_cost
if selected_model != "gpt-4":
self.model_savings += (baseline_cost - actual_cost)
return result
def get_optimization_report(self):
"""Generate cost optimization report"""
baseline_estimated = self.total_requests * 0.006 # Estimated GPT-4 cost per request
return {
"total_requests": self.total_requests,
"actual_cost": self.total_cost,
"baseline_estimated_cost": baseline_estimated,
"total_savings": baseline_estimated - self.total_cost,
"savings_percentage": ((baseline_estimated - self.total_cost) / baseline_estimated) * 100 if baseline_estimated > 0 else 0,
"cache_savings": self.cache_savings,
"model_selection_savings": self.model_savings,
"cache_stats": self.cache.get_cache_stats()
}
# Run the exercise
analyzer = FeedbackAnalyzer()
print("=== Cost Optimization Exercise ===\n")
# Process sample feedback with optimization
print("Processing feedback with optimization...")
for i, feedback in enumerate(sample_feedbacks):
result = analyzer.analyze_feedback(feedback, optimize=True)
print(f"\nFeedback {i+1}: {feedback}")
print(f"Model used: {result.get('model', 'cached')}")
print(f"Cached: {result.get('cached', False)}")
print(f"Cost: ${result.get('cost', 0):.4f}")
# Process some duplicates to show caching benefits
print("\n--- Testing cache with duplicate feedback ---")
for feedback in sample_feedbacks[:3]: # Repeat first 3
result = analyzer.analyze_feedback(feedback, optimize=True)
print(f"Duplicate feedback cost: ${result.get('cost', 0):.4f} (cached: {result.get('cached')})")
# Generate optimization report
report = analyzer.get_optimization_report()
print(f"\n=== OPTIMIZATION RESULTS ===")
print(f"Total requests processed: {report['total_requests']}")
print(f"Actual cost: ${report['actual_cost']:.4f}")
print(f"Baseline estimated cost: ${report['baseline_estimated_cost']:.4f}")
print(f"Total savings: ${report['total_savings']:.4f}")
print(f"Savings percentage: {report['savings_percentage']:.1f}%")
print(f"Cache hit rate: {report['cache_stats']['hit_rate']:.1%}")
def simulate_production_load(analyzer, num_requests=100):
"""Simulate realistic production load with various feedback types"""
# Generate realistic distribution of feedback
feedback_types = {
"simple_positive": ["Great!", "Love it!", "Perfect product"],
"simple_negative": ["Terrible", "Waste of money", "Poor quality"],
"complex_mixed": [
"The product quality is good but delivery was slow and packaging could be better",
"Great customer service but the product didn't meet my expectations for the price point",
"Interface is intuitive for basic tasks but lacks advanced features I needed for my work"
]
}
results = {
"baseline_cost": 0,
"optimized_cost": 0,
"processing_time": 0
}
start_time = time.time()
for i in range(num_requests):
# Simulate realistic distribution: 40% simple positive, 30% simple negative, 30% complex
rand_val = random.random()
if rand_val < 0.4:
feedback = random.choice(feedback_types["simple_positive"])
elif rand_val < 0.7:
feedback = random.choice(feedback_types["simple_negative"])
else:
feedback = random.choice(feedback_types["complex_mixed"])
# Add some randomness to simulate real variations
if random.random() < 0.2: # 20% chance of additional context
feedback += f" Order #{random.randint(1000, 9999)}"
result = analyzer.analyze_feedback(feedback, optimize=True)
if i % 50 == 0: # Progress update
print(f"Processed {i} requests...")
results["processing_time"] = time.time() - start_time
# Get final report
optimization_report = analyzer.get_optimization_report()
return {
"requests_processed": num_requests,
"total_time": results["processing_time"],
"avg_time_per_request": results["processing_time"] / num_requests,
**optimization_report
}
# Run production simulation
print("\n=== PRODUCTION SCALE SIMULATION ===")
print("Simulating 100 customer feedback analyses...")
final_results = simulate_production_load(analyzer, 100)
print(f"\n=== FINAL RESULTS ===")
print(f"Requests processed: {final_results['requests_processed']}")
print(f"Total processing time: {final_results['total_time']:.2f} seconds")
print(f"Average time per request: {final_results['avg_time_per_request']:.3f} seconds")
print(f"Total cost: ${final_results['actual_cost']:.4f}")
print(f"Estimated baseline cost: ${final_results['baseline_estimated_cost']:.4f}")
print(f"Cost savings: ${final_results['total_savings']:.4f} ({final_results['savings_percentage']:.1f}%)")
print(f"Cache hit rate: {final_results['cache_stats']['hit_rate']:.1%}")
# Calculate ROI
daily_volume = 1000 # feedbacks per day
monthly_savings = final_results['total_savings'] * 10 * 30 # Scale to monthly
print(f"\nProjected monthly savings at 1,000 requests/day: ${monthly_savings:.2f}")
Problem: Caching responses when using temperature > 0 or when you want varied responses.
# WRONG: Caching creative content generation
def generate_marketing_copy(product_name):
prompt = f"Write creative marketing copy for {product_name}"
# This will always return the same cached response
return client.query(prompt, model="gpt-4", temperature=0.8) # High temperature but cached!
# CORRECT: Conditional caching
def generate_marketing_copy(product_name, use_cache=False):
prompt = f"Write creative marketing copy for {product_name}"
if use_cache:
# Only cache if you want identical responses
return client.query(prompt, model="gpt-4", temperature=0)
else:
# Skip cache for creative content
return make_direct_api_call(prompt, model="gpt-4", temperature=0.8)
Solution: Only cache deterministic outputs (temperature=0) or when identical responses are desired.
Problem: Counting tokens after API calls instead of before, leading to budget overruns.
# WRONG: Count tokens after spending money
def risky_api_call(prompt):
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# Too late! Money already spent
tokens_used = response.usage.total_tokens
if tokens_used > budget_limit:
print("Oops, went over budget!")
# CORRECT: Budget control before API calls
def safe_api_call(prompt, max_budget=0.01):
estimated_cost = tracker.estimate_cost(prompt)
if estimated_cost['estimated_cost'] > max_budget:
# Try shorter prompt or different model
return optimize_prompt_for_budget(prompt, max_budget)
return make_api_call(prompt)
Problem: Using expensive models for simple tasks or cheap models for complex ones.
# WRONG: Expensive model for simple classification
def classify_sentiment_expensive(text):
prompt = f"Is this positive or negative: {text}"
return client.query(prompt, model="gpt-4") # Overkill and expensive
# WRONG: Cheap model for complex reasoning
def complex_analysis_cheap(data):
prompt = f"Perform detailed statistical analysis with recommendations: {data}"
return client.query(prompt, model="claude-3-haiku") # May produce poor results
# CORRECT: Task-appropriate model selection
def classify_sentiment(text):
prompt = f"Sentiment: {text}"
return client.query(prompt, model="gpt-3.5-turbo") # Right balance
def complex_analysis(data):
prompt = f"Analyze and recommend: {data}"
return client.query(prompt, model="gpt-4") # Complex reasoning needs premium model
Cache Inconsistencies:
def debug_cache_issues():
# Check if your cache keys are consistent
prompt1 = "What is the weather?"
prompt2 = "What is the weather?" # Same but different object
key1 = cache._generate_cache_key(prompt1, "gpt-4", 0.0)
key2 = cache._generate_cache_key(prompt2, "gpt-4", 0.0)
assert key1 == key2, "Cache key generation is inconsistent!"
# Check cache expiration
cache.set("test", "gpt-4", "response", 0.001, ttl=1) # 1 second TTL
time.sleep(2)
result = cache.get("test", "gpt-4")
assert result is None, "Cache not expiring properly!"
Token Count Discrepancies:
def verify_token_counting():
test_text = "Hello, world!"
# Count using tiktoken
local_count = tracker.count_tokens(test_text)
# Make API call and compare
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": test_text}]
)
api_count = response.usage.prompt_tokens
difference = abs(local_count - api_count)
if difference > 2: # Small differences are normal
print(f"Token count discrepancy: local={local_count}, api={api_count}")
Memory Issues with Large Caches:
def monitor_cache_size():
# Monitor Redis memory usage
info = cache.redis_client.info('memory')
used_memory_mb = info['used_memory'] / 1024 / 1024
if used_memory_mb > 100: # 100MB threshold
print(f"Cache using {used_memory_mb:.1f}MB, consider cleanup")
# Implement cache cleanup strategy
cache.redis_client.execute_command('MEMORY PURGE')
You've now built a comprehensive cost optimization system that can reduce LLM expenses by 60-80% while maintaining quality. Let's recap the key strategies:
Token Counting & Monitoring: You learned to track usage proactively, optimize prompts for efficiency, and set up automated cost alerts before bills get out of control.
Intelligent Caching: You implemented both exact-match and semantic caching systems that eliminate redundant API calls, potentially saving thousands of dollars monthly on repetitive tasks.
Strategic Model Selection: You built a framework to automatically choose the most cost-effective model for each task, using premium models only when necessary.
Audit your current LLM usage: Implement token tracking on your existing applications this week. You'll likely find surprising cost drivers.
Start with caching wins: Identify your most repetitive API calls and implement caching. FAQ systems and content analysis are prime candidates.
Evaluate model choices: Review tasks currently using GPT-4. Can 70% of them work just as well with GPT-3.5-turbo?
Fine-tuning for cost optimization: Train smaller, task-specific models that outperform general models while costing less.
Prompt engineering at scale: Advanced techniques like few-shot prompting and chain-of-thought optimization for specific business use cases.
Multi-model architectures: Design systems that automatically route requests to different models based on real-time pricing and availability.
Production monitoring: Build comprehensive dashboards that track cost per business outcome, not just raw API expenses.
The strategies you've learned today will serve as the foundation for building profitable AI applications. Remember: the goal isn't to minimize costs at all costs, but to maximize the value you get from every dollar spent on AI. Start implementing these techniques gradually, measure the impact, and iterate based on your specific use cases.
Your AI applications can be both intelligent and economical—now you have the tools to make it happen.
Learning Path: Building with LLMs