
Imagine you're a customer service representative for a tech company with thousands of product manuals, troubleshooting guides, and policy documents. A customer asks a complex question about warranty coverage for a specific product purchased two years ago. You could spend 20 minutes digging through files, or you could have an AI assistant that instantly finds the relevant information and crafts a personalized, accurate response in seconds.
This scenario illustrates the core challenge that Retrieval-Augmented Generation (RAG) solves: how to combine the vast knowledge stored in documents with the conversational abilities of large language models (LLMs). By the end of this lesson, you'll understand exactly how RAG works and be able to implement a basic RAG system yourself.
What you'll learn:
You should have:
No prior experience with vector databases or embeddings is required—we'll explain everything from the ground up.
Before diving into RAG, let's understand the fundamental problem it addresses. Large language models like GPT-4 or Claude are incredibly capable, but they have a critical limitation: their knowledge is frozen at their training cutoff date. Ask ChatGPT about your company's new product launched last month, and it will politely tell you it doesn't know.
Even worse, LLMs can "hallucinate"—confidently provide information that sounds plausible but is completely wrong. When you need accurate, up-to-date information from specific sources, pure LLMs fall short.
Traditional approaches to this problem include:
RAG offers a different approach: instead of changing the model, we change what information it has access to when generating responses.
RAG operates on a deceptively simple principle: before generating an answer, first retrieve relevant information from a knowledge base. Think of it like an open-book exam where the AI can consult reference materials before answering.
Here's the complete RAG workflow:
Before any retrieval can happen, we need to prepare our knowledge base. This involves three sub-steps:
Document Chunking: Large documents are broken into smaller, manageable pieces (typically 200-1000 tokens each). This chunking is crucial because:
Embedding Generation: Each chunk is converted into a vector embedding—a list of numbers that represents the semantic meaning of the text. Documents about similar topics will have similar embeddings.
Vector Storage: These embeddings are stored in a vector database, which allows for fast similarity searches.
When a user asks a question:
Query Embedding: The user's question is converted into an embedding using the same model used for the knowledge base.
Similarity Search: The system searches the vector database to find the chunks most similar to the query embedding.
Context Selection: The top-k most relevant chunks are selected (typically 3-10 pieces).
Prompt Construction: The retrieved chunks are combined with the user's question into a single prompt.
LLM Generation: The language model generates a response based on both the question and the retrieved context.
Response Delivery: The answer is returned to the user, often with source citations.
Vector embeddings are the foundation that makes RAG possible. Think of embeddings as a way to represent the "meaning" of text as coordinates in a high-dimensional space.
Here's an intuitive way to understand embeddings: imagine you could plot every possible concept on a map. Related concepts would be close together—"dog" and "puppy" might be nearby, while "dog" and "algebra" would be far apart. Embeddings do exactly this, but instead of a 2D map, they use spaces with hundreds or thousands of dimensions.
# Example: How embeddings capture semantic similarity
# These are simplified 3D embeddings for illustration
document_chunks = {
"Dogs are loyal pets that require daily exercise": [0.8, 0.2, 0.1],
"Cats are independent animals that sleep often": [0.7, 0.1, 0.3],
"Python is a programming language used for data science": [0.1, 0.9, 0.8],
"Machine learning requires large datasets": [0.2, 0.8, 0.9]
}
query = "What pets need exercise?" # Embedding: [0.75, 0.15, 0.05]
# The query embedding is most similar to the first chunk about dogs
Modern embedding models like OpenAI's text-embedding-3-small or Sentence-BERT can capture these semantic relationships with remarkable accuracy.
A vector database is specifically designed to store and search through embeddings efficiently. Unlike traditional databases that search for exact matches, vector databases find the most similar vectors using mathematical operations like cosine similarity or dot product.
Popular vector database options include:
Here's how similarity search works conceptually:
import numpy as np
def cosine_similarity(vec1, vec2):
"""Calculate how similar two vectors are (0-1 scale)"""
dot_product = np.dot(vec1, vec2)
magnitude1 = np.linalg.norm(vec1)
magnitude2 = np.linalg.norm(vec2)
return dot_product / (magnitude1 * magnitude2)
# Example similarity calculation
query_embedding = [0.75, 0.15, 0.05]
doc_embedding = [0.8, 0.2, 0.1]
similarity = cosine_similarity(query_embedding, doc_embedding)
print(f"Similarity: {similarity:.3f}") # Output: Similarity: 0.997
Let's build a basic RAG system step by step. We'll create a system that can answer questions about a collection of company policy documents.
import openai
import chromadb
from chromadb.config import Settings
import tiktoken
import numpy as np
from typing import List, Dict
# You'll need to install these packages:
# pip install openai chromadb tiktoken numpy
class DocumentProcessor:
def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.tokenizer = tiktoken.get_encoding("cl100k_base")
def chunk_text(self, text: str, source: str) -> List[Dict]:
"""Split text into overlapping chunks"""
tokens = self.tokenizer.encode(text)
chunks = []
start = 0
chunk_id = 0
while start < len(tokens):
# Define chunk boundaries
end = min(start + self.chunk_size, len(tokens))
chunk_tokens = tokens[start:end]
chunk_text = self.tokenizer.decode(chunk_tokens)
chunks.append({
'id': f"{source}_{chunk_id}",
'text': chunk_text,
'source': source,
'chunk_id': chunk_id
})
# Move start position with overlap
start += self.chunk_size - self.chunk_overlap
chunk_id += 1
return chunks
# Example usage
processor = DocumentProcessor()
# Sample company policy document
policy_text = """
Employee Vacation Policy
All full-time employees are entitled to 15 days of paid vacation per year.
Vacation days accrue monthly at a rate of 1.25 days per month. Employees
must request vacation at least two weeks in advance through the HR portal.
Unused vacation days up to 5 days may be carried over to the following year.
Vacation requests during peak business periods (November-December) require
approval from department managers.
Remote Work Policy
Employees may work remotely up to 2 days per week with manager approval.
Remote work days must be scheduled in advance and documented in the team
calendar. All remote workers must be available during core business hours
(9 AM - 3 PM EST) and maintain reliable internet connectivity.
"""
chunks = processor.chunk_text(policy_text, "company_policies.txt")
print(f"Created {len(chunks)} chunks")
class RAGSystem:
def __init__(self, openai_api_key: str):
self.client = openai.OpenAI(api_key=openai_api_key)
self.chroma_client = chromadb.Client()
self.collection = self.chroma_client.create_collection(
name="company_knowledge_base",
metadata={"hnsw:space": "cosine"}
)
def get_embedding(self, text: str) -> List[float]:
"""Generate embedding for a piece of text"""
response = self.client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def add_documents(self, chunks: List[Dict]):
"""Add document chunks to the vector database"""
texts = [chunk['text'] for chunk in chunks]
metadatas = [
{
'source': chunk['source'],
'chunk_id': chunk['chunk_id']
}
for chunk in chunks
]
ids = [chunk['id'] for chunk in chunks]
# Generate embeddings for all chunks
embeddings = []
for text in texts:
embedding = self.get_embedding(text)
embeddings.append(embedding)
# Add to vector database
self.collection.add(
embeddings=embeddings,
documents=texts,
metadatas=metadatas,
ids=ids
)
print(f"Added {len(chunks)} chunks to knowledge base")
# Initialize and populate the RAG system
rag = RAGSystem(openai_api_key="your-api-key-here")
rag.add_documents(chunks)
def retrieve_relevant_chunks(self, query: str, n_results: int = 3) -> List[str]:
"""Retrieve the most relevant chunks for a query"""
query_embedding = self.get_embedding(query)
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
include=['documents', 'metadatas', 'distances']
)
relevant_chunks = []
for i, doc in enumerate(results['documents'][0]):
distance = results['distances'][0][i]
source = results['metadatas'][0][i]['source']
# Only include chunks with high similarity (low distance)
if distance < 0.8: # Adjust threshold as needed
relevant_chunks.append({
'text': doc,
'source': source,
'similarity': 1 - distance # Convert distance to similarity
})
return relevant_chunks
# Add this method to the RAGSystem class
RAGSystem.retrieve_relevant_chunks = retrieve_relevant_chunks
def generate_answer(self, query: str) -> Dict:
"""Generate an answer using retrieved context"""
# Retrieve relevant chunks
relevant_chunks = self.retrieve_relevant_chunks(query)
if not relevant_chunks:
return {
'answer': "I don't have enough information to answer that question.",
'sources': []
}
# Build context from retrieved chunks
context = "\n\n".join([
f"Source: {chunk['source']}\nContent: {chunk['text']}"
for chunk in relevant_chunks
])
# Create the prompt
prompt = f"""
Based on the following context, please answer the user's question.
If the context doesn't contain enough information, say so clearly.
Context:
{context}
Question: {query}
Answer:
"""
# Generate response
response = self.client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "You are a helpful assistant that answers questions based on provided context. Always cite your sources."
},
{"role": "user", "content": prompt}
],
temperature=0.1
)
return {
'answer': response.choices[0].message.content,
'sources': [chunk['source'] for chunk in relevant_chunks],
'retrieved_chunks': len(relevant_chunks)
}
# Add this method to the RAGSystem class
RAGSystem.generate_answer = generate_answer
Let's put our RAG system to work with some realistic queries:
# Test the complete RAG system
def test_rag_system():
# Initialize with your OpenAI API key
rag = RAGSystem(openai_api_key="your-api-key-here")
# Process and add documents
processor = DocumentProcessor()
chunks = processor.chunk_text(policy_text, "company_policies.txt")
rag.add_documents(chunks)
# Test queries
test_queries = [
"How many vacation days do employees get?",
"Can I work from home every day?",
"What's the policy on carrying over unused vacation?",
"How much notice do I need to give for vacation?",
"What are the core business hours for remote workers?"
]
for query in test_queries:
print(f"\nQuestion: {query}")
result = rag.generate_answer(query)
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
print(f"Retrieved chunks: {result['retrieved_chunks']}")
print("-" * 50)
# Run the test
test_rag_system()
Try running this exercise with different types of questions. Notice how the system:
Tip: Start with a small document collection when testing. It's easier to understand what's happening and debug issues.
Problem: Chunks that are too large contain too much irrelevant information. Chunks that are too small lack sufficient context.
Solution: Experiment with chunk sizes between 200-800 tokens. For technical documents, smaller chunks (200-400 tokens) often work better. For narrative content, larger chunks (500-800 tokens) may be more effective.
# Test different chunk sizes
chunk_sizes_to_test = [200, 400, 600, 800]
for size in chunk_sizes_to_test:
processor = DocumentProcessor(chunk_size=size)
chunks = processor.chunk_text(your_text, "test_doc")
print(f"Chunk size {size}: Created {len(chunks)} chunks")
Problem: The system breaks when no relevant context is found or when queries are very different from the training data.
Solution: Always check retrieval results and set similarity thresholds.
def safe_retrieve(self, query: str, similarity_threshold: float = 0.7):
chunks = self.retrieve_relevant_chunks(query)
# Filter out chunks below similarity threshold
relevant_chunks = [
chunk for chunk in chunks
if chunk['similarity'] >= similarity_threshold
]
if not relevant_chunks:
return "I don't have enough relevant information to answer this question confidently."
return relevant_chunks
Problem: Focusing only on generation quality while ignoring whether the right information is being retrieved.
Solution: Always evaluate retrieval performance separately from generation performance.
def evaluate_retrieval(self, query: str, expected_sources: List[str]):
"""Check if retrieval finds the expected sources"""
chunks = self.retrieve_relevant_chunks(query)
retrieved_sources = [chunk['source'] for chunk in chunks]
overlap = set(retrieved_sources) & set(expected_sources)
precision = len(overlap) / len(retrieved_sources) if retrieved_sources else 0
recall = len(overlap) / len(expected_sources) if expected_sources else 0
return {
'precision': precision,
'recall': recall,
'retrieved_sources': retrieved_sources
}
Problem: The language model doesn't effectively use the retrieved context or provides inconsistent responses.
Solution: Use clear, specific prompts that explicitly instruct the model how to use the context.
def create_better_prompt(self, query: str, context_chunks: List[str]) -> str:
context = "\n\n---\n\n".join([
f"Document Section {i+1}:\n{chunk}"
for i, chunk in enumerate(context_chunks)
])
return f"""
You are an expert assistant that answers questions using only the provided context.
CONTEXT:
{context}
INSTRUCTIONS:
1. Answer the question using ONLY information from the context above
2. If the context doesn't contain enough information, clearly state this
3. Quote specific parts of the context when possible
4. If you make any inferences, clearly mark them as such
QUESTION: {query}
ANSWER:
"""
As you become more comfortable with basic RAG, consider these advanced topics:
Combining vector similarity search with traditional keyword-based search often improves results. This approach captures both semantic similarity and exact term matches.
After initial retrieval, use a separate model to re-rank the results based on their relevance to the specific query. This two-stage approach often improves precision.
Automatically expand user queries with synonyms or related terms to improve retrieval coverage.
Use document metadata (date, author, document type) to filter retrieval results before similarity search.
You've learned the fundamental concepts behind Retrieval-Augmented Generation and built a working RAG system from scratch. The key takeaways are:
RAG is a foundational technique that opens the door to building AI applications with access to real-world knowledge. The system you've built today is a stepping stone to more sophisticated AI applications that can reason over your organization's unique data and documents.
Learning Path: RAG & AI Agents