
You've built a basic RAG pipeline. Your agent can answer questions from a document store, reason through prompts, and produce coherent responses. Then a user asks it: "What's the current EUR/USD exchange rate?" Your agent confidently hallucinates a number from its training data, which could be months or years out of date. Or they ask: "Can you analyze this CSV and tell me which products have declining 90-day trends?" Your agent tries to reason through it in prose and gets the math wrong.
These are the moments where a language model running in isolation hits its ceiling. LLMs are remarkable at language — at reasoning, summarizing, and synthesizing — but they have no access to the present, no ability to run code, and no way to call external systems. Tool augmentation is the architectural pattern that fixes this. By giving your agent tools — callable functions that interface with the real world — you extend it from a very smart text predictor into a system that can actually do things.
By the end of this lesson, you'll have built a production-oriented tool-augmented agent using LangChain and OpenAI's function-calling interface. You'll understand not just how to wire up tools but how agents decide when to use them, how to handle errors gracefully, and how to avoid the common failure modes that turn demo-worthy agents into unreliable production nightmares.
What you'll learn:
You should be comfortable with Python and have some experience building LLM-powered applications — working knowledge of LangChain or a similar framework is helpful but not required. You'll need API keys for OpenAI and SerpAPI (or Tavily — we'll cover both). Familiarity with REST APIs, JSON, and basic async Python will serve you well here.
Install the required packages before continuing:
pip install langchain langchain-openai langchain-community \
serpapi tavily-python \
requests pandas \
python-dotenv
Before writing a single line of tool code, you need to understand the control loop your agent is running. The dominant pattern is called ReAct — short for Reason + Act — and it's the backbone of nearly every production agent today.
The loop looks like this:
This iterative cycle is what separates agents from simple LLM calls. A basic RAG pipeline asks the model once and returns the answer. An agent can ask itself "do I have enough to answer this?" repeatedly until it does.
Here's what a ReAct trace looks like in practice for the query "What's the market cap of Nvidia today, and how does that compare to their revenue over the last 4 quarters?":
Thought: I need current market cap data for Nvidia. I should search for this.
Action: web_search("Nvidia market cap today 2024")
Observation: Nvidia market cap is approximately $2.2 trillion as of...
Thought: Now I need recent quarterly revenue. I'll search for their earnings.
Action: web_search("Nvidia quarterly revenue 2023 2024 earnings")
Observation: Q1 FY2025: $26.0B, Q4 FY2024: $22.1B, Q3 FY2024: $18.1B...
Thought: I have all the data. Let me calculate the ratio and write the answer.
Final Answer: Nvidia's current market cap of ~$2.2 trillion represents...
The agent didn't do this in one shot. It planned, acted, observed, and iterated. That's the mental model you need to hold when you're debugging agent behavior — you're not debugging a function call, you're debugging a reasoning loop.
Modern agents use one of two mechanisms for tool selection:
Prompt-based selection (older, less reliable): The LLM is given a description of available tools in the system prompt and asked to output tool calls in a structured format like Action: tool_name\nAction Input: .... LangChain's classic initialize_agent uses this approach with ZERO_SHOT_REACT_DESCRIPTION.
Function calling / tool use (current standard): The LLM provider exposes a structured API where you define tools as JSON schemas. The model returns a structured tool_call object rather than unstructured text. OpenAI's function calling, Anthropic's tool use, and Google's function calling all use this pattern. It's dramatically more reliable because the model isn't parsing free text — it's selecting from a typed schema.
We'll use OpenAI's function calling interface throughout this lesson, which LangChain exposes cleanly via bind_tools.
Web search is the most universally useful tool you can give an agent. Let's build it properly — not just "it works on my machine" but with error handling, rate limit awareness, and result filtering.
We'll use Tavily, which is purpose-built for LLM agents and returns cleaner, more structured results than raw SerpAPI. Create your .env file:
OPENAI_API_KEY=sk-...
TAVILY_API_KEY=tvly-...
Here's the search tool implementation:
import os
from dotenv import load_dotenv
from langchain_core.tools import tool
from tavily import TavilyClient
load_dotenv()
tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])
@tool
def web_search(query: str) -> str:
"""
Search the web for current information about a topic. Use this for:
- Current events, prices, or statistics
- Recent news about companies or markets
- Any factual information that may have changed after 2023
Args:
query: A specific, well-formed search query
Returns:
Summarized search results with source URLs
"""
try:
results = tavily.search(
query=query,
search_depth="advanced",
max_results=5,
include_answer=True
)
# Build a structured response the LLM can reason about
output_parts = []
if results.get("answer"):
output_parts.append(f"Summary: {results['answer']}\n")
output_parts.append("Sources:")
for i, result in enumerate(results.get("results", []), 1):
output_parts.append(
f"{i}. {result['title']}\n"
f" URL: {result['url']}\n"
f" Content: {result['content'][:300]}..."
)
return "\n".join(output_parts)
except Exception as e:
return f"Search failed: {str(e)}. Try rephrasing your query."
Notice a few deliberate choices here:
The docstring is part of the tool's contract with the model. When you use @tool, LangChain extracts the docstring as the tool's description and passes it to the LLM. A vague docstring like "searches the web" produces worse tool selection than a specific one that tells the model when to use it. Write docstrings for the LLM, not just for human readers.
The error handler returns a string, not a raised exception. If the tool raises, the agent crashes. If it returns an error string, the agent can read it, reason about it, and try again with a different query. This is a critical difference in production.
This is where agents get genuinely powerful. A code execution tool lets your agent write Python, run it, and observe the output — which means it can do math, data analysis, string manipulation, and statistical calculations without hallucinating results.
The tradeoff is security. Running arbitrary LLM-generated code is a legitimate threat surface. We'll handle this with two approaches: a sandboxed subprocess for untrusted contexts, and a direct exec approach for controlled environments where you trust the inputs.
Let's start with the safer subprocess approach:
import subprocess
import sys
import tempfile
import os
from langchain_core.tools import tool
@tool
def execute_python(code: str) -> str:
"""
Execute Python code and return the output. Use this for:
- Mathematical calculations and statistical analysis
- Data manipulation and transformation
- Generating computed results that require precision
- Any task where reasoning through numbers might be error-prone
The code runs in an isolated environment. Print your results explicitly.
Pandas, numpy, and standard library modules are available.
Do NOT use this for file I/O, network requests, or system commands.
Args:
code: Valid Python code to execute
Returns:
stdout output from the code, or error message if execution failed
"""
# Write to a temp file to avoid injection via exec()
with tempfile.NamedTemporaryFile(
mode='w',
suffix='.py',
delete=False,
dir='/tmp'
) as f:
f.write(code)
temp_path = f.name
try:
result = subprocess.run(
[sys.executable, temp_path],
capture_output=True,
text=True,
timeout=30, # Hard limit — LLM-generated loops can be infinite
# Restrict environment variables to prevent credential access
env={
"PATH": os.environ.get("PATH", ""),
"HOME": "/tmp",
"PYTHONPATH": os.environ.get("PYTHONPATH", "")
}
)
if result.returncode != 0:
# Return stderr so the agent can self-correct
return f"Execution error:\n{result.stderr[:1000]}"
output = result.stdout.strip()
if not output:
return "Code executed successfully but produced no output. Make sure to print() your results."
# Truncate very long outputs to avoid context overflow
if len(output) > 3000:
output = output[:3000] + "\n[Output truncated at 3000 chars]"
return output
except subprocess.TimeoutExpired:
return "Execution timed out after 30 seconds. Simplify your code or break it into smaller steps."
finally:
os.unlink(temp_path) # Always clean up
Let's watch this tool in action. If a user asks "What's the compound annual growth rate of a stock that went from $45 to $312 over 7 years?", the agent should write and execute this:
# What the LLM generates and sends to execute_python:
initial = 45
final = 312
years = 7
cagr = (final / initial) ** (1 / years) - 1
print(f"CAGR: {cagr:.4f} ({cagr*100:.2f}%)")
Output: CAGR: 0.3154 (31.54%)
The agent gets a precise, correct answer rather than reasoning through the math and potentially drifting. This is the core value proposition: offload computation to a deterministic system, offload reasoning to the probabilistic one.
Security note for production: If your agent is user-facing with arbitrary user inputs, the subprocess isolation above is a starting point, not a complete solution. Consider Docker containerization with resource limits, or use an LLM sandbox service like E2B or Modal for genuine isolation. Never run LLM-generated code with elevated privileges.
The third major tool category is external API access — CRMs, databases, internal services, weather APIs, financial data providers. We'll build a tool that hits a real financial data API to fetch stock information, which is representative of the pattern you'd use for any REST API.
We'll use Yahoo Finance via the yfinance library as a dependency-light example, then show how the same pattern applies to authenticated APIs:
pip install yfinance
import yfinance as yf
import json
from langchain_core.tools import tool
@tool
def get_stock_data(ticker: str, period: str = "1mo") -> str:
"""
Fetch current and historical stock data for a given ticker symbol.
Use this when you need:
- Current stock price, volume, or market cap
- Historical price data for trend analysis
- Basic company financials (P/E ratio, earnings, etc.)
Args:
ticker: Stock ticker symbol (e.g., 'AAPL', 'NVDA', 'MSFT')
period: Time period for historical data. Options: '1d', '5d', '1mo', '3mo', '1y', '5y'
Returns:
JSON string with current info and recent price history
"""
try:
stock = yf.Ticker(ticker.upper())
info = stock.info
hist = stock.history(period=period)
if hist.empty:
return f"No data found for ticker '{ticker}'. Verify the symbol is correct."
# Extract the most useful fields rather than dumping everything
summary = {
"ticker": ticker.upper(),
"company_name": info.get("longName", "N/A"),
"current_price": info.get("currentPrice") or hist["Close"].iloc[-1],
"market_cap": info.get("marketCap"),
"pe_ratio": info.get("trailingPE"),
"52_week_high": info.get("fiftyTwoWeekHigh"),
"52_week_low": info.get("fiftyTwoWeekLow"),
"volume_avg": info.get("averageVolume"),
"price_history": {
"period": period,
"start_price": round(hist["Close"].iloc[0], 2),
"end_price": round(hist["Close"].iloc[-1], 2),
"high": round(hist["High"].max(), 2),
"low": round(hist["Low"].min(), 2),
"pct_change": round(
((hist["Close"].iloc[-1] - hist["Close"].iloc[0])
/ hist["Close"].iloc[0]) * 100,
2
)
}
}
return json.dumps(summary, indent=2, default=str)
except Exception as e:
return f"Failed to fetch data for {ticker}: {str(e)}"
For APIs that require authentication, the pattern is identical — you just pull credentials from environment variables and pass them in the request headers. Never hardcode API keys, and never expose them in tool outputs:
import requests
from langchain_core.tools import tool
@tool
def query_internal_crm(account_id: str, fields: str = "all") -> str:
"""
Query internal CRM for account information.
Args:
account_id: The CRM account identifier (format: ACC-XXXXX)
fields: Comma-separated fields to retrieve, or 'all' for full record
Returns:
Account data as JSON string
"""
api_key = os.environ["CRM_API_KEY"] # Never from user input
base_url = os.environ["CRM_BASE_URL"]
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
params = {}
if fields != "all":
params["fields"] = fields
try:
response = requests.get(
f"{base_url}/accounts/{account_id}",
headers=headers,
params=params,
timeout=10
)
response.raise_for_status()
return json.dumps(response.json(), indent=2)
except requests.exceptions.Timeout:
return f"CRM request timed out. Try again or contact support."
except requests.exceptions.HTTPError as e:
return f"CRM returned error {e.response.status_code}: {e.response.text[:200]}"
except Exception as e:
return f"CRM query failed: {str(e)}"
Now we wire everything together. We'll use LangChain's create_tool_calling_agent with OpenAI's function-calling interface, which is the current recommended approach:
from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import SystemMessage
# Collect all tools
tools = [web_search, execute_python, get_stock_data]
# Initialize the model — gpt-4o handles tool selection better than gpt-3.5
llm = ChatOpenAI(
model="gpt-4o",
temperature=0, # Determinism matters for tool selection
api_key=os.environ["OPENAI_API_KEY"]
)
# Bind tools to the model — this registers the JSON schemas with the API
llm_with_tools = llm.bind_tools(tools)
# The system prompt shapes how the agent reasons about tool use
system_prompt = """You are a financial research assistant with access to web search,
code execution, and real-time stock data.
When answering questions:
1. Use web_search for current news, recent events, or general market context
2. Use get_stock_data when you need specific price or financial metrics for a ticker
3. Use execute_python when you need to perform calculations, statistical analysis,
or data transformations that require precision
Always show your reasoning. If a tool returns an error, try a different approach
before giving up. When presenting financial data, always note the data source
and approximate timestamp."""
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
MessagesPlaceholder(variable_name="chat_history", optional=True),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
])
# Create the agent
agent = create_tool_calling_agent(llm, tools, prompt)
# AgentExecutor handles the ReAct loop
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True, # Shows the reasoning trace — invaluable for debugging
max_iterations=10, # Prevent infinite loops
handle_parsing_errors=True, # Graceful recovery from malformed outputs
return_intermediate_steps=True # Capture the full reasoning chain
)
Let's test it with a multi-step query that requires all three tools:
response = agent_executor.invoke({
"input": """I'm evaluating Nvidia as a potential investment.
Can you: (1) get their current stock metrics, (2) search for any recent
news about their AI chip business, and (3) calculate what a $10,000
investment would be worth today if purchased at their 52-week low?"""
})
print(response["output"])
With verbose=True, you'll see the full reasoning trace in your terminal — each thought, each tool call, each observation. This is your primary debugging interface. Read it carefully when things go wrong.
A single-turn agent is useful but limited. Real workflows require conversational context — "now do the same analysis for AMD" should understand that "same analysis" means the financial evaluation you just did for Nvidia.
LangChain provides ConversationBufferMemory for short conversations and ConversationSummaryMemory for longer ones. Here's how to add it:
from langchain.memory import ConversationBufferWindowMemory
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
k=10 # Keep last 10 turns — balance context length vs cost
)
agent_executor_with_memory = AgentExecutor(
agent=agent,
tools=tools,
memory=memory,
verbose=True,
max_iterations=10,
handle_parsing_errors=True
)
# First turn
result1 = agent_executor_with_memory.invoke({
"input": "What's Nvidia's current P/E ratio compared to their 5-year average?"
})
# Second turn — references prior context
result2 = agent_executor_with_memory.invoke({
"input": "Now do the same comparison for AMD and tell me which looks more fairly valued."
})
Watch your token budget. Each tool call appends its output to the context window. A 5-step agent with verbose tool outputs can easily consume 15,000–20,000 tokens per invocation. Set
max_iterationsconservatively (8–12 for most cases) and truncate tool outputs as we did in the code execution tool above.
Let's build something you could actually use. This agent will take a company name, research it using all three tools, and produce a structured investment brief.
from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
import json
# Assume web_search, execute_python, get_stock_data are defined from above
RESEARCH_SYSTEM_PROMPT = """You are a professional equity research assistant.
When asked to research a company, you will:
1. Fetch current stock data (price, P/E, market cap, 52-week range)
2. Search for recent news: earnings results, product launches, competitive threats,
regulatory issues, and analyst ratings changes (from the last 30 days)
3. Use code execution to calculate:
- Year-to-date performance
- Volatility (if you have price history)
- Any other quantitative metrics useful for evaluation
Output a structured brief with sections:
- SNAPSHOT: Key current metrics
- RECENT DEVELOPMENTS: 3-5 most important recent news items with dates
- QUANTITATIVE ANALYSIS: Calculated metrics with methodology shown
- RISKS & TAILWINDS: Balanced view based on your research
- DATA SOURCES: Where information came from
Be precise with numbers. Flag any data that seems outdated or uncertain."""
def research_company(company_ticker: str, company_name: str) -> dict:
"""Run a full research cycle on a public company."""
llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [web_search, execute_python, get_stock_data]
prompt = ChatPromptTemplate.from_messages([
("system", RESEARCH_SYSTEM_PROMPT),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
])
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
max_iterations=12,
handle_parsing_errors=True,
return_intermediate_steps=True
)
research_query = f"""
Please research {company_name} (ticker: {company_ticker}) and produce
a comprehensive investment brief. Use all available tools to gather
current data and recent news. Calculate key metrics using the code
execution tool for precision.
"""
result = executor.invoke({"input": research_query})
return {
"company": company_name,
"ticker": company_ticker,
"brief": result["output"],
"tool_calls_made": len(result["intermediate_steps"]),
"steps": [
{
"tool": step[0].tool,
"input": step[0].tool_input,
"output_preview": str(step[1])[:200]
}
for step in result["intermediate_steps"]
]
}
# Run it
if __name__ == "__main__":
report = research_company("MSFT", "Microsoft")
print(f"\n{'='*60}")
print(f"RESEARCH BRIEF: {report['company']} ({report['ticker']})")
print(f"Tool calls made: {report['tool_calls_made']}")
print('='*60)
print(report["brief"])
print(f"\n{'='*60}")
print("REASONING TRACE:")
print('='*60)
for i, step in enumerate(report["steps"], 1):
print(f"\nStep {i}: {step['tool']}")
print(f" Input: {step['input']}")
print(f" Output preview: {step['output_preview']}...")
Run this and you'll see the agent plan its approach, make targeted tool calls, retrieve real data, calculate metrics, and synthesize a coherent brief — all without you micromanaging the steps. That's the architecture working correctly.
Symptom: Your agent keeps calling the same tool repeatedly, or hitting max_iterations every time.
Cause: Usually a tool that keeps returning unhelpful errors, causing the agent to retry the same approach. Or the task is genuinely underspecified — the agent can't determine when it's done.
Fix: Check your tool error messages — they should guide the agent toward alternatives, not just say "failed." Also check your system prompt: it should tell the agent what "done" looks like. Set max_iterations to 10–15 and ensure it fails gracefully with a partial answer rather than an exception.
# In AgentExecutor, this returns a message instead of throwing on iteration limit:
agent_executor = AgentExecutor(
...,
max_iterations=10,
max_execution_time=120, # Also add a wall-clock timeout in seconds
early_stopping_method="generate" # LLM generates a final answer when limit hit
)
Symptom: Costs are high, performance degrades on longer conversations, or you hit context length errors.
Cause: Tool outputs being returned in full — a web search might return 5,000 characters, a code execution might print a large dataframe.
Fix: Truncate all tool outputs. We showed this in the web search and code execution tools above. A good rule of thumb is 1,000–2,000 characters per tool call result. The agent rarely needs more.
Symptom: Agent uses web_search for stock prices instead of get_stock_data, or tries to do math via reasoning instead of execute_python.
Cause: Poor tool descriptions. The LLM decides which tool to use based on the docstring. Ambiguous descriptions lead to inconsistent selection.
Fix: Make tool docstrings explicit about when to use them and not use them. Add negative examples if needed:
"""
Use get_stock_data for: current prices, P/E ratios, market cap, price history
Do NOT use for: news, analysis, or non-price company information (use web_search for those)
"""
Symptom: Agent reports "code executed successfully but produced no output."
Cause: The LLM generates code that assigns variables but never prints them — very common when the model is writing code in a REPL-style mental model.
Fix: Your tool's error message is already handling this (we included it above). Additionally, add to your system prompt: "When writing code for execute_python, always print() your results explicitly. Don't just assign to variables."
Symptom: API error messages contain your API key or internal URLs.
Cause: Exception messages from requests library often include the full URL with query parameters, which can contain keys.
Fix: Always catch exceptions and sanitize the error message before returning it. Never return str(e) raw from an API tool.
except requests.exceptions.HTTPError as e:
# Good: sanitized message
return f"API returned status {e.response.status_code}. Check input parameters."
# Bad: might expose internals
# return f"Request failed: {str(e)}"
Symptom: With older models or prompt-based agents, the agent "thinks" it called a tool and invents the results.
Cause: This happens with ZERO_SHOT_REACT_DESCRIPTION agents on weaker models. The model generates the action and then generates a fake observation.
Fix: Use function-calling based agents (create_tool_calling_agent) with models that support it (GPT-4o, GPT-4, Claude 3+). Function calling enforces that tool calls and results are in separate message slots with provenance — the model physically cannot fake a tool result.
Tool-augmented agents are significantly more expensive than single LLM calls. A research query that makes 5 tool calls with verbose outputs can cost 10x more than a simple completion. Here's how to manage it:
Cache deterministic tool results. Stock prices change every second, but a news article from yesterday doesn't. Use functools.lru_cache or Redis for tools where the same input reliably produces the same output.
Choose the right model for each role. You don't need GPT-4o for every step. Consider using a cheaper model for tool selection and a stronger one for final synthesis. LangChain supports per-step model selection.
Instrument everything. Log every tool call, input, output length, latency, and cost. You can't optimize what you can't measure. LangSmith (LangChain's tracing product) is excellent for this.
Set hard budgets. Build a token counter that tracks cumulative usage per agent invocation and stops the loop if it exceeds a threshold:
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
result = agent_executor.invoke({"input": user_query})
print(f"Total tokens: {cb.total_tokens}")
print(f"Total cost: ${cb.total_cost:.4f}")
if cb.total_cost > 0.50:
# Log expensive invocations for review
logger.warning(f"Expensive agent call: ${cb.total_cost:.4f} for query: {user_query[:100]}")
You've built a tool-augmented agent that can search the web, execute code, and call external APIs — and more importantly, you understand why the architecture works and how to debug it when it doesn't.
The key principles to take forward:
Tools extend the agent's sensory reach. The LLM is the reasoner; tools are how it perceives and affects the world. That division of responsibility is the architectural foundation.
Docstrings are model-facing API contracts. Write them with the model as your audience. Clear, specific descriptions about when and how to use each tool directly determine how reliably the agent uses them.
Errors should guide, not crash. Every tool should return a descriptive error string rather than raising an exception. The agent can recover from a bad tool call if it understands what went wrong.
The reasoning trace is your debugger. verbose=True and return_intermediate_steps=True give you full visibility into why the agent made each decision. Use them aggressively during development.
Where to go from here:
The market intelligence agent you built in the exercise is a genuine starting point for production systems. With proper error handling, caching, and logging, the same pattern powers AI assistants used in real financial research, customer support, and data analytics workflows.
Learning Path: RAG & AI Agents