
You have a folder full of vendor contracts, a pile of customer support emails, a stack of scanned PDFs from a due diligence process, or a database of raw clinical notes — and somewhere buried in all of that prose are the facts your business actually needs. Dates, dollar amounts, named parties, risk clauses, sentiment signals, product SKUs. Right now, a human reads through each document and types those facts into a spreadsheet. It takes forever, it's error-prone, and it doesn't scale. This is the problem that structured data extraction solves — and AI has made it genuinely tractable for the first time.
What "structuring unstructured data" means, concretely, is teaching an AI to read a blob of text and return something your database, spreadsheet, or pipeline can actually consume: a JSON object with named fields, a clean table with rows and columns, a list of extracted entities with their types and positions. The key insight is that large language models aren't just autocomplete engines — they're remarkably good at following instructions that say "read this, identify these things, and give them back to me in this shape." When you combine that capability with disciplined prompt design and a clear output schema, you get a repeatable extraction pipeline that can process thousands of documents overnight.
By the end of this lesson, you will have built working extraction pipelines for real-world document types and understood the design decisions that separate fragile demos from production-grade systems.
What you'll learn:
You should already be comfortable with:
You do not need a machine learning background. This lesson is about applied prompt engineering and pipeline design, not model internals.
Before AI, the standard toolkit for this problem was regex, rule-based parsers, and classical NLP libraries like spaCy or NLTK. These approaches work — if your documents are highly consistent. The moment a vendor writes "net thirty days" instead of "NET 30" or puts the contract value in a footnote instead of a header, your regex breaks. Rule-based systems are brittle by design: they only know what you explicitly programmed them to know.
LLMs approach this differently. Instead of matching patterns, they understand context. They can read "the parties have agreed that payment shall be remitted no later than the final business day of the calendar month following delivery" and correctly identify that as a 30-day payment term — something no regex in the world would catch without heroic effort. That contextual understanding is what makes AI extraction qualitatively different from the old approach.
But LLMs come with their own failure modes: they hallucinate values that aren't in the text, they format output inconsistently when not constrained, and they can silently misinterpret ambiguous passages. Good extraction engineering is about harnessing the contextual power of LLMs while systematically guarding against these failure modes. That's what this lesson is actually about.
The single most important concept in AI-powered data extraction is schema-first thinking. Before you write a single line of prompt text, you need to define exactly what you want to get out. What are the fields? What type is each field? What should the model return when a field isn't present?
This sounds obvious but most people skip it, which is why most extraction demos fall apart when they hit real data.
Let's work with a concrete example. Suppose you're processing vendor invoices. Before prompting anything, define your target schema:
invoice_schema = {
"invoice_number": "string or null",
"invoice_date": "ISO 8601 date string or null",
"due_date": "ISO 8601 date string or null",
"vendor_name": "string or null",
"vendor_address": "string or null",
"total_amount": "float or null",
"currency": "3-letter ISO currency code or null",
"line_items": [
{
"description": "string",
"quantity": "float or null",
"unit_price": "float or null",
"line_total": "float or null"
}
],
"payment_terms": "string or null",
"notes": "string or null"
}
Once you have this, your prompt writes itself. You're not asking the model to "extract information from an invoice" — you're asking it to populate a specific data structure. That specificity dramatically improves consistency.
Here's what a production-worthy extraction prompt looks like:
INVOICE_EXTRACTION_PROMPT = """
You are a data extraction assistant. Your job is to extract structured information
from the invoice text provided and return it as valid JSON.
RULES:
1. Return ONLY valid JSON. No explanation, no markdown, no code blocks.
2. If a field is not present in the document, return null for that field.
3. Normalize dates to ISO 8601 format (YYYY-MM-DD).
4. Normalize monetary amounts to float (e.g., "$1,234.56" becomes 1234.56).
5. Currency should be a 3-letter ISO code (USD, EUR, GBP, etc.). Default to USD
if currency is implied but not stated.
6. Do not infer or calculate values that are not explicitly present in the text.
SCHEMA:
{schema}
INVOICE TEXT:
{document_text}
Return the populated JSON object now:
"""
Notice what's happening in each rule:
json.loads())Let's build a complete, working extraction pipeline. We'll use the OpenAI Python SDK, but the pattern applies to any LLM API.
import json
import openai
from typing import Optional
client = openai.OpenAI() # assumes OPENAI_API_KEY is set in environment
INVOICE_SCHEMA = {
"invoice_number": "string or null",
"invoice_date": "ISO 8601 date string or null",
"due_date": "ISO 8601 date string or null",
"vendor_name": "string or null",
"vendor_address": "string or null",
"total_amount": "float or null",
"currency": "3-letter ISO currency code or null",
"line_items": [
{
"description": "string",
"quantity": "float or null",
"unit_price": "float or null",
"line_total": "float or null"
}
],
"payment_terms": "string or null",
"notes": "string or null"
}
EXTRACTION_PROMPT = """
You are a data extraction assistant. Extract structured information from the
invoice text below and return it as valid JSON matching the schema provided.
RULES:
1. Return ONLY valid JSON. No explanation, no markdown, no code blocks.
2. Use null (not "N/A", not "") for any field not present in the document.
3. Normalize dates to ISO 8601 format (YYYY-MM-DD).
4. Normalize monetary amounts to float (e.g., "$1,234.56" becomes 1234.56).
5. Currency should be a 3-letter ISO code. Default to USD if clearly implied.
6. Never infer or calculate values not explicitly present in the document.
SCHEMA:
{schema}
DOCUMENT TEXT:
{document_text}
"""
def extract_invoice_data(document_text: str) -> Optional[dict]:
"""
Extract structured invoice data from raw text.
Returns a dict on success, None on failure.
"""
prompt = EXTRACTION_PROMPT.format(
schema=json.dumps(INVOICE_SCHEMA, indent=2),
document_text=document_text
)
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a precise data extraction system. Return only valid JSON."
},
{
"role": "user",
"content": prompt
}
],
temperature=0, # deterministic output for extraction tasks
max_tokens=2000,
response_format={"type": "json_object"} # enforces JSON output at API level
)
raw_output = response.choices[0].message.content
extracted = json.loads(raw_output)
return extracted
except json.JSONDecodeError as e:
print(f"JSON parsing failed: {e}")
print(f"Raw output was: {raw_output[:500]}")
return None
except Exception as e:
print(f"API call failed: {e}")
return None
def validate_extracted_invoice(data: dict) -> dict:
"""
Post-process and validate extracted invoice data.
Returns the data dict with a 'validation_errors' list appended.
"""
errors = []
# Check required fields
required_fields = ["invoice_number", "vendor_name", "total_amount"]
for field in required_fields:
if data.get(field) is None:
errors.append(f"Missing required field: {field}")
# Validate total_amount is positive
if data.get("total_amount") is not None:
if not isinstance(data["total_amount"], (int, float)):
errors.append(f"total_amount is not numeric: {data['total_amount']}")
elif data["total_amount"] < 0:
errors.append(f"total_amount is negative: {data['total_amount']}")
# Validate line item math (flag discrepancies > 1% as potential issues)
if data.get("line_items") and data.get("total_amount"):
line_sum = sum(
item.get("line_total", 0) or 0
for item in data["line_items"]
)
if line_sum > 0 and abs(line_sum - data["total_amount"]) / data["total_amount"] > 0.01:
errors.append(
f"Line item sum ({line_sum:.2f}) doesn't match total ({data['total_amount']:.2f})"
)
data["validation_errors"] = errors
data["extraction_status"] = "clean" if not errors else "needs_review"
return data
Why
temperature=0? Extraction tasks have a correct answer — the information is either in the document or it isn't. Setting temperature to zero makes the model deterministic and suppresses the creativity that's useful for generation but actively harmful for extraction. Always use temperature 0 or very close to it for structured extraction.
Now let's test it with a realistic messy invoice:
sample_invoice = """
INVOICE
Acme Supplies Co.
742 Evergreen Terrace, Springfield, IL 62701
Tel: (555) 238-1000
Bill To:
Riverside Medical Group
1400 Harbor Boulevard
Fullerton, CA 92835
Invoice #: INV-2024-08847
Date: October 15, 2024
Payment Due: November 14, 2024
DESCRIPTION QTY UNIT PRICE AMOUNT
------------------------------------------------------------------
Nitrile Examination Gloves (Box/100) 50 $18.75 $937.50
Disposable Face Masks (Box/50) 30 $12.00 $360.00
Hand Sanitizer 1L Pump 20 $8.50 $170.00
Biohazard Waste Bags (Case/100) 10 $24.99 $249.90
SUBTOTAL: $1,717.40
TAX (8%): $137.39
TOTAL: $1,854.79
Payment Terms: Net 30
Please make checks payable to Acme Supplies Co.
For wire transfer inquiries, contact ar@acmesupplies.com
"""
result = extract_invoice_data(sample_invoice)
if result:
validated = validate_extracted_invoice(result)
print(json.dumps(validated, indent=2))
The output you'd get back would be a clean, typed JSON object with properly formatted dates, numeric totals, and structured line items — ready to insert into a database or feed to a downstream process.
Entity extraction is a specific flavor of structured extraction where you're pulling typed facts out of text — not filling out a form template, but finding and classifying mentions of things the document talks about.
Classic NER (Named Entity Recognition) using spaCy or similar tools will find entities, but only the ones it was trained on, and only when they're phrased conventionally. LLM-based NER is much more flexible. You can define custom entity types, handle unusual phrasings, and extract relationships between entities — not just the entities themselves.
Here's a realistic scenario: you're processing merger and acquisition press releases to build a competitive intelligence database. You need to extract the acquiring company, the target company, deal value, announcement date, and deal type.
MA_EXTRACTION_PROMPT = """
You are an M&A intelligence analyst. Extract deal information from the press
release below and return it as valid JSON.
ENTITY TYPES TO EXTRACT:
- acquiring_company: The company making the acquisition
- target_company: The company being acquired or merged with
- deal_value: The announced transaction value (normalize to USD millions as a float,
or null if undisclosed)
- deal_type: One of ["acquisition", "merger", "majority_stake", "minority_investment",
"asset_purchase", "unknown"]
- announcement_date: ISO 8601 date
- expected_close_date: ISO 8601 date or null if not stated
- advisors: List of financial/legal advisors mentioned with their role
(e.g., [{"firm": "Goldman Sachs", "role": "financial advisor", "advising": "acquirer"}])
- deal_rationale: A 1-2 sentence summary of the stated strategic rationale (not your
own analysis — only what the press release explicitly states)
- regulatory_approvals_required: List of regulatory bodies mentioned
(e.g., ["FTC", "EU Commission"])
Return ONLY valid JSON. Use null for fields not present in the text.
Do not infer values not explicitly stated.
PRESS RELEASE TEXT:
{text}
"""
sample_press_release = """
FOR IMMEDIATE RELEASE
October 22, 2024
NORTHSTAR HEALTH SYSTEMS ANNOUNCES DEFINITIVE AGREEMENT TO ACQUIRE
PRECISION DIAGNOSTICS INC. FOR $2.3 BILLION
CHICAGO — NorthStar Health Systems (NYSE: NHS) today announced it has entered
into a definitive agreement to acquire Precision Diagnostics Inc., a leading
provider of AI-powered pathology solutions, for approximately $2.3 billion in
an all-cash transaction.
The acquisition is expected to accelerate NorthStar's precision medicine strategy
by integrating Precision Diagnostics' proprietary image analysis platform, which
currently serves over 400 hospital systems nationwide, into NorthStar's existing
oncology service line.
"This transaction represents a compelling opportunity to deliver faster, more
accurate diagnoses to patients while creating significant value for our shareholders,"
said Dr. Margaret Chen, Chief Executive Officer of NorthStar Health Systems.
The transaction is subject to customary closing conditions, including regulatory
approval from the Federal Trade Commission, and is expected to close in the first
quarter of 2025.
Goldman Sachs & Co. LLC is serving as financial advisor and Skadden, Arps, Slate,
Meagher & Flom LLP is serving as legal counsel to NorthStar. Centerview Partners
LLC is serving as financial advisor and Weil, Gotshal & Manges LLP is serving as
legal counsel to Precision Diagnostics.
###
"""
The model will return something like:
{
"acquiring_company": "NorthStar Health Systems",
"target_company": "Precision Diagnostics Inc.",
"deal_value": 2300.0,
"deal_type": "acquisition",
"announcement_date": "2024-10-22",
"expected_close_date": "2025-03-31",
"advisors": [
{"firm": "Goldman Sachs & Co. LLC", "role": "financial advisor", "advising": "acquirer"},
{"firm": "Skadden, Arps, Slate, Meagher & Flom LLP", "role": "legal counsel", "advising": "acquirer"},
{"firm": "Centerview Partners LLC", "role": "financial advisor", "advising": "target"},
{"firm": "Weil, Gotshal & Manges LLP", "role": "legal counsel", "advising": "target"}
],
"deal_rationale": "The acquisition is intended to accelerate NorthStar's precision medicine strategy by integrating Precision Diagnostics' AI-powered image analysis platform into NorthStar's oncology service line.",
"regulatory_approvals_required": ["Federal Trade Commission"]
}
Notice that expected_close_date is 2025-03-31 — the model correctly interpreted "first quarter of 2025" as an approximate date and normalized it. That's the contextual understanding that makes LLM extraction different from regex.
One of the most common extraction challenges is when tabular data exists in a document but isn't stored as a structured table — it's in a PDF that got OCR'd poorly, it's in the body of an email, or it's embedded in a narrative financial report.
Let's say you're processing quarterly earnings call transcripts to extract financial guidance tables. The table might appear in the source text like this:
Management reaffirmed full-year 2024 guidance as follows: revenue is expected
to come in between $4.2 billion and $4.4 billion, representing growth of 12 to
18 percent over the prior year. Adjusted EBITDA margin is expected to be in the
range of 23 to 25 percent. Capital expenditures are guided to approximately
$180 million, flat with 2023 levels. Free cash flow conversion is expected to
remain above 85 percent of net income.
Your extraction prompt needs to recognize that this narrative contains a table and reconstruct it:
TABLE_EXTRACTION_PROMPT = """
You are a financial data analyst. The text below contains financial guidance
information presented in narrative form. Extract it into a structured table.
Return a JSON object with a single key "guidance_table" containing a list of rows.
Each row should have these fields:
- metric: The financial metric name (string)
- low_end: The low end of the guidance range as a float, or the single value if
no range is given (float or null)
- high_end: The high end of the guidance range as a float, or null if it's a
point estimate (float or null)
- unit: The unit of measurement — one of ["USD_millions", "USD_billions",
"percentage", "ratio", "count"] (string)
- notes: Any qualifying language (string or null)
Normalize all dollar amounts — "$4.2 billion" becomes 4200.0 with unit "USD_millions".
Percentages like "23 to 25 percent" become low_end: 23.0, high_end: 25.0, unit: "percentage".
Return ONLY valid JSON. No explanation.
TEXT:
{text}
"""
Expected output:
{
"guidance_table": [
{
"metric": "Revenue",
"low_end": 4200.0,
"high_end": 4400.0,
"unit": "USD_millions",
"notes": "Represents 12 to 18 percent growth over prior year"
},
{
"metric": "Adjusted EBITDA Margin",
"low_end": 23.0,
"high_end": 25.0,
"unit": "percentage",
"notes": null
},
{
"metric": "Capital Expenditures",
"low_end": 180.0,
"high_end": null,
"unit": "USD_millions",
"notes": "Flat with 2023 levels"
},
{
"metric": "Free Cash Flow Conversion",
"low_end": 85.0,
"high_end": null,
"unit": "percentage",
"notes": "Expected to remain above 85 percent of net income"
}
]
}
Pro tip: When extracting tables from OCR'd PDFs, add a pre-processing instruction: "The text may contain OCR errors such as '0' rendered as 'O', missing spaces, or broken line breaks. Use context to resolve obvious OCR artifacts." This small addition significantly improves accuracy on imperfect source documents.
Everything above assumes your document fits in the model's context window. When you're processing long contracts (50+ pages), full annual reports, or large email threads, you need a strategy for handling documents that exceed your token budget.
There are two main approaches, and choosing between them depends on your document structure:
Strategy 1: Hierarchical Chunking — Split the document into meaningful chunks (by section, by page, by paragraph), extract from each chunk independently, then merge the results. This works well when each chunk is self-contained.
Strategy 2: Map-Reduce Extraction — Extract from each chunk, then pass all chunk extractions to a final "merge and deduplicate" pass. This is better when facts might span chunk boundaries.
Here's a practical implementation of map-reduce extraction for long contracts:
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens in text using tiktoken."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def chunk_document(text: str, max_tokens: int = 6000, overlap_tokens: int = 200) -> list[str]:
"""
Split document into overlapping chunks by token count.
Overlap helps catch facts that span chunk boundaries.
"""
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunk_text = enc.decode(chunk_tokens)
chunks.append(chunk_text)
if end >= len(tokens):
break
# Move start forward by (max_tokens - overlap_tokens)
start += max_tokens - overlap_tokens
return chunks
def extract_from_long_document(document_text: str, schema: dict, extraction_prompt: str) -> dict:
"""
Process a long document using map-reduce extraction.
"""
total_tokens = count_tokens(document_text)
# If it fits in context, process directly
if total_tokens < 6000:
return extract_invoice_data(document_text) # or whatever your extractor is
print(f"Document is {total_tokens} tokens — chunking into segments...")
chunks = chunk_document(document_text, max_tokens=6000, overlap_tokens=300)
print(f"Created {len(chunks)} chunks")
# Map phase: extract from each chunk
chunk_results = []
for i, chunk in enumerate(chunks):
print(f"Processing chunk {i+1}/{len(chunks)}...")
result = extract_invoice_data(chunk)
if result:
chunk_results.append(result)
# Reduce phase: merge chunk extractions
merge_prompt = f"""
You are a data merging assistant. Below are {len(chunk_results)} partial extractions
from different sections of the same document. Some fields may appear in multiple
partial extractions.
Your job: merge them into a single coherent JSON object following the schema below.
RULES:
- For fields with single values: use the most specific/complete value across all partials
- For list fields (like line_items): combine all unique entries, deduplicate
- If the same field has conflicting non-null values across chunks, note the conflict
in a "merge_conflicts" list
- Fields that are null in all partials should remain null
SCHEMA:
{json.dumps(schema, indent=2)}
PARTIAL EXTRACTIONS:
{json.dumps(chunk_results, indent=2)}
Return only the merged JSON object:
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": merge_prompt}],
temperature=0,
response_format={"type": "json_object"}
)
merged = json.loads(response.choices[0].message.content)
merged["_chunks_processed"] = len(chunks)
return merged
Production extraction systems need more than the extracted values — they need to know how confident the model is and where in the document each value came from. This is essential for human-in-the-loop review workflows where you want to automatically accept high-confidence extractions and flag low-confidence ones for a human to verify.
Modify your extraction prompt to request source attribution:
ATTRIBUTED_EXTRACTION_PROMPT = """
You are a precise data extraction system. For each field you extract, also provide:
1. A confidence score from 0.0 to 1.0 (how certain are you this value is correct?)
2. The verbatim text from the document that supports this extraction (quote it exactly)
Return a JSON object where each field has this structure:
{
"value": <the extracted value>,
"confidence": <float 0.0-1.0>,
"source_text": <verbatim quote from document, or null if field is null>
}
Confidence guide:
- 1.0: The value is stated explicitly and unambiguously
- 0.8-0.9: The value is clearly implied or requires minor interpretation
- 0.5-0.7: The value requires significant inference or there's some ambiguity
- Below 0.5: You are guessing — consider returning null instead
DOCUMENT:
{document_text}
"""
With this structure, your validation pipeline can automatically route extractions:
def route_extraction_by_confidence(extraction: dict, threshold: float = 0.85) -> str:
"""
Route an extraction based on minimum field confidence.
Returns 'auto_accept', 'needs_review', or 'reject'
"""
if not extraction:
return 'reject'
critical_fields = ['invoice_number', 'total_amount', 'vendor_name', 'invoice_date']
confidences = []
for field in critical_fields:
field_data = extraction.get(field, {})
if isinstance(field_data, dict):
conf = field_data.get('confidence', 0)
confidences.append(conf)
if not confidences:
return 'needs_review'
min_confidence = min(confidences)
avg_confidence = sum(confidences) / len(confidences)
if min_confidence >= threshold:
return 'auto_accept'
elif avg_confidence >= 0.6:
return 'needs_review'
else:
return 'reject'
Here's a complete real-world project to build on your own. You're building a customer support ticket triage system that needs to extract structured data from raw email/ticket text so tickets can be automatically routed and prioritized.
Your target schema:
TICKET_SCHEMA = {
"customer_name": "string or null",
"customer_account_id": "string or null — look for patterns like ACC-XXXXX or account numbers",
"product_name": "string or null",
"issue_category": "one of: ['billing', 'technical_error', 'feature_request', 'account_access', 'shipping', 'other']",
"severity": "one of: ['critical', 'high', 'medium', 'low'] — infer from urgency language",
"error_codes": "list of any error codes or HTTP status codes mentioned",
"affected_feature": "string or null — the specific feature or workflow that's broken",
"reproducible": "boolean or null — can the customer reproduce the issue?",
"workaround_exists": "boolean or null — has the customer found any workaround?",
"sentiment": "one of: ['frustrated', 'neutral', 'positive']",
"action_requested": "string — what does the customer want done?"
}
Your test tickets (use all three to test different edge cases):
TICKET 1:
Subject: URGENT - Production system down, cannot process payments
From: sarah.kowalski@techcorp.com
Our entire payment processing pipeline has been throwing error code ERR-5031
since approximately 2:15 PM EST. We've tried restarting the service three times
with no luck. Every transaction is failing with HTTP 503. This is costing us
roughly $12,000 per hour. We need someone on the phone immediately.
Account: ACC-88291. Our CTO is copied on this thread.
TICKET 2:
Subject: Question about adding users to our Enterprise plan
From: james.liu@startup.io
Hi team, love the product! Quick question — we're growing and want to add 5 more
seats to our plan. Is this something I can do from the admin panel or do I need
to go through sales? We're on the Enterprise Plus tier. Not urgent, just want
to sort it out before our next billing cycle. Thanks!
TICKET 3:
Subject: Fwd: Export to CSV not working properly
From: m.petersen@globalcorp.net
See the attached screenshots (note: I can reproduce this consistently).
When I try to export reports to CSV from the Analytics Dashboard, the file
downloads but the date column is all messed up — dates are showing as 5-digit
numbers instead of formatted dates. Looks like a Unix timestamp issue maybe?
This affects ALL of our analysts. We've found that exporting to XLSX works fine
as a temporary fix. Running Chrome 119.0 on Windows 11.
Your tasks:
Mistake 1: Not specifying null behavior
If you don't tell the model what to return for missing fields, it will invent plausible-sounding values. A missing invoice number might come back as "INV-001" or "N/A" or an empty string — none of which are consistent or useful. Always explicitly state: "return null for any field not present in the document."
Mistake 2: Asking for too much in one prompt
If your schema has 30 fields and your document is complex, prompt quality degrades. The model starts making more errors as it tries to track everything at once. Solution: break complex extractions into 2-3 focused passes. First extract the header fields, then extract line items, then extract terms and conditions.
Mistake 3: Not using temperature 0
This one bites people constantly. They run an extraction, it works, they move on. Then they notice that the same document extracts differently on different runs because temperature > 0 introduces variability. Set temperature to 0 for any extraction task where you want deterministic behavior.
Mistake 4: Trusting extracted numbers without validation
The model can silently misread $1,234.56 as $12,345.6 or drop a zero from a large number. Always validate numeric extractions against any checksums or totals available in the document. The line item sum validation in our pipeline above is an example of this.
Mistake 5: Not handling API JSON mode correctly
OpenAI's response_format={"type": "json_object"} enforces JSON output, but it doesn't guarantee your JSON matches your schema. A model can return perfectly valid JSON that completely ignores your schema. You still need schema validation — consider using Pydantic models or jsonschema for this:
from pydantic import BaseModel, validator
from typing import Optional, List
class LineItem(BaseModel):
description: str
quantity: Optional[float] = None
unit_price: Optional[float] = None
line_total: Optional[float] = None
class Invoice(BaseModel):
invoice_number: Optional[str] = None
invoice_date: Optional[str] = None
vendor_name: Optional[str] = None
total_amount: Optional[float] = None
currency: Optional[str] = "USD"
line_items: List[LineItem] = []
@validator('total_amount')
def total_must_be_positive(cls, v):
if v is not None and v < 0:
raise ValueError('total_amount must be positive')
return v
# Then validate like this:
try:
invoice = Invoice(**extracted_data)
print("Validation passed")
except Exception as e:
print(f"Validation failed: {e}")
Mistake 6: Skipping chunk overlap on long documents
If you chunk a document without overlap, facts that appear near a chunk boundary get split. An entity mention that starts at the end of chunk 1 might be meaningless without the context that continues into chunk 2. Use 200-300 token overlaps between chunks.
You've now built the conceptual and technical foundation for production-grade document extraction pipelines. The core principles to carry forward:
Schema-first always. Define your output before you write your prompt. The schema is the contract between your extraction system and everything downstream.
LLMs excel at contextual understanding; validation code handles correctness. Use the model for the hard semantic work (understanding "net thirty" as a payment term), and use deterministic code for the verification work (checking that line items sum to the stated total).
Temperature 0, explicit null handling, and source attribution are the three non-negotiable ingredients of extraction prompts that you'll trust in production.
Chunking is a pipeline design problem, not an afterthought. Think about document structure and chunk boundaries before you write your first line of code for any document type.
Where to go next:
pdfplumber, pypdf, and pytesseract to extract clean text from PDFs and scanned documents before passing them to your LLM — garbage in, garbage outThe extraction patterns in this lesson aren't toy demonstrations — they're the same architectural patterns used in production contract intelligence platforms, financial document processing systems, and medical record structuring tools. The difference between a demo and a production system is mostly in the validation layer, the error handling, and the evaluation framework. You now have the foundation for all three.
Learning Path: Intro to AI & Prompt Engineering