Structuring Unstructured Data with AI: Extracting Tables, Entities, and Insights from Text and Documents

Introduction

You have a folder full of vendor contracts, a pile of customer support emails, a stack of scanned PDFs from a due diligence process, or a database of raw clinical notes — and somewhere buried in all of that prose are the facts your business actually needs. Dates, dollar amounts, named parties, risk clauses, sentiment signals, product SKUs. Right now, a human reads through each document and types those facts into a spreadsheet. It takes forever, it's error-prone, and it doesn't scale. This is the problem that structured data extraction solves — and AI has made it genuinely tractable for the first time.

What "structuring unstructured data" means, concretely, is teaching an AI to read a blob of text and return something your database, spreadsheet, or pipeline can actually consume: a JSON object with named fields, a clean table with rows and columns, a list of extracted entities with their types and positions. The key insight is that large language models aren't just autocomplete engines — they're remarkably good at following instructions that say "read this, identify these things, and give them back to me in this shape." When you combine that capability with disciplined prompt design and a clear output schema, you get a repeatable extraction pipeline that can process thousands of documents overnight.

By the end of this lesson, you will have built working extraction pipelines for real-world document types and understood the design decisions that separate fragile demos from production-grade systems.

What you'll learn:

How to design prompts that extract structured JSON from free-form text reliably
How to define and enforce output schemas so downstream systems don't break
How to extract named entities (people, organizations, dates, monetary values) at scale
How to pull tabular data out of unstructured prose, PDFs, and messy reports
How to handle edge cases, missing fields, and model hallucinations in extraction pipelines

Prerequisites

You should already be comfortable with:

Writing and iterating on prompts with an LLM (GPT-4, Claude, or equivalent)
Basic Python (reading files, calling APIs, working with dictionaries and lists)
JSON as a data format — you know what it is and how to parse it
The general idea of tokens, context windows, and why prompt length matters

You do not need a machine learning background. This lesson is about applied prompt engineering and pipeline design, not model internals.

Why Unstructured-to-Structured Is Hard (and Why AI Changes the Game)

Before AI, the standard toolkit for this problem was regex, rule-based parsers, and classical NLP libraries like spaCy or NLTK. These approaches work — if your documents are highly consistent. The moment a vendor writes "net thirty days" instead of "NET 30" or puts the contract value in a footnote instead of a header, your regex breaks. Rule-based systems are brittle by design: they only know what you explicitly programmed them to know.

LLMs approach this differently. Instead of matching patterns, they understand context. They can read "the parties have agreed that payment shall be remitted no later than the final business day of the calendar month following delivery" and correctly identify that as a 30-day payment term — something no regex in the world would catch without heroic effort. That contextual understanding is what makes AI extraction qualitatively different from the old approach.

But LLMs come with their own failure modes: they hallucinate values that aren't in the text, they format output inconsistently when not constrained, and they can silently misinterpret ambiguous passages. Good extraction engineering is about harnessing the contextual power of LLMs while systematically guarding against these failure modes. That's what this lesson is actually about.

The Core Pattern: Schema-First Extraction

The single most important concept in AI-powered data extraction is schema-first thinking. Before you write a single line of prompt text, you need to define exactly what you want to get out. What are the fields? What type is each field? What should the model return when a field isn't present?

This sounds obvious but most people skip it, which is why most extraction demos fall apart when they hit real data.

Let's work with a concrete example. Suppose you're processing vendor invoices. Before prompting anything, define your target schema:

invoice_schema = {
    "invoice_number": "string or null",
    "invoice_date": "ISO 8601 date string or null",
    "due_date": "ISO 8601 date string or null",
    "vendor_name": "string or null",
    "vendor_address": "string or null",
    "total_amount": "float or null",
    "currency": "3-letter ISO currency code or null",
    "line_items": [
        {
            "description": "string",
            "quantity": "float or null",
            "unit_price": "float or null",
            "line_total": "float or null"
        }
    ],
    "payment_terms": "string or null",
    "notes": "string or null"
}

Once you have this, your prompt writes itself. You're not asking the model to "extract information from an invoice" — you're asking it to populate a specific data structure. That specificity dramatically improves consistency.

Here's what a production-worthy extraction prompt looks like:

INVOICE_EXTRACTION_PROMPT = """
You are a data extraction assistant. Your job is to extract structured information 
from the invoice text provided and return it as valid JSON.

RULES:
1. Return ONLY valid JSON. No explanation, no markdown, no code blocks.
2. If a field is not present in the document, return null for that field.
3. Normalize dates to ISO 8601 format (YYYY-MM-DD).
4. Normalize monetary amounts to float (e.g., "$1,234.56" becomes 1234.56).
5. Currency should be a 3-letter ISO code (USD, EUR, GBP, etc.). Default to USD 
   if currency is implied but not stated.
6. Do not infer or calculate values that are not explicitly present in the text.

SCHEMA:
{schema}

INVOICE TEXT:
{document_text}

Return the populated JSON object now:
"""

Notice what's happening in each rule:

Rule 1 prevents the model from wrapping JSON in markdown code fences (which breaks json.loads())
Rule 2 handles missing fields explicitly so you don't get hallucinated values
Rules 3 and 4 normalize format so downstream systems don't need to handle ten variations of date format
Rule 6 is critical: it tells the model not to calculate totals from line items if the total isn't printed on the document

Building the Extraction Pipeline in Python

Let's build a complete, working extraction pipeline. We'll use the OpenAI Python SDK, but the pattern applies to any LLM API.

import json
import openai
from typing import Optional

client = openai.OpenAI()  # assumes OPENAI_API_KEY is set in environment

INVOICE_SCHEMA = {
    "invoice_number": "string or null",
    "invoice_date": "ISO 8601 date string or null", 
    "due_date": "ISO 8601 date string or null",
    "vendor_name": "string or null",
    "vendor_address": "string or null",
    "total_amount": "float or null",
    "currency": "3-letter ISO currency code or null",
    "line_items": [
        {
            "description": "string",
            "quantity": "float or null",
            "unit_price": "float or null",
            "line_total": "float or null"
        }
    ],
    "payment_terms": "string or null",
    "notes": "string or null"
}

EXTRACTION_PROMPT = """
You are a data extraction assistant. Extract structured information from the 
invoice text below and return it as valid JSON matching the schema provided.

RULES:
1. Return ONLY valid JSON. No explanation, no markdown, no code blocks.
2. Use null (not "N/A", not "") for any field not present in the document.
3. Normalize dates to ISO 8601 format (YYYY-MM-DD).
4. Normalize monetary amounts to float (e.g., "$1,234.56" becomes 1234.56).
5. Currency should be a 3-letter ISO code. Default to USD if clearly implied.
6. Never infer or calculate values not explicitly present in the document.

SCHEMA:
{schema}

DOCUMENT TEXT:
{document_text}
"""

def extract_invoice_data(document_text: str) -> Optional[dict]:
    """
    Extract structured invoice data from raw text.
    Returns a dict on success, None on failure.
    """
    prompt = EXTRACTION_PROMPT.format(
        schema=json.dumps(INVOICE_SCHEMA, indent=2),
        document_text=document_text
    )
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system", 
                    "content": "You are a precise data extraction system. Return only valid JSON."
                },
                {
                    "role": "user", 
                    "content": prompt
                }
            ],
            temperature=0,       # deterministic output for extraction tasks
            max_tokens=2000,
            response_format={"type": "json_object"}  # enforces JSON output at API level
        )
        
        raw_output = response.choices[0].message.content
        extracted = json.loads(raw_output)
        return extracted
        
    except json.JSONDecodeError as e:
        print(f"JSON parsing failed: {e}")
        print(f"Raw output was: {raw_output[:500]}")
        return None
    except Exception as e:
        print(f"API call failed: {e}")
        return None


def validate_extracted_invoice(data: dict) -> dict:
    """
    Post-process and validate extracted invoice data.
    Returns the data dict with a 'validation_errors' list appended.
    """
    errors = []
    
    # Check required fields
    required_fields = ["invoice_number", "vendor_name", "total_amount"]
    for field in required_fields:
        if data.get(field) is None:
            errors.append(f"Missing required field: {field}")
    
    # Validate total_amount is positive
    if data.get("total_amount") is not None:
        if not isinstance(data["total_amount"], (int, float)):
            errors.append(f"total_amount is not numeric: {data['total_amount']}")
        elif data["total_amount"] < 0:
            errors.append(f"total_amount is negative: {data['total_amount']}")
    
    # Validate line item math (flag discrepancies > 1% as potential issues)
    if data.get("line_items") and data.get("total_amount"):
        line_sum = sum(
            item.get("line_total", 0) or 0 
            for item in data["line_items"]
        )
        if line_sum > 0 and abs(line_sum - data["total_amount"]) / data["total_amount"] > 0.01:
            errors.append(
                f"Line item sum ({line_sum:.2f}) doesn't match total ({data['total_amount']:.2f})"
            )
    
    data["validation_errors"] = errors
    data["extraction_status"] = "clean" if not errors else "needs_review"
    return data

Why temperature=0? Extraction tasks have a correct answer — the information is either in the document or it isn't. Setting temperature to zero makes the model deterministic and suppresses the creativity that's useful for generation but actively harmful for extraction. Always use temperature 0 or very close to it for structured extraction.

Now let's test it with a realistic messy invoice:

sample_invoice = """
INVOICE

Acme Supplies Co.
742 Evergreen Terrace, Springfield, IL 62701
Tel: (555) 238-1000

Bill To:
Riverside Medical Group
1400 Harbor Boulevard
Fullerton, CA 92835

Invoice #: INV-2024-08847
Date: October 15, 2024
Payment Due: November 14, 2024

DESCRIPTION                          QTY    UNIT PRICE    AMOUNT
------------------------------------------------------------------
Nitrile Examination Gloves (Box/100) 50     $18.75        $937.50
Disposable Face Masks (Box/50)       30     $12.00        $360.00
Hand Sanitizer 1L Pump               20     $8.50         $170.00
Biohazard Waste Bags (Case/100)      10     $24.99        $249.90

                                              SUBTOTAL:   $1,717.40
                                              TAX (8%):     $137.39
                                              TOTAL:      $1,854.79

Payment Terms: Net 30
Please make checks payable to Acme Supplies Co.
For wire transfer inquiries, contact ar@acmesupplies.com
"""

result = extract_invoice_data(sample_invoice)
if result:
    validated = validate_extracted_invoice(result)
    print(json.dumps(validated, indent=2))

The output you'd get back would be a clean, typed JSON object with properly formatted dates, numeric totals, and structured line items — ready to insert into a database or feed to a downstream process.

Named Entity Recognition: People, Organizations, Dates, and More

Entity extraction is a specific flavor of structured extraction where you're pulling typed facts out of text — not filling out a form template, but finding and classifying mentions of things the document talks about.

Classic NER (Named Entity Recognition) using spaCy or similar tools will find entities, but only the ones it was trained on, and only when they're phrased conventionally. LLM-based NER is much more flexible. You can define custom entity types, handle unusual phrasings, and extract relationships between entities — not just the entities themselves.

Here's a realistic scenario: you're processing merger and acquisition press releases to build a competitive intelligence database. You need to extract the acquiring company, the target company, deal value, announcement date, and deal type.

MA_EXTRACTION_PROMPT = """
You are an M&A intelligence analyst. Extract deal information from the press 
release below and return it as valid JSON.

ENTITY TYPES TO EXTRACT:
- acquiring_company: The company making the acquisition
- target_company: The company being acquired or merged with  
- deal_value: The announced transaction value (normalize to USD millions as a float, 
  or null if undisclosed)
- deal_type: One of ["acquisition", "merger", "majority_stake", "minority_investment", 
  "asset_purchase", "unknown"]
- announcement_date: ISO 8601 date
- expected_close_date: ISO 8601 date or null if not stated
- advisors: List of financial/legal advisors mentioned with their role 
  (e.g., [{"firm": "Goldman Sachs", "role": "financial advisor", "advising": "acquirer"}])
- deal_rationale: A 1-2 sentence summary of the stated strategic rationale (not your 
  own analysis — only what the press release explicitly states)
- regulatory_approvals_required: List of regulatory bodies mentioned 
  (e.g., ["FTC", "EU Commission"])

Return ONLY valid JSON. Use null for fields not present in the text.
Do not infer values not explicitly stated.

PRESS RELEASE TEXT:
{text}
"""

sample_press_release = """
FOR IMMEDIATE RELEASE
October 22, 2024

NORTHSTAR HEALTH SYSTEMS ANNOUNCES DEFINITIVE AGREEMENT TO ACQUIRE 
PRECISION DIAGNOSTICS INC. FOR $2.3 BILLION

CHICAGO — NorthStar Health Systems (NYSE: NHS) today announced it has entered 
into a definitive agreement to acquire Precision Diagnostics Inc., a leading 
provider of AI-powered pathology solutions, for approximately $2.3 billion in 
an all-cash transaction.

The acquisition is expected to accelerate NorthStar's precision medicine strategy 
by integrating Precision Diagnostics' proprietary image analysis platform, which 
currently serves over 400 hospital systems nationwide, into NorthStar's existing 
oncology service line.

"This transaction represents a compelling opportunity to deliver faster, more 
accurate diagnoses to patients while creating significant value for our shareholders," 
said Dr. Margaret Chen, Chief Executive Officer of NorthStar Health Systems.

The transaction is subject to customary closing conditions, including regulatory 
approval from the Federal Trade Commission, and is expected to close in the first 
quarter of 2025.

Goldman Sachs & Co. LLC is serving as financial advisor and Skadden, Arps, Slate, 
Meagher & Flom LLP is serving as legal counsel to NorthStar. Centerview Partners 
LLC is serving as financial advisor and Weil, Gotshal & Manges LLP is serving as 
legal counsel to Precision Diagnostics.

###
"""

The model will return something like:

{
  "acquiring_company": "NorthStar Health Systems",
  "target_company": "Precision Diagnostics Inc.",
  "deal_value": 2300.0,
  "deal_type": "acquisition",
  "announcement_date": "2024-10-22",
  "expected_close_date": "2025-03-31",
  "advisors": [
    {"firm": "Goldman Sachs & Co. LLC", "role": "financial advisor", "advising": "acquirer"},
    {"firm": "Skadden, Arps, Slate, Meagher & Flom LLP", "role": "legal counsel", "advising": "acquirer"},
    {"firm": "Centerview Partners LLC", "role": "financial advisor", "advising": "target"},
    {"firm": "Weil, Gotshal & Manges LLP", "role": "legal counsel", "advising": "target"}
  ],
  "deal_rationale": "The acquisition is intended to accelerate NorthStar's precision medicine strategy by integrating Precision Diagnostics' AI-powered image analysis platform into NorthStar's oncology service line.",
  "regulatory_approvals_required": ["Federal Trade Commission"]
}

Notice that expected_close_date is 2025-03-31 — the model correctly interpreted "first quarter of 2025" as an approximate date and normalized it. That's the contextual understanding that makes LLM extraction different from regex.

Extracting Tables from Unstructured Text

One of the most common extraction challenges is when tabular data exists in a document but isn't stored as a structured table — it's in a PDF that got OCR'd poorly, it's in the body of an email, or it's embedded in a narrative financial report.

Let's say you're processing quarterly earnings call transcripts to extract financial guidance tables. The table might appear in the source text like this:

Management reaffirmed full-year 2024 guidance as follows: revenue is expected 
to come in between $4.2 billion and $4.4 billion, representing growth of 12 to 
18 percent over the prior year. Adjusted EBITDA margin is expected to be in the 
range of 23 to 25 percent. Capital expenditures are guided to approximately 
$180 million, flat with 2023 levels. Free cash flow conversion is expected to 
remain above 85 percent of net income.

Your extraction prompt needs to recognize that this narrative contains a table and reconstruct it:

TABLE_EXTRACTION_PROMPT = """
You are a financial data analyst. The text below contains financial guidance 
information presented in narrative form. Extract it into a structured table.

Return a JSON object with a single key "guidance_table" containing a list of rows.
Each row should have these fields:
- metric: The financial metric name (string)
- low_end: The low end of the guidance range as a float, or the single value if 
  no range is given (float or null)
- high_end: The high end of the guidance range as a float, or null if it's a 
  point estimate (float or null)
- unit: The unit of measurement — one of ["USD_millions", "USD_billions", 
  "percentage", "ratio", "count"] (string)
- notes: Any qualifying language (string or null)

Normalize all dollar amounts — "$4.2 billion" becomes 4200.0 with unit "USD_millions".
Percentages like "23 to 25 percent" become low_end: 23.0, high_end: 25.0, unit: "percentage".

Return ONLY valid JSON. No explanation.

TEXT:
{text}
"""

Expected output:

{
  "guidance_table": [
    {
      "metric": "Revenue",
      "low_end": 4200.0,
      "high_end": 4400.0,
      "unit": "USD_millions",
      "notes": "Represents 12 to 18 percent growth over prior year"
    },
    {
      "metric": "Adjusted EBITDA Margin",
      "low_end": 23.0,
      "high_end": 25.0,
      "unit": "percentage",
      "notes": null
    },
    {
      "metric": "Capital Expenditures",
      "low_end": 180.0,
      "high_end": null,
      "unit": "USD_millions",
      "notes": "Flat with 2023 levels"
    },
    {
      "metric": "Free Cash Flow Conversion",
      "low_end": 85.0,
      "high_end": null,
      "unit": "percentage",
      "notes": "Expected to remain above 85 percent of net income"
    }
  ]
}

Pro tip: When extracting tables from OCR'd PDFs, add a pre-processing instruction: "The text may contain OCR errors such as '0' rendered as 'O', missing spaces, or broken line breaks. Use context to resolve obvious OCR artifacts." This small addition significantly improves accuracy on imperfect source documents.

Processing Documents at Scale: Chunking and Batching

Everything above assumes your document fits in the model's context window. When you're processing long contracts (50+ pages), full annual reports, or large email threads, you need a strategy for handling documents that exceed your token budget.

There are two main approaches, and choosing between them depends on your document structure:

Strategy 1: Hierarchical Chunking — Split the document into meaningful chunks (by section, by page, by paragraph), extract from each chunk independently, then merge the results. This works well when each chunk is self-contained.

Strategy 2: Map-Reduce Extraction — Extract from each chunk, then pass all chunk extractions to a final "merge and deduplicate" pass. This is better when facts might span chunk boundaries.

Here's a practical implementation of map-reduce extraction for long contracts:

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens in text using tiktoken."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def chunk_document(text: str, max_tokens: int = 6000, overlap_tokens: int = 200) -> list[str]:
    """
    Split document into overlapping chunks by token count.
    Overlap helps catch facts that span chunk boundaries.
    """
    enc = tiktoken.encoding_for_model("gpt-4o")
    tokens = enc.encode(text)
    
    chunks = []
    start = 0
    
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text = enc.decode(chunk_tokens)
        chunks.append(chunk_text)
        
        if end >= len(tokens):
            break
        
        # Move start forward by (max_tokens - overlap_tokens)
        start += max_tokens - overlap_tokens
    
    return chunks

def extract_from_long_document(document_text: str, schema: dict, extraction_prompt: str) -> dict:
    """
    Process a long document using map-reduce extraction.
    """
    total_tokens = count_tokens(document_text)
    
    # If it fits in context, process directly
    if total_tokens < 6000:
        return extract_invoice_data(document_text)  # or whatever your extractor is
    
    print(f"Document is {total_tokens} tokens — chunking into segments...")
    chunks = chunk_document(document_text, max_tokens=6000, overlap_tokens=300)
    print(f"Created {len(chunks)} chunks")
    
    # Map phase: extract from each chunk
    chunk_results = []
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}/{len(chunks)}...")
        result = extract_invoice_data(chunk)
        if result:
            chunk_results.append(result)
    
    # Reduce phase: merge chunk extractions
    merge_prompt = f"""
You are a data merging assistant. Below are {len(chunk_results)} partial extractions 
from different sections of the same document. Some fields may appear in multiple 
partial extractions. 

Your job: merge them into a single coherent JSON object following the schema below.

RULES:
- For fields with single values: use the most specific/complete value across all partials
- For list fields (like line_items): combine all unique entries, deduplicate
- If the same field has conflicting non-null values across chunks, note the conflict 
  in a "merge_conflicts" list
- Fields that are null in all partials should remain null

SCHEMA:
{json.dumps(schema, indent=2)}

PARTIAL EXTRACTIONS:
{json.dumps(chunk_results, indent=2)}

Return only the merged JSON object:
"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": merge_prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )
    
    merged = json.loads(response.choices[0].message.content)
    merged["_chunks_processed"] = len(chunks)
    return merged

Adding Confidence Scores and Source Attribution

Production extraction systems need more than the extracted values — they need to know how confident the model is and where in the document each value came from. This is essential for human-in-the-loop review workflows where you want to automatically accept high-confidence extractions and flag low-confidence ones for a human to verify.

Modify your extraction prompt to request source attribution:

ATTRIBUTED_EXTRACTION_PROMPT = """
You are a precise data extraction system. For each field you extract, also provide:
1. A confidence score from 0.0 to 1.0 (how certain are you this value is correct?)
2. The verbatim text from the document that supports this extraction (quote it exactly)

Return a JSON object where each field has this structure:
{
  "value": <the extracted value>,
  "confidence": <float 0.0-1.0>,
  "source_text": <verbatim quote from document, or null if field is null>
}

Confidence guide:
- 1.0: The value is stated explicitly and unambiguously
- 0.8-0.9: The value is clearly implied or requires minor interpretation
- 0.5-0.7: The value requires significant inference or there's some ambiguity
- Below 0.5: You are guessing — consider returning null instead

DOCUMENT:
{document_text}
"""

With this structure, your validation pipeline can automatically route extractions:

def route_extraction_by_confidence(extraction: dict, threshold: float = 0.85) -> str:
    """
    Route an extraction based on minimum field confidence.
    Returns 'auto_accept', 'needs_review', or 'reject'
    """
    if not extraction:
        return 'reject'
    
    critical_fields = ['invoice_number', 'total_amount', 'vendor_name', 'invoice_date']
    
    confidences = []
    for field in critical_fields:
        field_data = extraction.get(field, {})
        if isinstance(field_data, dict):
            conf = field_data.get('confidence', 0)
            confidences.append(conf)
    
    if not confidences:
        return 'needs_review'
    
    min_confidence = min(confidences)
    avg_confidence = sum(confidences) / len(confidences)
    
    if min_confidence >= threshold:
        return 'auto_accept'
    elif avg_confidence >= 0.6:
        return 'needs_review'
    else:
        return 'reject'

Hands-On Exercise: Build a Support Ticket Triage Extractor

Here's a complete real-world project to build on your own. You're building a customer support ticket triage system that needs to extract structured data from raw email/ticket text so tickets can be automatically routed and prioritized.

Your target schema:

TICKET_SCHEMA = {
    "customer_name": "string or null",
    "customer_account_id": "string or null — look for patterns like ACC-XXXXX or account numbers",
    "product_name": "string or null",
    "issue_category": "one of: ['billing', 'technical_error', 'feature_request', 'account_access', 'shipping', 'other']",
    "severity": "one of: ['critical', 'high', 'medium', 'low'] — infer from urgency language",
    "error_codes": "list of any error codes or HTTP status codes mentioned",
    "affected_feature": "string or null — the specific feature or workflow that's broken",
    "reproducible": "boolean or null — can the customer reproduce the issue?",
    "workaround_exists": "boolean or null — has the customer found any workaround?",
    "sentiment": "one of: ['frustrated', 'neutral', 'positive']",
    "action_requested": "string — what does the customer want done?"
}

Your test tickets (use all three to test different edge cases):

TICKET 1:
Subject: URGENT - Production system down, cannot process payments
From: sarah.kowalski@techcorp.com

Our entire payment processing pipeline has been throwing error code ERR-5031 
since approximately 2:15 PM EST. We've tried restarting the service three times 
with no luck. Every transaction is failing with HTTP 503. This is costing us 
roughly $12,000 per hour. We need someone on the phone immediately. 
Account: ACC-88291. Our CTO is copied on this thread.

TICKET 2:
Subject: Question about adding users to our Enterprise plan
From: james.liu@startup.io

Hi team, love the product! Quick question — we're growing and want to add 5 more 
seats to our plan. Is this something I can do from the admin panel or do I need 
to go through sales? We're on the Enterprise Plus tier. Not urgent, just want 
to sort it out before our next billing cycle. Thanks!

TICKET 3:
Subject: Fwd: Export to CSV not working properly
From: m.petersen@globalcorp.net

See the attached screenshots (note: I can reproduce this consistently).
When I try to export reports to CSV from the Analytics Dashboard, the file 
downloads but the date column is all messed up — dates are showing as 5-digit 
numbers instead of formatted dates. Looks like a Unix timestamp issue maybe? 
This affects ALL of our analysts. We've found that exporting to XLSX works fine 
as a temporary fix. Running Chrome 119.0 on Windows 11.

Your tasks:

Write the extraction prompt for this schema
Process all three tickets and store results in a list
Add a post-processing function that assigns a queue (e.g., "emergency_escalation", "billing_queue", "technical_queue", "general_support") based on the extracted severity and issue_category
Output a summary table showing ticket number, customer name, severity, category, and queue assignment

Common Mistakes & Troubleshooting

Mistake 1: Not specifying null behavior

If you don't tell the model what to return for missing fields, it will invent plausible-sounding values. A missing invoice number might come back as "INV-001" or "N/A" or an empty string — none of which are consistent or useful. Always explicitly state: "return null for any field not present in the document."

Mistake 2: Asking for too much in one prompt

If your schema has 30 fields and your document is complex, prompt quality degrades. The model starts making more errors as it tries to track everything at once. Solution: break complex extractions into 2-3 focused passes. First extract the header fields, then extract line items, then extract terms and conditions.

Mistake 3: Not using temperature 0

This one bites people constantly. They run an extraction, it works, they move on. Then they notice that the same document extracts differently on different runs because temperature > 0 introduces variability. Set temperature to 0 for any extraction task where you want deterministic behavior.

Mistake 4: Trusting extracted numbers without validation

The model can silently misread $1,234.56 as $12,345.6 or drop a zero from a large number. Always validate numeric extractions against any checksums or totals available in the document. The line item sum validation in our pipeline above is an example of this.

Mistake 5: Not handling API JSON mode correctly

OpenAI's response_format={"type": "json_object"} enforces JSON output, but it doesn't guarantee your JSON matches your schema. A model can return perfectly valid JSON that completely ignores your schema. You still need schema validation — consider using Pydantic models or jsonschema for this:

from pydantic import BaseModel, validator
from typing import Optional, List

class LineItem(BaseModel):
    description: str
    quantity: Optional[float] = None
    unit_price: Optional[float] = None
    line_total: Optional[float] = None

class Invoice(BaseModel):
    invoice_number: Optional[str] = None
    invoice_date: Optional[str] = None
    vendor_name: Optional[str] = None
    total_amount: Optional[float] = None
    currency: Optional[str] = "USD"
    line_items: List[LineItem] = []
    
    @validator('total_amount')
    def total_must_be_positive(cls, v):
        if v is not None and v < 0:
            raise ValueError('total_amount must be positive')
        return v

# Then validate like this:
try:
    invoice = Invoice(**extracted_data)
    print("Validation passed")
except Exception as e:
    print(f"Validation failed: {e}")

Mistake 6: Skipping chunk overlap on long documents

If you chunk a document without overlap, facts that appear near a chunk boundary get split. An entity mention that starts at the end of chunk 1 might be meaningless without the context that continues into chunk 2. Use 200-300 token overlaps between chunks.

Summary & Next Steps

You've now built the conceptual and technical foundation for production-grade document extraction pipelines. The core principles to carry forward:

Schema-first always. Define your output before you write your prompt. The schema is the contract between your extraction system and everything downstream.

LLMs excel at contextual understanding; validation code handles correctness. Use the model for the hard semantic work (understanding "net thirty" as a payment term), and use deterministic code for the verification work (checking that line items sum to the stated total).

Temperature 0, explicit null handling, and source attribution are the three non-negotiable ingredients of extraction prompts that you'll trust in production.

Chunking is a pipeline design problem, not an afterthought. Think about document structure and chunk boundaries before you write your first line of code for any document type.

Where to go next:

Function calling / tool use: OpenAI's function calling feature and Anthropic's tool use let you define your schema as a JSON Schema object that the API enforces at call time, eliminating an entire class of parsing errors
Fine-tuning for domain-specific extraction: When you're processing thousands of documents of the same type, fine-tuning a smaller model on your extraction task can be dramatically cheaper and faster than using GPT-4 for every document
Document pre-processing: Learn to use libraries like pdfplumber, pypdf, and pytesseract to extract clean text from PDFs and scanned documents before passing them to your LLM — garbage in, garbage out
Vector databases for extraction at scale: When your document corpus is massive, combine embedding-based retrieval with extraction so you only send the relevant sections to the LLM instead of the whole document
Evaluation frameworks: Build a labeled test set of 50-100 documents where you know the correct extractions, and measure your pipeline's field-level precision and recall as you iterate on prompts

The extraction patterns in this lesson aren't toy demonstrations — they're the same architectural patterns used in production contract intelligence platforms, financial document processing systems, and medical record structuring tools. The difference between a demo and a production system is mostly in the validation layer, the error handling, and the evaluation framework. You now have the foundation for all three.

Structuring Unstructured Data with AI: Extracting Tables, Entities, and Insights from Text and Documents

Structuring Unstructured Data with AI: Extracting Tables, Entities, and Insights from Text and Documents

Introduction

Prerequisites

Why Unstructured-to-Structured Is Hard (and Why AI Changes the Game)

The Core Pattern: Schema-First Extraction

Building the Extraction Pipeline in Python

Named Entity Recognition: People, Organizations, Dates, and More

Extracting Tables from Unstructured Text

Processing Documents at Scale: Chunking and Batching

Adding Confidence Scores and Source Attribution

Hands-On Exercise: Build a Support Ticket Triage Extractor

Common Mistakes & Troubleshooting

Summary & Next Steps

Related Articles

Prompt Engineering for RAG: How to Structure System Prompts That Ground LLM Responses in Retrieved Context

Prompt Engineering Fundamentals: System Prompts, Few-Shot Examples, and Temperature Control

Tokens, Context Windows, and Input Limits: What Data Professionals Need to Know Before Building AI Workflows

Related Articles

AI & Machine Learning🌱 Foundation
Prompt Engineering for RAG: How to Structure System Prompts That Ground LLM Responses in Retrieved Context
15 min

AI & Machine Learning🌱 Foundation
Prompt Engineering Fundamentals: System Prompts, Few-Shot Examples, and Temperature Control
18 min

AI & Machine Learning🌱 Foundation
Tokens, Context Windows, and Input Limits: What Data Professionals Need to Know Before Building AI Workflows
17 min