Schema-First PDF Extraction in Python with Pydantic

Table of Contents

Most PDF extraction projects start with the document. You open a PDF, look at the text, write a regex, extract a value. Repeat for each field.

This works for one document type with one layout. It doesn’t scale.

Schema-first extraction inverts the order: define exactly what the output should look like before you write a single line of extraction code. The schema becomes the specification that every extraction function has to satisfy — and the tool that tells you, immediately and explicitly, when an extraction fails.

Why schema-first matters in production
#

Consider what happens with the ad-hoc approach over time. You have an invoice extractor. It works on your main supplier’s format. A second supplier sends invoices with a different layout — you add a special case. A third supplier sends scanned PDFs — another special case. The total field is sometimes Total Amount Due, sometimes Amount Payable, sometimes Grand Total — another regex branch.

Six months later you have a function that’s 300 lines long, full of conditionals, and nobody remembers why the third elif exists.

The schema-first approach doesn’t eliminate this complexity, but it contains it. You know exactly what you’re trying to extract. Every extraction path tries to produce the same output shape. Failures are field-level and explicit.

Define the schema first
#

Start with a Pydantic model that describes what you want out of a document. For an invoice:

from pydantic import BaseModel, Field, field_validator
from datetime import date
from decimal import Decimal
from typing import Optional
import re


class LineItem(BaseModel):
    description: str
    quantity: Optional[float] = None
    unit_price: Optional[Decimal] = None
    total: Decimal

    @field_validator("total", "unit_price", mode="before")
    @classmethod
    def clean_currency(cls, v):
        if isinstance(v, str):
            # Remove currency symbols and commas
            v = re.sub(r"[£$€,\s]", "", v)
        return v


class Invoice(BaseModel):
    invoice_number: str
    invoice_date: date
    vendor_name: str
    vendor_address: Optional[str] = None
    total_amount: Decimal
    vat_amount: Optional[Decimal] = None
    line_items: list[LineItem] = Field(default_factory=list)

    @field_validator("invoice_date", mode="before")
    @classmethod
    def parse_date(cls, v):
        if isinstance(v, str):
            for fmt in ("%d/%m/%Y", "%d-%m-%Y", "%B %d, %Y", "%d %B %Y"):
                try:
                    from datetime import datetime
                    return datetime.strptime(v, fmt).date()
                except ValueError:
                    continue
            raise ValueError(f"Cannot parse date: {v}")
        return v

A few things to notice:

Fields are typed. total_amount is a Decimal, not a string. If you extract "£1,234.56" and can’t convert it to Decimal, you know immediately.
Validators handle format variation. The date validator tries multiple date formats — the variation is in one place, not scattered across extraction functions.
Optional is explicit. Fields that may not be present in all documents are Optional. Fields that must always be present are not. If extraction misses a required field, validation fails.

Write extraction functions that produce schema-conformant output
#

With the schema defined, each extraction function has a clear contract: given page text (or a PDF object), return the value for one field, or raise if it can’t.

import re
from typing import Optional


def extract_invoice_number(text: str) -> str:
    patterns = [
        r"Invoice\s*(?:No\.?|Number|#)[:\s]*([A-Z0-9\-]+)",
        r"INV[:\s]*([A-Z0-9\-]+)",
        r"(?:Invoice|Bill)\s+([A-Z]{2,4}\d{4,})",
    ]
    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            return match.group(1).strip()
    raise ValueError("Invoice number not found")


def extract_total_amount(text: str) -> str:
    patterns = [
        r"(?:Total Amount Due|Amount Payable|Grand Total|Total)[:\s]*([£$€]?[\d,]+\.?\d{0,2})",
        r"(?:TOTAL)[:\s]*([£$€]?[\d,]+\.?\d{0,2})",
    ]
    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            return match.group(1).strip()
    raise ValueError("Total amount not found")

Each function raises if extraction fails. This keeps the logic explicit — no silent empty strings or None where a value was expected.

Assemble and validate
#

The extraction functions produce raw strings. Pydantic handles validation and type coercion:

import pdfplumber
from pydantic import ValidationError


def extract_invoice(pdf_path: str) -> Invoice:
    with pdfplumber.open(pdf_path) as pdf:
        # Combine text from all pages
        full_text = "\n".join(
            page.extract_text() or "" for page in pdf.pages
        )

    raw_data = {}
    errors = {}

    # Extract each field, collecting errors rather than failing immediately
    extractors = {
        "invoice_number": extract_invoice_number,
        "invoice_date": extract_invoice_date,
        "vendor_name": extract_vendor_name,
        "total_amount": extract_total_amount,
        "vat_amount": extract_vat_amount,
    }

    for field, extractor in extractors.items():
        try:
            raw_data[field] = extractor(full_text)
        except ValueError as e:
            errors[field] = str(e)

    if errors:
        # Log which fields failed before attempting validation
        raise ExtractionError(f"Field extraction failed: {errors}")

    # Pydantic validates types, coerces formats, runs field validators
    return Invoice(**raw_data)

When this works, you get a fully typed, validated Invoice object. When it doesn’t, you know exactly which fields failed and why.

Adding confidence scoring
#

Raw extraction either succeeds or fails. For a production pipeline, you want a middle ground: extractions that succeeded but aren’t certain.

Add a parallel confidence model:

from pydantic import BaseModel
from typing import Optional


class FieldConfidence(BaseModel):
    value: str
    confidence: float  # 0.0 - 1.0
    method: str        # "regex", "llm", "coordinate"


class InvoiceExtractionResult(BaseModel):
    data: Optional[Invoice] = None
    field_confidences: dict[str, FieldConfidence] = {}
    requires_review: bool = False
    review_reasons: list[str] = []

    def flag_for_review(self, reason: str):
        self.requires_review = True
        self.review_reasons.append(reason)

When a field is extracted with a low-confidence method (say, an LLM fallback rather than a regex match), record it:

CONFIDENCE_THRESHOLD = 0.85


def extract_with_confidence(pdf_path: str) -> InvoiceExtractionResult:
    result = InvoiceExtractionResult()

    # ... extraction logic ...

    for field, conf in result.field_confidences.items():
        if conf.confidence < CONFIDENCE_THRESHOLD:
            result.flag_for_review(
                f"{field} extracted with low confidence ({conf.confidence:.0%}) via {conf.method}"
            )

    return result

Documents with low-confidence fields go to a human review queue rather than straight to your database. High-confidence documents pass through automatically. Nothing fails silently.

Handling layout variation
#

The schema stays constant across document layouts. The extraction functions absorb the variation.

When a new invoice format arrives that your existing patterns don’t handle, you extend the extractors — not the schema:

def extract_invoice_number(text: str) -> str:
    patterns = [
        r"Invoice\s*(?:No\.?|Number|#)[:\s]*([A-Z0-9\-]+)",
        r"INV[:\s]*([A-Z0-9\-]+)",
        # Added for Supplier X's format:
        r"Reference:\s*([A-Z]{3}\d{6})",
    ]
    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            return match.group(1).strip()
    raise ValueError("Invoice number not found")

The output shape — an Invoice with an invoice_number: str — never changes. The extraction logic evolves; the schema doesn’t.

When to bring in an LLM
#

For fields where layout variation makes regex patterns too brittle, an LLM can extract the value and return it as the raw string the Pydantic validator expects:

import json
from openai import OpenAI

client = OpenAI()

def llm_extract_field(text: str, field: str, description: str) -> tuple[str, float]:
    """Returns (extracted_value, confidence_score)."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Extract the {field} from this document text.
{description}

Return JSON: {{"value": "<extracted value>", "confidence": <0.0-1.0>}}
If not found, return {{"value": null, "confidence": 0.0}}

Document text:
{text[:3000]}"""
        }],
        response_format={"type": "json_object"},
    )
    result = json.loads(response.choices[0].message.content)
    return result["value"], result["confidence"]

The LLM output feeds into the same Pydantic validation chain. The schema doesn’t care whether a value came from regex or an LLM — it validates the same way.

The key principle
#

Schema-first extraction separates two concerns that ad-hoc extraction conflates:

What the output should look like — the schema, defined once, stable
How to get there from a given document — the extractors, which change as layouts change

When the schema is clear, extraction failures are field-level and explicit. When extraction logic changes, the schema enforces that the output contract hasn’t changed. And when you add LLMs for hard cases, they slot into the same validation pipeline as everything else.

This is what makes a pipeline maintainable at scale — not the tools, but the structure.

Why schema-first matters in production#

Define the schema first#

Write extraction functions that produce schema-conformant output#

Assemble and validate#

Adding confidence scoring#

Handling layout variation#

When to bring in an LLM#

The key principle#

Related