Schema-First Extraction: What It Is and Why It Matters for Production IDP

Table of Contents

Schema-first extraction is an approach to document processing where you define the output structure — every field, its type, its validation rules — before writing a single line of extraction logic.

The schema is the specification. It describes exactly what a successful extraction looks like: which fields are required, which are optional, what format dates should be in, what range is valid for numeric values. Extraction logic is then written to satisfy that specification.

This sounds obvious. In practice, most document extraction projects work the other way around.

How extraction usually goes wrong
#

The typical approach to building a document extractor:

Get some sample documents
Write code to pull out the values you need
Test it on the samples
Ship it

The output structure emerges from the code rather than being defined upfront. Fields are extracted in whatever format the document provides them. Validation — if it exists at all — is added later, piecemeal.

This works on the sample documents. Real production documents are different. A date that’s always been DD/MM/YYYY arrives in Month D, YYYY format. A field that’s always present is missing. A number that should be a decimal arrives as a string with a currency symbol.

Without a schema, none of these failures are obvious. The extraction runs. Something ends up in the output field. That something propagates downstream. The error surfaces far from where it originated.

What schema-first looks like
#

Before touching any documents, you define a schema — typically as a data class or a structured type definition. In Python, Pydantic is the natural tool for this:

from pydantic import BaseModel, Field, validator
from datetime import date
from decimal import Decimal
from typing import Optional

class InvoiceExtraction(BaseModel):
    invoice_number: str = Field(..., description="Unique invoice identifier")
    invoice_date: date = Field(..., description="Date of invoice issuance")
    supplier_name: str = Field(..., description="Name of the issuing supplier")
    total_amount: Decimal = Field(..., ge=0, description="Total invoice amount including tax")
    currency: str = Field(default="GBP", description="Currency code")
    vat_amount: Optional[Decimal] = Field(None, ge=0, description="VAT amount if itemised")

    @validator("invoice_number")
    def invoice_number_not_empty(cls, v):
        if not v.strip():
            raise ValueError("Invoice number cannot be empty")
        return v.strip()

The schema defines what “done” looks like before extraction begins. Any extraction result that doesn’t satisfy this schema fails explicitly — a validation error, not a silent bad value.

This shifts the design question from “how do I pull data out of this document?” to “what does a valid output look like, and how do I get there reliably?”

Why this changes everything
#

Failures are explicit. When extraction produces something that doesn’t satisfy the schema, it fails loudly — at the point of extraction, not downstream. A missing required field raises a validation error immediately. A date in the wrong format fails the type validator. Nothing ambiguous passes through.

The scope of the problem becomes clear early. Writing the schema before extraction forces you to think precisely about what you need and what variation looks like. You discover that “total amount” can arrive as a number, a string with a currency symbol, or a string with thousands separators — before you’ve written extraction logic that assumes one format.

Extraction logic has a clear target. Each piece of extraction logic exists to populate specific fields in the schema. If a field is extracted inconsistently, the schema validates it consistently. Normalisation happens at the schema boundary, not scattered throughout extraction code.

New document layouts don’t require schema changes. When a supplier changes their invoice template, the extraction rules for that template need updating — but the schema stays the same. Downstream systems continue to receive the same structured output regardless of which template the document used.

In production
#

In the water consultancy pipeline I’ve run for two years, the schema was defined before any extraction logic was written. The lab report schema specified every field: parameter name, measurement value, units, detection limit, sample collection date, sample ID, accreditation code.

When a new testing laboratory was onboarded — different layout, different field labelling — only the extraction rules for that lab needed updating. The schema stayed stable. The downstream system that consumed the output didn’t change. The validation that checked each extracted report against the schema caught layout-specific extraction failures before they propagated.

That’s the practical value of schema-first: it makes new document sources a bounded problem rather than a system-wide one.

Related concepts
#

What is a Document Extraction Pipeline? — how schema-first fits into the broader pipeline
What is Confidence Scoring in Document Extraction? — the mechanism that flags uncertain extractions for review
What is Human-in-the-Loop Document Processing? — what happens when schema validation fails
What is Intelligent Document Processing? — the broader IDP context

Building a document extraction system? Start with a Diagnostic Session →

How extraction usually goes wrong#

What schema-first looks like#

Why this changes everything#

In production#

Related concepts#

Related

How extraction usually goes wrong
#

What schema-first looks like
#

Why this changes everything
#

In production
#

Related concepts
#