The naive approach is obvious: take a document, pass it to an LLM, ask for the data you want. It works on clean examples. Ask GPT-4 to extract invoice fields from a well-formatted PDF and you get a clean JSON response that looks exactly right.
Then you run it on production documents. The model returns a date in a format your parser doesn’t expect. It invents a vendor name that almost matches the one on the invoice. It returns null for a field that’s clearly present, or returns a plausible-sounding value for a field that isn’t there at all. You have no way to know which outputs are correct without checking each one manually — which defeats the point.
The problem is that LLMs are probabilistic. They produce plausible-sounding output, not guaranteed-correct output. Without a validation layer, you have no signal when they’re wrong. And at scale, wrong 3% of the time means hundreds of bad records a day.
Why raw LLM extraction breaks in production#
The demo-to-production gap in LLM document extraction is real, and it has specific causes.
Hallucination on unfamiliar layouts. LLMs have seen a lot of documents, but they haven’t seen your specific supplier’s unusual invoice format. When a field appears in an unexpected position or uses non-standard labelling, the model will often fill in something that looks right rather than admit it couldn’t find the value.
Inconsistent field names in output. Ask the same model to extract an invoice the same way ten times and you may get invoice_number, invoiceNumber, invoice_no, and InvoiceNum across those runs. Without strict output formatting, you’re adding a normalisation problem on top of an extraction problem.
Different formats for the same field. Dates come back as 2026-03-11, 11/03/2026, March 11 2026, and 11 Mar 26 — sometimes within the same batch. Currency fields arrive with and without symbols, with different decimal separators depending on which locale the model decided the document was from.
Silent failures. The model returns something that looks valid. It passes basic type checks. It reaches your database. Three weeks later you find that one supplier’s totals have been systematically wrong because the model was reading the subtotal line instead of the grand total. Nothing failed — it just produced the wrong answer quietly.
Cost at scale. If every document goes through an LLM call, you’re paying LLM pricing for documents where a regex would have done the job in two milliseconds for free. On moderate volumes, this adds up quickly and creates pressure to cut corners elsewhere.
The right mental model#
LLMs are one layer in a pipeline, not the whole pipeline. This is the distinction that separates production systems from demos.
A document processing pipeline that works reliably looks like this:
- Ingest and preprocess — receive the document, determine whether it’s text-based or scanned, run OCR if needed, normalise page order and orientation
- Rules-based extraction for fields that appear in predictable locations or follow predictable patterns
- LLM extraction only for fields where rules can’t reach reliably
- Schema validation — every output, regardless of source, is validated against a Pydantic model
- Confidence scoring — every extracted field gets a reliability estimate
- Human review for results below the confidence threshold
Each layer has a specific job. None of them replace the others. The LLM layer is important — it handles cases that rules can’t — but it sits inside a structure that validates what it produces.
Schema-first design#
Before writing any extraction logic, define the output schema. This is the schema-first extraction principle, and it matters more than any model or library decision you make later.
A Pydantic model defines what a correct extraction looks like. Every field has a type. Required fields are marked. Optional fields are explicit. Validators handle known format variation — the date field validator tries multiple formats, the currency field strips symbols and handles both decimal conventions.
The schema is the contract between the pipeline and the downstream system. If an extraction produces a value that doesn’t satisfy the schema, you know immediately and explicitly, before it reaches your database. The schema doesn’t flex to accommodate bad extractions. Bad extractions get flagged, corrected, or rejected.
This also forces a useful discipline early: if you can’t define what a correctly-extracted field looks like, you don’t understand the problem well enough to build reliable extraction for it. The schema definition step often surfaces ambiguity in the requirements before a single line of extraction code is written.
See schema-first PDF extraction with Pydantic for how to build this out in practice.
Layer 1: rules-based extraction#
The first extraction layer handles everything it can deterministically. Regex for fields that follow consistent patterns — invoice numbers, dates in known positions, totals with known labels. Coordinate-based extraction for layout-stable documents where fields appear at predictable positions on the page. pdfplumber’s table extraction for structured line item data.
Rules-based extraction is fast, cheap, and auditable. When a regex matches, you know exactly why and what it matched. When it fails, you know immediately. There are no surprises about what the extractor was thinking.
In most document types with a reasonably consistent supplier or source base, rules handle 70-80% of fields reliably. That’s 70-80% of your extraction volume that doesn’t require an LLM call, doesn’t introduce probabilistic uncertainty, and runs in milliseconds per document.
The investment here is worth making properly. Good regex patterns with named groups, document classification to apply the right patterns to each document type, fallback patterns when the primary pattern doesn’t match. The more you invest in the rules layer, the less you need from the LLM layer.
Layer 2: LLM extraction#
For the fields that rules can’t reach reliably — variable-position fields, fields that require reading surrounding context to identify, natural language fields, fields where the label varies too much for regex to be practical — this is where the LLM layer earns its place.
Three practices are non-negotiable for reliable LLM extraction in production.
Structured output only. Never ask an LLM for freeform text and parse it yourself. Use JSON mode, OpenAI’s structured outputs, or the instructor library to enforce that the model returns data conforming to your Pydantic schema. The model fills your schema. It doesn’t write prose that you then try to extract values from.
Narrow scope. Ask for one field at a time, or small logical groups of related fields. A prompt that asks for 20 fields at once introduces more surface area for hallucination and makes it harder to attribute errors to specific fields. Smaller, focused prompts are more reliable.
Temperature zero. For extraction tasks, you want the most probable output given the input, not creative variation. Set temperature to 0. This doesn’t eliminate hallucination, but it makes outputs deterministic so identical inputs produce identical outputs — which makes testing and debugging tractable.
The LLM layer produces values in the same format the rules layer produces. Both feed into the same schema validation step.
Confidence scoring#
Every extracted field needs a reliability estimate, regardless of how it was extracted. This is what makes routing decisions possible downstream.
For rules-based extraction, confidence is largely binary: the pattern matched or it didn’t, the value parsed or it didn’t. A strong regex match against a known field label with a specific pattern gets a high confidence score. A loose fallback match that found something that might be the right value gets a lower one.
For LLM extraction, confidence is harder to derive. The model’s self-reported confidence is unreliable. More useful signals come from consistency checks: does the extracted total match the sum of line items? Does the date fall within a plausible range? Is the extracted vendor name in your known supplier list? A value that passes these cross-checks gets a higher confidence score than one that doesn’t.
The confidence scoring threshold determines what goes to human review. Set it based on the cost of errors in your context. For a financial document feeding an ERP, the threshold should be high. For a lower-stakes categorisation task, you can tolerate lower confidence before requiring review.
Human review routing#
Extractions below the confidence threshold don’t proceed to the downstream system. They go to a review queue. A human looks at the flagged fields alongside the original document, confirms or corrects the extraction, and releases the record.
This is human-in-the-loop processing, and it’s a design decision, not a fallback. The pipeline is designed so that uncertain results go to humans rather than silently downstream. The goal is not to eliminate human review — it’s to route documents that genuinely need human judgment to humans, while automating everything that can be automated reliably.
Corrected extractions are useful beyond fixing individual records. Patterns in what gets corrected reveal where the extraction logic is weak. Rules that consistently produce low-confidence results can be improved. Document types that consistently require human review might benefit from LLM extraction where rules were previously used. The review queue is a feedback mechanism, not just a safety net.
Production architecture#
A complete document extraction pipeline for production looks like this in outline:
An ingestion service receives documents from whatever sources your workflow uses — email attachments, SFTP drops, a document upload API, a scanner integration. Documents are normalised to a consistent internal format and queued for processing.
A preprocessing step determines whether the document has a usable text layer or needs OCR, handles page orientation, and classifies the document type if you’re handling multiple document types in the same pipeline.
The extraction pipeline runs the rules layer first, then the LLM layer for fields the rules couldn’t extract with sufficient confidence. Both layers produce field-level results with confidence scores.
Schema validation runs on the assembled extraction result. Cross-field checks run here — totals that should match, dates that should be in sequence, references that should exist in related systems.
Confidence scoring at the record level aggregates field-level scores and determines whether the record can proceed automatically or needs review.
Human review handles the flagged records via a web interface for higher volumes or a simple spreadsheet export for lower volumes. Either way, the interface shows the original document alongside the extracted fields so reviewers can work quickly.
The output API delivers validated records to the downstream system — an ERP, an accounting system, a database, whatever the use case requires.
This isn’t a complex architecture. The value is in the discipline: every component has a defined responsibility, failures are explicit and routed appropriately, and nothing reaches the downstream system without passing validation. See what is intelligent document processing for a broader overview of where this fits.
LLMs make document processing genuinely more capable than a rules-only approach. They handle the variation that rules can’t reach. But they only deliver that capability reliably inside a pipeline that validates what they produce, scores confidence, and routes uncertain results to humans. Without that structure, you have a demo, not a production system.
If you’re building a document processing pipeline and want to get the architecture right before you’re dealing with production failures, a diagnostic session is a good starting point.
Book a Diagnostic Session →