Every document automation project starts the same way. You pick a tool, write some code, test it on a handful of documents — and it works. Fields are extracted, outputs look right. You ship it.
Then the edge cases arrive.
A supplier changes their invoice template. A lab report comes in from a new testing facility with a different layout. A PDF is scanned at an angle. A table spans two pages. A date field that’s always been in one corner moves to the header.
The system breaks. You fix it. Another edge case appears. You fix that one too. Eventually, you’re spending more time maintaining the automation than you saved by building it — or you’ve quietly gone back to doing parts of it manually.
This is the normal trajectory for document automation built without accounting for edge cases from the start.
What an edge case actually is#
An edge case isn’t a rare event. In document processing, it’s anything that differs from the documents you built and tested against.
That includes:
- Layout variation — the same document type from different sources, or the same source at different times, formatted differently
- Field movement — a field that usually appears in a fixed location appearing somewhere else in some versions
- Missing fields — a value that’s usually present being absent in some documents
- Format variation — dates written as DD/MM/YYYY in some documents and Month DD, YYYY in others
- Multi-page structures — tables, totals, or summaries that span page breaks
- Scan quality — rotation, skew, low resolution, background noise
- Embedded images — text rendered as images rather than actual text in a PDF
- Encoding issues — PDFs where text extraction produces garbled characters
None of these are unusual. In any real-world document workflow with more than a few sources, you’ll see all of them eventually.
Why simple scripts break#
A Python script that extracts data from PDFs using pdfplumber or PyMuPDF works by finding text in expected locations. It might look for a value after a label, extract numbers from a specific bounding box, or use a regex to match a known field format.
This works well when the document looks exactly like the ones you wrote the script against.
When the layout shifts — even slightly — the extraction logic fails. The label moves. The bounding box is wrong. The regex doesn’t match the new date format. The script returns nothing, or worse, the wrong value with no indication that anything went wrong.
Simple scripts have no concept of uncertainty. They extract what they find, or they don’t. They can’t tell you that a result looks suspicious. They have no fallback.
Why enterprise platforms break#
Azure Document Intelligence, AWS Textract, Google Document AI — these tools are trained on large datasets of representative documents. For standard formats, they perform well.
The limit is the training distribution. Your documents aren’t in it.
A water utility’s lab report. A customs broker’s specific boL template. A financial services firm’s proprietary reporting format. The prebuilt models weren’t trained on these. When you submit them, the model extracts what it can, assigns confidence scores — and those scores reflect confidence in the extraction, not accuracy of the result.
Custom training helps, but it requires labelled examples. The more your documents vary, the more examples you need. And when your document layouts change — new supplier templates, regulatory format updates — you’re back to retraining.
The deeper problem: there’s no clean mechanism for routing uncertain extractions to a human reviewer. That logic is yours to build on top. The platform doesn’t know what it doesn’t know about your documents.
Why LLM-only extraction breaks#
Large language models can read a document and extract fields from it without any explicit rules. You describe what you want; the model finds it.
This is genuinely useful for documents where field locations are unpredictable or where the relevant data is embedded in natural language. The flexibility is real.
The problem is that LLMs are probabilistic. They don’t know when they’re wrong. A well-calibrated LLM might get 90% of extractions right on your document types — which sounds good until you realise that a 10% error rate on high-stakes documents (invoices, compliance records, customs declarations) produces bad data that passes silently downstream.
“Silently” is the key word. LLM extraction without a validation layer has no way to distinguish a correct extraction from a plausible-sounding hallucination. The model returns a value, the pipeline accepts it, it ends up in your database.
LLM-only extraction isn’t a production IDP system. It’s a component of one — the part you reach for when rules aren’t enough.
What production pipelines do differently#
The systems that hold up in production share a few design properties that prevent edge cases from becoming silent failures.
Schema defined first. Before any document is touched, the output structure is defined precisely: which fields, what types, what validation rules. The schema is the contract that the extraction logic has to satisfy. Any document that can’t produce a valid output against the schema fails explicitly.
Deterministic extraction as the baseline. Rules, regex, and layout logic handle everything they can. This is fast, auditable, and doesn’t introduce uncertainty where none is necessary. The rules expose where the hard cases actually are.
LLMs only where rules can’t reach. LLM extraction is introduced specifically for fields where layout variation makes rules too brittle — not applied globally because it’s easier.
Confidence scoring on every field. Every extracted value gets a reliability score. Values above the threshold pass through automatically. Values below are flagged for human review before going downstream. Nothing fails silently.
Human-in-the-loop by design. The pipeline doesn’t try to handle everything automatically. Uncertain cases go to a review queue. A person resolves them. This isn’t a fallback for when the automation fails — it’s an explicit part of the design.
Loud failures over silent ones. When something goes wrong — missing required field, validation failure, extraction below confidence threshold — the system flags it and stops rather than passing bad data downstream.
The common thread#
Every approach that breaks on edge cases has the same underlying problem: it was designed for the documents it was tested on, not for the documents it will eventually see.
Real document workflows have variation. The sources change, the templates evolve, the exception cases multiply as the business grows. A system built without accounting for this will require constant maintenance — or it will quietly degrade as the documents diverge from what it was built for.
The way out isn’t a better tool. It’s a different design philosophy: define what you expect the output to look like, establish what you can extract deterministically, and handle uncertainty explicitly rather than pretending it isn’t there.
That’s the difference between a document automation script and a production IDP pipeline.
Book a Diagnostic Session →