Skip to main content
  1. Blogs/
  2. Production PDF Extraction in Python — Guides and Code/

Handling PDF Layout Variations in Python

·1302 words·7 mins·
Subhajit Bhar
Author
Subhajit Bhar
I build production-grade document extraction pipelines for businesses that process invoices, lab reports, contracts, and other document types at scale.
Table of Contents

The script works perfectly on the sample document. You tested it on twenty invoices from your main supplier and it extracted every field correctly. Then the first real batch arrives — invoices from six different suppliers — and half of them fail.

The invoice number you’re extracting at coordinates (120, 340) doesn’t appear there in supplier B’s format. It’s at (80, 210), after “Ref:” instead of “Invoice No.”. Supplier C uses a completely different page layout. The table column names are different. Some suppliers put the total at the bottom right; one puts it mid-page in a summary box.

This is the layout variation problem. It’s not a bug in your extraction logic — that logic is correct for the format it was built on. The problem is that coordinate-based, position-dependent extraction doesn’t generalise across document variants. The first batch of real documents exposes that.


Why coordinate-based extraction breaks
#

Extracting text at fixed pixel coordinates works when every document is generated from the same template. The moment a supplier updates their invoice template, or a new supplier uses a different design, the coordinates shift and your extraction fails.

The same applies to fixed page regions — “the vendor name is always in the top-left block” — and to column index assumptions in table extraction. These approaches are fragile by design. They encode layout assumptions directly into extraction logic, so any layout change breaks them.

The underlying issue is that coordinates describe position, not meaning. What you actually want is the value after the label “Invoice Number:”, not whatever happens to be at x=120, y=340.


Pattern-based extraction instead
#

A more robust approach anchors extraction to content patterns rather than positions. Instead of “get text at coordinates (x, y)”, the instruction becomes “get text after the label ‘Invoice Number:’ on the same line” — or within a few lines below it.

This is keyword anchoring. You define the labels that signal a field’s location in the document, then extract the value relative to that label. Regex patterns handle the matching:

patterns = [
    r"Invoice\s*(?:No\.?|Number|#)[:\s]*([A-Z0-9\-]+)",
    r"Ref[:\s]*([A-Z0-9\-]+)",
    r"Reference[:\s]*([A-Z]{3}\d{4,})",
]

Each pattern represents a different way the same field is labelled across supplier formats. The list grows as you encounter new variants. The extraction logic stays in one place.

For fields where the label varies too much for reliable regex, proximity-based extraction — finding the nearest plausible value to a set of candidate labels — adds another layer of robustness. The schema-first PDF extraction with Pydantic post covers how to structure these extraction functions so failures are explicit rather than silent.


Building a layout classifier
#

When you have known document variants — supplier A’s format, supplier B’s format, an internal purchase order format — it’s worth detecting which variant you’re dealing with at ingestion time and applying the appropriate rule set.

The classifier doesn’t need to be complex. Detection signals include: distinctive header text that only appears in one format, table column names, the supplier name embedded in the document footer, or even page dimensions if suppliers consistently differ there. A simple rule-based classifier that checks for these signals in order is often sufficient.

def classify_layout(text: str) -> str:
    if "ACME CORP" in text and "PO Reference" in text:
        return "supplier_acme"
    if re.search(r"Tax Invoice\s+\d{4}", text):
        return "supplier_format_b"
    if "Consolidated Statement" in text:
        return "statement_format"
    return "unknown"

The classifier returns a layout identifier, and the extraction pipeline routes to the matching rule set. Supplier A’s extractor handles supplier A’s column names and label conventions. Supplier B’s extractor handles its own. Adding a new known variant means adding a detection rule and a rule set — not modifying existing logic.


Handling unknown layouts
#

Not every document that arrives will match a known variant. When a new supplier sends their first batch, or an existing supplier changes their template, the classifier returns “unknown”.

There are three things to do with an unknown layout. First, flag it for human review rather than attempting extraction and silently producing wrong values. Second, use LLM extraction as a fallback for fields that rules can’t reliably extract — this provides a best-effort result that a human reviewer can confirm or correct. Third, once the layout is confirmed and understood, add it to the classifier as a new known variant.

This is the human-in-the-loop processing design for layout variation. New variants surface as reviewable exceptions rather than causing silent downstream errors. The system degrades gracefully — unknown layouts get human attention, not corrupted data.

The LLM fallback is a bridge, not a permanent solution. Once a layout is encountered frequently enough, it deserves its own rule set.


Table extraction across layouts
#

Tables are the hardest part of layout variation. Different suppliers use bordered tables, whitespace-separated columns, different column orders, merged header cells, and inconsistent row formatting.

pdfplumber handles the two main cases: lattice mode for tables with visible borders (it traces the lines), and stream mode for whitespace-separated columns (it infers columns from text spacing). The right mode depends on how the table was rendered in the PDF.

Even with the correct mode, column names vary. One supplier’s “Unit Price” is another’s “Rate” or “Each”. After extracting the raw table, normalise column names before processing:

COLUMN_ALIASES = {
    "description": ["description", "item", "details", "service"],
    "quantity": ["quantity", "qty", "units"],
    "unit_price": ["unit price", "rate", "each", "unit cost"],
    "total": ["total", "amount", "line total", "ext. price"],
}

When neither lattice nor stream mode produces a reliable table — scanned PDFs, complex merged cells, unusual rendering — LLM-based table extraction with a structured output schema is the practical fallback. Pass the raw text of the table region to the model and ask it to return structured rows conforming to your schema. See extract tables from PDF Python for how to set this up with pdfplumber.


Confidence scoring per field
#

After extraction, every field should have a confidence score attached to it. High confidence: the regex matched cleanly and the extracted value validates against the expected format (an invoice number that matches [A-Z]{2,4}\d{4,}, a total that parses as a decimal). Low confidence: the value came from an LLM fallback, the pattern matched loosely, or the value fails a format check.

Confidence scoring is what prevents layout variation from causing silent downstream errors. Without it, a value extracted incorrectly because of a layout mismatch flows into your database or downstream system looking identical to a correctly extracted value. With it, low-confidence fields go to a human review queue.

The threshold matters. Set it too high and you’re reviewing everything. Set it too low and errors slip through. The right level depends on the downstream cost of errors in your specific case — a wrong total on a finance record has a different cost to a wrong line item description in a log.


Schema-first design as the foundation
#

Every technique described here — layout classifiers, pattern libraries, LLM fallbacks, confidence scoring — works because the output schema is fixed.

Define what you want to extract before you write extraction logic. The schema describes the fields, their types, their validation rules. It’s the same for an invoice from supplier A, supplier B, or an unknown new supplier. What varies is the extraction path to get there.

This separation is what makes layout variation manageable. You can add a new rule set for a new supplier without touching the schema. You can swap a regex pattern for an LLM fallback without changing what downstream systems receive. You can add a new layout to the classifier without breaking existing variants.

Without a stable schema, each new layout variant risks changing the output shape. With one, the output contract holds regardless of what arrives.

Schema-first extraction is the structural decision that ties the rest of this together. It’s worth getting right before the first new supplier arrives.

Book a Diagnostic Session →

Related

Extract Data from Scanned PDFs with Python

·1337 words·7 mins
If pdfplumber returns empty strings or None on pages that clearly have content, stop before writing more extraction code. The problem almost certainly isn’t your code — it’s the PDF type. Scanned PDFs are images wrapped in a PDF container. There is no underlying text layer. Only pixels. Every Python PDF library that operates on text — pdfplumber, PyPDF2, even PyMuPDF in text mode — will return nothing useful, because there is nothing to return. You need OCR before any extraction can happen.