Invoice Data Extraction with Python: From Script to Production Pipeline

Table of Contents

Extracting vendor name, invoice number, date, line items, and total from a single consistent invoice format is a few lines of pdfplumber. If your company uses one internal invoice template and you control the format, that script will probably hold. The real problem appears the moment you have invoices from 30 different suppliers, each using a different layout, font, table structure, and occasionally a different currency format. That’s when a script becomes a pipeline — or it becomes a maintenance burden.

What you’re actually trying to extract
#

Before touching any document, define the schema. This is the schema-first extraction principle, and it matters more than any library choice or model you pick later.

The invoice schema you’re working toward typically includes: vendor name, vendor address, invoice number, invoice date, due date, line items (each with description, quantity, unit price, and line total), subtotal, tax amount (sometimes multiple), total amount due, payment terms, and PO number if present.

Defining this up front does two things. It tells you exactly what to look for — which focuses your extraction logic — and it gives you a validation target at the end. If the extracted data doesn’t satisfy the schema, you know before it reaches your accounting system.

Most extraction projects that go wrong skip this step. They start by opening a PDF with pdfplumber, pulling out text, and figuring out what they can grab. The schema ends up implicitly defined by whatever the first supplier invoice happened to contain. That causes problems the moment supplier two shows up.

Model your schema as a Pydantic class. Every field gets a type, optional fields are marked explicitly, and line items are a typed list. This is also the input contract for your document extraction pipeline — everything downstream expects this shape.

The basic approach (and where it breaks)
#

For a clean, text-based PDF invoice from a single known supplier, pdfplumber works well. You open the PDF, extract text by page, run a few regex patterns against it — one for invoice number, one for the date, one for the total — and you’re done in an afternoon. For table-based line items, pdfplumber’s extract_table() gives you rows and columns that map directly to your schema.

The problems start with variation. Even within a single supplier, invoices change: they update their template, add a new fee line, reformat the tax section. Across suppliers, the variation is substantial.

Layout variation is the most common issue. “Invoice Number” might appear top-right on one supplier’s invoice and below the header on another. Total might be labelled “Total Due”, “Amount Payable”, “Grand Total”, or just sit in the bottom-right cell of a table with no label. Regex patterns anchored to specific strings break when the label changes.

Multi-page invoices break naive extraction immediately. Line items that span pages require you to concatenate correctly — pdfplumber gives you one page at a time, and a table that starts on page one and continues on page two needs handling explicitly.

Scanned invoices bypass text-based extraction entirely. pdfplumber and PyMuPDF read the PDF’s text layer. If the invoice was scanned and no OCR was run, you get nothing or garbage. You need an OCR step before extraction — Tesseract via pytesseract, or a cloud OCR API depending on your volume and accuracy requirements.

Tables with merged cells — common in more complex invoice formats — confuse table extraction algorithms. The coordinates don’t align cleanly, and you end up with misaligned columns or missing cells. See the comparison of PDF extraction libraries for where each library handles this differently.

None of these are edge cases. They’re routine when you’re processing invoices from multiple suppliers at scale.

The production approach: rules first, LLMs selectively
#

The mistake is jumping straight to an LLM because the regex approach feels fragile. LLMs introduce their own problems — latency, cost, hallucination on numeric fields, and outputs that are hard to validate. The right architecture uses them selectively.

Layer 1: Rules-based extraction. For fields that appear consistently — invoice number always follows “Invoice #:”, total always in the bottom-right cell of the summary table — write deterministic extractors. Regex with named groups, coordinate-based extraction with PyMuPDF, or pdfplumber’s table extraction for structured line items. This layer is fast, predictable, and auditable. In a reasonably consistent supplier base, it handles 70-80% of invoices cleanly.

Each field extracted at this layer gets a confidence score. A regex match with a specific pattern against a known field label gets a high score. A fallback extraction from an unexpected position gets a lower one. Confidence isn’t binary — it’s a signal you use downstream.

Layer 2: LLM extraction. For the 20-30% where layout variation makes rules insufficient — the supplier uses an unusual format, fields are embedded in running text, or the table structure doesn’t parse cleanly — pass the page text or image to an LLM. The output must be structured: a Pydantic model, not a freeform string. You’re asking the model to fill your schema, not summarise the invoice. See schema-first PDF extraction with Pydantic for how to structure this prompt and parse the response reliably.

Every field the LLM returns also gets a confidence score. LLM confidence isn’t the model’s self-reported certainty — it’s a signal derived from consistency checks: does the extracted total match the sum of line items? Does the date parse correctly? Is the vendor name in your known supplier list?

Layer 3: Human review. Extractions below your confidence threshold go to a review queue. A human checks the flagged fields against the original document and corrects them before the record proceeds. This is human-in-the-loop processing — not a fallback for when the system fails, but a designed part of the architecture. Nothing fails silently.

Handling the hard cases
#

Multi-page line items require tracking state across pages. When you detect that a table continues to the next page (usually via a “continued” label or by checking whether the table ends before the page does), you concatenate the row sets before parsing. Keep the raw page boundaries in your extraction metadata — you’ll want them when something goes wrong.

Scanned invoices need an OCR layer before any text extraction runs. The pipeline detects whether a PDF has a text layer by attempting extraction and checking whether the result is substantively empty or garbled. If it is, route to OCR. The OCR output feeds the same extraction logic as text PDFs, but expect lower confidence scores — OCR introduces its own errors, particularly on tabular data and small fonts.

Non-standard tax formats are common internationally. UK invoices show VAT, Australian invoices show GST, some European invoices break down multiple tax rates. Your line item schema needs to handle tax as a structured field — rate, label, amount — not a single number. Assuming a single “tax” field causes silent data loss when a supplier breaks it into components.

Currency and number formatting varies. European invoices often use period as a thousands separator and comma as a decimal separator — 1.234,56 rather than 1,234.56. Parse numbers carefully: strip currency symbols, detect the format convention from context, and convert to a canonical float before storing. Getting this wrong produces values that are off by a factor of a thousand and may not fail any downstream validation.

Output schema and validation
#

Extraction produces a populated Pydantic model. Before that record moves anywhere, validate it.

Cross-check: do the line item totals sum to the subtotal? Does subtotal plus tax equal the total amount? If not, flag it — either an extraction error or a legitimate discount or rounding difference that should be reviewed. Does the invoice date precede the due date? Is the invoice number format consistent with what this supplier normally produces? Is the vendor name or VAT number in your known supplier registry?

These checks are cheap to run and catch a significant proportion of extraction errors before they reach your accounting system. Flag anomalies — don’t silently pass them downstream and don’t silently drop them. Every anomaly either gets resolved in human review or it reveals a pattern worth addressing in the extraction logic.

What production looks like
#

A working invoice pipeline ingests PDFs from email attachments, SFTP drops, or a document upload API. Each PDF enters the extraction pipeline, gets classified (is this an invoice? which supplier?), and runs through the layered extraction logic. The result is a validated Pydantic object — or a flagged record routed to a review interface with the original document alongside the extracted fields.

Reviewed records feed back into the pipeline as training signal: patterns that consistently required human correction get translated into new rules or few-shot examples for the LLM layer. The pipeline improves over time, but not automatically — someone has to look at the corrections and decide what they mean.

Validated records output to your accounting system or ERP via API: Xero, QuickBooks, SAP, whatever your stack is. The output format is defined by your schema. The integration is straightforward once the data is clean.

The engineering work is in the edge cases, not the happy path. The happy path — a clean PDF from a familiar supplier — works on day one. What takes time is building the handling for the scanned invoice with rotated text, the supplier who changed their template without notice, and the line item that spans three pages. That’s the pipeline, not the script.

If you’re building an invoice extraction pipeline and hitting the limits of a script-based approach, a short diagnostic session can identify where the gaps are and what the right architecture looks like for your supplier mix and volume.

Book a Diagnostic Session →

What you’re actually trying to extract#

The basic approach (and where it breaks)#

The production approach: rules first, LLMs selectively#

Handling the hard cases#

Output schema and validation#

What production looks like#

Related