Purchase order extraction appears simple on the surface. A PO has a number, a date, a list of line items, and a total. If your business receives POs only from customers who use a single consistent format, a script handles it fine.
Lab reports are among the harder document types for automated extraction. They come from multiple testing laboratories, each with a proprietary format built around their own LIMS software, reporting preferences, and historical conventions. The same parameter — say, nitrate concentration — might appear in a column headed “NO3-N (mg/L)”, “Nitrate as N”, or “NO₃⁻” depending on which lab issued the report. The value might be in a structured table, a semi-structured list, or embedded in narrative text alongside method references and QA annotations. A pipeline that works reliably on one laboratory’s reports needs to be explicitly designed and tested against each additional format. That’s not a limitation of the approach — it’s the nature of the domain.
Extracting vendor name, invoice number, date, line items, and total from a single consistent invoice format is a few lines of pdfplumber. If your company uses one internal invoice template and you control the format, that script will probably hold. The real problem appears the moment you have invoices from 30 different suppliers, each using a different layout, font, table structure, and occasionally a different currency format. That’s when a script becomes a pipeline — or it becomes a maintenance burden.
The script works perfectly on the sample document. You tested it on twenty invoices from your main supplier and it extracted every field correctly. Then the first real batch arrives — invoices from six different suppliers — and half of them fail.
If pdfplumber returns empty strings or None on pages that clearly have content, stop before writing more extraction code. The problem almost certainly isn’t your code — it’s the PDF type.
Scanned PDFs are images wrapped in a PDF container. There is no underlying text layer. Only pixels. Every Python PDF library that operates on text — pdfplumber, PyPDF2, even PyMuPDF in text mode — will return nothing useful, because there is nothing to return. You need OCR before any extraction can happen.
The naive approach is obvious: take a document, pass it to an LLM, ask for the data you want. It works on clean examples. Ask GPT-4 to extract invoice fields from a well-formatted PDF and you get a clean JSON response that looks exactly right.
Most PDF extraction projects start with the document. You open a PDF, look at the text, write a regex, extract a value. Repeat for each field.
This works for one document type with one layout. It doesn’t scale.
Schema-first extraction inverts the order: define exactly what the output should look like before you write a single line of extraction code. The schema becomes the specification that every extraction function has to satisfy — and the tool that tells you, immediately and explicitly, when an extraction fails.
If you’re extracting data from PDFs in Python, you’ll encounter three libraries repeatedly: pdfplumber, PyMuPDF (imported as fitz), and PyPDF2. They overlap in capability but differ in what they’re optimised for.
Picking the wrong one costs time. Here’s how to pick the right one.
Extracting tables from PDFs is one of the most common requirements in document automation and one of the most reliable ways to introduce subtle errors if you do it carelessly.
This guide covers table extraction with pdfplumber — the most capable Python library for this — including how it works, when it works, and what to do when it doesn’t.