Skip to main content
  1. Tags/

Python

Lab Report Data Extraction with Python

·2239 words·11 mins
Lab reports are among the harder document types for automated extraction. They come from multiple testing laboratories, each with a proprietary format built around their own LIMS software, reporting preferences, and historical conventions. The same parameter — say, nitrate concentration — might appear in a column headed “NO3-N (mg/L)”, “Nitrate as N”, or “NO₃⁻” depending on which lab issued the report. The value might be in a structured table, a semi-structured list, or embedded in narrative text alongside method references and QA annotations. A pipeline that works reliably on one laboratory’s reports needs to be explicitly designed and tested against each additional format. That’s not a limitation of the approach — it’s the nature of the domain.

Invoice Data Extraction with Python: From Script to Production Pipeline

·1616 words·8 mins
Extracting vendor name, invoice number, date, line items, and total from a single consistent invoice format is a few lines of pdfplumber. If your company uses one internal invoice template and you control the format, that script will probably hold. The real problem appears the moment you have invoices from 30 different suppliers, each using a different layout, font, table structure, and occasionally a different currency format. That’s when a script becomes a pipeline — or it becomes a maintenance burden.

Extract Data from Scanned PDFs with Python

·1337 words·7 mins
If pdfplumber returns empty strings or None on pages that clearly have content, stop before writing more extraction code. The problem almost certainly isn’t your code — it’s the PDF type. Scanned PDFs are images wrapped in a PDF container. There is no underlying text layer. Only pixels. Every Python PDF library that operates on text — pdfplumber, PyPDF2, even PyMuPDF in text mode — will return nothing useful, because there is nothing to return. You need OCR before any extraction can happen.

Schema-First PDF Extraction in Python with Pydantic

·1186 words·6 mins
Most PDF extraction projects start with the document. You open a PDF, look at the text, write a regex, extract a value. Repeat for each field. This works for one document type with one layout. It doesn’t scale. Schema-first extraction inverts the order: define exactly what the output should look like before you write a single line of extraction code. The schema becomes the specification that every extraction function has to satisfy — and the tool that tells you, immediately and explicitly, when an extraction fails.

pdfplumber vs PyMuPDF vs PyPDF2 for PDF Extraction

·875 words·5 mins
If you’re extracting data from PDFs in Python, you’ll encounter three libraries repeatedly: pdfplumber, PyMuPDF (imported as fitz), and PyPDF2. They overlap in capability but differ in what they’re optimised for. Picking the wrong one costs time. Here’s how to pick the right one.

Extracting Tables from PDFs in Python: The Complete Guide

·1073 words·6 mins
Extracting tables from PDFs is one of the most common requirements in document automation and one of the most reliable ways to introduce subtle errors if you do it carelessly. This guide covers table extraction with pdfplumber — the most capable Python library for this — including how it works, when it works, and what to do when it doesn’t.