Case Study: Lab report extraction for a UK water consultancy
The situation#
A UK water consultancy receives lab analysis results as PDF reports from multiple testing laboratories. Each lab uses a different format — over 10 distinct layouts, and the templates drift over time without warning.
Every morning, someone on the team opened each PDF, read the tables, and typed the results into their own Excel templates. Slow, error-prone, and completely dependent on one person knowing which numbers went where.
They’d tried an extraction script before I got involved. It worked for one lab’s format. Broke the moment another lab changed their template.
Why it’s hard#
Lab A puts test results in a vertical table. Lab B uses a horizontal layout. Lab C splits results across two pages. A script built for one format fails on the others.
Naming is inconsistent too. What one lab calls “Total Hardness,” another labels “Hardness (as CaCO3).” You can’t match on field names.
And labs update their report templates without telling anyone. A pipeline that worked yesterday stops working today, and nobody knows until the data is already wrong in the spreadsheet.
The previous script had no way to flag when something went wrong. Bad data just passed through silently.
What I built#
A single extraction pipeline that handles all the layout variations. I started with a Pydantic schema — defined exactly what fields to extract before writing any extraction code. Every output is validated against that schema.
For predictable fields, I used deterministic rules. LLMs only came in for fields where layout variation made rules too brittle. Every extracted field gets a confidence score. Low-confidence results get flagged for human review instead of passing through quietly.
In year two, I added email integration. The system reads the inbox every morning, finds the lab report attachments, extracts the data, and sends structured results back. The team stopped downloading and uploading PDFs entirely.
Results#
Manual data entry dropped 75% from day one. The engagement ran two years on retainer — the system became central to daily operations and expanded as new labs came on. One pipeline handles all 10+ layout variations.
“Thanks to Subhajit’s work, we are saving countless hours having to manually enter results into our own template.”
“He always delivers great results. His communication is clear, he takes feedback well, and he consistently finds effective solutions.”
Similar problem?#
If your team is manually pulling data from PDFs and you’ve tried tools that only work some of the time, I’m happy to look at your documents.
Book a Free 30-min Call →