Skip to main content
  1. Blogs/
  2. Intelligent Document Processing — Guides and Code/

Certificate of Analysis Data Extraction: A Production Guide

·1701 words·8 mins·
Subhajit Bhar
Author
Subhajit Bhar
I build production-grade document extraction pipelines for businesses that process invoices, lab reports, contracts, and other document types at scale.
Table of Contents

A certificate of analysis (CoA) is one of the most information-dense documents in regulated industries. It carries test results, method references, accreditation details, chain-of-custody information, and the laboratory’s sign-off — all in a format designed for human reading, not machine parsing.

Extracting structured data from CoAs reliably is harder than it looks. The document type is consistent in what it contains, but not in how it presents that information. Manual extraction works at low volume. It breaks down the moment the volume grows or the number of issuing laboratories increases.


What a CoA contains and why it matters
#

A standard certificate of analysis includes:

Results table — the test parameters, their measured values, units, and the applicable limits or specifications. This is the data most systems need. It’s also the section most prone to layout variation.

Method references — the test method used for each parameter (e.g., ISO 17294-2, EPA 200.8, BS EN 1484). Relevant for audit trails and compliance documentation.

Accreditation details — which accreditation body covers the laboratory, the scope reference, and the specific tests included in the scope. For regulated industries, using results from outside the accreditation scope is a compliance risk.

Sample information — sample ID, sampling date, receipt date, analysis date, and the sample matrix (water, soil, air, biological material). These need to stay linked to the results, not just extracted as free-floating values.

Detection limits — the minimum level the method can reliably detect. A result reported as < 0.01 requires the detection limit to interpret it correctly as a non-detect, not as a value of 0.01.

Signatory and issue date — who signed the certificate and when. Required for chain-of-custody documentation.

When extraction misses or misinterprets any of these, the error propagates into the downstream record that feeds compliance submissions, quality assurance databases, or production release decisions.


Why CoA extraction is harder than invoice extraction
#

Invoice extraction is a well-understood problem. The variation is in layout, but the semantics are consistent — every invoice has a total, a date, a vendor. The values are usually easy to identify once the layout is known.

CoAs are more complex for several reasons:

Results are relational, not flat. Every extracted result value must stay associated with its parameter name, unit, detection limit, and sample ID. If those associations break — because the table was parsed with misaligned columns — you get orphaned values or, worse, values assigned to the wrong parameter.

Non-detect values require special handling. A result below the detection limit isn’t zero. It’s a non-detect, typically reported as < DL where DL is the detection limit. Simple numeric extraction misses this entirely. The system needs to handle <0.01, ND, BDL, < DL, Not detected, and < LOQ as non-detect indicators and extract the limit value separately.

Units must be preserved exactly. mg/L and µg/L differ by a factor of 1000. mg/kg and mg/L refer to different sample matrices. An extraction system that normalises units or strips special characters will produce values that look correct but are off by orders of magnitude. For a compliance submission, this is a critical error.

Layout variation across laboratories is high. Unlike invoices, where a handful of software packages produce most of the templates, laboratory report systems are diverse. Some laboratories use LIMS software that exports structured PDFs. Others generate reports in Word. Some smaller labs produce scanned documents. Each has its own table structure, column ordering, and convention for how qualifiers and detection limits appear.

Accreditation scope requires cross-reference. Extracting that the pH was measured by Method X is straightforward. Confirming that Method X falls within the laboratory’s UKAS or A2LA accreditation scope requires a cross-reference that simple extraction can’t provide — it requires domain knowledge embedded in the pipeline.


The schema for CoA extraction
#

Schema-first extraction is especially important for CoAs because the downstream use cases depend on the relational structure being intact. A flat list of extracted values isn’t enough.

from pydantic import BaseModel, Field, validator
from datetime import date
from decimal import Decimal
from typing import Optional, List
from enum import Enum

class NonDetectIndicator(str, Enum):
    LESS_THAN = "less_than"
    NOT_DETECTED = "not_detected"
    BELOW_DETECTION = "below_detection"

class AnalysisResult(BaseModel):
    parameter: str = Field(..., description="Parameter name, normalised")
    parameter_raw: str = Field(..., description="Parameter name as it appears in the CoA")
    value: Optional[Decimal] = Field(None, description="Numeric result value; None if non-detect")
    non_detect: Optional[NonDetectIndicator] = Field(None, description="Non-detect type if applicable")
    detection_limit: Optional[Decimal] = Field(None, description="Method detection limit")
    unit: str = Field(..., description="Unit exactly as reported")
    method: Optional[str] = Field(None, description="Test method reference")
    specification_limit: Optional[str] = Field(None, description="Applicable limit or spec if shown")

class CertificateOfAnalysis(BaseModel):
    laboratory_name: str
    certificate_number: str
    issue_date: date
    analysis_date: date
    sample_id: str
    sample_description: Optional[str] = None
    sample_matrix: Optional[str] = None
    accreditation_number: Optional[str] = None
    accreditation_body: Optional[str] = None
    results: List[AnalysisResult]
    signatory: Optional[str] = None
    within_accreditation_scope: Optional[bool] = None

The AnalysisResult schema forces the pipeline to handle non-detects explicitly. If value is None, non_detect must be populated. This constraint prevents silent data loss on non-detect values — instead of storing nothing, the system stores the detection limit and the non-detect indicator.


Extraction approach by CoA type
#

Structured digital PDFs from LIMS are the cleanest case. Tables have clear column headers, values are in predictable positions, and text extraction with pdfplumber or PyMuPDF produces reliable output for rules-based extraction. Most fields extract with high confidence.

Word-generated PDFs often have table structures that appear clean visually but are poorly formed in the PDF’s internal structure. Tables with merged cells, split rows, or wrapped text in cells require more careful parsing. pdfplumber’s extract_tables() with explicit settings often outperforms the defaults here.

Scanned CoAs require an OCR layer first. Quality varies considerably: a scan of a laser-printed document from a modern printer extracts well; a photocopy of a fax extract poorly. For scanned CoAs below a quality threshold, routing to human review is more reliable than attempting extraction and getting noisy output.

Multi-sample CoAs — where one certificate covers multiple samples, often presented as transposed tables — require the extraction logic to correctly associate each result column with the corresponding sample ID. This is where column-alignment errors produce the most dangerous mismatches.


Confidence scoring and validation
#

Confidence scoring on CoA extraction should account for the specific failure modes of this document type:

  • Table alignment confidence: were the columns parsed cleanly, or are there alignment ambiguities?
  • Unit extraction confidence: was the unit extracted exactly, or does it match a known-valid unit string approximately?
  • Non-detect handling: was the non-detect indicator recognised, or was a numeric value extracted from a < expression?
  • Accreditation metadata: are the accreditation number and scope present?

Cross-validation checks that catch CoA-specific errors:

  • Does the certificate number follow the laboratory’s known numbering convention?
  • Are analysis dates within a plausible range relative to sampling dates?
  • Do extracted units match expected units for the known parameters?
  • Are detection limits consistent with the extraction method?

Any extraction that fails these checks — or that scores below threshold on column alignment — routes to human-in-the-loop review before the record is accepted.


In practice: the water consultancy pipeline
#

The CoA extraction I built for a water consultancy handles certificates from over ten laboratories in daily operation. Non-detect handling and unit preservation were the two failure modes that required the most careful design.

Non-detect handling was addressed by building a normalisation layer that maps all known non-detect representations to the NonDetectIndicator enum before any numeric extraction is attempted. The detection limit is extracted as a separate field, so downstream systems always know what ND means for a given result.

Unit preservation was addressed by treating unit extraction as a string match against a validated unit vocabulary specific to water quality parameters — not a free-form string extraction. If the extracted string doesn’t match a known unit for the parameter type, the field is flagged for review rather than accepted at face value.

The result is that CoAs from laboratories with very different formats — some highly structured, some quite variable — produce consistent structured output that feeds directly into the compliance dataset.


FAQ
#

What Python library is best for certificate of analysis extraction?

For structured digital CoAs, pdfplumber handles table extraction reliably. For complex table layouts with merged cells or wrapped text, PyMuPDF with coordinate-based extraction gives more control. Scanned CoAs need an OCR step first — pytesseract for on-premise or a cloud OCR API for higher volume and accuracy requirements.

How do you handle non-detect values in CoA extraction?

Non-detect values should never be treated as numeric zero. Build a normalisation layer that maps representations like <0.01, ND, BDL, and Not detected to a non-detect flag, and extract the detection limit as a separate field. Store both — the downstream system needs to know the limit, not just that the result was below it.

Can IDP achieve high accuracy on CoAs from multiple laboratories?

Yes, with a per-laboratory extraction profile. A single generic extractor rarely achieves high accuracy across diverse lab formats. The approach that works in production is a laboratory identification step followed by laboratory-specific extraction rules, all producing output against the same validated schema.

How long does it take to onboard a new laboratory’s CoA format?

For a well-structured digital CoA from a new laboratory, onboarding a new extraction profile typically takes a few hours to a day — analysing the format, writing the extraction rules, and testing against a sample batch. The schema doesn’t change; only the extraction rules for that laboratory are new.

What happens when a CoA fails extraction?

A well-designed pipeline fails loudly. If required fields are missing, the table structure doesn’t parse cleanly, or extraction falls below confidence thresholds, the document is flagged and routed to a review queue. A person reviews the original document alongside the extraction result and corrects any errors before the record is accepted downstream.


Related articles#

Extracting data from certificates of analysis? Start with a Diagnostic Session →

Related

Contract Data Extraction: Pulling Structured Data from Legal Documents

·1710 words·9 mins
Contracts are the hardest document type to extract data from reliably. Invoices have a predictable structure. Lab reports have defined fields. Contracts are natural language documents, and the information you need — key dates, party names, payment terms, renewal clauses, termination conditions — can appear anywhere, phrased in many different ways, across documents that range from two pages to two hundred.

Customs Declaration Data Extraction: Automating Import and Export Documentation

·1439 words·7 mins
Customs declarations are among the most error-sensitive documents in logistics. A wrong tariff code or an incorrectly extracted commodity value can trigger delays, fines, or hold actions. At the same time, import/export operations process hundreds or thousands of declarations per month, and the manual effort of verifying and entering data from these documents is substantial.