Purchase order extraction appears simple on the surface. A PO has a number, a date, a list of line items, and a total. If your business receives POs only from customers who use a single consistent format, a script handles it fine.
Reality for most operations: POs arrive from dozens of customers using different ERP systems, templates, and approval workflows. One customer sends a PDF export from SAP. Another sends a Word document saved as PDF. A third sends a scanned paper form. Each has different column ordering, different field labelling, and different conventions for how product codes and quantities are presented.
That’s when PO extraction becomes a production engineering problem.
What a purchase order extraction needs to capture#
The standard PO schema includes:
- PO number — the customer’s internal reference; this is often also your AR reference
- PO date — date the order was raised; not the same as the date you receive it
- Requested delivery date — sometimes per-line, sometimes a single date for the full order
- Buyer details — company name, contact, billing address, delivery address
- Line items — each with: your product code or description, the customer’s product code (may differ), quantity, unit of measure, unit price, and line total
- Payment terms — net 30, 30 days EOM, immediate, etc.
- Total order value — for cross-validation against line item sum
- Delivery instructions — warehouse location, reference codes, handling requirements
The tricky fields are the line items. Product codes rarely match between buyer and supplier. A customer’s internal SKU “WIDGET-A-100” may correspond to your product code “P4521”. Managing this mapping is part of the extraction system, not just a post-processing step.
The hard parts of PO extraction#
Line item matching. Your customer orders “100mm M6 hex bolts” using their internal part number. You need to map that to your catalogue SKU. This requires a product mapping table that the extraction system can reference — either a fixed lookup or a fuzzy-match against your product database. Extraction without this mapping produces line items that require manual rekeying regardless of how accurately the raw values were extracted.
Quantity and unit consistency. A customer ordering “10 boxes” when your pricing is per-unit creates a discrepancy that must be handled. Quantity and unit of measure are best extracted together and validated against your product catalogue’s expected units.
Multi-page POs. Large orders from manufacturing or retail customers can span many pages. Table extraction needs to correctly detect continuation across pages and not treat each page as a separate order.
Approval-stage POs. Many ERP systems produce POs at multiple approval stages — draft, approved, revised. The same PO number can appear in multiple versions with different line items or values. The extraction system needs to handle version identification, either from document metadata or from explicit version labelling in the document.
Customer-specific quirks. One customer always puts their delivery address where most customers put the billing address. Another uses a non-standard date format. A third embeds special handling instructions in the line item description rather than in a separate field. Per-customer extraction profiles handle this more reliably than a single generic extractor.
Schema and extraction approach#
from pydantic import BaseModel, Field, validator
from datetime import date
from decimal import Decimal
from typing import Optional, List
class POLineItem(BaseModel):
line_number: int
customer_product_code: Optional[str] = None
customer_description: str
our_product_code: Optional[str] = None # resolved from mapping table
quantity: Decimal
unit_of_measure: str
unit_price: Optional[Decimal] = None
line_total: Optional[Decimal] = None
requested_delivery_date: Optional[date] = None
class PurchaseOrderExtraction(BaseModel):
po_number: str
po_date: date
customer_name: str
customer_reference: Optional[str] = None
billing_address: Optional[str] = None
delivery_address: Optional[str] = None
payment_terms: Optional[str] = None
requested_delivery_date: Optional[date] = None # document-level if uniform
line_items: List[POLineItem]
subtotal: Optional[Decimal] = None
tax: Optional[Decimal] = None
total: Optional[Decimal] = None
currency: str = "GBP"
special_instructions: Optional[str] = None
@validator("line_items")
def at_least_one_line(cls, v):
if not v:
raise ValueError("PO must have at least one line item")
return vSchema-first extraction here means the line item is a typed model, not a list of raw strings. quantity is Decimal, not a string that might be "100" or "100.00" or "1e2". unit_of_measure is preserved as extracted — normalisation to a standard unit vocabulary happens as a separate post-extraction step with its own validation.
Rules-based vs. LLM extraction for POs#
Unlike contracts, most POs are sufficiently structured that rules-based extraction handles the majority well. The variation is in layout, not in semantics — the same fields are present, just positioned differently.
Rules-based extraction per customer template — once you’ve seen a customer’s PO format and written extraction rules for it, those rules will handle their future POs reliably unless they change their ERP or template. Customer identification (from email sender, filename pattern, or document content) routes each PO to the right set of rules.
LLM extraction for unknown templates — when a new customer sends their first PO and no rules exist yet, an LLM extracts the fields against the schema. This produces a record that goes to human review before being processed. The reviewed output confirms what the LLM extracted or corrects it, and that becomes the training basis for building extraction rules for the new customer.
Confidence scoring is applied throughout. Cross-validation checks are particularly useful for POs:
- Do line item totals sum to the stated subtotal?
- Does the PO date precede the requested delivery date?
- Does the extracted PO number match any known format for this customer?
These checks catch a high proportion of extraction errors before the record reaches your order management system.
The product mapping problem#
Every PO extraction system for a supplier-side operation eventually needs to handle product mapping — the customer’s product codes don’t match your catalogue.
The options range from simple to sophisticated:
Static lookup table — a maintained mapping of known customer SKUs to your product codes. Works well for steady-state customer relationships. Requires maintenance when customers update their product codes or when new products are added.
Fuzzy description matching — when the customer provides a text description rather than a code, fuzzy-match against your product catalogue descriptions. Useful as a fallback but produces false matches on common terms. Always route fuzzy matches to human review.
LLM-assisted mapping — for complex descriptions or when the customer uses terminology that doesn’t match your catalogue vocabulary. The LLM suggests the most likely match, which is reviewed before being committed to the lookup table.
The operational goal is a lookup table that’s maintained from the review process — every manually resolved mapping is added to the table, so the same mapping doesn’t require human resolution twice.
Integrating with order management#
The downstream integration shapes the extraction pipeline’s output format. The two common scenarios:
API integration to ERP/OMS — the extraction pipeline outputs a structured order object that maps directly to the API schema (SAP BAPI, Netsuite SuiteScript, Xero API). Product code mapping happens before output. Delivery addresses are normalised to match the address format the ERP expects.
File-based integration — the pipeline produces a standardised CSV or XML that’s imported into the OMS. The format is fixed by the import specification. Extraction validation includes checking that extracted values fit the import format constraints.
In either case, the extraction pipeline’s output schema is designed around the downstream integration requirements, not just around what the PO contains. This is the schema-first principle applied from both ends — what the document contains and what the downstream system expects.
FAQ#
How do you extract line items from purchase order PDFs with Python?
Use pdfplumber’s extract_table() for tabular POs, with explicit table settings for edge cases with poorly-formed tables. Define line items as a typed list in a Pydantic schema before writing any extraction logic. Cross-validate line totals against quantity × unit price to catch extraction errors. For multi-page POs, concatenate tables across pages before parsing.
What’s the best way to handle POs from multiple customers in different formats?
A customer identification step routes each PO to a customer-specific extraction profile. For known customers, rules-based extraction handles the layout variation. For new customers, LLM extraction with human review produces the first batch while rules are being built for the new template. The review queue for new customers is temporary — it shrinks as rules are added.
How accurate is automated PO extraction?
For POs from known customers with stable templates, 95%+ field-level accuracy is achievable with rules-based extraction. The realistic bottleneck is product code mapping — even when the PO fields extract correctly, unmapped customer codes require resolution. A well-maintained mapping table reduces this to a small proportion of POs.
Can you extract data from scanned purchase orders?
Yes, with an OCR layer before text extraction. Scanned PO quality varies: a clean scan of a laser-printed document extracts well; a photocopy of a fax extracts poorly. For scanned POs, apply stricter confidence thresholds and route more aggressively to human review. OCR-extracted table alignment is less reliable than text-layer extraction, so cross-validation of line item totals matters more.
How do you handle PO version control in extraction?
Extract a version or revision indicator if present in the document. Where the same PO number appears in multiple versions, store the extracted version alongside the PO data. For systems that need only the latest approved version, implement an ingestion rule that replaces previous versions when a new one is received for the same PO number.
Related articles#
- Invoice Data Extraction with Python — same pipeline architecture for supplier-side document processing
- What is Schema-First Extraction? — defining output structure before touching documents
- What is a Document Extraction Pipeline? — end-to-end system architecture
- What is Confidence Scoring in Document Extraction? — flagging uncertain extractions for review
