Structured documents have a predictable, machine-readable layout — the same fields in the same positions, every time. Unstructured documents present information in free-form natural language, where the relevant data could be anywhere and phrased in any number of ways.
The distinction matters because it determines your extraction approach. Structured documents can be extracted reliably with rules. Unstructured documents require more sophisticated methods, and reliable extraction is harder to guarantee.
Structured documents#
A structured document has a defined format. The fields are in known positions, labelled consistently, and the document was designed to be machine-readable (even if it’s rendered as a PDF for human reading).
Examples:
- Tax forms (W-2, P60, VAT returns) — defined fields in regulated positions
- SAP or Oracle ERP exports — consistent column headers, consistent field positions
- Standard customs forms (SAD) — box-numbered fields defined by regulation
- Database-generated reports — same schema every time, just different data
Extraction approach: Rules-based extraction handles structured documents well. Regex patterns, coordinate-based extraction, and table parsing against known column headers all work reliably. Confidence scores are high because the extraction logic is deterministic.
The catch: “Structured” is relative to the source. A tax form is highly structured if you know the exact form type. But the same data type (annual income) appears in W-2s, P60s, self-assessment returns, and employer letters — all structured within their own format, but each requiring different extraction logic.
Semi-structured documents#
Semi-structured documents have a consistent general format — the same sections always appear — but the exact position, labelling, and presentation of individual fields varies between instances.
Examples:
- Invoices — always have a total, a date, a vendor name, and line items, but positioned and labelled differently across suppliers
- Purchase orders — always have a PO number, line items, and delivery address, but in different layouts across different customers’ ERP systems
- Lab reports — always contain test results, but table structure and parameter labelling vary by laboratory
- Contracts — always contain parties, dates, and key clauses, but in different locations and phrasings across different templates
Extraction approach: Semi-structured documents need a combination of rules-based extraction (for fields that are consistently located within a known template variant) and LLM or ML extraction (for fields that vary). Per-source profiles handle the most common variants deterministically; LLM extraction handles new or unusual variants.
Layout variation is the defining challenge of semi-structured document extraction. The variation isn’t in whether the fields exist — they always do — but in where they are and how they’re presented.
Unstructured documents#
Unstructured documents present information in free-form natural language. There’s no predictable structure. The relevant data might be anywhere, phrased differently in each document, and sometimes embedded in context that requires understanding to extract correctly.
Examples:
- Correspondence and emails — relevant information (dates, amounts, commitments) appears in running text
- Meeting minutes — decisions and action items buried in narrative
- Analyst reports — data and conclusions in prose paragraphs
- Legal opinions — conclusions depend on reasoning that precedes them
- Field survey notes — observations in unformatted text
Extraction approach: Unstructured documents rely on LLMs or trained NLP models. Rules work poorly because there’s no structure to anchor them. Extraction confidence is inherently lower, and human-in-the-loop review handles a higher proportion of outputs.
In practice, most “unstructured” extraction targets are actually specific entities in context — a date, a decision, an amount — rather than complete freeform text. Named entity recognition (NER) and context-window extraction with LLMs both work for this, with the schema defining what entities to extract.
Why the distinction matters for extraction design#
The structured/unstructured axis determines the appropriate extraction approach, the achievable accuracy, and the proportion of outputs that require human review.
| Structured | Semi-structured | Unstructured | |
|---|---|---|---|
| Primary extraction method | Rules, templates | Rules + LLM | LLM, NER |
| Achievable accuracy | 95%+ | 85-95% | 70-85% |
| Human review proportion | Low | Medium | High |
| Sensitivity to layout change | High | Medium | Low |
| Domain knowledge required | Low | Medium | High |
Most real-world document workflows contain semi-structured documents, not purely structured or purely unstructured ones. The extraction system needs to handle the semi-structured case well — which means per-source profiles, layered extraction methods, and confidence scoring, not just one approach applied uniformly.
Related concepts#
- What is Layout Variation in Document Extraction? — the key challenge for semi-structured documents
- What is Schema-First Extraction? — defining output structure regardless of document structure
- What is Intelligent Document Processing? — the broader context for extraction across document types
- What is a Document Extraction Pipeline? — how different document types are handled in one pipeline
