Structured vs Unstructured Documents: What's the Difference?

Table of Contents

Structured documents have a predictable, machine-readable layout — the same fields in the same positions, every time. Unstructured documents present information in free-form natural language, where the relevant data could be anywhere and phrased in any number of ways.

The distinction matters because it determines your extraction approach. Structured documents can be extracted reliably with rules. Unstructured documents require more sophisticated methods, and reliable extraction is harder to guarantee.

Structured documents
#

A structured document has a defined format. The fields are in known positions, labelled consistently, and the document was designed to be machine-readable (even if it’s rendered as a PDF for human reading).

Examples:

Tax forms (W-2, P60, VAT returns) — defined fields in regulated positions
SAP or Oracle ERP exports — consistent column headers, consistent field positions
Standard customs forms (SAD) — box-numbered fields defined by regulation
Database-generated reports — same schema every time, just different data

Extraction approach: Rules-based extraction handles structured documents well. Regex patterns, coordinate-based extraction, and table parsing against known column headers all work reliably. Confidence scores are high because the extraction logic is deterministic.

The catch: “Structured” is relative to the source. A tax form is highly structured if you know the exact form type. But the same data type (annual income) appears in W-2s, P60s, self-assessment returns, and employer letters — all structured within their own format, but each requiring different extraction logic.

Semi-structured documents
#

Semi-structured documents have a consistent general format — the same sections always appear — but the exact position, labelling, and presentation of individual fields varies between instances.

Examples:

Invoices — always have a total, a date, a vendor name, and line items, but positioned and labelled differently across suppliers
Purchase orders — always have a PO number, line items, and delivery address, but in different layouts across different customers’ ERP systems
Lab reports — always contain test results, but table structure and parameter labelling vary by laboratory
Contracts — always contain parties, dates, and key clauses, but in different locations and phrasings across different templates

Extraction approach: Semi-structured documents need a combination of rules-based extraction (for fields that are consistently located within a known template variant) and LLM or ML extraction (for fields that vary). Per-source profiles handle the most common variants deterministically; LLM extraction handles new or unusual variants.

Layout variation is the defining challenge of semi-structured document extraction. The variation isn’t in whether the fields exist — they always do — but in where they are and how they’re presented.

Unstructured documents
#

Unstructured documents present information in free-form natural language. There’s no predictable structure. The relevant data might be anywhere, phrased differently in each document, and sometimes embedded in context that requires understanding to extract correctly.

Examples:

Correspondence and emails — relevant information (dates, amounts, commitments) appears in running text
Meeting minutes — decisions and action items buried in narrative
Analyst reports — data and conclusions in prose paragraphs
Legal opinions — conclusions depend on reasoning that precedes them
Field survey notes — observations in unformatted text

Extraction approach: Unstructured documents rely on LLMs or trained NLP models. Rules work poorly because there’s no structure to anchor them. Extraction confidence is inherently lower, and human-in-the-loop review handles a higher proportion of outputs.

In practice, most “unstructured” extraction targets are actually specific entities in context — a date, a decision, an amount — rather than complete freeform text. Named entity recognition (NER) and context-window extraction with LLMs both work for this, with the schema defining what entities to extract.

Why the distinction matters for extraction design
#

The structured/unstructured axis determines the appropriate extraction approach, the achievable accuracy, and the proportion of outputs that require human review.

	Structured	Semi-structured	Unstructured
Primary extraction method	Rules, templates	Rules + LLM	LLM, NER
Achievable accuracy	95%+	85-95%	70-85%
Human review proportion	Low	Medium	High
Sensitivity to layout change	High	Medium	Low
Domain knowledge required	Low	Medium	High

Most real-world document workflows contain semi-structured documents, not purely structured or purely unstructured ones. The extraction system needs to handle the semi-structured case well — which means per-source profiles, layered extraction methods, and confidence scoring, not just one approach applied uniformly.

Related concepts
#

What is Layout Variation in Document Extraction? — the key challenge for semi-structured documents
What is Schema-First Extraction? — defining output structure regardless of document structure
What is Intelligent Document Processing? — the broader context for extraction across document types
What is a Document Extraction Pipeline? — how different document types are handled in one pipeline

Not sure which extraction approach fits your documents? Start with a Diagnostic Session →

Structured documents#

Semi-structured documents#

Unstructured documents#

Why the distinction matters for extraction design#

Related concepts#

Related

Structured documents
#

Semi-structured documents
#

Unstructured documents
#

Why the distinction matters for extraction design
#

Related concepts
#