What is Document Classification in IDP?

Table of Contents

Document classification is the step in an extraction pipeline that identifies what type of document has arrived before any field extraction begins. In a pipeline that handles multiple document types — invoices, purchase orders, lab reports, contracts — classification routes each document to the extraction logic designed for it.

Without classification, you either need separate ingestion pipelines per document type (which creates operational complexity), or you apply all extraction logic to every document (which produces noise and false matches).

Why classification matters
#

A document extraction pipeline that handles one document type doesn’t need classification — you already know what every incoming document is. Classification becomes necessary when the same pipeline ingests multiple document types from the same source or channel.

Consider an accounts payable team that receives invoices, credit notes, remittance advices, and delivery notes all via the same email inbox. A pipeline without classification attempts to extract invoice fields from credit notes, remittance advices, and delivery notes — producing nonsense for non-invoice documents.

With classification, each incoming document is identified first. Invoices go to invoice extraction. Credit notes go to credit note extraction. Delivery notes are identified and either extracted with their own logic or routed to a simple filing step if extraction isn’t needed.

Classification approaches
#

Rule-based classification uses document features to determine type — keywords in the document, file name patterns, email subject lines, sender addresses, or document metadata. An email from a known supplier with “Invoice” in the subject and a PDF attachment is likely an invoice. A document with the word “Credit Note” in the header is likely a credit note.

Rule-based classification is fast, transparent, and sufficient for cases where document types are clearly distinguishable by surface features. It fails when document types are similar enough that surface features don’t reliably distinguish them.

Layout signature matching compares the spatial structure of the incoming document against known layout templates. An invoice from a specific supplier has a consistent layout signature — the positions of the header, the line item table, and the total section. If the incoming document matches that signature closely enough, it’s classified as that supplier’s invoice.

Layout signature matching works well for identifying specific template variants within a document type. It doesn’t generalise well to documents with novel layouts.

ML-based classification trains a model on labelled examples of each document type. The model learns features that distinguish document types without explicit rules. It generalises better to novel layouts within known types and handles cases where surface features are ambiguous.

The tradeoff: ML models require labelled training data, perform worse on document types underrepresented in the training set, and produce less interpretable classifications (you know the model said “invoice” but not exactly why). For most production pipelines, a combination of rule-based and ML classification covers the majority of cases — rules handle confident, clear-cut classifications; ML handles the ambiguous cases.

Classification confidence
#

Like field extraction, classification should produce a confidence score, not just a label. A document that clearly matches a known type should score near 1.0. A document that partially matches multiple types should score lower and potentially route to human classification.

This matters because classification errors compound downstream. A purchase order classified as an invoice goes to invoice extraction and produces fields that look plausible but are incorrect. Those incorrect values may pass schema validation because they’re in the right format — just from the wrong document.

Low-confidence classifications going to human review is cheaper than catching misclassification errors after extraction has run.

Classification in the pipeline sequence
#

Classification sits after ingestion (the document has been received and normalised) and before extraction (which requires knowing what to extract). The pipeline sequence:

Ingestion — receive document, convert format if needed, OCR if image-based
Classification — identify document type (and optionally document source, for per-template extraction profiles)
Extraction — apply the extraction logic for the identified type and source
Validation — check extracted output against schema
Routing — pass to downstream system or review queue

When classification confidence is below threshold, the document routes to human classification before extraction runs — not after.

Related concepts
#

What is a Document Extraction Pipeline? — where classification fits in the overall pipeline
What is Layout Variation in Document Extraction? — why per-source profiles are needed after classification
What is Confidence Scoring in Document Extraction? — how classification uncertainty is handled
What is Human-in-the-Loop Document Processing? — what happens when classification confidence is low

Building a multi-document-type extraction pipeline? Start with a Diagnostic Session →

Why classification matters#

Classification approaches#

Classification confidence#

Classification in the pipeline sequence#

Related concepts#

Related

Why classification matters
#

Classification approaches
#

Classification confidence
#

Classification in the pipeline sequence
#

Related concepts
#