Document validation is the step in an extraction pipeline that checks whether the extracted data is internally consistent, correctly formatted, and plausible — before that data passes to any downstream system.
Extraction produces values. Validation determines whether those values are correct. The two steps are distinct, and skipping validation is the most common reason extraction errors reach production systems undetected.
Why validation is a separate step#
Extraction and validation do different things. Extraction finds values in documents. Validation checks whether found values make sense.
An extraction that reads £1.234,56 from a European invoice might produce 1234.56 (correct) or 1.234 (incorrect — treated the comma as decimal separator) or 1234 (dropped the decimal entirely). All three are plausible numbers. None will cause a type error. Only validation — specifically, cross-checking the line item sum against the stated total — will detect the wrong value.
Without validation, extraction errors look like correct data until something breaks downstream: an invoice is paid at the wrong amount, a compliance value is off by a factor, a product ships in the wrong quantity. By then, the error is hard to trace.
Types of validation in extraction pipelines#
Schema validation — the first and most basic check. Does the extracted value match the expected type and format? A date field should contain a parseable date. A numeric field should be a number. A required field should be present. Schema-first extraction defines these constraints upfront so they can be checked as soon as extraction produces output.
Range validation — is the value within a plausible range? A lab report nitrate value of 5,000 mg/L might extract correctly as a number but is physically implausible for a drinking water sample. A delivery date ten years in the past is probably an extraction error. Range checks catch outliers that schema validation can’t.
Cross-field consistency — do the extracted values make sense together? Invoice line totals should sum to the stated subtotal. Total plus tax should equal the amount due. A certificate of analysis analysis date shouldn’t precede the sample collection date. These cross-field checks catch a category of errors that field-level validation misses.
Business rule validation — does the extracted data satisfy rules specific to your domain? A purchase order for a product not in your catalogue is flagged for review. An invoice from a supplier not in your approved vendor list is held. A lab result below a detection limit shouldn’t be stored as a numeric value. These rules are added as additional validators in the schema model.
Reference validation — does the extracted value match known reference data? A supplier name should match an entity in your supplier registry. A HS code should exist in the current tariff schedule. A laboratory accreditation number should resolve against the accreditation body’s register. Reference validation catches extraction errors that produce plausible but incorrect values.
Validation failures and routing#
When validation fails, there are two options: reject the record or route it for human review. The right choice depends on the severity and the downstream consequence.
Hard failures — missing required fields, type mismatches, schema violations — should stop the record from proceeding. A record with no invoice total shouldn’t enter an accounts payable system.
Soft failures — range anomalies, consistency warnings, reference mismatches — are better handled by routing to a human-in-the-loop review queue. A line item total that’s slightly inconsistent with the sum might be a rounding difference or an extraction error. A human can resolve this in seconds; stopping the record entirely is too aggressive.
Confidence scoring complements validation. A high-confidence extraction that passes all validation checks proceeds automatically. A low-confidence extraction that also fails a consistency check routes to review with both signals available to the reviewer.
Validation in practice#
In the water consultancy pipeline I built, validation is applied at two levels.
Field-level validation checks each extracted result: is it a number? Is the unit a recognised unit for this parameter? Is the value within a plausible range for this parameter and sample type? Is the non-detect flag handled correctly?
Cross-field validation checks the record as a whole: are analysis dates after sample collection dates? Are there any parameters reported with values below their detection limits (a data quality flag)? Do the sample IDs match the pattern expected from this laboratory?
Records that fail field-level validation on required fields are rejected and logged. Records that fail cross-field validation, or where any field is below the confidence threshold, go to the review queue. Nothing goes downstream until either the automated validation cleared or a person reviewed it.
The validation layer is where most of the extractable errors are caught. After two years of daily operation, the patterns of validation failures have informed improvements to both extraction rules and validation thresholds — the validation log is a quality improvement tool, not just a gating mechanism.
Related concepts#
- What is Schema-First Extraction? — how output constraints are defined before extraction begins
- What is Confidence Scoring in Document Extraction? — how extraction uncertainty is quantified
- What is Human-in-the-Loop Document Processing? — what happens when validation flags an issue
- What is a Document Extraction Pipeline? — where validation fits in the overall pipeline
