Skip to main content
  1. Blogs/
  2. Intelligent Document Processing — Guides and Code/

Customs Declaration Data Extraction: Automating Import and Export Documentation

·1439 words·7 mins·
Subhajit Bhar
Author
Subhajit Bhar
I build production-grade document extraction pipelines for businesses that process invoices, lab reports, contracts, and other document types at scale.
Table of Contents

Customs declarations are among the most error-sensitive documents in logistics. A wrong tariff code or an incorrectly extracted commodity value can trigger delays, fines, or hold actions. At the same time, import/export operations process hundreds or thousands of declarations per month, and the manual effort of verifying and entering data from these documents is substantial.

Extracting data from customs documents reliably — at scale, across multiple origin countries and document formats — requires a pipeline designed specifically for this document type.


What customs declaration data extraction involves
#

The core document types in customs operations each have distinct extraction challenges:

Single Administrative Documents (SAD / C88) are the standard EU/UK customs entry form. They have a fixed field structure defined by the TARIC/CDS system — box numbers correspond to defined fields. The structure is consistent, which makes rules-based extraction viable. The challenge is that SADs arrive both as digital PDFs and as scanned paper documents, and scanned quality varies considerably.

Bills of Lading (BoL) describe the shipment from the carrier’s perspective: shipper, consignee, port of loading, port of discharge, container numbers, and goods description. These are issued by shipping lines, each with their own format. Maersk, MSC, and CMA CGM all produce BoLs with different layouts. The goods description is often a free-text field with no structure.

Commercial invoices in a customs context need fields beyond the standard invoice extraction: country of origin, HS/commodity code, incoterms, and the declared customs value (which may differ from the commercial price). These fields aren’t always present or consistently labelled.

Packing lists list the contents of each package, with quantities, weights, and dimensions. They’re often tabular but with varying table structures. The challenge is associating each line item correctly with the corresponding invoice line.

Certificates of origin certify where goods were manufactured. They have a defined structure (EUR.1, Form A, GSP certificates) but arrive from multiple issuing bodies with slightly different layouts.


The specific challenges
#

Tariff code accuracy is critical. A Harmonised System (HS) code incorrectly extracted — even a transposition of two digits — changes the duty rate applied. Unlike a date extraction error, which may be obvious, a plausible-looking wrong HS code can pass through without triggering obvious validation failures. The extraction system needs HS code validation against the applicable tariff schedule, not just format checking.

Multilingual fields. International shipments involve documents from multiple countries. A goods description on a BoL might be in Chinese, German, or Spanish. Extraction systems need to handle multilingual content, at minimum preserving the original text accurately even when translation isn’t performed.

Incoterms affect customs value calculation. The declared customs value depends on which Incoterm was agreed — DDP, DAP, CIF, FOB all produce different dutiable values for the same shipment. Extracting the incoterm correctly is a prerequisite for correct customs value derivation.

Container and reference number formats vary. ISO 6346 container numbers have a check digit. ISO BIC codes identify carriers. IMO numbers identify vessels. These identifiers have defined formats that can be validated — but only if the extraction system knows what it’s looking for. A system that extracts them as free text has no way to detect transcription errors.

Volume creates pressure for automation. A customs broker or freight forwarder processing 500 declarations per week can’t afford manual verification of every field. The extraction system needs to be reliable enough that humans review exceptions, not every record.


The schema for customs declaration extraction
#

from pydantic import BaseModel, Field, validator
from datetime import date
from decimal import Decimal
from typing import Optional, List

class CustomsLineItem(BaseModel):
    line_number: int
    commodity_description: str
    hs_code: str = Field(..., description="HS/commodity code as declared")
    country_of_origin: str = Field(..., description="ISO 3166-1 alpha-2 country code")
    quantity: Decimal
    unit_of_measure: str
    gross_weight_kg: Optional[Decimal] = None
    net_weight_kg: Optional[Decimal] = None
    customs_value: Decimal
    currency: str

    @validator("hs_code")
    def hs_code_format(cls, v):
        # Basic format check: 6-digit minimum, up to 10 digits
        cleaned = v.replace(".", "").replace(" ", "")
        if not cleaned.isdigit() or len(cleaned) < 6:
            raise ValueError(f"HS code format invalid: {v}")
        return cleaned

class CustomsDeclaration(BaseModel):
    document_type: str  # SAD, BoL, commercial_invoice, packing_list, certificate_of_origin
    declaration_number: Optional[str] = None
    declaration_date: Optional[date] = None
    declarant_name: str
    declarant_reference: Optional[str] = None
    exporter_name: str
    exporter_country: str
    importer_name: str
    importer_country: str
    port_of_loading: Optional[str] = None
    port_of_discharge: Optional[str] = None
    incoterms: Optional[str] = None
    container_numbers: Optional[List[str]] = None
    total_gross_weight_kg: Optional[Decimal] = None
    total_customs_value: Decimal
    currency: str
    line_items: List[CustomsLineItem]

The @validator on hs_code catches format errors immediately. A HS code that’s four digits, contains letters, or is otherwise malformed fails validation at extraction time, not after it’s been submitted to a customs authority.


Extraction approach by document type
#

SADs and structured customs forms are well-suited to rules-based extraction. The field numbering system provides reliable anchors — Box 1 is always declaration type, Box 8 is always consignee, Box 33 is always commodity code. Rules tied to box numbers are stable across template variations.

Bills of Lading require per-carrier extraction profiles. The shipper, consignee, and port fields are in different positions across different shipping line templates. A carrier identification step routes each BoL to the right extraction profile. Where carrier isn’t identified, LLM extraction provides a fallback with lower confidence.

Commercial invoices from international suppliers combine invoice extraction with customs-specific field extraction. The challenge is that fields like HS code and country of origin may be in a separate column, a footer, or embedded in the goods description text. LLM extraction handles the variable cases, with rules for suppliers who use consistent templates.

Scanned documents require OCR first. Customs document OCR quality matters for accuracy: a misread digit in a container number or HS code has downstream consequences. For scanned customs documents, higher OCR confidence thresholds and stricter validation are warranted than for general document extraction.


Confidence scoring for customs data
#

Confidence scoring in customs extraction must account for the cost of errors. Fields with direct regulatory or financial consequences need higher thresholds:

  • HS codes: validate format, check against known tariff schedule, cross-check against goods description
  • Customs value: cross-check against line item totals, verify currency consistency
  • Country of origin: validate against ISO 3166-1, check plausibility against supplier country
  • Container numbers: validate ISO 6346 check digit

Fields where errors are less consequential — goods description text, packaging details — can use lower thresholds.

The human-in-the-loop review queue for customs extraction is primarily a compliance tool, not just a quality tool. Fields flagged for review should be resolved before submission to the customs authority, not after.


FAQ
#

Can you automate customs declaration data entry with IDP?

Yes, with a pipeline designed for the specific document types your operation processes. SADs and structured forms are the most tractable — rules-based extraction achieves high accuracy on standard fields. BoLs and commercial invoices from multiple origins require per-template extraction profiles and LLM extraction for variable fields. The key constraint is that validation must be rigorous: errors in HS codes and customs values have direct compliance and financial consequences.

What is the hardest field to extract from customs documents?

HS codes are the highest-risk extraction target. They look like simple numeric codes, but a wrong code — including a plausible-looking one — changes the duty classification. Format validation catches obvious errors; semantic validation (does this HS code plausibly match this goods description?) requires additional logic or domain knowledge.

How do you handle customs documents in multiple languages?

For fields with defined positions or box numbers (SADs, structured forms), language doesn’t significantly affect extraction. For free-text fields like goods descriptions in BoLs, preserve the original text first, then apply translation if needed downstream. Attempting to extract and translate simultaneously introduces additional error risk.

What happens if customs declaration data is extracted incorrectly?

Incorrect data submitted to a customs authority can result in incorrect duty assessment, customs holds, penalties for misdeclaration, or delays to shipment clearance. This is why customs extraction pipelines must fail loudly rather than silently — an extraction the system isn’t confident about should go to human review, not be submitted automatically.

Is Azure Document Intelligence or AWS Textract suitable for customs declaration extraction?

Both can handle structured fields from standard forms adequately. Neither has built-in understanding of HS codes, customs value rules, or the per-carrier variation in BoL formats. For reliable production use, you typically need custom extraction logic on top of — or instead of — the off-the-shelf models.


Related articles#

Automating customs document processing? Start with a Diagnostic Session →

Related

Contract Data Extraction: Pulling Structured Data from Legal Documents

·1710 words·9 mins
Contracts are the hardest document type to extract data from reliably. Invoices have a predictable structure. Lab reports have defined fields. Contracts are natural language documents, and the information you need — key dates, party names, payment terms, renewal clauses, termination conditions — can appear anywhere, phrased in many different ways, across documents that range from two pages to two hundred.