A mid-size law firm might review two hundred NDAs in a year. Each one has the same core fields — parties, effective date, governing law, notice period, confidentiality obligations — but those fields appear in different places, with different phrasing, in different structures. Some are two pages. Some are fourteen. Some have amendments attached. One paralegal manually extracting key terms from each document spends hours per week on work that produces a spreadsheet no one fully trusts.
A mid-sized water consultancy typically works with anywhere from five to twenty testing laboratories. Each one returns results in their own format. Nitrate concentration might appear in the third column of a table on page two of one lab’s report, and as a value embedded in a narrative paragraph on page one of another’s. Units shift between mg/L and µg/L. Sample IDs follow different naming conventions. Detection limits are sometimes in a dedicated column, sometimes appended as footnotes, sometimes absent entirely when a result exceeds the reporting threshold.
There are three ways to approach intelligent document processing: SaaS platforms like Nanonets and Docsumo, cloud provider APIs like Azure Document Intelligence, AWS Textract, and Google Document AI, and custom-built pipelines designed around your specific documents.
Each is genuinely right for different situations. The mistake isn’t choosing the wrong technology — it’s choosing based on what’s easiest to procure rather than what fits the actual documents you need to process.
The script works perfectly on the sample document. You tested it on twenty invoices from your main supplier and it extracted every field correctly. Then the first real batch arrives — invoices from six different suppliers — and half of them fail.
Google Document AI is Google Cloud’s managed document processing service. It uses OCR and machine learning to extract structured data from PDFs, images, and forms. For teams already in the GCP ecosystem, it’s a natural starting point — strong table parsing, solid form extraction, and tight integration with BigQuery and Cloud Storage. Gemini integration extends it to unstructured text where rule-based extraction would otherwise struggle.
If pdfplumber returns empty strings or None on pages that clearly have content, stop before writing more extraction code. The problem almost certainly isn’t your code — it’s the PDF type.
Scanned PDFs are images wrapped in a PDF container. There is no underlying text layer. Only pixels. Every Python PDF library that operates on text — pdfplumber, PyPDF2, even PyMuPDF in text mode — will return nothing useful, because there is nothing to return. You need OCR before any extraction can happen.
Docsumo is a SaaS intelligent document processing platform built specifically for financial and lending workflows. It handles bank statements, tax returns, pay stubs, and utility bills well — with a clean API, a human review interface, and SOC 2 compliance that matters in financial services. For fintechs and lenders processing standard KYC documents, it does what it says.
The naive approach is obvious: take a document, pass it to an LLM, ask for the data you want. It works on clean examples. Ask GPT-4 to extract invoice fields from a well-formatted PDF and you get a clean JSON response that looks exactly right.
Both Azure Document Intelligence and a custom extraction pipeline can work. The question is not which one is better in the abstract — it is which one fits your documents, your accuracy requirements, and your operational context. This article is written for teams who have already done some research and are now trying to make an actual decision.
AWS Textract is Amazon’s managed document extraction service. It reads text, tables, and form fields from PDFs and images, and integrates directly into AWS infrastructure — S3, Lambda, Comprehend. For organisations already in the AWS ecosystem processing standard document types, it is a reasonable place to start. For documents that deviate from the expected, or for operations running at volume, the limits show up quickly.