AWS Textract is Amazon’s managed document extraction service. It reads text, tables, and form fields from PDFs and images, and integrates directly into AWS infrastructure — S3, Lambda, Comprehend. For organisations already in the AWS ecosystem processing standard document types, it is a reasonable place to start. For documents that deviate from the expected, or for operations running at volume, the limits show up quickly.
What AWS Textract does well#
Before the alternatives, the genuine strengths are worth stating clearly.
Table and form extraction. The Queries API lets you ask specific questions about a document — “What is the invoice total?”, “What is the delivery date?” — and Textract returns the answer with a confidence score. For well-structured forms and tables, this works.
AWS integration. If your documents already land in S3 and your processing runs on Lambda, Textract fits naturally into that architecture. No external API to manage, IAM handles access control, and the Textract + Comprehend combination covers extraction plus basic entity recognition.
Decent OCR on digital PDFs. Textract performs well on clean, digitally-generated PDFs. Not all extraction services do.
Managed infrastructure. No model to host, no scaling to manage. AWS handles availability and updates. For teams without MLOps capacity, that matters.
Where it falls short#
The limitations become visible as soon as your documents leave the expected range.
Complex multi-page layouts#
Textract handles single-page forms well. Multi-page documents with variable structure — where a field might appear on page 2 in one version and page 4 in another — are harder. Context doesn’t carry cleanly across page boundaries, and the extraction logic doesn’t adapt to layout shifts.
Domain-specific documents#
Textract’s prebuilt models were trained on representative samples of common document types. Lab reports from water utilities, bills of lading from freight forwarders, proprietary templates from financial services firms — these fall outside that training distribution. Extraction quality drops, and the API doesn’t always signal clearly when it has.
The Custom Adapter feature added in 2023 allows some fine-tuning. In practice, it requires labelled examples, produces models with the same fundamental constraints as the base models, and adds retraining overhead when layouts change.
Cost at volume#
Textract pricing is per page. The Queries API (used for forms and structured data) runs at $1.50 per 1,000 pages. At low volumes that’s negligible. At tens of thousands of pages per month, it becomes significant — and the economics of a custom document extraction pipeline often look better at that scale.
No built-in confidence routing#
Textract returns confidence scores on extracted fields. What it doesn’t do is route low-confidence results anywhere. If the extraction is uncertain, you get a low-confidence value back — and whatever comes next in your pipeline receives it without knowing it should be treated differently.
Building human-in-the-loop processing on top of Textract is possible, but it’s entirely your logic to write. That’s not always a problem, but it’s a cost to account for.
Silent failures on edge cases#
The deeper issue with any managed extraction service is that errors can pass silently downstream. A field extracted incorrectly with a moderate confidence score looks identical to a field extracted correctly. Without field-level confidence scoring baked into your workflow — not just API scores, but scored against business rules — you don’t know which is which until something breaks downstream.
The alternatives#
Azure Document Intelligence#
Microsoft’s equivalent, formerly Form Recognizer. Similar prebuilt models for standard types (invoices, receipts, IDs), similar custom training options. Strong handwriting recognition.
Worth evaluating if you’re in Azure infrastructure or if handwritten forms are part of your document mix. The same edge-case limitations apply. See also: Azure Document Intelligence alternatives for a fuller comparison.
Google Document AI#
Google’s offering. In some benchmarks, the form parsing and table extraction performs better than the other managed services, particularly on structured documents. The Workbench product allows custom model training.
Worth testing if you’re in GCP or if table-heavy documents — multi-column layouts, nested tables — are your primary challenge.
Open-source OCR + custom pipeline#
Tools like pdfplumber and PyMuPDF handle text and layout extraction from PDFs without an external API. Tesseract covers OCR for scanned documents. You write the extraction logic yourself.
This approach has the highest build cost and the most control. It makes sense when your document layouts are specific enough that prebuilt models can’t cover them, when volume makes per-page pricing unworkable, or when your extraction logic needs to be auditable and maintainable without vendor dependency.
Nanonets / Docsumo#
Smaller managed IDP products with stronger domain-specific model support and lower barrier to custom training. Worth evaluating for teams that want more configurability than the hyperscaler platforms but don’t want to build from scratch. Both have per-page pricing models.
Custom pipeline with selective LLMs#
This is the approach I use in production. The baseline is schema-first extraction — every field defined before extraction starts, with regex and structural rules as the first pass. LLMs are introduced selectively, only where layout variation genuinely makes rules insufficient. Every field gets a confidence score, and anything below threshold is routed to a human reviewer before it moves downstream.
For a water consultancy client, this approach reduced document processing from weeks of manual work to minutes, and the pipeline has been running in production for two years without fundamental rework. The key is that the system knows what it doesn’t know — uncertain results don’t propagate silently. That’s the part managed platforms make difficult.
How to decide#
| AWS Textract | Custom Pipeline | |
|---|---|---|
| Standard document types | Works well | Overkill |
| High layout variation | Breaks at edges | Handles it |
| Domain-specific documents | Needs Custom Adapter | Built for this |
| Silent failures acceptable | Manageable | Not recommended |
| Control over failure modes | Limited | Full control |
| Time to first result | Days | Weeks |
| Cost at high volume | Per-page ($1.50/1,000) | Fixed infrastructure |
| Ongoing maintenance | Platform-managed | Your team or contractor |
A useful rule of thumb: if all your documents look like textbook examples of a common type, a managed platform is probably fine. If a meaningful proportion are domain-specific, have variable layouts, or feed downstream decisions where errors are costly — you need more control than Textract provides.
The real question#
The choice between Textract and a custom pipeline is not primarily about which has better OCR. It’s about what happens when extraction fails.
With Textract, an incorrect extraction with a moderate confidence score passes silently into whatever comes next. Your downstream process — the database record, the report, the compliance log — contains a wrong value and doesn’t know it. That’s a manageable risk if your documents are standard and your stakes are low. It’s a serious problem if your extracted data feeds operational or financial decisions.
If you’re not sure which category your documents fall into, the fastest way to find out is to run your actual documents — the awkward ones, not the clean samples — through the platform. The edge cases make the decision obvious. If you want help running that test systematically, that’s what a diagnostic session is for.
Book a Diagnostic Session →