Skip to main content
  1. Blogs/
  2. Production PDF Extraction in Python — Guides and Code/

Extract Data from Scanned PDFs with Python

·1337 words·7 mins·
Subhajit Bhar
Author
Subhajit Bhar
I build production-grade document extraction pipelines for businesses that process invoices, lab reports, contracts, and other document types at scale.
Table of Contents

If pdfplumber returns empty strings or None on pages that clearly have content, stop before writing more extraction code. The problem almost certainly isn’t your code — it’s the PDF type.

Scanned PDFs are images wrapped in a PDF container. There is no underlying text layer. Only pixels. Every Python PDF library that operates on text — pdfplumber, PyPDF2, even PyMuPDF in text mode — will return nothing useful, because there is nothing to return. You need OCR before any extraction can happen.


How to detect if a PDF is scanned
#

The quick test: call page.extract_text() in pdfplumber. If it returns an empty string or None for a page that visibly has content, you’re almost certainly dealing with a scanned document.

A more reliable check uses PyMuPDF. Open the PDF with fitz, then for each page call both page.get_text() and page.get_images(). If get_text() returns nothing but get_images() returns one or more image objects, the page is image-based. Text PDFs have text objects. Scanned PDFs have image objects in their place.

Some PDFs are hybrid — they have a mix of scanned and digital pages. Check per page, not per document.


The OCR pipeline: three steps
#

Once you’ve confirmed the PDF is scanned, you need a three-step pipeline:

Step 1 — Convert PDF pages to images. pdf2image converts each PDF page to a PIL image using Poppler under the hood. PyMuPDF can do the same via page.get_pixmap(), which avoids the Poppler dependency. Both work well. Request at least 300 DPI at this stage — more on why below.

Step 2 — Run OCR on the images. pytesseract is the standard Python wrapper for Tesseract OCR. Pass the PIL image in and you get back either a raw string (image_to_string) or structured data with bounding boxes (image_to_data). The structured output is more useful if you need positional information for field extraction.

Cloud OCR — AWS Textract, Google Cloud Vision, Azure Document Intelligence — is the alternative to local Tesseract. You send the image bytes to an API and get back extracted text, bounding boxes, and in some cases structured tables. These services outperform Tesseract on difficult inputs. More on when to use them below.

Step 3 — Extract structured data from the OCR output. This is where your actual extraction logic runs. The approach is the same as for digital PDFs, just applied to OCR output instead of native text.

The pipeline as a document extraction pipeline has clearly defined stages with explicit failure points. Treat each step as independently testable.


OCR quality factors
#

OCR quality sets the ceiling on extraction accuracy. If the OCR output is bad, nothing downstream fixes it.

Scan resolution is the most critical factor. 300 DPI is the practical minimum for reliable Tesseract output. Below 200 DPI you’ll see consistent character-level errors, especially on small fonts. When converting PDF pages to images, specify the DPI explicitly — don’t rely on the default.

Image orientation matters. Tesseract has an orientation and script detection mode (--psm 0), but it’s not always reliable. For production pipelines, add an explicit deskew step before OCR. Libraries like deskew or OpenCV can correct moderate rotation.

Noise and contrast. Scans with speckle noise, shadows, or low contrast produce degraded OCR output. A preprocessing step that applies thresholding (converting to clean black-and-white) before OCR reduces noise and typically improves accuracy. OpenCV provides the tools for this.

Investing time in preprocessing pays back in extraction accuracy. A clean image through Tesseract often outperforms a noisy image through a cloud OCR service.


From OCR text to structured data
#

After OCR you have raw text — a string, or a dataframe of words with bounding box coordinates. The extraction approaches from this point are the same as for digital PDFs:

  • Regex for predictable patterns: invoice numbers, dates, totals, postcodes. If the format is consistent, a well-written pattern is reliable and fast.
  • Coordinate-based extraction for layout-stable documents. If a field always appears in roughly the same region of the page, extract by bounding box rather than by content pattern. pytesseract.image_to_data() gives you word-level coordinates. PyMuPDF in OCR mode (via tesserocr or via cloud output) does the same.
  • LLM extraction for variable or complex layouts. Pass the OCR text to an LLM with a structured output schema. This handles layout variation that regex and coordinates cannot, at the cost of latency and per-call expense.

The difference between digital PDFs and scanned PDFs at this stage is noise. OCR output has spelling errors, merged words, incorrect line breaks, and positional uncertainty. Your extraction logic needs to be more tolerant than it would be for clean digital text. Fuzzy matching for field labels, wider bounding box tolerances, and confidence scoring on extraction results all help.


Table extraction from scanned PDFs
#

Tables are the hardest case. In a digital PDF, a table has explicit structure — cell boundaries, coordinates, text within cells. After scanning and OCR, that structure is gone. You have text in roughly grid-shaped positions, but the cell boundaries are not explicitly represented.

Camelot and pdfplumber’s lattice mode both require a text layer. They don’t work on scanned documents.

For scanned tables, your options are:

  • Vision-based table detection: detect table borders as image features using OpenCV or a dedicated model (Microsoft Table Transformer, for example), extract cell regions, then OCR each cell individually. This works but requires meaningful engineering effort to make robust.
  • Cloud OCR with table detection: AWS Textract has a table detection mode that returns structured table objects with row and column assignments. Google Document AI does the same. These are purpose-built for this problem and handle many real-world cases reasonably well out of the box.

For a deeper look at table extraction options across PDF types, see extract tables from PDF Python.


When cloud OCR beats local Tesseract
#

Tesseract is free, runs locally, and is fast enough for many use cases. Use it as the default for clean, high-resolution, Latin-script documents.

Cloud OCR wins on:

  • Low scan quality. Textract and Document AI are trained on a much wider distribution of document quality than Tesseract.
  • Handwriting. Tesseract has minimal handwriting support. Cloud services handle it substantially better, though accuracy varies with handwriting legibility.
  • Complex layouts. Multi-column documents, forms with mixed text and tables, documents with watermarks — cloud OCR handles these more reliably.
  • Non-Latin scripts. Tesseract supports many languages but has clear accuracy gaps on Arabic, Chinese, and other non-Latin scripts in difficult conditions. Cloud OCR is more consistent.

The cost is real: cloud OCR is per-page pricing. For high-volume pipelines, running the numbers matters. A common pattern is to use Tesseract on high-quality scans and route low-confidence pages to cloud OCR — a hybrid approach that balances cost and accuracy.


Production considerations
#

A Tesseract call wrapped in a script is not a production pipeline. For anything running at scale:

Preprocessing is non-optional. Build a step before OCR that checks resolution, corrects orientation, and applies noise reduction. Log what you’re doing so you can diagnose OCR failures later.

Per-page error handling. Some pages will be unreadable — water damage, extreme skew, blank pages. Catch failures per page, log them, and continue processing the rest of the document. Crashing on one bad page is not acceptable in production.

Confidence scoring on OCR output. Tesseract returns a word-level confidence score via image_to_data. Aggregate this into a page-level confidence signal. Pages with low OCR confidence should be flagged before extraction even runs — garbage in, garbage out applies here before it applies at the extraction stage.

Human-in-the-loop processing for low-confidence output. Define a confidence threshold below which a document goes to a human reviewer rather than being passed downstream as clean data. The threshold depends on the cost of an extraction error in your specific use case.

Scanned PDFs are solvable. The pipeline is more complex than digital PDF extraction, but each step is well-understood. The problems that derail production deployments are usually skipping preprocessing, ignoring confidence signals, and assuming OCR output is cleaner than it is.


Book a Diagnostic Session →

Related