OCR (Optical Character Recognition) is software that converts images containing text — scanned documents, photos of pages, image-based PDFs — into machine-readable characters.
How OCR works#
An OCR engine does not read text the way you do. It processes an image and works through several stages to identify what characters are present.
Image preprocessing comes first. The engine cleans up the input: straightening a skewed scan (deskew), reducing noise and speckle (denoise), and converting the image to high-contrast black and white (binarise). The goal is to make character boundaries as clear as possible before the recognition step.
Character segmentation is next. The engine divides the image into regions, then lines, then individual words or characters. This stage is where complex layouts — multi-column pages, tables, headers in unusual positions — introduce errors.
Recognition is the core step. The engine matches each segmented region against known character shapes and outputs the most likely character. Modern OCR engines use neural networks for this, trained on large volumes of text. Earlier systems used template matching, which was brittle. Neural approaches handle more variation in fonts and image quality but still degrade significantly on handwriting and low-quality scans.
The output is raw text, typically with position information attached to each word or line.
What OCR outputs#
OCR gives you text. That is all.
The output is a flat sequence of characters in roughly the order they appear on the page, sometimes with bounding box coordinates indicating where each word sits on the image. There are no labelled fields, no structured records, no understanding of what the text means.
If a page contains an invoice with a total of £4,250.00, OCR will output the string “4,250.00” somewhere in a stream of text. It will not tell you that this is the invoice total. It will not separate it from the supplier name, the line items, or the payment terms. That interpretation is a separate problem entirely.
This matters because OCR is frequently described as “reading” a document. It is not. It is transcribing the visual appearance of characters. Reading — in the sense of understanding meaning and structure — requires something else.
OCR quality factors#
Several factors determine how accurately OCR performs on a given document.
Resolution is the most controllable one. 300 DPI is the widely cited minimum for reliable results on printed text. Below that, character edges become ambiguous and error rates increase sharply. 300 DPI and above is standard practice for document scanning intended for automated processing.
Image quality covers noise, contrast, and skew. A clean, flat, high-contrast scan will OCR well. A photo taken at an angle under fluorescent lighting will not. Preprocessing helps, but it cannot fully recover a poor original.
Font type has a large effect. Printed text in standard fonts is the easiest case. Stylised fonts, small print, and degraded photocopies introduce errors. Handwriting is significantly harder — most general-purpose OCR engines are not trained for it, and dedicated handwriting recognition models exist for this reason.
Language matters because the recognition model must know what character set and patterns to expect. Most engines support major languages well. Less common languages and mixed-language documents are more error-prone.
Document complexity affects accuracy at the layout stage. A simple single-column page with no tables is straightforward. A multi-column layout with embedded tables and footnotes will produce a character-accurate but structurally scrambled output that is harder to use downstream.
OCR engines#
Several options are available, with meaningful differences in performance and cost.
Tesseract is open source and widely used. It works well on clean, printed, single-column text. It is a reasonable starting point for controlled environments where you can guarantee input quality. On difficult inputs — low resolution, handwriting, complex layouts — accuracy degrades noticeably.
Cloud OCR engines (Google Vision OCR, AWS Textract’s OCR layer, Azure Document Intelligence, Adobe Acrobat) significantly outperform Tesseract on difficult inputs. They are trained on much larger datasets, handle handwriting, and deal better with layout complexity. They cost money per page and require sending documents to a third-party service, which is a relevant consideration if your documents contain sensitive information.
The gap between Tesseract and cloud engines is small on ideal inputs. On real-world documents with variable quality, the gap is large.
OCR vs IDP#
OCR is one component in a larger process. Intelligent Document Processing uses OCR as an input and adds extraction logic, validation, and structured output on top.
OCR tells you what text is on the page. IDP tells you what that text means in terms of specific data fields — invoice number, supplier name, line item amounts, document date.
A pipeline built purely on OCR output requires custom code to locate and extract each field, validate the result, and handle the inevitable cases where the OCR got something wrong. IDP platforms and frameworks provide that layer, or a custom pipeline implements it explicitly. Either way, OCR is the start of the process, not the end of it. For a detailed comparison, see OCR vs Intelligent Document Processing.
When OCR alone is enough#
There are cases where raw text output is exactly what you need, with no further extraction required.
Full-text search indexing is the clearest example. If you want users to search a library of scanned documents by keyword, you need the text. You do not need structure. OCR output fed into a search index solves this problem directly.
Archive digitisation where the goal is to preserve and make accessible the text content of historical documents — without any need to parse specific fields from them — is another case where OCR output is sufficient.
Inputs to downstream NLP that only need a text string. If a language model or text classifier is going to process the content, it typically needs raw text. OCR provides that. The NLP layer then handles whatever interpretation is needed.
Outside these cases, if you need specific data fields from a document, you need more than OCR. See how to build a document extraction pipeline or how to extract data from scanned PDFs in Python for where to go next.
Book a Diagnostic Session →
