Skip to main content
  1. Blogs/
  2. Production PDF Extraction in Python — Guides and Code/

pdfplumber vs PyMuPDF vs PyPDF2 for PDF Extraction

·875 words·5 mins·
Subhajit Bhar
Author
Subhajit Bhar
I build production-grade document extraction pipelines for businesses that process invoices, lab reports, contracts, and other document types at scale.
Table of Contents

If you’re extracting data from PDFs in Python, you’ll encounter three libraries repeatedly: pdfplumber, PyMuPDF (imported as fitz), and PyPDF2. They overlap in capability but differ in what they’re optimised for.

Picking the wrong one costs time. Here’s how to pick the right one.


Quick comparison
#

pdfplumberPyMuPDF (fitz)PyPDF2
Text extraction✓ Good✓ Excellent✓ Basic
Table extraction✓ Built-in✗ Manual✗ None
Layout / coordinates✓ Rich✓ Rich✗ Limited
Image extraction✗ Limited✓ Excellent✗ None
SpeedMediumFastFast
Scanned PDFs✗ Text PDFs only✗ Text PDFs only✗ Text PDFs only
MaintenanceActiveActiveLimited
LicenseMITAGPL / commercialBSD

pdfplumber
#

pdfplumber is built on top of pdfminer.six and exposes detailed layout information for every character, word, and line on the page. Its main strength is table extraction.

Text extraction
#

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

extract_text() returns text in reading order — left-to-right, top-to-bottom. For most text-based PDFs this is reliable.

Table extraction
#

pdfplumber’s table detection algorithm looks for lines (rules) in the PDF to identify table boundaries. When tables have visible borders, it works well:

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

For borderless tables, you can tune the extraction settings:

table_settings = {
    "vertical_strategy": "text",    # use text alignment, not lines
    "horizontal_strategy": "text",
    "intersection_tolerance": 5,
}
tables = page.extract_tables(table_settings)

Coordinate-based extraction
#

Every word has position metadata you can use to extract by region:

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    # Crop to a bounding box (x0, top, x1, bottom)
    region = page.crop((0, 100, 300, 200))
    text = region.extract_text()

This is useful for documents where a field always appears in the same location.

Use pdfplumber when: You need table extraction, you want to filter by coordinates, or you need detailed word-level layout metadata.


PyMuPDF (fitz)
#

PyMuPDF is a Python binding for the MuPDF library — a fast, production-grade PDF renderer. It handles a broader range of PDF types and has stronger text extraction than pdfplumber, but no built-in table logic.

Text extraction
#

import fitz  # PyMuPDF

doc = fitz.open("invoice.pdf")
for page in doc:
    text = page.get_text()
    print(text)

For more control, use the "dict" or "rawdict" output to get block and span-level metadata:

for page in doc:
    blocks = page.get_text("dict")["blocks"]
    for block in blocks:
        if block["type"] == 0:  # text block
            for line in block["lines"]:
                for span in line["spans"]:
                    print(span["text"], span["bbox"], span["size"])

This gives you font size, position, and text content for every span — useful for identifying headers, labels, and values based on visual properties.

Image extraction
#

PyMuPDF handles embedded images cleanly:

doc = fitz.open("report.pdf")
for page_num, page in enumerate(doc):
    for img_index, img in enumerate(page.get_images(full=True)):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image["image"]
        # write to file or pass to OCR
        with open(f"page{page_num}_img{img_index}.png", "wb") as f:
            f.write(image_bytes)

Speed
#

PyMuPDF is the fastest of the three for raw text extraction, which matters at high document volumes.

Use PyMuPDF when: Speed matters, you need image extraction, you’re dealing with complex PDFs that pdfplumber struggles with, or you need span-level font/size metadata.


PyPDF2
#

PyPDF2 is the oldest of the three and the most limited. It handles basic text extraction and PDF manipulation (merging, splitting, rotating pages) but has no table support and limited layout awareness.

from PyPDF2 import PdfReader

reader = PdfReader("invoice.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

Text extraction quality is inconsistent — multi-column layouts and complex spacing often produce garbled output.

The honest assessment: For new production work, there’s rarely a reason to choose PyPDF2 over pdfplumber or PyMuPDF. It’s encountered mainly in legacy codebases.

Use PyPDF2 when: You’re maintaining existing code, or you only need PDF manipulation (splitting, merging) rather than content extraction.


Handling scanned PDFs
#

None of these libraries extract text from scanned PDFs — they only work with text-based PDFs where the text is embedded as actual characters. For scanned inputs, you need an OCR step first.

The standard approach is to render each page to an image with PyMuPDF and pass it to Tesseract or a cloud OCR service:

import fitz
import pytesseract
from PIL import Image
import io

doc = fitz.open("scanned_report.pdf")
for page in doc:
    # Render page to image at 300 DPI
    mat = fitz.Matrix(300/72, 300/72)
    pix = page.get_pixmap(matrix=mat)
    img = Image.open(io.BytesIO(pix.tobytes("png")))
    text = pytesseract.image_to_string(img)
    print(text)

Decision framework
#

Does the PDF contain scanned images?
  └─> Yes: render with PyMuPDF + OCR (Tesseract or cloud)
  └─> No: continue

Do you need table extraction?
  └─> Yes: pdfplumber
  └─> No: continue

Do you need image extraction or span-level metadata?
  └─> Yes: PyMuPDF
  └─> No: continue

Is speed a priority at high volume?
  └─> Yes: PyMuPDF
  └─> No: either pdfplumber or PyMuPDF

Is this legacy code using PyPDF2?
  └─> Yes: consider migrating if extraction quality is a problem

In practice, most production pipelines use both pdfplumber and PyMuPDF — pdfplumber for table-heavy pages, PyMuPDF for everything else. The overhead of importing both is negligible.


License note
#

PyMuPDF is AGPL-licensed, which has implications for commercial closed-source projects. If this is a concern, pdfplumber (MIT) or the commercial PyMuPDF license are the alternatives. PyPDF2 is BSD-licensed.

Related