Extracting Tables from PDFs in Python: The Complete Guide

Table of Contents

Extracting tables from PDFs is one of the most common requirements in document automation and one of the most reliable ways to introduce subtle errors if you do it carelessly.

This guide covers table extraction with pdfplumber — the most capable Python library for this — including how it works, when it works, and what to do when it doesn’t.

Why PDF tables are hard
#

A PDF is a rendering format, not a data format. When a table is drawn in a PDF, there’s no underlying structure that says “this is a table with these rows and columns.” What exists is:

Lines and rectangles (if the table has visible borders)
Text positioned at specific coordinates on the page

The extraction library has to infer the table structure from the visual layout. This works well for clean, bordered tables and becomes unreliable as tables deviate from that.

Basic table extraction with pdfplumber
#

For PDFs with visible table borders, extract_tables() usually works out of the box:

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

extract_tables() returns a list of tables. Each table is a list of rows. Each row is a list of cell values as strings (or None for empty cells).

For a single table on a page:

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()  # singular — returns first table
    print(table)

Converting to a DataFrame
#

The most common next step is converting to pandas:

import pdfplumber
import pandas as pd

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()

if table:
    # First row as header
    df = pd.DataFrame(table[1:], columns=table[0])
    print(df)

Watch out for None headers — pdfplumber returns None for merged or empty cells. Clean them before constructing the DataFrame:

headers = [h or f"col_{i}" for i, h in enumerate(table[0])]
df = pd.DataFrame(table[1:], columns=headers)

Borderless tables
#

Many real-world tables have no visible borders — columns are separated by whitespace, rows by line spacing. pdfplumber’s default strategy (looking for lines) fails here.

Switch to text-based strategies:

table_settings = {
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
}

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table(table_settings)

The text strategy uses the spacing between text elements to infer column and row boundaries. Results depend on how consistently the PDF was laid out.

Additional settings to tune when the text strategy misaligns:

table_settings = {
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
    "snap_tolerance": 3,       # pixels within which elements are snapped to same row/col
    "intersection_tolerance": 5,
    "edge_min_length": 3,
}

Extracting from a specific region
#

When a page contains both text and a table, sometimes extract_tables() captures surrounding text as table content. Cropping to the table region fixes this:

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]

    # Crop to bounding box (x0, top, x1, bottom) — coordinates in points
    # Use page.width and page.height for reference
    table_area = page.crop((50, 200, page.width - 50, 500))
    table = table_area.extract_table()

To find the right bounding box without guessing, use pdfplumber’s visual debugging:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    # Save a visual with bounding boxes drawn
    im = page.to_image()
    im.draw_rects(page.extract_words())
    im.save("debug_words.png")

    im2 = page.to_image()
    im2.debug_tablefinder()
    im2.save("debug_tables.png")

The debug images show you what pdfplumber sees — useful when tables aren’t being detected correctly.

Multi-page tables
#

When a table spans multiple pages, extract_table() on each page returns the portion on that page. You need to concatenate them — but the repeated header row on continuation pages creates duplicates.

import pdfplumber
import pandas as pd


def extract_multipage_table(pdf_path: str) -> pd.DataFrame:
    all_rows = []
    header = None

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            table = page.extract_table()
            if not table:
                continue

            if page_num == 0:
                # First page: first row is the header
                header = table[0]
                all_rows.extend(table[1:])
            else:
                # Subsequent pages: check if first row is a repeated header
                if table[0] == header:
                    all_rows.extend(table[1:])
                else:
                    all_rows.extend(table)

    if not header or not all_rows:
        return pd.DataFrame()

    return pd.DataFrame(all_rows, columns=header)

This works when pages repeat the header exactly. For partial headers or summary rows at page breaks, you’ll need to inspect the specific patterns in your PDFs.

Post-processing extracted values
#

Raw table cells from pdfplumber are strings with inconsistent whitespace and formatting. Clean them before use:

import re
from decimal import Decimal, InvalidOperation


def clean_cell(value: str | None) -> str | None:
    if value is None:
        return None
    # Collapse whitespace and strip
    return re.sub(r"\s+", " ", value).strip()


def parse_currency(value: str | None) -> Decimal | None:
    if not value:
        return None
    cleaned = re.sub(r"[£$€,\s]", "", value)
    try:
        return Decimal(cleaned)
    except InvalidOperation:
        return None


def parse_table_row(row: list[str | None]) -> dict:
    """Example for an invoice line items table."""
    return {
        "description": clean_cell(row[0]),
        "quantity": float(row[1]) if row[1] else None,
        "unit_price": parse_currency(row[2]),
        "total": parse_currency(row[3]),
    }

Common failure modes
#

Merged cells — pdfplumber doesn’t handle merged cells (where one cell spans multiple columns or rows). It typically returns None for the spanned cells. You’ll need to forward-fill or detect merges manually.

# Forward-fill None values in a column (common for merged row headers)
current_value = None
for row in table:
    if row[0] is not None:
        current_value = row[0]
    else:
        row[0] = current_value

Footnotes inside tables — small footnote text within a table region gets extracted as a row. Filter by checking if the row has the expected number of non-None values.

Number formatting — European number formats (1.234,56) will break a simple Decimal() parse. Detect and normalise before parsing:

def parse_number(value: str) -> Decimal:
    # Detect European format: period as thousands separator, comma as decimal
    if re.match(r"^\d{1,3}(\.\d{3})*(,\d+)?$", value.strip()):
        value = value.replace(".", "").replace(",", ".")
    else:
        value = value.replace(",", "")
    return Decimal(value)

Image-based tables — if the table is a scanned image rather than rendered text, pdfplumber returns nothing. The text simply isn’t there. This requires an OCR step followed by table reconstruction, which is considerably more involved.

When pdfplumber isn’t enough
#

pdfplumber handles bordered and text-based borderless tables reliably. For more complex cases:

Heavily nested or merged cell tables — consider Camelot, which uses more sophisticated grid detection and has explicit handling for complex table structures
Scanned tables — OCR (Tesseract or a cloud service) to get the text, then post-processing to reconstruct rows and columns
Tables in Word/Excel documents converted to PDF — often produce well-structured text that pdfplumber handles cleanly, but layouts vary

For most production invoice and report processing, pdfplumber is sufficient if you account for the edge cases above.

Why PDF tables are hard#

Basic table extraction with pdfplumber#

Converting to a DataFrame#

Borderless tables#

Extracting from a specific region#

Multi-page tables#

Post-processing extracted values#

Common failure modes#

When pdfplumber isn’t enough#

Related