Extracting tables from PDFs is one of the most common requirements in document automation and one of the most reliable ways to introduce subtle errors if you do it carelessly.
This guide covers table extraction with pdfplumber — the most capable Python library for this — including how it works, when it works, and what to do when it doesn’t.
Why PDF tables are hard#
A PDF is a rendering format, not a data format. When a table is drawn in a PDF, there’s no underlying structure that says “this is a table with these rows and columns.” What exists is:
- Lines and rectangles (if the table has visible borders)
- Text positioned at specific coordinates on the page
The extraction library has to infer the table structure from the visual layout. This works well for clean, bordered tables and becomes unreliable as tables deviate from that.
Basic table extraction with pdfplumber#
For PDFs with visible table borders, extract_tables() usually works out of the box:
import pdfplumber
with pdfplumber.open("invoice.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)extract_tables() returns a list of tables. Each table is a list of rows. Each row is a list of cell values as strings (or None for empty cells).
For a single table on a page:
with pdfplumber.open("invoice.pdf") as pdf:
page = pdf.pages[0]
table = page.extract_table() # singular — returns first table
print(table)Converting to a DataFrame#
The most common next step is converting to pandas:
import pdfplumber
import pandas as pd
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
table = page.extract_table()
if table:
# First row as header
df = pd.DataFrame(table[1:], columns=table[0])
print(df)Watch out for None headers — pdfplumber returns None for merged or empty cells. Clean them before constructing the DataFrame:
headers = [h or f"col_{i}" for i, h in enumerate(table[0])]
df = pd.DataFrame(table[1:], columns=headers)Borderless tables#
Many real-world tables have no visible borders — columns are separated by whitespace, rows by line spacing. pdfplumber’s default strategy (looking for lines) fails here.
Switch to text-based strategies:
table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "text",
}
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
table = page.extract_table(table_settings)The text strategy uses the spacing between text elements to infer column and row boundaries. Results depend on how consistently the PDF was laid out.
Additional settings to tune when the text strategy misaligns:
table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "text",
"snap_tolerance": 3, # pixels within which elements are snapped to same row/col
"intersection_tolerance": 5,
"edge_min_length": 3,
}Extracting from a specific region#
When a page contains both text and a table, sometimes extract_tables() captures surrounding text as table content. Cropping to the table region fixes this:
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
# Crop to bounding box (x0, top, x1, bottom) — coordinates in points
# Use page.width and page.height for reference
table_area = page.crop((50, 200, page.width - 50, 500))
table = table_area.extract_table()To find the right bounding box without guessing, use pdfplumber’s visual debugging:
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
# Save a visual with bounding boxes drawn
im = page.to_image()
im.draw_rects(page.extract_words())
im.save("debug_words.png")
im2 = page.to_image()
im2.debug_tablefinder()
im2.save("debug_tables.png")The debug images show you what pdfplumber sees — useful when tables aren’t being detected correctly.
Multi-page tables#
When a table spans multiple pages, extract_table() on each page returns the portion on that page. You need to concatenate them — but the repeated header row on continuation pages creates duplicates.
import pdfplumber
import pandas as pd
def extract_multipage_table(pdf_path: str) -> pd.DataFrame:
all_rows = []
header = None
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
table = page.extract_table()
if not table:
continue
if page_num == 0:
# First page: first row is the header
header = table[0]
all_rows.extend(table[1:])
else:
# Subsequent pages: check if first row is a repeated header
if table[0] == header:
all_rows.extend(table[1:])
else:
all_rows.extend(table)
if not header or not all_rows:
return pd.DataFrame()
return pd.DataFrame(all_rows, columns=header)This works when pages repeat the header exactly. For partial headers or summary rows at page breaks, you’ll need to inspect the specific patterns in your PDFs.
Post-processing extracted values#
Raw table cells from pdfplumber are strings with inconsistent whitespace and formatting. Clean them before use:
import re
from decimal import Decimal, InvalidOperation
def clean_cell(value: str | None) -> str | None:
if value is None:
return None
# Collapse whitespace and strip
return re.sub(r"\s+", " ", value).strip()
def parse_currency(value: str | None) -> Decimal | None:
if not value:
return None
cleaned = re.sub(r"[£$€,\s]", "", value)
try:
return Decimal(cleaned)
except InvalidOperation:
return None
def parse_table_row(row: list[str | None]) -> dict:
"""Example for an invoice line items table."""
return {
"description": clean_cell(row[0]),
"quantity": float(row[1]) if row[1] else None,
"unit_price": parse_currency(row[2]),
"total": parse_currency(row[3]),
}Common failure modes#
Merged cells — pdfplumber doesn’t handle merged cells (where one cell spans multiple columns or rows). It typically returns None for the spanned cells. You’ll need to forward-fill or detect merges manually.
# Forward-fill None values in a column (common for merged row headers)
current_value = None
for row in table:
if row[0] is not None:
current_value = row[0]
else:
row[0] = current_valueFootnotes inside tables — small footnote text within a table region gets extracted as a row. Filter by checking if the row has the expected number of non-None values.
Number formatting — European number formats (1.234,56) will break a simple Decimal() parse. Detect and normalise before parsing:
def parse_number(value: str) -> Decimal:
# Detect European format: period as thousands separator, comma as decimal
if re.match(r"^\d{1,3}(\.\d{3})*(,\d+)?$", value.strip()):
value = value.replace(".", "").replace(",", ".")
else:
value = value.replace(",", "")
return Decimal(value)Image-based tables — if the table is a scanned image rather than rendered text, pdfplumber returns nothing. The text simply isn’t there. This requires an OCR step followed by table reconstruction, which is considerably more involved.
When pdfplumber isn’t enough#
pdfplumber handles bordered and text-based borderless tables reliably. For more complex cases:
- Heavily nested or merged cell tables — consider Camelot, which uses more sophisticated grid detection and has explicit handling for complex table structures
- Scanned tables — OCR (Tesseract or a cloud service) to get the text, then post-processing to reconstruct rows and columns
- Tables in Word/Excel documents converted to PDF — often produce well-structured text that pdfplumber handles cleanly, but layouts vary
For most production invoice and report processing, pdfplumber is sufficient if you account for the edge cases above.
