Document summarization is a critical NLP task that helps users quickly grasp key information from long documents. But how do you know if your model is actually working? This guide shows a workflow that starts with evaluation and acceptance criteria before touching models — the approach that got a finance report summarizer from prototype to production in three weeks.
TL;DR
- Define ROUGE-L and BERTScore thresholds before building.
- Use an extractive baseline (TextRank/TF-IDF) before reaching for LLMs — it’s free, fast, and sets a floor.
- LLM abstraction is for refinement, not replacement of the extractive base.
- Human checklist catches what metrics miss: faithfulness, specificity, no hallucinated numbers.
Why Eval-First#
When I built an extractive summarizer for finance reports, we shipped faster by defining evaluation and acceptance criteria before touching models.
Workflow#
- Curate a small, representative dataset (20–50 docs)
- Define extractive baseline + abstractive model
- Compute ROUGE/BERTScore, then human checklist (coverage, faithfulness)
- Review failure modes and iterate on chunking/prompts
from __future__ import annotations
from datasets import load_metric
def rouge(refs: list[str], hyps: list[str]):
metric = load_metric("rouge")
scores = metric.compute(predictions=hyps, references=refs)
return {k: v.mid.fmeasure for k, v in scores.items()}
if __name__ == "__main__":
refs = ["Revenue increased due to subscriptions and lower churn."]
hyps = ["Revenue increased from new subscriptions; churn was lower."]
print(rouge(refs, hyps))BERTScore: Semantic Evaluation#
ROUGE measures n-gram overlap, which can penalize valid paraphrases. BERTScore uses contextual embeddings to capture semantic similarity — a better proxy for human judgment on abstractive summaries.
from bert_score import score as bert_score
def bertscore(refs: list[str], hyps: list[str], lang: str = "en"):
P, R, F1 = bert_score(hyps, refs, lang=lang, verbose=False)
return {
"precision": P.mean().item(),
"recall": R.mean().item(),
"f1": F1.mean().item(),
}
if __name__ == "__main__":
refs = ["Revenue increased due to subscriptions and lower churn."]
hyps = ["Subscription growth and declining churn drove revenue higher."]
print(bertscore(refs, hyps))
# {'precision': 0.91, 'recall': 0.89, 'f1': 0.90}
# ROUGE-L would score this lower despite it being a valid paraphraseInstall: pip install bert-score
Extractive vs Abstractive: When to Use Which#
| Extractive | Abstractive (LLM) | |
|---|---|---|
| Speed | Fast (~10ms/doc) | Slow (~1–3s/doc + API cost) |
| Faithfulness | High — uses source sentences | Risk of hallucination |
| Fluency | Choppy — sentence fragments | Natural prose |
| Best for | Internal tools, high-stakes domains | Consumer-facing, narrative docs |
| Cost | Free | $0.01–0.10 per document |
Recommended approach: Use extractive as a baseline and a pre-filter. Feed only key extracted sentences to the LLM for refinement — this cuts tokens (and cost) by 60–80%.
Human Checklist (Print and Use)#
- Coverage: all key bullets present?
- Faithfulness: no invented numbers or facts?
- Specificity: numbers and entities preserved?
- Brevity: filler and boilerplate removed?
Run this on 20 sampled summaries weekly. A pass rate < 0.9 means something changed — check for model drift, doc format changes, or chunking regressions.
Architecture#
- Ingest and clean text (see text-cleaning pipeline)
- Segment by sections; avoid cross-topic chunks
- Extractive baseline (TextRank or embedding-based key sentence selection)
- Abstractive refinement with constrained prompting
- Score with ROUGE/BERTScore + human checklist
Extractive baseline example#
from __future__ import annotations
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
def textrank_sentences(sentences: list[str], top_k: int = 5) -> list[str]:
tfidf = TfidfVectorizer().fit_transform(sentences)
sim = (tfidf * tfidf.T).A
scores = sim.sum(axis=1)
idx = np.argsort(-scores)[:top_k]
return [sentences[i] for i in sorted(idx)]Abstractive refinement prompt (LLM)#
You are summarizing a section for financial analysts.
Constraints:
- Keep numbers and entities accurate.
- No claims beyond the provided sentences.
- Max 120 words.
Sentences:
<paste extractive sentences>Metrics and acceptance criteria#
- ROUGE-L ≥ 0.35 on validation set; BERTScore-F1 ≥ 0.86 on domain corpus.
- Human checklist pass rate ≥ 0.9 (sampled 20 summaries weekly).
- Drift alerts if either metric drops ≥ 10% week-over-week.
Failure modes and fixes#
- Missing critical bullet: increase top_k extractive or re-segment by section headings.
- Fabricated numbers: add unit tests scanning for number changes vs source.
- Repetition/bloat: enforce word cap and remove boilerplate via cleaning.
Chunking Strategy Matters#
Bad chunking is the most common reason summarization fails silently. If a chunk cuts in the middle of a key finding, the model never sees it.
import re
def segment_by_headings(text: str) -> list[str]:
"""Split on markdown or common heading patterns."""
pattern = r"(?=\n#{1,3} |\n[A-Z][A-Z\s]{3,}\n)"
sections = re.split(pattern, text)
return [s.strip() for s in sections if s.strip()]
def chunk_section(section: str, max_tokens: int = 800) -> list[str]:
"""Simple sentence-boundary chunker for sections exceeding max_tokens."""
sentences = re.split(r"(?<=[.!?])\s+", section)
chunks, current, count = [], [], 0
for sent in sentences:
word_count = len(sent.split())
if count + word_count > max_tokens and current:
chunks.append(" ".join(current))
current, count = [sent], word_count
else:
current.append(sent)
count += word_count
if current:
chunks.append(" ".join(current))
return chunksIntegration Notes#
- Store source sentence IDs alongside summaries for traceability (which sentences informed each claim).
- Log tokens, latency, and scores per job — create a dashboard so regressions are visible before users report them.
- For long docs (annual reports, legal briefs): summarize sections first, then synthesize an executive summary from those section summaries. Two-pass keeps each LLM call within context window.
