Document Summarization: Eval First

Table of Contents

Document summarization is a critical NLP task that helps users quickly grasp key information from long documents. But how do you know if your model is actually working? This guide shows a workflow that starts with evaluation and acceptance criteria before touching models.

Why eval-first
#

When I built an extractive summarizer for finance reports, we shipped faster by defining evaluation and acceptance criteria before touching models.

Workflow
#

Curate a small, representative dataset (20–50 docs)
Define extractive baseline + abstractive model
Compute ROUGE/BERTScore, then human checklist (coverage, faithfulness)
Review failure modes and iterate on chunking/prompts

from __future__ import annotations
from datasets import load_metric


def rouge(refs: list[str], hyps: list[str]):
    metric = load_metric("rouge")
    scores = metric.compute(predictions=hyps, references=refs)
    return {k: v.mid.fmeasure for k, v in scores.items()}


if __name__ == "__main__":
    refs = ["Revenue increased due to subscriptions and lower churn."]
    hyps = ["Revenue increased from new subscriptions; churn was lower."]
    print(rouge(refs, hyps))

Human checklist (print and use)
#

Coverage: all key bullets present?
Faithfulness: no invented numbers or facts?
Specificity: numbers and entities preserved?
Brevity: remove filler and boilerplate?

Related: RAG and data quality posts to improve chunking/grounding: /blogs/does-langchain-use-rag, /blogs/lightrag-fast-retrieval-augmented-generation, and /blogs/detect-remove-outliers-python-iqr-zscore.

Architecture
#

Ingest and clean text (see text-cleaning pipeline)
Segment by sections; avoid cross-topic chunks
Extractive baseline (TextRank or embedding-based key sentence selection)
Abstractive refinement with constrained prompting
Score with ROUGE/BERTScore + human checklist

Extractive baseline example
#

from __future__ import annotations
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer


def textrank_sentences(sentences: list[str], top_k: int = 5) -> list[str]:
    tfidf = TfidfVectorizer().fit_transform(sentences)
    sim = (tfidf * tfidf.T).A
    scores = sim.sum(axis=1)
    idx = np.argsort(-scores)[:top_k]
    return [sentences[i] for i in sorted(idx)]

Abstractive refinement prompt (LLM)
#

You are summarizing a section for financial analysts.
Constraints:
- Keep numbers and entities accurate.
- No claims beyond the provided sentences.
- Max 120 words.

Sentences:
<paste extractive sentences>

Metrics and acceptance criteria
#

ROUGE-L ≥ 0.35 on validation set; BERTScore-F1 ≥ 0.86 on domain corpus.
Human checklist pass rate ≥ 0.9 (sampled 20 summaries weekly).
Drift alerts if either metric drops ≥ 10% week-over-week.

Failure modes and fixes
#

Missing critical bullet: increase top_k extractive or re-segment by section headings.
Fabricated numbers: add unit tests scanning for number changes vs source.
Repetition/bloat: enforce word cap and remove boilerplate via cleaning.

Integration notes
#

Store source sentence IDs alongside summaries for traceability.
Log tokens, latency, and scores for each job; create dashboards.
For long docs, summarize sections first, then synthesize an executive summary.

Why eval-first#

Workflow#

Human checklist (print and use)#

Architecture#

Extractive baseline example#

Abstractive refinement prompt (LLM)#

Metrics and acceptance criteria#

Failure modes and fixes#

Integration notes#

Related