Skip to main content
  1. Blogs/

Document Summarization: Eval First

·823 words·4 mins·
Subhajit Bhar
Author
Subhajit Bhar
I build production-grade document extraction pipelines for businesses that process invoices, lab reports, contracts, and other document types at scale.
Table of Contents

Document summarization is a critical NLP task that helps users quickly grasp key information from long documents. But how do you know if your model is actually working? This guide shows a workflow that starts with evaluation and acceptance criteria before touching models — the approach that got a finance report summarizer from prototype to production in three weeks.

TL;DR

  • Define ROUGE-L and BERTScore thresholds before building.
  • Use an extractive baseline (TextRank/TF-IDF) before reaching for LLMs — it’s free, fast, and sets a floor.
  • LLM abstraction is for refinement, not replacement of the extractive base.
  • Human checklist catches what metrics miss: faithfulness, specificity, no hallucinated numbers.

Why Eval-First
#

When I built an extractive summarizer for finance reports, we shipped faster by defining evaluation and acceptance criteria before touching models.

Workflow
#

  1. Curate a small, representative dataset (20–50 docs)
  2. Define extractive baseline + abstractive model
  3. Compute ROUGE/BERTScore, then human checklist (coverage, faithfulness)
  4. Review failure modes and iterate on chunking/prompts
from __future__ import annotations
from datasets import load_metric


def rouge(refs: list[str], hyps: list[str]):
    metric = load_metric("rouge")
    scores = metric.compute(predictions=hyps, references=refs)
    return {k: v.mid.fmeasure for k, v in scores.items()}


if __name__ == "__main__":
    refs = ["Revenue increased due to subscriptions and lower churn."]
    hyps = ["Revenue increased from new subscriptions; churn was lower."]
    print(rouge(refs, hyps))

BERTScore: Semantic Evaluation
#

ROUGE measures n-gram overlap, which can penalize valid paraphrases. BERTScore uses contextual embeddings to capture semantic similarity — a better proxy for human judgment on abstractive summaries.

from bert_score import score as bert_score

def bertscore(refs: list[str], hyps: list[str], lang: str = "en"):
    P, R, F1 = bert_score(hyps, refs, lang=lang, verbose=False)
    return {
        "precision": P.mean().item(),
        "recall": R.mean().item(),
        "f1": F1.mean().item(),
    }

if __name__ == "__main__":
    refs = ["Revenue increased due to subscriptions and lower churn."]
    hyps = ["Subscription growth and declining churn drove revenue higher."]
    print(bertscore(refs, hyps))
    # {'precision': 0.91, 'recall': 0.89, 'f1': 0.90}
    # ROUGE-L would score this lower despite it being a valid paraphrase

Install: pip install bert-score

Extractive vs Abstractive: When to Use Which
#

ExtractiveAbstractive (LLM)
SpeedFast (~10ms/doc)Slow (~1–3s/doc + API cost)
FaithfulnessHigh — uses source sentencesRisk of hallucination
FluencyChoppy — sentence fragmentsNatural prose
Best forInternal tools, high-stakes domainsConsumer-facing, narrative docs
CostFree$0.01–0.10 per document

Recommended approach: Use extractive as a baseline and a pre-filter. Feed only key extracted sentences to the LLM for refinement — this cuts tokens (and cost) by 60–80%.

Human Checklist (Print and Use)
#

  • Coverage: all key bullets present?
  • Faithfulness: no invented numbers or facts?
  • Specificity: numbers and entities preserved?
  • Brevity: filler and boilerplate removed?

Run this on 20 sampled summaries weekly. A pass rate < 0.9 means something changed — check for model drift, doc format changes, or chunking regressions.

Architecture
#

  1. Ingest and clean text (see text-cleaning pipeline)
  2. Segment by sections; avoid cross-topic chunks
  3. Extractive baseline (TextRank or embedding-based key sentence selection)
  4. Abstractive refinement with constrained prompting
  5. Score with ROUGE/BERTScore + human checklist

Extractive baseline example
#

from __future__ import annotations
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer


def textrank_sentences(sentences: list[str], top_k: int = 5) -> list[str]:
    tfidf = TfidfVectorizer().fit_transform(sentences)
    sim = (tfidf * tfidf.T).A
    scores = sim.sum(axis=1)
    idx = np.argsort(-scores)[:top_k]
    return [sentences[i] for i in sorted(idx)]

Abstractive refinement prompt (LLM)
#

You are summarizing a section for financial analysts.
Constraints:
- Keep numbers and entities accurate.
- No claims beyond the provided sentences.
- Max 120 words.

Sentences:
<paste extractive sentences>

Metrics and acceptance criteria
#

  • ROUGE-L ≥ 0.35 on validation set; BERTScore-F1 ≥ 0.86 on domain corpus.
  • Human checklist pass rate ≥ 0.9 (sampled 20 summaries weekly).
  • Drift alerts if either metric drops ≥ 10% week-over-week.

Failure modes and fixes
#

  • Missing critical bullet: increase top_k extractive or re-segment by section headings.
  • Fabricated numbers: add unit tests scanning for number changes vs source.
  • Repetition/bloat: enforce word cap and remove boilerplate via cleaning.

Chunking Strategy Matters
#

Bad chunking is the most common reason summarization fails silently. If a chunk cuts in the middle of a key finding, the model never sees it.

import re

def segment_by_headings(text: str) -> list[str]:
    """Split on markdown or common heading patterns."""
    pattern = r"(?=\n#{1,3} |\n[A-Z][A-Z\s]{3,}\n)"
    sections = re.split(pattern, text)
    return [s.strip() for s in sections if s.strip()]

def chunk_section(section: str, max_tokens: int = 800) -> list[str]:
    """Simple sentence-boundary chunker for sections exceeding max_tokens."""
    sentences = re.split(r"(?<=[.!?])\s+", section)
    chunks, current, count = [], [], 0
    for sent in sentences:
        word_count = len(sent.split())
        if count + word_count > max_tokens and current:
            chunks.append(" ".join(current))
            current, count = [sent], word_count
        else:
            current.append(sent)
            count += word_count
    if current:
        chunks.append(" ".join(current))
    return chunks

Integration Notes
#

  • Store source sentence IDs alongside summaries for traceability (which sentences informed each claim).
  • Log tokens, latency, and scores per job — create a dashboard so regressions are visible before users report them.
  • For long docs (annual reports, legal briefs): summarize sections first, then synthesize an executive summary from those section summaries. Two-pass keeps each LLM call within context window.

Related

RAG for Knowledge-Intensive Tasks

·842 words·4 mins
Picture this: You’re asking an AI about cancer treatments. It sounds super confident and gives you detailed answers. But here’s the problem — it just made up a medical study that doesn’t exist. TL;DR RAG fixes LLM hallucinations by grounding answers in retrieved documents. Pipeline: chunk documents → embed → store in vector index → retrieve at query time → generate. Use RAG for knowledge-intensive tasks (legal, medical, finance) where accuracy is non-negotiable. Evaluate with RAGAS or custom metrics: faithfulness, answer relevancy, context recall. That’s not just embarrassing. When we’re talking about healthcare, finance, or legal advice, these AI “hallucinations” can be downright dangerous.

NLP Entity Matching with Fuzzy Search

·1100 words·6 mins
Product catalogs rarely match 1:1. Supplier A calls it “Apple iPhone 13 Pro 256GB Space Grey” while your system has “iPhone 13 Pro - 256 - Gray”. String equality fails. This guide covers a three-stage approach combining lexical, surface, and semantic similarity to match entities at scale with minimal false positives.

FAISS Index Types for Production RAG

·420 words·2 mins
IndexFlatIP works for small corpora. For production with 100K+ vectors, you need smarter indexes. Here’s how to choose and implement them. FAISS Index Types Overview # Index Corpus Size Memory Accuracy Build Time IndexFlatIP < 50K High Exact Fast IndexIVFFlat 50K - 1M Medium ~95-99% Medium IndexHNSWFlat 50K - 10M High ~95-99% Slow IndexIVFPQ 1M+ Low ~90-95% Slow IndexFlatIP (Baseline) # Exact search, no training required. Use for prototypes and small corpora.