Skip to main content
  1. Blogs/

NLP Entity Matching with Fuzzy Search

·1100 words·6 mins·
Subhajit Bhar
Author
Subhajit Bhar
I build production-grade document extraction pipelines for businesses that process invoices, lab reports, contracts, and other document types at scale.
Table of Contents

Product catalogs rarely match 1:1. Supplier A calls it “Apple iPhone 13 Pro 256GB Space Grey” while your system has “iPhone 13 Pro - 256 - Gray”. String equality fails. This guide covers a three-stage approach combining lexical, surface, and semantic similarity to match entities at scale with minimal false positives.

TL;DR

  • Stage 1: TF-IDF cosine for fast candidate generation (cheap, no model needed)
  • Stage 2: Jaro-Winkler for surface/character similarity re-ranking
  • Stage 3: Small embeddings (text-embedding-3-small) for semantic tie-breaking
  • Route results through score thresholds: auto-accept ≥ 0.8, manual review 0.6–0.8, reject < 0.6

The Problem
#

Entity matching (also called record linkage or deduplication) is the task of identifying records across two datasets that refer to the same real-world entity. It comes up constantly in:

  • Retail: Matching your product catalog against a supplier feed
  • Finance: Deduplicating company names across data sources (e.g., “Goldman Sachs” vs “Goldman Sachs & Co LLC”)
  • Healthcare: Matching patient records across hospital systems
  • E-commerce: Normalizing product listings from multiple merchants

Exact string matching fails immediately — different abbreviations, punctuation, word order, and typos make it unusable at scale. You need a pipeline that handles all of these gracefully.

Approach
#

Three-stage pipeline, ordered by cost:

  1. Candidate generation with TF-IDF cosine — fast, no model, narrows the search space
  2. Re-ranking with Jaro-Winkler — character-level surface similarity catches abbreviations and typos
  3. Semantic tie-breaking with embeddings — resolves synonyms and paraphrase mismatches

Step 0: Normalize Before Matching
#

Normalization removes noise before any similarity computation. It’s the highest-ROI step and often ignored.

import re
import unicodedata

def normalize(text: str) -> str:
    """Lowercase, strip accents, remove punctuation, collapse whitespace."""
    text = text.lower()
    text = unicodedata.normalize("NFD", text)
    text = "".join(c for c in text if unicodedata.category(c) != "Mn")
    text = re.sub(r"[^\w\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Domain-specific: normalize units and common abbreviations
UNIT_MAP = {"ml": "ml", "milliliter": "ml", "0.5 l": "500ml", "0.5l": "500ml"}
ALIAS_MAP = {"google": "alphabet", "fb": "facebook", "meta": "facebook"}

def normalize_domain(text: str) -> str:
    text = normalize(text)
    for src, dst in UNIT_MAP.items():
        text = text.replace(src, dst)
    for alias, canonical in ALIAS_MAP.items():
        text = re.sub(rf"\b{alias}\b", canonical, text)
    return text
from __future__ import annotations
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def jaro_winkler(a: str, b: str) -> float:
    try:
        import jellyfish
        return jellyfish.jaro_winkler(a, b)
    except Exception:
        return 0.0


def match_entities(left: list[str], right: list[str], top_k: int = 5) -> list[tuple[str, str, float]]:
    tfidf = TfidfVectorizer(min_df=1, ngram_range=(1, 2))
    Xl = tfidf.fit_transform(left)
    Xr = tfidf.transform(right)
    sims = cosine_similarity(Xl, Xr)
    matches = []
    for i, row in enumerate(sims):
        idx = int(np.argmax(row))
        jw = jaro_winkler(left[i], right[idx])
        score = 0.7 * row[idx] + 0.3 * jw
        matches.append((left[i], right[idx], float(score)))
    return matches


if __name__ == "__main__":
    a = ["Apple iPhone 13 Pro", "Samsung Galaxy S22"]
    b = ["iPhone 13 Pro Max by Apple", "Galaxy S22 Ultra Samsung"]
    for m in match_entities(a, b):
        print(m)

Step 3: Semantic Tie-Breaking with Embeddings
#

For pairs that score 0.6–0.8 on the blended score, a small embedding model resolves ambiguity from synonyms and paraphrases that TF-IDF and Jaro-Winkler miss.

import numpy as np
from openai import OpenAI

def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
    client = OpenAI()
    resp = client.embeddings.create(input=texts, model=model)
    vecs = np.array([r.embedding for r in resp.data], dtype="float32")
    # L2-normalize for cosine via dot product
    norms = np.linalg.norm(vecs, axis=1, keepdims=True)
    return vecs / np.maximum(norms, 1e-10)

def semantic_score(a: str, b: str) -> float:
    vecs = embed_batch([a, b])
    return float(np.dot(vecs[0], vecs[1]))

# Only call embeddings for ambiguous pairs to keep costs low
REVIEW_BAND = (0.6, 0.8)

def resolve_ambiguous(matches):
    resolved = []
    for left, right, score in matches:
        if REVIEW_BAND[0] <= score < REVIEW_BAND[1]:
            sem = semantic_score(left, right)
            # Blend: give semantic score 40% weight in ambiguous band
            score = 0.6 * score + 0.4 * sem
        resolved.append((left, right, score))
    return resolved

Thresholds and QA
#

  • Accept ≥ 0.8 as confident match; 0.6–0.8 → manual review queue; < 0.6 reject.
  • Evaluate with precision@1 and manual spot-checks on a labelled gold set.

Full Pipeline
#

def run_pipeline(left: list[str], right: list[str]) -> list[tuple[str, str, float, str]]:
    """
    Returns: list of (left_entity, right_entity, score, decision)
    decision: 'accept' | 'review' | 'reject'
    """
    # Step 0: Normalize
    left_norm = [normalize_domain(x) for x in left]
    right_norm = [normalize_domain(x) for x in right]

    # Step 1 + 2: TF-IDF + Jaro-Winkler blended score
    raw_matches = match_entities(left_norm, right_norm)

    # Step 3: Semantic resolution for ambiguous pairs
    resolved = resolve_ambiguous(raw_matches)

    # Route by threshold
    results = []
    for (orig_l, orig_r), (_, _, score) in zip(zip(left, right), resolved):
        if score >= 0.8:
            decision = "accept"
        elif score >= 0.6:
            decision = "review"
        else:
            decision = "reject"
        results.append((orig_l, orig_r, score, decision))
    return results

Architecture & Workflow
#

  1. Normalize product titles (case, unicode, punctuation, units, aliases)
  2. Generate candidates via TF-IDF cosine (top-10)
  3. Re-rank with Jaro-Winkler; compute blended score (0.7 TF-IDF + 0.3 JW)
  4. For 0.6–0.8 band: embed with text-embedding-3-small for semantic tie-breaking
  5. Threshold routing: accept / manual review / reject

Evaluation harness
#

from __future__ import annotations
import numpy as np


def precision_at_1(gold: list[tuple[str, str]], preds: list[tuple[str, str, float]]):
    lookup = {a: b for a, b in gold}
    hits = 0
    for a, b, _ in preds:
        hits += int(lookup.get(a) == b)
    return hits / max(1, len(preds))


if __name__ == "__main__":
    gold = [("Apple iPhone 13 Pro", "iPhone 13 Pro Max by Apple"), ("Samsung Galaxy S22", "Galaxy S22 Ultra Samsung")]
    preds = match_entities([g[0] for g in gold], [g[1] for g in gold])
    print({"p@1": precision_at_1(gold, preds)})

Target: ≥ 0.9 p@1 on clean catalogs; add human review queue for ambiguous ranges.

Performance Characteristics
#

StageLatency (10K pairs)CostWhen it helps
TF-IDF cosine~0.5sFreeAlways — fast candidate narrowing
Jaro-Winkler~1sFreeTypos, abbreviations, word-order differences
Embeddings (3-small)~3s + API~$0.002/1K tokensSynonyms, paraphrases, cross-language

The embedding stage only runs on the ambiguous 0.6–0.8 band, so in practice you only pay for a fraction of your catalog.

Edge Cases and Fixes
#

  • Brand aliases (Google vs Alphabet): maintain a canonical alias map and apply it during normalization.
  • Units and pack sizes (“500ml” vs “0.5 L”): normalize units before matching — both become “500ml”.
  • Noise tokens (“new”, “sale”, “limited edition”): remove domain-specific stopwords to avoid false matches.
  • Transliteration (Müller vs Mueller): strip accents in normalization (unicodedata.NFD).
  • Short entity names (single words like “Apple”): set a minimum candidate threshold — short names match too broadly on TF-IDF.

Integrations
#

  • Push accept matches to your MDM system with versioned lineage (record source, score, timestamp).
  • Emit review matches to a human-review dashboard with both raw names and scores visible.
  • Schedule nightly diffs on new catalog entries; alert on drift in p@1 vs your gold set.

Related

Document Summarization: Eval First

·823 words·4 mins
Document summarization is a critical NLP task that helps users quickly grasp key information from long documents. But how do you know if your model is actually working? This guide shows a workflow that starts with evaluation and acceptance criteria before touching models — the approach that got a finance report summarizer from prototype to production in three weeks.

FAISS Index Types for Production RAG

·420 words·2 mins
IndexFlatIP works for small corpora. For production with 100K+ vectors, you need smarter indexes. Here’s how to choose and implement them. FAISS Index Types Overview # Index Corpus Size Memory Accuracy Build Time IndexFlatIP < 50K High Exact Fast IndexIVFFlat 50K - 1M Medium ~95-99% Medium IndexHNSWFlat 50K - 10M High ~95-99% Slow IndexIVFPQ 1M+ Low ~90-95% Slow IndexFlatIP (Baseline) # Exact search, no training required. Use for prototypes and small corpora.

RAG for Knowledge-Intensive Tasks

·842 words·4 mins
Picture this: You’re asking an AI about cancer treatments. It sounds super confident and gives you detailed answers. But here’s the problem — it just made up a medical study that doesn’t exist. TL;DR RAG fixes LLM hallucinations by grounding answers in retrieved documents. Pipeline: chunk documents → embed → store in vector index → retrieve at query time → generate. Use RAG for knowledge-intensive tasks (legal, medical, finance) where accuracy is non-negotiable. Evaluate with RAGAS or custom metrics: faithfulness, answer relevancy, context recall. That’s not just embarrassing. When we’re talking about healthcare, finance, or legal advice, these AI “hallucinations” can be downright dangerous.