NLP Entity Matching with Fuzzy Search

Product catalogs rarely match 1:1. Supplier A calls it “Apple iPhone 13 Pro 256GB Space Grey” while your system has “iPhone 13 Pro - 256 - Gray”. String equality fails. This guide covers a three-stage approach combining lexical, surface, and semantic similarity to match entities at scale with minimal false positives.

TL;DR
Stage 1: TF-IDF cosine for fast candidate generation (cheap, no model needed)
Stage 2: Jaro-Winkler for surface/character similarity re-ranking
Stage 3: Small embeddings (text-embedding-3-small) for semantic tie-breaking
Route results through score thresholds: auto-accept ≥ 0.8, manual review 0.6–0.8, reject < 0.6

The Problem
#

Entity matching (also called record linkage or deduplication) is the task of identifying records across two datasets that refer to the same real-world entity. It comes up constantly in:

Retail: Matching your product catalog against a supplier feed
Finance: Deduplicating company names across data sources (e.g., “Goldman Sachs” vs “Goldman Sachs & Co LLC”)
Healthcare: Matching patient records across hospital systems
E-commerce: Normalizing product listings from multiple merchants

Exact string matching fails immediately — different abbreviations, punctuation, word order, and typos make it unusable at scale. You need a pipeline that handles all of these gracefully.

Approach
#

Three-stage pipeline, ordered by cost:

Candidate generation with TF-IDF cosine — fast, no model, narrows the search space
Re-ranking with Jaro-Winkler — character-level surface similarity catches abbreviations and typos
Semantic tie-breaking with embeddings — resolves synonyms and paraphrase mismatches

Step 0: Normalize Before Matching
#

Normalization removes noise before any similarity computation. It’s the highest-ROI step and often ignored.

import re
import unicodedata

def normalize(text: str) -> str:
    """Lowercase, strip accents, remove punctuation, collapse whitespace."""
    text = text.lower()
    text = unicodedata.normalize("NFD", text)
    text = "".join(c for c in text if unicodedata.category(c) != "Mn")
    text = re.sub(r"[^\w\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Domain-specific: normalize units and common abbreviations
UNIT_MAP = {"ml": "ml", "milliliter": "ml", "0.5 l": "500ml", "0.5l": "500ml"}
ALIAS_MAP = {"google": "alphabet", "fb": "facebook", "meta": "facebook"}

def normalize_domain(text: str) -> str:
    text = normalize(text)
    for src, dst in UNIT_MAP.items():
        text = text.replace(src, dst)
    for alias, canonical in ALIAS_MAP.items():
        text = re.sub(rf"\b{alias}\b", canonical, text)
    return text

from __future__ import annotations
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def jaro_winkler(a: str, b: str) -> float:
    try:
        import jellyfish
        return jellyfish.jaro_winkler(a, b)
    except Exception:
        return 0.0


def match_entities(left: list[str], right: list[str], top_k: int = 5) -> list[tuple[str, str, float]]:
    tfidf = TfidfVectorizer(min_df=1, ngram_range=(1, 2))
    Xl = tfidf.fit_transform(left)
    Xr = tfidf.transform(right)
    sims = cosine_similarity(Xl, Xr)
    matches = []
    for i, row in enumerate(sims):
        idx = int(np.argmax(row))
        jw = jaro_winkler(left[i], right[idx])
        score = 0.7 * row[idx] + 0.3 * jw
        matches.append((left[i], right[idx], float(score)))
    return matches


if __name__ == "__main__":
    a = ["Apple iPhone 13 Pro", "Samsung Galaxy S22"]
    b = ["iPhone 13 Pro Max by Apple", "Galaxy S22 Ultra Samsung"]
    for m in match_entities(a, b):
        print(m)

Step 3: Semantic Tie-Breaking with Embeddings
#

For pairs that score 0.6–0.8 on the blended score, a small embedding model resolves ambiguity from synonyms and paraphrases that TF-IDF and Jaro-Winkler miss.

import numpy as np
from openai import OpenAI

def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
    client = OpenAI()
    resp = client.embeddings.create(input=texts, model=model)
    vecs = np.array([r.embedding for r in resp.data], dtype="float32")
    # L2-normalize for cosine via dot product
    norms = np.linalg.norm(vecs, axis=1, keepdims=True)
    return vecs / np.maximum(norms, 1e-10)

def semantic_score(a: str, b: str) -> float:
    vecs = embed_batch([a, b])
    return float(np.dot(vecs[0], vecs[1]))

# Only call embeddings for ambiguous pairs to keep costs low
REVIEW_BAND = (0.6, 0.8)

def resolve_ambiguous(matches):
    resolved = []
    for left, right, score in matches:
        if REVIEW_BAND[0] <= score < REVIEW_BAND[1]:
            sem = semantic_score(left, right)
            # Blend: give semantic score 40% weight in ambiguous band
            score = 0.6 * score + 0.4 * sem
        resolved.append((left, right, score))
    return resolved

Thresholds and QA
#

Accept ≥ 0.8 as confident match; 0.6–0.8 → manual review queue; < 0.6 reject.
Evaluate with precision@1 and manual spot-checks on a labelled gold set.

Full Pipeline
#

def run_pipeline(left: list[str], right: list[str]) -> list[tuple[str, str, float, str]]:
    """
    Returns: list of (left_entity, right_entity, score, decision)
    decision: 'accept' | 'review' | 'reject'
    """
    # Step 0: Normalize
    left_norm = [normalize_domain(x) for x in left]
    right_norm = [normalize_domain(x) for x in right]

    # Step 1 + 2: TF-IDF + Jaro-Winkler blended score
    raw_matches = match_entities(left_norm, right_norm)

    # Step 3: Semantic resolution for ambiguous pairs
    resolved = resolve_ambiguous(raw_matches)

    # Route by threshold
    results = []
    for (orig_l, orig_r), (_, _, score) in zip(zip(left, right), resolved):
        if score >= 0.8:
            decision = "accept"
        elif score >= 0.6:
            decision = "review"
        else:
            decision = "reject"
        results.append((orig_l, orig_r, score, decision))
    return results

Architecture & Workflow
#

Normalize product titles (case, unicode, punctuation, units, aliases)
Generate candidates via TF-IDF cosine (top-10)
Re-rank with Jaro-Winkler; compute blended score (0.7 TF-IDF + 0.3 JW)
For 0.6–0.8 band: embed with text-embedding-3-small for semantic tie-breaking
Threshold routing: accept / manual review / reject

Evaluation harness
#

from __future__ import annotations
import numpy as np


def precision_at_1(gold: list[tuple[str, str]], preds: list[tuple[str, str, float]]):
    lookup = {a: b for a, b in gold}
    hits = 0
    for a, b, _ in preds:
        hits += int(lookup.get(a) == b)
    return hits / max(1, len(preds))


if __name__ == "__main__":
    gold = [("Apple iPhone 13 Pro", "iPhone 13 Pro Max by Apple"), ("Samsung Galaxy S22", "Galaxy S22 Ultra Samsung")]
    preds = match_entities([g[0] for g in gold], [g[1] for g in gold])
    print({"p@1": precision_at_1(gold, preds)})

Target: ≥ 0.9 p@1 on clean catalogs; add human review queue for ambiguous ranges.

Performance Characteristics
#

Stage	Latency (10K pairs)	Cost	When it helps
TF-IDF cosine	~0.5s	Free	Always — fast candidate narrowing
Jaro-Winkler	~1s	Free	Typos, abbreviations, word-order differences
Embeddings (3-small)	~3s + API	~$0.002/1K tokens	Synonyms, paraphrases, cross-language

The embedding stage only runs on the ambiguous 0.6–0.8 band, so in practice you only pay for a fraction of your catalog.

Edge Cases and Fixes
#

Brand aliases (Google vs Alphabet): maintain a canonical alias map and apply it during normalization.
Units and pack sizes (“500ml” vs “0.5 L”): normalize units before matching — both become “500ml”.
Noise tokens (“new”, “sale”, “limited edition”): remove domain-specific stopwords to avoid false matches.
Transliteration (Müller vs Mueller): strip accents in normalization (unicodedata.NFD).
Short entity names (single words like “Apple”): set a minimum candidate threshold — short names match too broadly on TF-IDF.

Integrations
#

Push accept matches to your MDM system with versioned lineage (record source, score, timestamp).
Emit review matches to a human-review dashboard with both raw names and scores visible.
Schedule nightly diffs on new catalog entries; alert on drift in p@1 vs your gold set.

The Problem#

Approach#

Step 0: Normalize Before Matching#

Step 3: Semantic Tie-Breaking with Embeddings#

Thresholds and QA#

Full Pipeline#

Architecture & Workflow#

Evaluation harness#

Performance Characteristics#

Edge Cases and Fixes#

Integrations#

Related