Product catalogs rarely match 1:1. Supplier A calls it “Apple iPhone 13 Pro 256GB Space Grey” while your system has “iPhone 13 Pro - 256 - Gray”. String equality fails. This guide covers a three-stage approach combining lexical, surface, and semantic similarity to match entities at scale with minimal false positives.
TL;DR
- Stage 1: TF-IDF cosine for fast candidate generation (cheap, no model needed)
- Stage 2: Jaro-Winkler for surface/character similarity re-ranking
- Stage 3: Small embeddings (
text-embedding-3-small) for semantic tie-breaking- Route results through score thresholds: auto-accept ≥ 0.8, manual review 0.6–0.8, reject < 0.6
The Problem#
Entity matching (also called record linkage or deduplication) is the task of identifying records across two datasets that refer to the same real-world entity. It comes up constantly in:
- Retail: Matching your product catalog against a supplier feed
- Finance: Deduplicating company names across data sources (e.g., “Goldman Sachs” vs “Goldman Sachs & Co LLC”)
- Healthcare: Matching patient records across hospital systems
- E-commerce: Normalizing product listings from multiple merchants
Exact string matching fails immediately — different abbreviations, punctuation, word order, and typos make it unusable at scale. You need a pipeline that handles all of these gracefully.
Approach#
Three-stage pipeline, ordered by cost:
- Candidate generation with TF-IDF cosine — fast, no model, narrows the search space
- Re-ranking with Jaro-Winkler — character-level surface similarity catches abbreviations and typos
- Semantic tie-breaking with embeddings — resolves synonyms and paraphrase mismatches
Step 0: Normalize Before Matching#
Normalization removes noise before any similarity computation. It’s the highest-ROI step and often ignored.
import re
import unicodedata
def normalize(text: str) -> str:
"""Lowercase, strip accents, remove punctuation, collapse whitespace."""
text = text.lower()
text = unicodedata.normalize("NFD", text)
text = "".join(c for c in text if unicodedata.category(c) != "Mn")
text = re.sub(r"[^\w\s]", " ", text)
text = re.sub(r"\s+", " ", text).strip()
return text
# Domain-specific: normalize units and common abbreviations
UNIT_MAP = {"ml": "ml", "milliliter": "ml", "0.5 l": "500ml", "0.5l": "500ml"}
ALIAS_MAP = {"google": "alphabet", "fb": "facebook", "meta": "facebook"}
def normalize_domain(text: str) -> str:
text = normalize(text)
for src, dst in UNIT_MAP.items():
text = text.replace(src, dst)
for alias, canonical in ALIAS_MAP.items():
text = re.sub(rf"\b{alias}\b", canonical, text)
return textfrom __future__ import annotations
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def jaro_winkler(a: str, b: str) -> float:
try:
import jellyfish
return jellyfish.jaro_winkler(a, b)
except Exception:
return 0.0
def match_entities(left: list[str], right: list[str], top_k: int = 5) -> list[tuple[str, str, float]]:
tfidf = TfidfVectorizer(min_df=1, ngram_range=(1, 2))
Xl = tfidf.fit_transform(left)
Xr = tfidf.transform(right)
sims = cosine_similarity(Xl, Xr)
matches = []
for i, row in enumerate(sims):
idx = int(np.argmax(row))
jw = jaro_winkler(left[i], right[idx])
score = 0.7 * row[idx] + 0.3 * jw
matches.append((left[i], right[idx], float(score)))
return matches
if __name__ == "__main__":
a = ["Apple iPhone 13 Pro", "Samsung Galaxy S22"]
b = ["iPhone 13 Pro Max by Apple", "Galaxy S22 Ultra Samsung"]
for m in match_entities(a, b):
print(m)Step 3: Semantic Tie-Breaking with Embeddings#
For pairs that score 0.6–0.8 on the blended score, a small embedding model resolves ambiguity from synonyms and paraphrases that TF-IDF and Jaro-Winkler miss.
import numpy as np
from openai import OpenAI
def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
client = OpenAI()
resp = client.embeddings.create(input=texts, model=model)
vecs = np.array([r.embedding for r in resp.data], dtype="float32")
# L2-normalize for cosine via dot product
norms = np.linalg.norm(vecs, axis=1, keepdims=True)
return vecs / np.maximum(norms, 1e-10)
def semantic_score(a: str, b: str) -> float:
vecs = embed_batch([a, b])
return float(np.dot(vecs[0], vecs[1]))
# Only call embeddings for ambiguous pairs to keep costs low
REVIEW_BAND = (0.6, 0.8)
def resolve_ambiguous(matches):
resolved = []
for left, right, score in matches:
if REVIEW_BAND[0] <= score < REVIEW_BAND[1]:
sem = semantic_score(left, right)
# Blend: give semantic score 40% weight in ambiguous band
score = 0.6 * score + 0.4 * sem
resolved.append((left, right, score))
return resolvedThresholds and QA#
- Accept ≥ 0.8 as confident match; 0.6–0.8 → manual review queue; < 0.6 reject.
- Evaluate with precision@1 and manual spot-checks on a labelled gold set.
Full Pipeline#
def run_pipeline(left: list[str], right: list[str]) -> list[tuple[str, str, float, str]]:
"""
Returns: list of (left_entity, right_entity, score, decision)
decision: 'accept' | 'review' | 'reject'
"""
# Step 0: Normalize
left_norm = [normalize_domain(x) for x in left]
right_norm = [normalize_domain(x) for x in right]
# Step 1 + 2: TF-IDF + Jaro-Winkler blended score
raw_matches = match_entities(left_norm, right_norm)
# Step 3: Semantic resolution for ambiguous pairs
resolved = resolve_ambiguous(raw_matches)
# Route by threshold
results = []
for (orig_l, orig_r), (_, _, score) in zip(zip(left, right), resolved):
if score >= 0.8:
decision = "accept"
elif score >= 0.6:
decision = "review"
else:
decision = "reject"
results.append((orig_l, orig_r, score, decision))
return resultsArchitecture & Workflow#
- Normalize product titles (case, unicode, punctuation, units, aliases)
- Generate candidates via TF-IDF cosine (top-10)
- Re-rank with Jaro-Winkler; compute blended score (0.7 TF-IDF + 0.3 JW)
- For 0.6–0.8 band: embed with
text-embedding-3-smallfor semantic tie-breaking - Threshold routing: accept / manual review / reject
Evaluation harness#
from __future__ import annotations
import numpy as np
def precision_at_1(gold: list[tuple[str, str]], preds: list[tuple[str, str, float]]):
lookup = {a: b for a, b in gold}
hits = 0
for a, b, _ in preds:
hits += int(lookup.get(a) == b)
return hits / max(1, len(preds))
if __name__ == "__main__":
gold = [("Apple iPhone 13 Pro", "iPhone 13 Pro Max by Apple"), ("Samsung Galaxy S22", "Galaxy S22 Ultra Samsung")]
preds = match_entities([g[0] for g in gold], [g[1] for g in gold])
print({"p@1": precision_at_1(gold, preds)})Target: ≥ 0.9 p@1 on clean catalogs; add human review queue for ambiguous ranges.
Performance Characteristics#
| Stage | Latency (10K pairs) | Cost | When it helps |
|---|---|---|---|
| TF-IDF cosine | ~0.5s | Free | Always — fast candidate narrowing |
| Jaro-Winkler | ~1s | Free | Typos, abbreviations, word-order differences |
| Embeddings (3-small) | ~3s + API | ~$0.002/1K tokens | Synonyms, paraphrases, cross-language |
The embedding stage only runs on the ambiguous 0.6–0.8 band, so in practice you only pay for a fraction of your catalog.
Edge Cases and Fixes#
- Brand aliases (Google vs Alphabet): maintain a canonical alias map and apply it during normalization.
- Units and pack sizes (“500ml” vs “0.5 L”): normalize units before matching — both become “500ml”.
- Noise tokens (“new”, “sale”, “limited edition”): remove domain-specific stopwords to avoid false matches.
- Transliteration (Müller vs Mueller): strip accents in normalization (
unicodedata.NFD). - Short entity names (single words like “Apple”): set a minimum candidate threshold — short names match too broadly on TF-IDF.
Integrations#
- Push
acceptmatches to your MDM system with versioned lineage (record source, score, timestamp). - Emit
reviewmatches to a human-review dashboard with both raw names and scores visible. - Schedule nightly diffs on new catalog entries; alert on drift in p@1 vs your gold set.
