LLM Engineering

FAISS Index Types for Production RAG

29 January 2026·420 words·2 mins

IndexFlatIP works for small corpora. For production with 100K+ vectors, you need smarter indexes. Here’s how to choose and implement them. FAISS Index Types Overview # Index Corpus Size Memory Accuracy Build Time IndexFlatIP < 50K High Exact Fast IndexIVFFlat 50K - 1M Medium ~95-99% Medium IndexHNSWFlat 50K - 10M High ~95-99% Slow IndexIVFPQ 1M+ Low ~90-95% Slow IndexFlatIP (Baseline) # Exact search, no training required. Use for prototypes and small corpora.

RAG for Knowledge-Intensive Tasks

24 September 2025·842 words·4 mins

LLM Engineering

Picture this: You’re asking an AI about cancer treatments. It sounds super confident and gives you detailed answers. But here’s the problem — it just made up a medical study that doesn’t exist. TL;DR RAG fixes LLM hallucinations by grounding answers in retrieved documents. Pipeline: chunk documents → embed → store in vector index → retrieve at query time → generate. Use RAG for knowledge-intensive tasks (legal, medical, finance) where accuracy is non-negotiable. Evaluate with RAGAS or custom metrics: faithfulness, answer relevancy, context recall. That’s not just embarrassing. When we’re talking about healthcare, finance, or legal advice, these AI “hallucinations” can be downright dangerous.

NLP Entity Matching with Fuzzy Search

14 August 2025·1100 words·6 mins

LLM Engineering

Product catalogs rarely match 1:1. Supplier A calls it “Apple iPhone 13 Pro 256GB Space Grey” while your system has “iPhone 13 Pro - 256 - Gray”. String equality fails. This guide covers a three-stage approach combining lexical, surface, and semantic similarity to match entities at scale with minimal false positives.

Document Summarization: Eval First

14 August 2025·823 words·4 mins

LLM Engineering

Document summarization is a critical NLP task that helps users quickly grasp key information from long documents. But how do you know if your model is actually working? This guide shows a workflow that starts with evaluation and acceptance criteria before touching models — the approach that got a finance report summarizer from prototype to production in three weeks.

RAG with LangChain: Architecture, Code, and Metrics

2 August 2025·1260 words·6 mins

LLM Engineering

RAG is a design pattern, not a product. LangChain supports it out of the box. This guide shows a production-ready RAG setup in LangChain with architecture, retrieval choices, runnable code, evaluation metrics, and trade-offs from my client projects. TL;DR # Short answer: LangChain doesn’t “contain” RAG; it provides the building blocks to implement RAG cleanly. You wire up chunking, embeddings, vector store, and a retrieval-aware prompt chain. What you get below: Architecture diagram, runnable code (LangChain 0.2+), evaluation harness, parameter trade-offs, and when to avoid LangChain for leaner stacks. Related deep dives: Foundations of RAG → RAG for Knowledge-Intensive Tasks. Lightweight pipelines → LightRAG: Lean RAG with Benchmarks. Who should read this # You’re building an internal knowledge assistant, support bot, or compliance Q&A system. You need answers that cite real documents with predictable latency and cost. You want a minimal, maintainable RAG in LangChain with evaluation, not a toy demo. The problem I solved in production # When I implemented an extractive summarizer for financial and compliance reports, two pain points surfaced:

LightRAG: Lean RAG with Benchmarks

30 July 2025·884 words·5 mins

LLM Engineering

LightRAG is a minimal RAG toolkit that strips away heavy abstractions. Here’s a complete build with code, performance numbers versus a LangChain baseline, and when LightRAG is the right choice. TL;DR LightRAG is a minimal RAG stack: FAISS + embeddings + prompt composition, ~120 lines. ~20% faster p50 latency vs LangChain on small corpora (≤ 500 chunks) due to fewer abstractions. Best for: serverless/edge deployments, small teams, single-purpose Q&A. Use LangChain instead when you need agents, tracing, callbacks, or multi-step workflows. Don’t skip data quality: clean text, handle missing values, validate numeric tables before indexing. Why LightRAG # For small, self-hosted RAG services, I often don’t need callbacks, agents, or complex runtime graphs. I need:

Reranking for Better RAG Retrieval

29 January 2025·566 words·3 mins

LLM Engineering

Bi-encoder retrieval is fast but imprecise. Cross-encoder reranking improves top-k precision at the cost of some latency. Here’s when and how to add it. TL;DR Bi-encoders are fast (embeddings precomputed) but miss query-document interaction. Cross-encoders are slower but far more accurate — encode query + document together. Pattern: retrieve top-20 with bi-encoder, rerank to top-4 with cross-encoder. Start with ms-marco-MiniLM-L-6-v2 (80MB, fast, good accuracy). Skip reranking if latency budget < 200ms or bi-encoder recall is already high. Bi-Encoder vs Cross-Encoder # Bi-encoder (used in vector search):

LightRAG as a LangChain Retriever

29 January 2025·594 words·3 mins

LLM Engineering

Want LightRAG’s lean retrieval with LangChain’s chain ecosystem? Here’s how to wrap LightRAG as a LangChain-compatible retriever — keeping retrieval explicit and fast while using LangChain for everything downstream. TL;DR Implement BaseRetriever._get_relevant_documents to make any retriever LangChain-compatible. LightRAG’s FAISS retrieval slots straight into LangChain chains, LCEL, and agents. Use this pattern when migrating an existing LangChain pipeline to leaner retrieval incrementally. For full LangChain pipelines without constraints, the standard LangChain retriever is fine. Why Combine LightRAG with LangChain # LightRAG gives you minimal, fast retrieval. LangChain gives you chains, agents, and tooling. Sometimes you want both:

BM25 Hybrid Search with LightRAG

29 January 2025·643 words·4 mins

LLM Engineering

Vector search misses keyword-heavy queries. BM25 misses semantic similarity. Combine both with hybrid search for better retrieval recall. TL;DR Vector search (FAISS): great for semantic/paraphrase queries, bad for exact codes or IDs. BM25: great for keyword/exact matches, bad for synonyms and paraphrases. Hybrid with RRF: combines both rank lists — no score normalization needed. Start with vector_weight=0.5. Lower it if users search exact product codes frequently. Why Hybrid Search # Pure vector search struggles with:

↑