Skip to main content
  1. Blogs/

LightRAG: Lean RAG with Benchmarks

·884 words·5 mins·
Subhajit Bhar
Author
Subhajit Bhar
I build production-grade document extraction pipelines for businesses that process invoices, lab reports, contracts, and other document types at scale.
Table of Contents

LightRAG is a minimal RAG toolkit that strips away heavy abstractions. Here’s a complete build with code, performance numbers versus a LangChain baseline, and when LightRAG is the right choice.

TL;DR

  • LightRAG is a minimal RAG stack: FAISS + embeddings + prompt composition, ~120 lines.
  • ~20% faster p50 latency vs LangChain on small corpora (≤ 500 chunks) due to fewer abstractions.
  • Best for: serverless/edge deployments, small teams, single-purpose Q&A.
  • Use LangChain instead when you need agents, tracing, callbacks, or multi-step workflows.
  • Don’t skip data quality: clean text, handle missing values, validate numeric tables before indexing.

Why LightRAG
#

For small, self-hosted RAG services, I often don’t need callbacks, agents, or complex runtime graphs. I need:

  • Predictable latency on CPU
  • Tiny dependency surface
  • Explicit control over chunking, retrieval, and prompts

LightRAG gives me that — a thin layer over embeddings, a vector index, and prompt composition. If you’re shipping a single-purpose Q&A with tight cold-start budgets, this approach beats large frameworks.

Architecture
#

Lean RAG pipeline focused on minimal components: chunking, embeddings, vector store, retriever, prompt, LLM
  1. Ingest Markdown/PDF → normalize text
  2. Chunk with conservative overlap
  3. Embed with OpenAI (or local) embeddings
  4. Index with FAISS (in-memory) or sqlite-backed store
  5. Retrieve top-k and compose a strict prompt
  6. Generate with a small LLM; enforce citations

Implementation (minimal dependency stack)
#

uv pip install faiss-cpu tiktoken openai
from __future__ import annotations
import os
import time
from dataclasses import dataclass
from typing import List, Tuple

import faiss
import numpy as np
from openai import OpenAI


def split_text(text: str, chunk_size: int = 300, overlap: int = 50) -> List[str]:
    chunks, start = [], 0
    while start < len(text):
        end = min(len(text), start + chunk_size)
        chunks.append(text[start:end])
        start = end - overlap
        if start < 0: start = 0
    return chunks


def embed_texts(texts: List[str], model: str = "text-embedding-3-small") -> np.ndarray:
    client = OpenAI()
    # batch for throughput in production
    vecs = client.embeddings.create(input=texts, model=model).data
    return np.array([v.embedding for v in vecs]).astype("float32")


@dataclass
class Index:
    index: faiss.IndexFlatIP
    vectors: np.ndarray
    texts: List[str]
    sources: List[str]


def build_index(pairs: List[Tuple[str, str]]) -> Index:
    # pairs: (source, text)
    texts = [t for _, t in pairs]
    sources = [s for s, _ in pairs]
    X = embed_texts(texts)
    # normalize for cosine similarity via inner product
    faiss.normalize_L2(X)
    idx = faiss.IndexFlatIP(X.shape[1])
    idx.add(X)
    return Index(index=idx, vectors=X, texts=texts, sources=sources)


def search(idx: Index, query: str, k: int = 4):
    q = embed_texts([query])
    faiss.normalize_L2(q)
    D, I = idx.index.search(q, k)
    hits = [(idx.texts[i], idx.sources[i], float(D[0][j])) for j, i in enumerate(I[0])]
    return hits


def ask(idx: Index, question: str) -> str:
    hits = search(idx, question, k=4)
    context = "\n\n".join([f"[{src}]\n{text}" for text, src, _ in hits])
    prompt = (
        "You answer strictly from the context. If unsure, say you don't know.\n"
        f"Question: {question}\n\nContext:\n{context}\n\n"
        "Answer with cited sources in [source] form."
    )
    client = OpenAI()
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return resp.choices[0].message.content


if __name__ == "__main__":
    corpus = {
        "returns.md": "Customers may return items within 30 days with receipt.",
        "shipping.md": "Standard shipping 3-5 business days; expedited available.",
        "warranty.md": "Electronics include a 1-year limited warranty.",
    }
    pairs = []
    for src, text in corpus.items():
        for chunk in split_text(text):
            pairs.append((src, chunk))

    idx = build_index(pairs)
    out = ask(idx, "What is the return window?")
    print(out)

Notes:

  • This is short, dependency-light, and easy to port to serverless.
  • We normalize embeddings to approximate cosine similarity with FAISS inner product.
  • Replace OpenAI with local embeddings/LLMs as needed.

Benchmarks vs LangChain baseline (my runs)
#

Environment: M2, 16GB RAM, small corpus (≤ 500 chunks), gpt-4o-mini.

Approachp50 latencyp95 latencyContext tokensLOC
LightRAG (this)420 ms790 ms~900~120
LangChain RAG520 ms950 ms~950~200

Interpretation:

  • The difference comes from fewer abstractions and tighter control of retriever parameters.
  • On larger corpora, both converge; network/model latency dominates. Use whichever improves your team’s velocity.

Retrieval choices and trade-offs
#

  • Chunking: Start at 300/50. For long legal text, 600/80 reduces cross-chunk answers.
  • k: 3–5 for narrow domains. Reduce if you see mixed sources in answers.
  • Re-ranking: For noisy corpora, add a small lexical pass (BM25) before vector search.
  • Guardrails: Reject answers without [source]; ask a follow-up for clarification.

When to prefer LightRAG over LangChain
#

  • You deploy on serverless/edge with cold start constraints.
  • Team is small, prefers explicit over abstract.
  • You only need retrieval + prompt + LLM, not agents or tools.

When to stick with LangChain: you need tracing, callbacks, streaming tools, or plan to compose multi-step workflows. See RAG with LangChain: Architecture, Code, and Metrics.

Data quality pre-checks (don’t skip)
#

Business value from recent work
#

  • 18–25% latency reduction in FAQ assistants by trimming abstractions and tuning k.
  • 15–20% cost reduction via smaller contexts and fewer retries.
  • Fewer hallucinations after enforcing citation policy + evaluation gate.

Shipping Lean RAG Systems
#

For production RAG systems, focus on predictable latency and a minimal stack. Design lean RAG services with clear SLAs, proper dashboards, and evaluation gates to ensure reliability and performance.

Related Articles#

Related

BM25 Hybrid Search with LightRAG

·643 words·4 mins
Vector search misses keyword-heavy queries. BM25 misses semantic similarity. Combine both with hybrid search for better retrieval recall. TL;DR Vector search (FAISS): great for semantic/paraphrase queries, bad for exact codes or IDs. BM25: great for keyword/exact matches, bad for synonyms and paraphrases. Hybrid with RRF: combines both rank lists — no score normalization needed. Start with vector_weight=0.5. Lower it if users search exact product codes frequently. Why Hybrid Search # Pure vector search struggles with:

LightRAG as a LangChain Retriever

·594 words·3 mins
Want LightRAG’s lean retrieval with LangChain’s chain ecosystem? Here’s how to wrap LightRAG as a LangChain-compatible retriever — keeping retrieval explicit and fast while using LangChain for everything downstream. TL;DR Implement BaseRetriever._get_relevant_documents to make any retriever LangChain-compatible. LightRAG’s FAISS retrieval slots straight into LangChain chains, LCEL, and agents. Use this pattern when migrating an existing LangChain pipeline to leaner retrieval incrementally. For full LangChain pipelines without constraints, the standard LangChain retriever is fine. Why Combine LightRAG with LangChain # LightRAG gives you minimal, fast retrieval. LangChain gives you chains, agents, and tooling. Sometimes you want both:

FAISS Index Types for Production RAG

·420 words·2 mins
IndexFlatIP works for small corpora. For production with 100K+ vectors, you need smarter indexes. Here’s how to choose and implement them. FAISS Index Types Overview # Index Corpus Size Memory Accuracy Build Time IndexFlatIP < 50K High Exact Fast IndexIVFFlat 50K - 1M Medium ~95-99% Medium IndexHNSWFlat 50K - 10M High ~95-99% Slow IndexIVFPQ 1M+ Low ~90-95% Slow IndexFlatIP (Baseline) # Exact search, no training required. Use for prototypes and small corpora.