LightRAG: Lean RAG with Benchmarks

Jul 30, 2025
12 min read
RAGLightRAGNLPOpen SourceEvaluationEmbeddings

Why LightRAG

For small, self-hosted RAG services, I often don’t need callbacks, agents, or complex runtime graphs. I need:

  • Predictable latency on CPU
  • Tiny dependency surface
  • Explicit control over chunking, retrieval, and prompts

LightRAG gives me that — a thin layer over embeddings, a vector index, and prompt composition. If you’re shipping a single-purpose Q&A with tight cold-start budgets, this approach beats large frameworks.

Architecture

Lean RAG pipeline focused on minimal components: chunking, embeddings, vector store, retriever, prompt, LLM

  1. Ingest Markdown/PDF → normalize text
  2. Chunk with conservative overlap
  3. Embed with OpenAI (or local) embeddings
  4. Index with FAISS (in-memory) or sqlite-backed store
  5. Retrieve top-k and compose a strict prompt
  6. Generate with a small LLM; enforce citations

Implementation (minimal dependency stack)

uv pip install faiss-cpu tiktoken openai
from __future__ import annotations
import os
import time
from dataclasses import dataclass
from typing import List, Tuple

import faiss
import numpy as np
from openai import OpenAI


def split_text(text: str, chunk_size: int = 300, overlap: int = 50) -> List[str]:
    chunks, start = [], 0
    while start < len(text):
        end = min(len(text), start + chunk_size)
        chunks.append(text[start:end])
        start = end - overlap
        if start < 0: start = 0
    return chunks


def embed_texts(texts: List[str], model: str = "text-embedding-3-small") -> np.ndarray:
    client = OpenAI()
    # batch for throughput in production
    vecs = client.embeddings.create(input=texts, model=model).data
    return np.array([v.embedding for v in vecs]).astype("float32")


@dataclass
class Index:
    index: faiss.IndexFlatIP
    vectors: np.ndarray
    texts: List[str]
    sources: List[str]


def build_index(pairs: List[Tuple[str, str]]) -> Index:
    # pairs: (source, text)
    texts = [t for _, t in pairs]
    sources = [s for s, _ in pairs]
    X = embed_texts(texts)
    # normalize for cosine similarity via inner product
    faiss.normalize_L2(X)
    idx = faiss.IndexFlatIP(X.shape[1])
    idx.add(X)
    return Index(index=idx, vectors=X, texts=texts, sources=sources)


def search(idx: Index, query: str, k: int = 4):
    q = embed_texts([query])
    faiss.normalize_L2(q)
    D, I = idx.index.search(q, k)
    hits = [(idx.texts[i], idx.sources[i], float(D[0][j])) for j, i in enumerate(I[0])]
    return hits


def ask(idx: Index, question: str) -> str:
    hits = search(idx, question, k=4)
    context = "\n\n".join([f"[{src}]\n{text}" for text, src, _ in hits])
    prompt = (
        "You answer strictly from the context. If unsure, say you don't know.\n"
        f"Question: {question}\n\nContext:\n{context}\n\n"
        "Answer with cited sources in [source] form."
    )
    client = OpenAI()
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return resp.choices[0].message.content


if __name__ == "__main__":
    corpus = {
        "returns.md": "Customers may return items within 30 days with receipt.",
        "shipping.md": "Standard shipping 3-5 business days; expedited available.",
        "warranty.md": "Electronics include a 1-year limited warranty.",
    }
    pairs = []
    for src, text in corpus.items():
        for chunk in split_text(text):
            pairs.append((src, chunk))

    idx = build_index(pairs)
    out = ask(idx, "What is the return window?")
    print(out)

Notes:

  • This is short, dependency-light, and easy to port to serverless.
  • We normalize embeddings to approximate cosine similarity with FAISS inner product.
  • Replace OpenAI with local embeddings/LLMs as needed.

Benchmarks vs LangChain baseline (my runs)

Environment: M2, 16GB RAM, small corpus (≤ 500 chunks), gpt-4o-mini.

Approach p50 latency p95 latency Context tokens LOC
LightRAG (this) 420 ms 790 ms ~900 ~120
LangChain RAG 520 ms 950 ms ~950 ~200

Interpretation:

  • The difference comes from fewer abstractions and tighter control of retriever parameters.
  • On larger corpora, both converge; network/model latency dominates. Use whichever improves your team’s velocity.

Retrieval choices and trade-offs

  • Chunking: Start at 300/50. For long legal text, 600/80 reduces cross-chunk answers.
  • k: 3–5 for narrow domains. Reduce if you see mixed sources in answers.
  • Re-ranking: For noisy corpora, add a small lexical pass (BM25) before vector search.
  • Guardrails: Reject answers without [source]; ask a follow-up for clarification.

When to prefer LightRAG over LangChain

  • You deploy on serverless/edge with cold start constraints.
  • Team is small, prefers explicit over abstract.
  • You only need retrieval + prompt + LLM, not agents or tools.

When to stick with LangChain: you need tracing, callbacks, streaming tools, or plan to compose multi-step workflows. See /blogs/does-langchain-use-rag.

Data quality pre-checks (don’t skip)

Business value from recent work

  • 18–25% latency reduction in FAQ assistants by trimming abstractions and tuning k.
  • 15–20% cost reduction via smaller contexts and fewer retries.
  • Fewer hallucinations after enforcing citation policy + evaluation gate.

CTA — ship a lean RAG that meets SLAs

Need predictable latency and a minimal stack? I design and implement lean RAG services with SLAs, dashboards, and eval gates.

Work with me →

Enjoyed this article?

Discover more insights on AI, Python, and tech

LightRAG: Lean RAG with Benchmarks | Blogs | Subhajit Bhar — Freelance ML Engineer