Skip to main content
  1. blog/
  2. RAG/

FAISS Index Types for Production RAG

·420 words·2 mins·
Table of Contents

IndexFlatIP works for small corpora. For production with 100K+ vectors, you need smarter indexes. Here’s how to choose and implement them.

FAISS Index Types Overview
#

IndexCorpus SizeMemoryAccuracyBuild Time
IndexFlatIP< 50KHighExactFast
IndexIVFFlat50K - 1MMedium~95-99%Medium
IndexHNSWFlat50K - 10MHigh~95-99%Slow
IndexIVFPQ1M+Low~90-95%Slow

IndexFlatIP (Baseline)
#

Exact search, no training required. Use for prototypes and small corpora.

import faiss
import numpy as np

def build_flat_index(vectors: np.ndarray) -> faiss.IndexFlatIP:
    faiss.normalize_L2(vectors)
    index = faiss.IndexFlatIP(vectors.shape[1])
    index.add(vectors)
    return index

Pros: Exact results, simple
Cons: O(n) search time, doesn’t scale

IndexIVFFlat (Clustered Search)#

Partitions vectors into clusters. Searches only nearby clusters for speed.

def build_ivf_index(
    vectors: np.ndarray, 
    nlist: int = 100,
    nprobe: int = 10
) -> faiss.IndexIVFFlat:
    """
    Args:
        vectors: Normalized embedding vectors
        nlist: Number of clusters (sqrt(n) is a good start)
        nprobe: Clusters to search at query time
    """
    faiss.normalize_L2(vectors)
    dim = vectors.shape[1]
    
    # Quantizer for cluster centroids
    quantizer = faiss.IndexFlatIP(dim)
    index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)
    
    # Must train before adding
    index.train(vectors)
    index.add(vectors)
    
    # Set search-time parameter
    index.nprobe = nprobe
    return index

Tuning:

  • nlist: Start with sqrt(n). More clusters = faster search, lower recall
  • nprobe: Start with nlist / 10. Increase for better recall

IndexHNSWFlat (Graph-Based)
#

Hierarchical Navigable Small World graph. Excellent recall with fast search.

def build_hnsw_index(
    vectors: np.ndarray,
    M: int = 32,
    ef_construction: int = 200,
    ef_search: int = 64
) -> faiss.IndexHNSWFlat:
    """
    Args:
        vectors: Normalized embedding vectors
        M: Connections per node (higher = more accurate, more memory)
        ef_construction: Build-time search depth
        ef_search: Query-time search depth
    """
    faiss.normalize_L2(vectors)
    dim = vectors.shape[1]
    
    index = faiss.IndexHNSWFlat(dim, M, faiss.METRIC_INNER_PRODUCT)
    index.hnsw.efConstruction = ef_construction
    index.hnsw.efSearch = ef_search
    index.add(vectors)
    return index

Trade-offs:

  • Higher M = better recall, 4-8x more memory
  • HNSW doesn’t support removal; rebuild for updates

Persistence
#

Save and load indexes for production:

def save_index(index: faiss.Index, path: str):
    faiss.write_index(index, path)

def load_index(path: str) -> faiss.Index:
    return faiss.read_index(path)

# Usage
save_index(index, "vectors.index")
index = load_index("vectors.index")

For IVF indexes, you can also memory-map for reduced RAM:

index = faiss.read_index("vectors.index", faiss.IO_FLAG_MMAP)

Choosing an Index
#

Corpus < 50K vectors?
  └─> IndexFlatIP (exact, simple)

Corpus 50K - 500K?
  └─> IndexIVFFlat (nlist=sqrt(n), nprobe=10-20)

Corpus 500K - 5M?
  └─> IndexHNSWFlat (M=32, ef=64-128)

Corpus > 5M or memory constrained?
  └─> IndexIVFPQ (compressed, ~90% recall)

Production Checklist
#

  • Benchmark recall on held-out queries before deploying
  • Set nprobe / efSearch based on latency budget
  • Use faiss.write_index for persistence
  • Monitor p99 latency; increase search params if recall drops
  • For updates: IVF supports add(), HNSW requires rebuild

Related

RAG for Knowledge-Intensive Tasks

·791 words·4 mins
Picture this: You’re asking an AI about cancer treatments. It sounds super confident and gives you detailed answers. But here’s the problem — it just made up a medical study that doesn’t exist.

RAG with LangChain: Architecture, Code, and Metrics

·1240 words·6 mins
RAG is a design pattern, not a product. LangChain supports it out of the box. This guide shows a production-ready RAG setup in LangChain with architecture, retrieval choices, runnable code, evaluation metrics, and trade-offs from my client projects.