Subhajit

Intelligent Document Processing — Guides and Code

4 March 2026·87 words·1 min

Intelligent Document Processing (IDP) is the discipline of extracting structured, decision-ready data from unstructured documents — invoices, lab reports, contracts, purchase orders — automatically and reliably. This cluster covers production IDP engineering: understanding what IDP actually is, choosing between platforms and custom pipelines, handling the edge cases that break every generic solution, and building systems that stay reliable as document volume and layout variation grow.

Extracting Tables from PDFs in Python: The Complete Guide

4 March 2026·1073 words·6 mins

PDF Extraction

Extracting tables from PDFs is one of the most common requirements in document automation and one of the most reliable ways to introduce subtle errors if you do it carelessly. This guide covers table extraction with pdfplumber — the most capable Python library for this — including how it works, when it works, and what to do when it doesn’t.

Azure Document Intelligence Alternatives

4 March 2026·1300 words·7 mins

Intelligent Document Processing

Azure Document Intelligence (formerly Form Recognizer) is Microsoft’s managed IDP service. It handles invoices, receipts, purchase orders, and ID documents well — out of the box, with no custom training required for standard formats. For many use cases, it’s a reasonable starting point. For many production workflows, it’s not enough.

FAISS Index Types for Production RAG

29 January 2026·420 words·2 mins

LLM Engineering

IndexFlatIP works for small corpora. For production with 100K+ vectors, you need smarter indexes. Here’s how to choose and implement them. FAISS Index Types Overview # Index Corpus Size Memory Accuracy Build Time IndexFlatIP < 50K High Exact Fast IndexIVFFlat 50K - 1M Medium ~95-99% Medium IndexHNSWFlat 50K - 10M High ~95-99% Slow IndexIVFPQ 1M+ Low ~90-95% Slow IndexFlatIP (Baseline) # Exact search, no training required. Use for prototypes and small corpora.

Outliers Detection in Python

30 September 2025·14 words·1 min

This post has moved. Read the updated guide: Detect and Remove Outliers in Python.

Detect and Remove Outliers in Python: IQR and Z-Score

30 September 2025·1925 words·10 mins

Data Science

Outliers can significantly skew statistical analysis and machine learning model performance. This guide covers every practical method to detect, visualize, and handle outliers in Python — from IQR and Z-Score to Isolation Forest — with runnable code at each step.

System Architecture — A Comprehensive, Practical Guide

24 September 2025·1562 words·8 mins

Software Engineering

Designing and evolving system architecture is about making informed trade‑offs. This guide provides a practical, opinionated walkthrough of the core concepts, patterns, and decisions you need to build scalable, reliable, and cost‑efficient systems—plus answers to the most common questions engineers and architects ask.

Ruff: Modern Python Linter & Formatter Walkthrough

24 September 2025·1016 words·5 mins

Software Engineering

TL;DR Ruff replaces Flake8, Black, isort, and pydocstyle — one tool, 10–100× faster (written in Rust). Install: uv add --dev ruff or pip install ruff. Run: ruff check . (lint) and ruff format . (format). Add pre-commit hooks + GitHub Actions to enforce on every commit and PR. Pair with the Python CI Pipeline guide for the full uv + Ruff + ty setup. Writing clean, readable code is essential for collaboration and maintainability. Linters and formatters help us keep our codebase consistent and easy to understand.

RAG for Knowledge-Intensive Tasks

24 September 2025·842 words·4 mins

LLM Engineering

Picture this: You’re asking an AI about cancer treatments. It sounds super confident and gives you detailed answers. But here’s the problem — it just made up a medical study that doesn’t exist. TL;DR RAG fixes LLM hallucinations by grounding answers in retrieved documents. Pipeline: chunk documents → embed → store in vector index → retrieve at query time → generate. Use RAG for knowledge-intensive tasks (legal, medical, finance) where accuracy is non-negotiable. Evaluate with RAGAS or custom metrics: faithfulness, answer relevancy, context recall. That’s not just embarrassing. When we’re talking about healthcare, finance, or legal advice, these AI “hallucinations” can be downright dangerous.

Python Code Quality CI Pipeline with uv and Ruff

24 September 2025·928 words·5 mins

Software Engineering

You can create a Python Code Quality CI pipeline using uv, Ruff, and ty within 5 minutes. TL;DR Replace pip + requirements.txt with uv for fast, reproducible installs. Replace Flake8 + Black + isort with ruff — one tool, 10–100× faster. Add ty for type checking (Astral’s faster mypy replacement). Total CI time: ~30s. GitHub Actions config fits in 20 lines. Most of us begin a Python project with high hopes. We set up a clean virtual environment, organize a requirements file, and plan to add a linter—then forget.

↑