Skip to main content

About

I’m Subhajit — a freelance engineer specialising in Intelligent Document Processing.

If someone on your team is manually reading PDFs and typing data into a spreadsheet, that’s the problem I solve. Invoices, lab reports, contracts, purchase orders — I build the pipelines that handle them automatically, reliably, and without babysitting.


What that actually means
#

Most businesses hit the same wall eventually: documents arrive — by email, by upload, sometimes still by post — and a person has to read them, figure out what matters, and enter the data somewhere. It works fine until it doesn’t. Until the volume grows, the errors pile up, or the one person who knows the process leaves.

The fix isn’t just a script. It’s a system that handles the messiness of real documents — layouts that change, fields that move around, edge cases that break simple rules — and keeps working without someone constantly tweaking it.

That’s what I build.


How I work
#

Most document automation projects fail for the same reason: people reach for the most powerful tool first. They throw a large language model at a problem that a well-designed schema could solve in milliseconds, then wonder why the outputs are all over the place.

I work the other way around.

Every project starts with a Pydantic schema — a precise definition of what the output should look like before I write a single line of extraction code. From there I establish a regex baseline: deterministic extraction that shows me exactly where the hard cases are. Only then do I bring in an LLM — for the specific fields where layout variation or ambiguous language makes rules too brittle.

The result is a pipeline that’s fast where it can be, intelligent where it needs to be, and auditable throughout.


What you get
#

A finished pipeline isn’t just an extraction script. It includes:

  • Schema validation and field-level confidence scoring
  • Retry logic and graceful failure handling
  • A FastAPI layer so your existing tools can talk to it
  • A web interface your team can actually use without any technical setup

You get a complete, production-ready system — not a model, not a proof of concept.


Where this comes from
#

My approach comes from two years building a live document processing system for a water consultancy — a three-phase pipeline handling real lab reports in production across ten or more layout variations (and their inevitable drifts). I built it from a simple CLI script into a full web application with LLM-augmented extraction. That system is now central to their daily operations.

That experience taught me what breaks at scale, what makes clients actually trust a system, and why the gap between a working demo and a working product is bigger than most people think.


Work with me
#

The fastest way to start is a Diagnostic Session — a fixed-price engagement where you share your documents and tell me what’s failing or what you need. I review your current approach, identify the exact failure points, and deliver a written action plan with implementation steps within 3 days.

No long commitment. No ambiguity. Just a clear picture of what needs to happen and how.

If you want to move forward after that, we scope the full pipeline together. If you don’t, you still walk away with a concrete plan you can hand to anyone.

Book a Diagnostic Session →