Document AI & Provenance Systems Engineer
PDF triage and extraction pipeline that detects document origin, layout, and domain, escalates extraction strategies by confidence, builds PageIndex trees, and answers with provenance chains.
"Converted messy PDFs into auditable structured facts with confidence gates, retrieval indexes, and source-cited answers."

Enterprise PDFs mix native text, scans, tables, forms, legal language, and financial facts, making naive OCR or single-pass extraction brittle and hard to audit.
Built a triage-first pipeline with FastText, layout, and vision extraction strategies, chunk validation, PageIndex navigation, Chroma vector search, FactTable extraction, and claim verification.
Document Intelligence Refinery starts by profiling PDFs, then routes each document through the cheapest extraction strategy likely to work. Low-confidence outputs escalate from fast text to layout-aware parsing and finally to vision extraction.
Case studies in similar engineering domains.
Event-sourced lending pipeline for document intake, extraction, credit analysis, fraud, compliance, and decision orchestration over an append-only ledger.
Multi-agent codebase cartography tool that analyzes local or GitHub repositories with Surveyor and Hydrologist agents to produce module graphs and data lineage artifacts.
Schema integrity and lineage attribution system that turns inter-system dependencies into formal contracts, detects schema/type/statistical drift, and reports downstream blast radius.