Natnael Alemseged
AboutProjectsTestimonialsWork Experience
© 2026 Natnael Alemseged. All Rights Reserved.
Secure Agent Protocol // Latency Critical // Addis Ababa

Document Intelligence Refinery

Document AI & Provenance Systems Engineer

PDF triage and extraction pipeline that detects document origin, layout, and domain, escalates extraction strategies by confidence, builds PageIndex trees, and answers with provenance chains.

"Converted messy PDFs into auditable structured facts with confidence gates, retrieval indexes, and source-cited answers."
Document Intelligence Refinery confidence-gated PDF extraction pipeline
Click to Zoom
Triage, multi-strategy extraction, PageIndex retrieval, FactTable storage, and provenance-backed answers

Problem

Enterprise PDFs mix native text, scans, tables, forms, legal language, and financial facts, making naive OCR or single-pass extraction brittle and hard to audit.

Solution

Built a triage-first pipeline with FastText, layout, and vision extraction strategies, chunk validation, PageIndex navigation, Chroma vector search, FactTable extraction, and claim verification.

Deep Dive

What It Refines

Document Intelligence Refinery starts by profiling PDFs, then routes each document through the cheapest extraction strategy likely to work. Low-confidence outputs escalate from fast text to layout-aware parsing and finally to vision extraction.

Engineering Highlights

  • •PageIndex retrieval: hierarchical page trees improve navigation over naive vector search.
  • •Provenance chains: answers include document name, page number, bounding box, and content hash.
  • •Audit mode: claims can be verified, marked not found, or declared unverifiable with cited evidence.

Tech Stack

PythonPydanticDoclingLangGraphChromaDBSQLite

Tags

#Document AI#PDF Extraction#RAG#Provenance
View GitHub Repo

More Software Software

Case studies in similar engineering domains.

Axiom Ledger

→

Event-sourced lending pipeline for document intake, extraction, credit analysis, fraud, compliance, and decision orchestration over an append-only ledger.

Brownfield Cartographer

→

Multi-agent codebase cartography tool that analyzes local or GitHub repositories with Surveyor and Hydrologist agents to produce module graphs and data lineage artifacts.

Data Contract Enforcer

→

Schema integrity and lineage attribution system that turns inter-system dependencies into formal contracts, detects schema/type/statistical drift, and reports downstream blast radius.