Data Agent Benchmark Operator
Fork and evaluation workspace for DAB, a realistic enterprise data-agent benchmark spanning multi-database integration, messy joins, unstructured text transformation, and domain knowledge.
"Used a serious external benchmark to stress data agents against multi-database enterprise complexity rather than SQL-only toy tasks."

Data agents often appear strong on single-database SQL tasks but fail when real enterprise workloads require cross-system joins, text transformation, domain reasoning, and tool execution.
Set up the DAB benchmark workflow with local database dependencies, Dockerized Python execution, supported LLM provider configuration, run logs, validation scripts, and pass@1 aggregation.
DataAgentBench covers 12 datasets and 54 queries across 9 domains and multiple DBMSes, including PostgreSQL, MongoDB, SQLite, and DuckDB. It evaluates agents on realistic data work rather than isolated SQL answering.
Case studies in similar engineering domains.
Sales-domain benchmark and trained critic layer for Tenacious-style B2B outreach, with contamination-aware task generation, deterministic scoring checks, preference data, and a small LoRA judge.
Async multi-provider AI content generation framework for music, video, and images with plugin providers, style presets, CLI workflows, job tracking, duplicate detection, and cost controls.
Spec-driven autonomous influencer network foundation using FastRender Swarm architecture, MCP-only external IO, Planner/Worker/Judge task DAGs, HITL review, multi-tenancy, and budget governance.