Domain Benchmark & Judge Training Architect
Sales-domain benchmark and trained critic layer for Tenacious-style B2B outreach, with contamination-aware task generation, deterministic scoring checks, preference data, and a small LoRA judge.
"Turned prospect-facing sales failures into a measurable benchmark and judge gate instead of trusting generated drafts by default."

Generic agent benchmarks miss high-cost sales mistakes such as bench overcommitment, ICP misclassification, ungrounded gap claims, tone drift, and booking CTAs that arrive too early.
Built Tenacious-Bench v0.2 with 240 tasks, deterministic evaluator checks, human grading support, preference-pair generation, and a Path B judge that reproduces a +76.6pp held-out lift over the deterministic baseline.
SalesConversion-Bench evaluates whether a sales agent follows Tenacious-specific business rules. The dataset covers trace-derived, programmatic, multi-LLM, and hand-authored tasks across failure categories such as bench overcommitment, signal overclaiming, ICP misclassification, and tone drift.
Case studies in similar engineering domains.
Fork and evaluation workspace for DAB, a realistic enterprise data-agent benchmark spanning multi-database integration, messy joins, unstructured text transformation, and domain knowledge.
Async multi-provider AI content generation framework for music, video, and images with plugin providers, style presets, CLI workflows, job tracking, duplicate detection, and cost controls.
Spec-driven autonomous influencer network foundation using FastRender Swarm architecture, MCP-only external IO, Planner/Worker/Judge task DAGs, HITL review, multi-tenancy, and budget governance.