Natnael Alemseged
AboutProjectsTestimonialsWork Experience
© 2026 Natnael Alemseged. All Rights Reserved.
Secure Agent Protocol // Latency Critical // Addis Ababa

SalesConversion-Bench

Domain Benchmark & Judge Training Architect

Sales-domain benchmark and trained critic layer for Tenacious-style B2B outreach, with contamination-aware task generation, deterministic scoring checks, preference data, and a small LoRA judge.

"Turned prospect-facing sales failures into a measurable benchmark and judge gate instead of trusting generated drafts by default."
SalesConversion-Bench evaluation and trained critic pipeline
Click to Zoom
Domain benchmark, deterministic checks, paired bootstrap evidence, and judge-gated outreach drafts

Problem

Generic agent benchmarks miss high-cost sales mistakes such as bench overcommitment, ICP misclassification, ungrounded gap claims, tone drift, and booking CTAs that arrive too early.

Solution

Built Tenacious-Bench v0.2 with 240 tasks, deterministic evaluator checks, human grading support, preference-pair generation, and a Path B judge that reproduces a +76.6pp held-out lift over the deterministic baseline.

Deep Dive

What It Measures

SalesConversion-Bench evaluates whether a sales agent follows Tenacious-specific business rules. The dataset covers trace-derived, programmatic, multi-LLM, and hand-authored tasks across failure categories such as bench overcommitment, signal overclaiming, ICP misclassification, and tone drift.

Engineering Highlights

  • •Machine-checkable evaluator: schema-backed tasks and deterministic scoring provide inspectable pass/fail reasons.
  • •Preference training loop: failed drafts are converted into chosen/rejected pairs for a small SimPO/LoRA critic.
  • •Published artifacts: the repo links to the Hugging Face dataset, judge adapter, technical write-up, and reproducible paired-bootstrap scripts.

Tech Stack

PythonJSON SchemaLoRASimPOHugging FaceStreamlit

Tags

#Benchmarking#LLM Judge#Preference Training#Sales AI
View GitHub Repo

More CreativeWork Software

Case studies in similar engineering domains.

DataAgentBench Evaluation Fork

→

Fork and evaluation workspace for DAB, a realistic enterprise data-agent benchmark spanning multi-database integration, messy joins, unstructured text transformation, and domain knowledge.

TRP1 AI Artist

→

Async multi-provider AI content generation framework for music, video, and images with plugin providers, style presets, CLI workflows, job tracking, duplicate detection, and cost controls.

Project Chimera

→

Spec-driven autonomous influencer network foundation using FastRender Swarm architecture, MCP-only external IO, Planner/Worker/Judge task DAGs, HITL review, multi-tenancy, and budget governance.