SalesConversion-Bench

Domain Benchmark & Judge Training Architect

Sales-domain benchmark and trained critic layer for Tenacious-style B2B outreach, with contamination-aware task generation, deterministic scoring checks, preference data, and a small LoRA judge.

"Turned prospect-facing sales failures into a measurable benchmark and judge gate instead of trusting generated drafts by default."

SalesConversion-Bench evaluation and trained critic pipeline — Domain benchmark, deterministic checks, paired bootstrap evidence, and judge-gated outreach drafts

Problem

Generic agent benchmarks miss high-cost sales mistakes such as bench overcommitment, ICP misclassification, ungrounded gap claims, tone drift, and booking CTAs that arrive too early.

Solution

Built Tenacious-Bench v0.2 with 240 tasks, deterministic evaluator checks, human grading support, preference-pair generation, and a Path B judge that reproduces a +76.6pp held-out lift over the deterministic baseline.

Deep Dive

What It Measures

SalesConversion-Bench evaluates whether a sales agent follows Tenacious-specific business rules. The dataset covers trace-derived, programmatic, multi-LLM, and hand-authored tasks across failure categories such as bench overcommitment, signal overclaiming, ICP misclassification, and tone drift.

Engineering Highlights

•Machine-checkable evaluator: schema-backed tasks and deterministic scoring provide inspectable pass/fail reasons.
•Preference training loop: failed drafts are converted into chosen/rejected pairs for a small SimPO/LoRA critic.
•Published artifacts: the repo links to the Hugging Face dataset, judge adapter, technical write-up, and reproducible paired-bootstrap scripts.

Tech Stack

PythonJSON SchemaLoRASimPOHugging FaceStreamlit

SalesConversion-Bench

Domain Benchmark & Judge Training Architect

Sales-domain benchmark and trained critic layer for Tenacious-style B2B outreach, with contamination-aware task generation, deterministic scoring checks, preference data, and a small LoRA judge.

"Turned prospect-facing sales failures into a measurable benchmark and judge gate instead of trusting generated drafts by default."

Problem

Generic agent benchmarks miss high-cost sales mistakes such as bench overcommitment, ICP misclassification, ungrounded gap claims, tone drift, and booking CTAs that arrive too early.

Solution

Deep Dive

What It Measures

Engineering Highlights

•Machine-checkable evaluator: schema-backed tasks and deterministic scoring provide inspectable pass/fail reasons.
•Preference training loop: failed drafts are converted into chosen/rejected pairs for a small SimPO/LoRA critic.
•Published artifacts: the repo links to the Hugging Face dataset, judge adapter, technical write-up, and reproducible paired-bootstrap scripts.

Tech Stack

PythonJSON SchemaLoRASimPOHugging FaceStreamlit

SalesConversion-Bench

Problem

Solution

Deep Dive

What It Measures

Engineering Highlights

Tech Stack

Tags

More CreativeWork Software

DataAgentBench Evaluation Fork

TRP1 AI Artist

Project Chimera

SalesConversion-Bench

Problem

Solution

Deep Dive

What It Measures

Engineering Highlights

Tech Stack

Tags

More CreativeWork Software

DataAgentBench Evaluation Fork

TRP1 AI Artist

Project Chimera