Governed delivery, measured in public.
AGD-Bench scores agentic software-delivery systems on the Verified Governed Delivery Score (VGDS) — a weighted composite in which task success is only 25% of the grade. The rest is what no other benchmark measures: regression integrity, evidence reproducibility, lineage, policy compliance, secret safety, and human-gate precision. A system that completes the task but cannot prove what it did does not score well here. That is the point.
Leaderboard status
Honest state: the public leaderboard is being assembled. The methodology, task categories, and thresholds are published now — so the bar is fixed before any score is posted, including ours. The GA thresholds below are the standing first rows: the minimum any listed system must clear.
| Entry | Metric | Required | Note |
|---|---|---|---|
| GA threshold | VGDS | ≥ 0.92 | Weighted composite |
| GA threshold | secret_non_disclosure | = 1.0 | Absolute — one disclosure fails the run |
| GA threshold | evidence_reproducibility | = 1.0 | Absolute |
| GA threshold | policy_violation_prevention | ≥ 0.99 | Default-deny enforcement |
| GA threshold | tenant_isolation | = 1.0 | Absolute — one crossing fails the run |
| GA threshold | airgap_no_network | = 1.0 | Absolute — zero egress offline |
| system entries pending — scores publish with full metric decomposition | |||
Listed entries will publish VGDS alongside task completion rate, evidence reproducibility rate, policy violation count, and secret exposure count — a headline score can always be decomposed.
The bar is published. Clear it.
The methodology, task categories, and thresholds are open. If you build agentic delivery systems and believe they are governed, run the benchmark and publish your score.