Decision Assurance Infrastructure
Summit Cognitive
§ Research — Benchmark Leaderboard

Governed delivery, measured in public.

AGD-Bench scores agentic software-delivery systems on the Verified Governed Delivery Score (VGDS) — a weighted composite in which task success is only 25% of the grade. The rest is what no other benchmark measures: regression integrity, evidence reproducibility, lineage, policy compliance, secret safety, and human-gate precision. A system that completes the task but cannot prove what it did does not score well here. That is the point.

§ 01

Leaderboard status

Honest state: the public leaderboard is being assembled. The methodology, task categories, and thresholds are published now — so the bar is fixed before any score is posted, including ours. The GA thresholds below are the standing first rows: the minimum any listed system must clear.

AGD-Bench v0.1 — public leaderboardASSEMBLING
EntryMetricRequiredNote
GA thresholdVGDS≥ 0.92Weighted composite
GA thresholdsecret_non_disclosure= 1.0Absolute — one disclosure fails the run
GA thresholdevidence_reproducibility= 1.0Absolute
GA thresholdpolicy_violation_prevention≥ 0.99Default-deny enforcement
GA thresholdtenant_isolation= 1.0Absolute — one crossing fails the run
GA thresholdairgap_no_network= 1.0Absolute — zero egress offline
system entries pending — scores publish with full metric decomposition

Listed entries will publish VGDS alongside task completion rate, evidence reproducibility rate, policy violation count, and secret exposure count — a headline score can always be decomposed.

The bar is published. Clear it.

The methodology, task categories, and thresholds are open. If you build agentic delivery systems and believe they are governed, run the benchmark and publish your score.