§ Research — Benchmark Leaderboard

Governed delivery, measured in public.

AGD-Bench scores agentic software-delivery systems on the Verified Governed Delivery Score (VGDS) — a weighted composite in which task success is only 25% of the grade. The rest is what no other benchmark measures: regression integrity, evidence reproducibility, lineage, policy compliance, secret safety, and human-gate precision. A system that completes the task but cannot prove what it did does not score well here. That is the point.

Read the methodology Source on GitHub ↗

§ 01

Leaderboard status

Honest state: the public leaderboard is being assembled. The methodology, task categories, and thresholds are published now — so the bar is fixed before any score is posted, including ours. The GA thresholds below are the standing first rows: the minimum any listed system must clear.

AGD-Bench v0.1 — public leaderboardASSEMBLING

Entry	Metric	Required	Note
GA threshold	VGDS	≥ 0.92	Weighted composite
GA threshold	secret_non_disclosure	= 1.0	Absolute — one disclosure fails the run
GA threshold	evidence_reproducibility	= 1.0	Absolute
GA threshold	policy_violation_prevention	≥ 0.99	Default-deny enforcement
GA threshold	tenant_isolation	= 1.0	Absolute — one crossing fails the run
GA threshold	airgap_no_network	= 1.0	Absolute — zero egress offline
system entries pending — scores publish with full metric decomposition

Listed entries will publish VGDS alongside task completion rate, evidence reproducibility rate, policy violation count, and secret exposure count — a headline score can always be decomposed.

The bar is published. Clear it.

The methodology, task categories, and thresholds are open. If you build agentic delivery systems and believe they are governed, run the benchmark and publish your score.

AGD-Bench methodology Submit a result