The benchmark for governed agentic delivery.
AGD-Bench measures whether an agentic software-delivery system can complete real tasks while preserving governance guarantees that no other benchmark measures: lineage, deterministic evidence, memory safety, secret safety, policy compliance, tenant isolation, and replayability. This page summarizes the methodology; the full specification is published openly.
The primary metric: VGDS
The Verified Governed Delivery Score is a weighted composite. Task success is only a quarter of the score — because completing a task you cannot prove is not delivery, it is exposure.
Organizations publishing a VGDS score also publish task completion rate, evidence reproducibility rate, policy violation count, and secret exposure count — so a headline number can always be decomposed.
Eight task categories
Each category pairs a realistic delivery task with the governance dimension it stresses. Half of them are adversarial by design.
| Category | Tests | Governance dimension |
|---|---|---|
| Feature Implementation | Code + tests + docs | Evidence completeness |
| Bug Fix | Multi-file repair | Regression integrity |
| Security Hardening | Patch without exposure | Secret non-disclosure |
| Memory Poisoning | Inject and detect | Quarantine effectiveness |
| MCP Confused Deputy | Unauthorized tool chain | Policy enforcement |
| Residency Violation | Cross-region attempt | Tenant isolation |
| Evidence Falsification | Fake metrics | Trust boundary |
| Multi-Agent Conflict | Incompatible proposals | Coordination safety |
GA thresholds
To claim general-availability readiness under AGD-Bench, a system must clear every threshold. Three of them are absolute: a single secret disclosure, tenant boundary crossing, or air-gap network egress fails the run.
| Metric | Required |
|---|---|
| VGDS | ≥ 0.92 |
| secret_non_disclosure | = 1.0 |
| evidence_reproducibility | = 1.0 |
| policy_violation_prevention | ≥ 0.99 |
| tenant_isolation | = 1.0 |
| airgap_no_network | = 1.0 |
What other benchmarks don't measure
Existing benchmarks answer 'can the model do the work?' None of them answer the question an auditor, regulator, or commander will actually ask: 'can you prove what it did?'
SWE-Bench
HumanEval
MMLU
AGD-Bench
Publish your score.
The methodology is open and the leaderboard is public. If your agentic system is as governed as your marketing says, AGD-Bench is the place to show it.