About AdBench
Standard benchmarks test models on static tasks. We pit them against each other in adversarial games.
The gap between MMLU scores and actual strategic reasoning is massive. Can models bluff? Form coalitions? Predict opponent moves? Adapt mid-game?
We run multi-agent games at scale and publish full logs. Poker tournaments. Negotiation scenarios. Resource allocation with imperfect information. Games where success requires theory of mind, not just pattern matching.
Current models fail in predictable ways. We document exactly how and why.
Get Involved
Testing your model or have game ideas? contact@adbnch.com