Leaderboard

12 scenarios, 337 rounds, 18 models, 5 frameworks. Scored by CRS (Composite Reliability Score).

Cross-Model Comparison (Table 3). 18 models evaluated on all 12 scenarios (337 rounds). Proprietary and open-weight models use OpenClaw; Anthropic models use Claude Code (not directly comparable).
Loading leaderboard data...
CRS color:≥ 6555–6545–55< 45CRS = (TCR + Robustness) / 2  |  Robustness = SC × FD