Dataset
All ClawArena data is publicly available under the MIT license. The benchmark includes 64 scenarios, 1,879 evaluation rounds, and 365 dynamic updates across 8 domains.
Full Dataset
Complete ClawArena benchmark with 64 scenarios across 8 domains, 1,879 evaluation rounds, workspace files, session histories, and dynamic updates.
Spec Templates
The 6-layer specification system (L0–L4 + GUIDE) used to author all 64 scenarios. Extend it for new domains or personas.
Plugin System
Add new agent frameworks via the plugin adapter interface. No core code modification needed.
Data Format
Each scenario has a questions.json with evaluation rounds:
// questions.json — per scenario
{
"id": "hil_c6",
"desc": "NexaFlow retention/morale crisis ...",
"rounds": [
{
"id": "q1",
"type": "multi_choice",
"question": "Based on the workspace documents ...",
"eval": {
"options": { "A": "...", "B": "...", "C": "...", "D": "..." },
"answer": ["A", "C"]
},
"update_ids": []
},
{
"id": "q7",
"type": "exec_check",
"question": "Generate a product prioritization matrix ...",
"exec_check": {
"command": "test -f ${workspace}/report.md && grep -q 'NPS' ..."
},
"update_ids": ["upd2_sessions", "upd2_workspace"]
}
]
}Evaluation Types
Agent selects from a discrete option set. Extract \bbox{A,B,...} from the response and compute IoU/F1 against the ground truth answer key.
Agent produces files or code. Scored by running shell commands that verify file existence, content matching, and output correctness.