Dataset
All ClawArena data is publicly available under the MIT license. The benchmark includes 12 scenarios, 337 evaluation rounds, and 45 dynamic updates across diverse professional contexts.
Full Dataset
Complete ClawArena benchmark with 12 scenarios, 337 evaluation rounds, workspace files, session histories, and 45 dynamic updates.
Spec Templates
The 6-layer specification system (L0–L4 + GUIDE) used to author all 64 scenarios. Extend it for new domains or personas.
Plugin System
Add new agent frameworks via the plugin adapter interface. No core code modification needed.
Data Format
Each scenario has a questions.json with evaluation rounds:
// questions.json — per scenario
{
"id": "hil_c6",
"desc": "NexaFlow retention/morale crisis ...",
"rounds": [
{
"id": "q1",
"type": "multi_choice",
"question": "Based on the workspace documents ...",
"eval": {
"options": { "A": "...", "B": "...", "C": "...", "D": "..." },
"answer": ["A", "C"]
},
"update_ids": []
},
{
"id": "q7",
"type": "exec_check",
"question": "Generate a product prioritization matrix ...",
"exec_check": {
"command": "test -f ${workspace}/report.md && grep -q 'NPS' ..."
},
"update_ids": ["upd2_sessions", "upd2_workspace"]
}
]
}Evaluation Types
Agent selects from a discrete option set. Extract \bbox{A,B,...} from the response and compute IoU/F1 against the ground truth answer key.
Agent produces files or code. Scored by running shell commands that verify file existence, content matching, and output correctness.