Dataset

All ClawArena data is publicly available under the MIT license. The benchmark includes 64 scenarios, 1,879 evaluation rounds, and 365 dynamic updates across 8 domains.

📦

Full Dataset

Complete ClawArena benchmark with 64 scenarios across 8 domains, 1,879 evaluation rounds, workspace files, session histories, and dynamic updates.

data/clawarena/eval/
data/clawarena/openclaw/
data/clawarena/tests.json
Size: ~42 MB
📋

Spec Templates

The 6-layer specification system (L0–L4 + GUIDE) used to author all 64 scenarios. Extend it for new domains or personas.

docs/data-spec/
Size: ~1 MB
🔌

Plugin System

Add new agent frameworks via the plugin adapter interface. No core code modification needed.

docs/plugin.md
src/clawarena/plugins/

Data Format

Each scenario has a questions.json with evaluation rounds:

questions.jsonJSON
// questions.json — per scenario
{
  "id": "hil_c6",
  "desc": "NexaFlow retention/morale crisis ...",
  "rounds": [
    {
      "id": "q1",
      "type": "multi_choice",
      "question": "Based on the workspace documents ...",
      "eval": {
        "options": { "A": "...", "B": "...", "C": "...", "D": "..." },
        "answer": ["A", "C"]
      },
      "update_ids": []
    },
    {
      "id": "q7",
      "type": "exec_check",
      "question": "Generate a product prioritization matrix ...",
      "exec_check": {
        "command": "test -f ${workspace}/report.md && grep -q 'NPS' ..."
      },
      "update_ids": ["upd2_sessions", "upd2_workspace"]
    }
  ]
}

Evaluation Types

multi_choice
Multiple Choice

Agent selects from a discrete option set. Extract \bbox{A,B,...} from the response and compute IoU/F1 against the ground truth answer key.

exec_check
Execution Check

Agent produces files or code. Scored by running shell commands that verify file existence, content matching, and output correctness.