Dataset

All ClawArena data is publicly available under the MIT license. The benchmark includes 12 scenarios, 337 evaluation rounds, and 45 dynamic updates across diverse professional contexts.

Full Dataset

Complete ClawArena benchmark with 12 scenarios, 337 evaluation rounds, workspace files, session histories, and 45 dynamic updates.

data/clawarena/eval/
data/clawarena/openclaw/
data/clawarena/tests.json
Size: ~42 MB

Spec Templates

The 6-layer specification system (L0–L4 + GUIDE) used to author all 64 scenarios. Extend it for new domains or personas.

docs/data-spec/
Size: ~1 MB

Plugin System

Add new agent frameworks via the plugin adapter interface. No core code modification needed.

docs/plugin.md
src/clawarena/plugins/

Data Format

Each scenario has a questions.json with evaluation rounds:

questions.jsonJSON
// questions.json — per scenario
{
  "id": "hil_c6",
  "desc": "NexaFlow retention/morale crisis ...",
  "rounds": [
    {
      "id": "q1",
      "type": "multi_choice",
      "question": "Based on the workspace documents ...",
      "eval": {
        "options": { "A": "...", "B": "...", "C": "...", "D": "..." },
        "answer": ["A", "C"]
      },
      "update_ids": []
    },
    {
      "id": "q7",
      "type": "exec_check",
      "question": "Generate a product prioritization matrix ...",
      "exec_check": {
        "command": "test -f ${workspace}/report.md && grep -q 'NPS' ..."
      },
      "update_ids": ["upd2_sessions", "upd2_workspace"]
    }
  ]
}

Evaluation Types

multi_choice
Multiple Choice

Agent selects from a discrete option set. Extract \bbox{A,B,...} from the response and compute IoU/F1 against the ground truth answer key.

exec_check
Execution Check

Agent produces files or code. Scored by running shell commands that verify file existence, content matching, and output correctness.