Dataset

All ClawArena data is publicly available under the MIT license. The benchmark includes 12 scenarios, 337 evaluation rounds, and 45 dynamic updates across diverse professional contexts.

Full Dataset

Complete ClawArena benchmark with 12 scenarios, 337 evaluation rounds, workspace files, session histories, and 45 dynamic updates.

data/clawarena/eval/

data/clawarena/openclaw/

data/clawarena/tests.json

Size: ~42 MB

GitHub HuggingFace

Spec Templates

The 6-layer specification system (L0–L4 + GUIDE) used to author all 64 scenarios. Extend it for new domains or personas.

docs/data-spec/

Size: ~1 MB

GitHub

Plugin System

Add new agent frameworks via the plugin adapter interface. No core code modification needed.

docs/plugin.md

src/clawarena/plugins/

Documentation

Data Format

Each scenario has a questions.json with evaluation rounds:

questions.jsonJSON

// questions.json — per scenario
{
  "id": "hil_c6",
  "desc": "NexaFlow retention/morale crisis ...",
  "rounds": [
    {
      "id": "q1",
      "type": "multi_choice",
      "question": "Based on the workspace documents ...",
      "eval": {
        "options": { "A": "...", "B": "...", "C": "...", "D": "..." },
        "answer": ["A", "C"]
      },
      "update_ids": []
    },
    {
      "id": "q7",
      "type": "exec_check",
      "question": "Generate a product prioritization matrix ...",
      "exec_check": {
        "command": "test -f ${workspace}/report.md && grep -q 'NPS' ..."
      },
      "update_ids": ["upd2_sessions", "upd2_workspace"]
    }
  ]
}

Evaluation Types

multi_choice

Multiple Choice

Agent selects from a discrete option set. Extract \bbox{A,B,...} from the response and compute IoU/F1 against the ground truth answer key.

exec_check

Execution Check

Agent produces files or code. Scored by running shell commands that verify file existence, content matching, and output correctness.