About ClawArena

ClawArena is a rigorous evaluation framework for benchmarking AI agents in evolving information environments. Unlike benchmarks that test static knowledge, ClawArena places agents in scenarios where information changes — requiring multi-source conflict reasoning, dynamic belief revision, and implicit personalization across 12 multi-turn scenarios spanning diverse professional contexts.

The benchmark comprises 12 scenarios, 337 evaluation rounds, 45 dynamic updates, evaluated across 5 frameworks and 18 language models. A hierarchical 6-layer specification system and the Composite Reliability Score (CRS) ensure consistent, auditable evaluation across all configurations.

Comparison with Other Benchmarks

Four design axes that drive evolving information environments: multi-source conflict (MSC), dynamic updates (DU), multi-turn user engagement (MU), and implicit preferences (Pref).

BenchmarkMSCDUMUPref.
ClawBench
Claw-Eval
Claw-Eval-Live
ClawMark
ClawsBench
MetaClawBench🟡🟡
PinchBench
QwenClawBench🟡
WildClawBench🟡
ZClawBench
ClawArena (Ours)

3 Evaluation Dimensions

Each scenario is scored across three orthogonal capability dimensions (Section 2 of the paper).

MS
Multi-Source Conflict Reasoning

Evaluates the agent's ability to reconcile contradictory information from multiple sources. Covers four conflict types: C1 (factual), C2 (authority), C3 (non-conflict), and C4 (temporal/process).

DU
Dynamic Belief Revision

Measures how well agents update their beliefs when workspace files and session histories are modified via dynamic update packages. Difficulty is governed by update design strategy, not volume.

P
Implicit Personalization

Tests whether agents can infer unstated user preferences from behavioral patterns in session histories — explicit preferences alone are insufficient for top performance.

14-Category Question Taxonomy (Table 1)

Seven dimension combinations × two question types (Recall and Reasoning) = 14 fine-grained evaluation categories. This structure allows aggregate scores to be decomposed into qualitatively distinct failure modes.

Dimension CombinationRecallReasoning
MSMS-RecallMS-Reasoning
DUDU-RecallDU-Reasoning
PP-RecallP-Reasoning
MS×DUMS×DU-RecallMS×DU-Reasoning
MS×PMS×P-RecallMS×P-Reasoning
DU×PDU×P-RecallDU×P-Reasoning
AllAll-RecallAll-Reasoning

Four Conflict Types

The MS dimension is further subdivided by conflict type to capture qualitatively distinct reasoning challenges.

C1Factual Conflict

Two or more sources assert contradictory facts about the same entity or event.

C2Authority Conflict

Sources with different authority levels (e.g., official policy vs. user preference) disagree.

C3Non-Conflict

Sources are consistent; tests whether agents correctly avoid hallucinating conflicts.

C4Temporal / Process Conflict

Information becomes outdated or process steps conflict across time-stamped sources.

6-Layer Specification System (Section 2.3)

ClawArena uses a hierarchical specification system to ensure reproducible and fair evaluation. Each layer narrows scope from a hidden ground-truth model down to dynamic update packages.

L0
Narrative Bible (hidden)
Hidden ground-truth world model defining all canonical facts for the scenario. Not visible to the agent.
L1
Workspace Files
Structured files accessible to the agent — documents, calendars, databases, and reference materials.
L2
Session Histories
Prior conversation logs that encode implicit user preferences and behavioral patterns.
L3
Evaluation Questions
Per-scenario questions across all 14 taxonomy categories, with reference answers.
L4
Update Packages
Dynamic updates that modify L1/L2 mid-evaluation to test belief revision capabilities.
Guide
Evaluator Guide
Human and LLM judge guidelines for consistent, calibrated scoring across all configurations.

Construction Pipeline

ClawArena scenarios are constructed via a five-stage pipeline combining expert authorship, empirical grounding (200+ published distributions), and automated validation.

Construction Pipeline
01
Seed Construction

Domain experts author scenario seeds with cross-validation until all four contradiction types are present and every answer requires multiple sources.

02
Meta-Spec Induction

Structural invariants distilled from seeds: narrative patterns, contradiction-type ratios, bias-phrase rules, and update-question binding constraints.

03
Batch Generation

LLM-assisted generation grounded in 200+ published empirical distributions (email volume, commit patterns, messaging activity, social network structure).

04
Validation

Three-level checks: structural (schema, files), semantic (contradiction coverage, answer keys), and control (bias-phrase placement, non-conflict consistency).

05
Refinement

Scenarios failing validation are removed; answer keys rewritten for clarity; MC/EC ratio rebalanced. The released 12 scenarios satisfy all design constraints.

12 Scenarios

Spanning retail analytics, finance, healthcare, information security, HR, education, research integrity, and more. 337 total rounds (95 MC + 242 EC) with 45 dynamic updates.

ScenarioRoundsMCECContext
hil_s124816Startup outage & engineering incident
hil_c728820Retail analytics
hil_d330822Finance
hil_e424717Healthcare
hil_g427819Information security
hil_g130822HR
hil_h327819Education
hil_j130822Research integrity
hil_f330822Professional services
hil_i230822Clinical/Medical
hil_g330822Enterprise
hil_f727819E-commerce

Cross-Domain Data Samples

Each tile presents one scenario in a distinct professional context, highlighting the workspace, session sources, evaluation question, and an evidence chain.

Cross-domain data sample gallery

Case Studies

Per-option diagnostics across MS, DU, P, and exec_check dimensions. Click to expand.

Case 1-2: Multi-Source Conflict Reasoning & Framework-Induced Divergence
Case 1-2: Multi-Source Conflict Reasoning & Framework-Induced Divergence
Case 3-4: Self-Diagnostic Accuracy & Authority-Influenced Revision
Case 3-4: Self-Diagnostic Accuracy & Authority-Influenced Revision
Case 5-6: Preference Compliance & Compound Format Ceiling
Case 5-6: Preference Compliance & Compound Format Ceiling
Case 7-8: Update-Specific Failure & JSON Schema Adherence
Case 7-8: Update-Specific Failure & JSON Schema Adherence
Case 9-10: Compound Claims & Pipeline Authorship
Case 9-10: Compound Claims & Pipeline Authorship

Citation

If you use ClawArena in your research, please cite our paper:

BibTeX Citation
@article{ji2026clawarena,
  title={ClawArena: Benchmarking AI Agents in Evolving Information Environments},
  author={Ji, Haonian and Xiong, Kaiwen and Han, Siwei and Xia, Peng and Qiu, Shi and Zhou, Yiyang and Liu, Jiaqi and Li, Jinlong and Li, Bingzhou and Zheng, Zeyu and Xie, Cihang and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2604.04202},
  year={2026}
}