About ClawArena

ClawArena is a rigorous evaluation framework for benchmarking AI agents in evolving information environments. Unlike benchmarks that test static knowledge, ClawArena places agents in scenarios where information changes — requiring multi-source conflict reasoning, dynamic belief revision, and implicit personalization across 12 multi-turn scenarios spanning diverse professional contexts.

The benchmark comprises 12 scenarios, 337 evaluation rounds, 45 dynamic updates, evaluated across 5 frameworks and 20 language models. A hierarchical 6-layer specification system and the Composite Reliability Score (CRS) ensure consistent, auditable evaluation across all configurations.

Comparison with Other Benchmarks

Four design axes that drive evolving information environments: multi-source conflict (MSC), dynamic updates (DU), multi-turn user engagement (MU), and implicit preferences (Pref).

Benchmark	MSC	DU	MU	Pref.	Verification	Frmw.	Scale
ClawBench	❌	❌	❌	❌	rule+llm	8	283 / 144 sites
Claw-Eval	❌	❌	✅	❌	rule+llm	1	300 / 9 cats
Claw-Eval-Live	❌	❌	❌	❌	rule+llm	1	105 / 17 fam.
ClawMark	✅	✅	✅	❌	rule-based	1	100 / 13 scen.
ClawsBench	✅	❌	❌	❌	rule-based	4	44 / 5 svc.
MetaClawBench	🟡	✅	✅	🟡	rule-based	1	346 / 30 days
PinchBench	❌	❌	❌	❌	rule+llm	1	23 / 8 cats
QwenClawBench	❌	❌	❌	🟡	rule+llm	1	100 / 8 dom.
WildClawBench	✅	❌	❌	🟡	rule+llm	1	60 / 6 cats
ZClawBench	❌	❌	❌	❌	rule+llm	1	116 / 6 cats
ClawArena (Ours)	✅	✅	✅	✅	rule-based	5	337 / 12 scen.

3 Evaluation Dimensions

Each scenario is scored across three orthogonal capability dimensions (Section 2 of the paper).

Multi-Source Conflict Reasoning

Evaluates the agent's ability to reconcile contradictory information from multiple sources. Covers four conflict types: C1 (factual), C2 (authority), C3 (non-conflict), and C4 (temporal/process).

Dynamic Belief Revision

Measures how well agents update their beliefs when workspace files and session histories are modified via dynamic update packages. Difficulty is governed by update design strategy, not volume.

Implicit Personalization

Tests whether agents can infer unstated user preferences from behavioral patterns in session histories — explicit preferences alone are insufficient for top performance.

14-Category Question Taxonomy (Table 1)

Seven dimension combinations × two question types (Recall and Reasoning) = 14 fine-grained evaluation categories. This structure allows aggregate scores to be decomposed into qualitatively distinct failure modes.

Dimension CombinationRecallReasoning

MS✓ MS-Recall✓ MS-Reasoning

DU✓ DU-Recall✓ DU-Reasoning

P✓ P-Recall✓ P-Reasoning

MS×DU✓ MS×DU-Recall✓ MS×DU-Reasoning

MS×P✓ MS×P-Recall✓ MS×P-Reasoning

DU×P✓ DU×P-Recall✓ DU×P-Reasoning

All✓ All-Recall✓ All-Reasoning

Four Conflict Types

The MS dimension is further subdivided by conflict type to capture qualitatively distinct reasoning challenges.

C1Factual Conflict

Two or more sources assert contradictory facts about the same entity or event.

C2Authority Conflict

Sources with different authority levels (e.g., official policy vs. user preference) disagree.

C3Non-Conflict

Sources are consistent; tests whether agents correctly avoid hallucinating conflicts.

C4Temporal / Process Conflict

Information becomes outdated or process steps conflict across time-stamped sources.

6-Layer Specification System (Section 2.3)

ClawArena uses a hierarchical specification system to ensure reproducible and fair evaluation. Each layer narrows scope from a hidden ground-truth model down to dynamic update packages.

Narrative Bible (hidden)

Hidden ground-truth world model defining all canonical facts for the scenario. Not visible to the agent.

Workspace Files

Structured files accessible to the agent — documents, calendars, databases, and reference materials.

Session Histories

Prior conversation logs that encode implicit user preferences and behavioral patterns.

Evaluation Questions

Per-scenario questions across all 14 taxonomy categories, with reference answers.

Update Packages

Dynamic updates that modify L1/L2 mid-evaluation to test belief revision capabilities.

Guide

Evaluator Guide

Human and LLM judge guidelines for consistent, calibrated scoring across all configurations.

Construction Pipeline

ClawArena scenarios are constructed via a five-stage pipeline combining expert authorship, empirical grounding (200+ published distributions), and automated validation.

Seed Construction

Domain experts author scenario seeds with cross-validation until all four contradiction types are present and every answer requires multiple sources.

Meta-Spec Induction

Structural invariants distilled from seeds: narrative patterns, contradiction-type ratios, bias-phrase rules, and update-question binding constraints.

Batch Generation

LLM-assisted generation grounded in 200+ published empirical distributions (email volume, commit patterns, messaging activity, social network structure).

Validation

Three-level checks: structural (schema, files), semantic (contradiction coverage, answer keys), and control (bias-phrase placement, non-conflict consistency).

Refinement

Scenarios failing validation are removed; answer keys rewritten for clarity; MC/EC ratio rebalanced. The released 12 scenarios satisfy all design constraints.

12 Scenarios

Spanning retail analytics, finance, healthcare, information security, HR, education, research integrity, and more. 337 total rounds (95 MC + 242 EC) with 45 dynamic updates.

ScenarioRoundsMCECContext

hil_s124816Startup outage & engineering incident

hil_c728820Retail analytics

hil_d330822Finance

hil_e424717Healthcare

hil_g427819Information security

hil_g130822HR

hil_h327819Education

hil_j130822Research integrity

hil_f330822Professional services

hil_i230822Clinical/Medical

hil_g330822Enterprise

hil_f727819E-commerce

Cross-Domain Data Samples

Each tile presents one scenario in a distinct professional context, highlighting the workspace, session sources, evaluation question, and an evidence chain.

Case Studies

Per-option diagnostics across MS, DU, P, and exec_check dimensions. Click to expand.

Case 1-2: Multi-Source Conflict Reasoning & Framework-Induced Divergence

Case 3-4: Self-Diagnostic Accuracy & Authority-Influenced Revision

Case 5-6: Preference Compliance & Compound Format Ceiling

Case 7-8: Update-Specific Failure & JSON Schema Adherence

Case 9-10: Compound Claims & Pipeline Authorship

Citation

If you use ClawArena in your research, please cite our paper:

BibTeX Citation

@article{ji2026clawarena,
  title={ClawArena: Benchmarking AI Agents in Evolving Information Environments},
  author={Ji, Haonian and Xiong, Kaiwen and Han, Siwei and Xia, Peng and Qiu, Shi and Zhou, Yiyang and Liu, Jiaqi and Li, Jinlong and Li, Bingzhou and Zheng, Zeyu and Xie, Cihang and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2604.04202},
  year={2026}
}