About ClawArena

ClawArena is a rigorous evaluation framework for benchmarking AI agents in evolving information environments. Unlike benchmarks that test static knowledge, ClawArena places agents in scenarios where information changes — requiring multi-source conflict reasoning, dynamic belief revision, and implicit personalization across 64 professional scenarios in 8 domains.

The benchmark comprises 64 scenarios, 1,879 evaluation rounds, 365 dynamic updates, and a hierarchical 6-layer specification system (L0 Narrative Bible → L1–L4 → Guide) that ensures consistent, auditable evaluation across all configurations.

3 Evaluation Dimensions

Each scenario is scored across three orthogonal capability dimensions (Section 2 of the paper).

MS
Multi-Source Conflict Reasoning

Evaluates the agent's ability to reconcile contradictory information from multiple sources. Covers four conflict types: C1 (factual), C2 (authority), C3 (non-conflict), and C4 (temporal/process).

🔄
DU
Dynamic Belief Revision

Measures how well agents update their beliefs when workspace files and session histories are modified via dynamic update packages. Difficulty is governed by update design strategy, not volume.

👤
P
Implicit Personalization

Tests whether agents can infer unstated user preferences from behavioral patterns in session histories — explicit preferences alone are insufficient for top performance.

14-Category Question Taxonomy (Table 1)

Seven dimension combinations × two question types (Recall and Reasoning) = 14 fine-grained evaluation categories. This structure allows aggregate scores to be decomposed into qualitatively distinct failure modes.

Dimension CombinationRecallReasoning
MSMS-RecallMS-Reasoning
DUDU-RecallDU-Reasoning
PP-RecallP-Reasoning
MS×DUMS×DU-RecallMS×DU-Reasoning
MS×PMS×P-RecallMS×P-Reasoning
DU×PDU×P-RecallDU×P-Reasoning
AllAll-RecallAll-Reasoning

Four Conflict Types

The MS dimension is further subdivided by conflict type to capture qualitatively distinct reasoning challenges.

C1Factual Conflict

Two or more sources assert contradictory facts about the same entity or event.

C2Authority Conflict

Sources with different authority levels (e.g., official policy vs. user preference) disagree.

C3Non-Conflict

Sources are consistent; tests whether agents correctly avoid hallucinating conflicts.

C4Temporal / Process Conflict

Information becomes outdated or process steps conflict across time-stamped sources.

6-Layer Specification System (Section 2.3)

ClawArena uses a hierarchical specification system to ensure reproducible and fair evaluation. Each layer narrows scope from a hidden ground-truth model down to dynamic update packages.

L0
Narrative Bible (hidden)
Hidden ground-truth world model defining all canonical facts for the scenario. Not visible to the agent.
L1
Workspace Files
Structured files accessible to the agent — documents, calendars, databases, and reference materials.
L2
Session Histories
Prior conversation logs that encode implicit user preferences and behavioral patterns.
L3
Evaluation Questions
Per-scenario questions across all 14 taxonomy categories, with reference answers.
L4
Update Packages
Dynamic updates that modify L1/L2 mid-evaluation to test belief revision capabilities.
Guide
Evaluator Guide
Human and LLM judge guidelines for consistent, calibrated scoring across all configurations.

Construction Pipeline (Section 2.4)

ClawArena scenarios are constructed via a four-stage pipeline that balances expert authorship with LLM-assisted generation.

01
Seed Construction

Domain experts author canonical scenario seeds with ground-truth world models (L0 Narrative Bible).

02
Meta-Spec Induction

Structured meta-specifications are induced from seeds to define the generation space for each scenario.

03
Batch Generation

LLM-assisted generation populates workspace files (L1), session histories (L2), questions (L3), and update packages (L4) at scale.

04
Validation

Automated consistency checks and human review ensure factual accuracy, conflict fidelity, and evaluation quality.

8 Evaluation Domains

Four English and four Chinese domains, each with a distinct professional persona and real-world context.

💼Domain C
Tech/HR Startup
Alex Rivera (CEO)
EN8 scenarios · 204 rounds
🏥Domain D
Hospital Admin
Dr. Kenji Tanaka
EN8 scenarios · 240 rounds
🌱Domain E
Nonprofit/NGO
Sarah Chen
EN8 scenarios · 240 rounds
🏠Domain F
Family/Personal
赵磊 (Quant Trader)
ZH8 scenarios · 240 rounds
🏢Domain G
Tech Corporate
陈静 (HR Manager)
ZH8 scenarios · 240 rounds
🎓Domain H
Campus/Student
王明 (CS Undergrad)
ZH8 scenarios · 235 rounds
🩺Domain I
Clinical/Medical
林逸 (ER Attending)
ZH8 scenarios · 240 rounds
📱Domain J
Community/MCN
周芳 (Food Blogger)
ZH8 scenarios · 240 rounds

Citation

If you use ClawArena in your research, please cite our paper:

BibTeX Citation
@article{clawarena2026,
  title={ClawArena: Benchmarking AI Agents in Evolving Information Environments},
  author={Anonymous},
  year={2026},
  note={Preprint}
}