About ClawArena
ClawArena is a rigorous evaluation framework for benchmarking AI agents in evolving information environments. Unlike benchmarks that test static knowledge, ClawArena places agents in scenarios where information changes — requiring multi-source conflict reasoning, dynamic belief revision, and implicit personalization across 64 professional scenarios in 8 domains.
The benchmark comprises 64 scenarios, 1,879 evaluation rounds, 365 dynamic updates, and a hierarchical 6-layer specification system (L0 Narrative Bible → L1–L4 → Guide) that ensures consistent, auditable evaluation across all configurations.
3 Evaluation Dimensions
Each scenario is scored across three orthogonal capability dimensions (Section 2 of the paper).
Evaluates the agent's ability to reconcile contradictory information from multiple sources. Covers four conflict types: C1 (factual), C2 (authority), C3 (non-conflict), and C4 (temporal/process).
Measures how well agents update their beliefs when workspace files and session histories are modified via dynamic update packages. Difficulty is governed by update design strategy, not volume.
Tests whether agents can infer unstated user preferences from behavioral patterns in session histories — explicit preferences alone are insufficient for top performance.
14-Category Question Taxonomy (Table 1)
Seven dimension combinations × two question types (Recall and Reasoning) = 14 fine-grained evaluation categories. This structure allows aggregate scores to be decomposed into qualitatively distinct failure modes.
Four Conflict Types
The MS dimension is further subdivided by conflict type to capture qualitatively distinct reasoning challenges.
Two or more sources assert contradictory facts about the same entity or event.
Sources with different authority levels (e.g., official policy vs. user preference) disagree.
Sources are consistent; tests whether agents correctly avoid hallucinating conflicts.
Information becomes outdated or process steps conflict across time-stamped sources.
6-Layer Specification System (Section 2.3)
ClawArena uses a hierarchical specification system to ensure reproducible and fair evaluation. Each layer narrows scope from a hidden ground-truth model down to dynamic update packages.
Construction Pipeline (Section 2.4)
ClawArena scenarios are constructed via a four-stage pipeline that balances expert authorship with LLM-assisted generation.
Domain experts author canonical scenario seeds with ground-truth world models (L0 Narrative Bible).
Structured meta-specifications are induced from seeds to define the generation space for each scenario.
LLM-assisted generation populates workspace files (L1), session histories (L2), questions (L3), and update packages (L4) at scale.
Automated consistency checks and human review ensure factual accuracy, conflict fidelity, and evaluation quality.
8 Evaluation Domains
Four English and four Chinese domains, each with a distinct professional persona and real-world context.
Citation
If you use ClawArena in your research, please cite our paper:
@article{clawarena2026,
title={ClawArena: Benchmarking AI Agents in Evolving Information Environments},
author={Anonymous},
year={2026},
note={Preprint}
}