Benchmark for AI Agents in Evolving Environments

ClawArenaClawArena

Benchmarking AI Agents in Evolving Information Environments

A rigorous evaluation framework for AI agents — measuring reasoning under multi-source conflicts, dynamic belief revision, and implicit personalization across 12 multi-turn scenarios spanning diverse professional contexts.

12Scenarios
337Rounds
18Models
5Frameworks
45Updates

Overview

An evolving information environment with conflicting and progressively updated evidence, structured along three coupled evaluation dimensions.

ClawArena Overview

Leaderboard

All configurations ranked by CRS (Composite Reliability Score) — 18 models × 5 frameworks

Detailed breakdown
Loading...

What we found

01

Model capability dominates framework design

Model choice accounts for a 29-point CRS range across 18 models, while framework design accounts for up to 24 points across 4 frameworks. Model capability still dominates, but framework choice is more consequential than previously reported.

02

MetaClaw improves robustness without degrading accuracy

Skill-based self-evolution consistently improves CRS by 0.33–1.19 across all three tested model families. The mechanism is behavioral consistency (SC and FD both rise), not raw accuracy.

03

Belief revision difficulty is governed by update design

Updates that force re-interpretation of earlier claims cause clustered failures, while updates that merely extend prior evidence are handled reliably. Update specificity, not volume, determines difficulty.

Three coupled challenges

Real information environments are multi-source, dynamic, and personalized. ClawArena evaluates all three jointly.

MS

Multi-Source Conflict

Evidence is scattered across heterogeneous sources that may contradict each other. The agent must judge source reliability across four canonical conflict types.

DU

Dynamic Belief Revision

New evidence can invalidate previously correct conclusions. 45 staged updates across 12 scenarios test whether agents revise rather than simply accumulate.

P

Implicit Personalization

User preferences surface through corrections and behavioral patterns, not explicit instructions. A four-stage protocol ends in silent-exam rounds.

14-category taxonomy

7 dimension combinations × 2 types (Recall, Reasoning) prevent systems from scoring well by solving only one dimension.

Executable checks

Shell-based verification of workspace file state. Agents must produce working artifacts, not just text answers.

6-layer specifications

Hidden ground truth (L0) is never shown to agents. Observable layers are noisy, partial reflections of the same underlying reality.

Comparison with agent benchmarks

Four design axes for evolving information environments. ClawArena is the only benchmark satisfying all four simultaneously.

BenchmarkMSCDUMUPref.
ClawBench
Claw-Eval
Claw-Eval-Live
ClawMark
ClawsBench
MetaClawBench🟡🟡
PinchBench
QwenClawBench🟡
WildClawBench🟡
ZClawBench
ClawArena (Ours)
MSC: multi-source conflict · DU: dynamic updates · MU: multi-turn user · Pref: implicit preferences