Home
Softono
leaderboard

leaderboard

Open source MIT TypeScript
38
Stars
11
Forks
10
Issues
1
Watchers
2 weeks
Last Commit

About leaderboard

Open leaderboard for browser agents

Platforms

Web Self-hosted

Languages

TypeScript

AI Browser Agent Leaderboards

Open reference for evaluating AI browser agents, computer-use systems, and coding agents across the public benchmarks teams actually compare on.

Steel.dev — Open-source Browser API for AI Agents & Apps

Live site: leaderboard.steel.dev — full rankings, methodology notes, and per-result detail.

Maintained by Steel, the open-source browser API for AI agents.

Top results by benchmark

The tables below show the current top entries on each tracked benchmark. Each section links to the full leaderboard with sources, methodology, and additional context.

WebVoyager

Browser agents · Agent scope · 19 entries tracked

Rank System Organization Score
1 Alumnium (new) Alumnium 98.5%
2 Surfer 2 H Company 97.1%
3 Magnitude Magnitude 93.9%
4 Surfer-H + Holo1 H Company 92.2%
5 Browserable Browserable 90.4%

See all 19 entries →


BrowseComp

Research/search · Mixed scope · 82 entries tracked

Rank System Organization Score
1 GPT-5.5 Pro OpenAI 90.1%
2 GPT-5.4 Pro OpenAI 89.3%
3 MiroThinker-H1 MiroMind 88.2%
4 Claude Mythos Preview Anthropic 86.9%
5 Kimi K2.6 Moonshot AI 86.3%

See all 82 entries →


WebArena

Browser agents · Agent scope · 49 entries tracked

Rank System Organization Score
1 WebTactix (DeepSeek v3.2) WebTactix 74.3%
2 OpAgent CodeFuse AI 71.6%
3 ColorBrowserAgent MadeAgents 71.2%
4 Claude Code + GBOX MCP GBOX AI 68.0%
5 DeepSky Agent DeepSky 66.9%

See all 49 entries →


SWE-bench Verified

Coding · Model scope · 16 entries tracked

Rank System Organization Score
1 Claude Mythos Anthropic 93.9%
2 Claude Opus 4.8 (new) Anthropic 88.6%
3 Claude Opus 4.7 Anthropic 87.6%
4 Claude Opus 4.5 Anthropic 80.9%
5 Claude Opus 4.6 Anthropic 80.8%

See all 16 entries →


OSWorld

Computer use · Agent scope · 17 entries tracked

Rank System Organization Score
1 Claude Opus 4.8 (new) Anthropic 83.4%
2 Mythos Preview (new) Anthropic 79.6%
3 OSAgent TheAGI Company 76.26%
4 GPT-5.4 (new) OpenAI 75.0%
5 Claude Opus 4.6 Anthropic 72.7%

See all 17 entries →


GAIA

Model evals / reasoning · Agent scope · 21 entries tracked

Rank System Organization Score
1 OPS-Agentic-Search (new) Alibaba Cloud 92.36%
1 openJiuwen-deepagent (new) Suzhou AI Lab / Shuqian Tech 92.36%
3 openJiuwen-deepagent (GPT5/Gemini) openJiuwen 91.69%
4 Lemon Agent Lenovo CTO Org 91.36%
5 JoinAI V2.2 JoinAI-CMCC 90.7%

See all 21 entries →


ClawBench

Browser agents · Agent scope · 7 entries tracked

Rank System Organization Score
1 Claude Sonnet 4.6 Anthropic 33.3%
2 GLM-5 (new) Z.ai 24.2%
3 Gemini 3 Flash Google 19.0%
4 Claude Haiku 4.5 Anthropic 18.3%
5 GPT-5.4 OpenAI 6.5%

See all 7 entries →


Online-Mind2Web

Browser agents · Agent scope · 22 entries tracked

Rank System Organization Score
1 Browser Use Cloud (bu-max) (new) Browser-Use 97.0%
2 GPT-5.4 Native Computer Use OpenAI 93.0%
3 ABP + Claude Opus 4.6 theredsix 90.53%
4 TinyFish TinyFish AI 90.0%
5 UI-TARS-2 ByteDance / VLM-Research 88.2%

See all 22 entries →


τ-bench

Model evals / reasoning · Model scope · 12 entries tracked

Rank System Organization Score
1 Step-3.5-Flash StepFun 88.2%
2 GLM-4.7 Z.ai 87.4%
3 MiMo-V2-Flash Xiaomi 80.3%
4 GLM-4.7-Flash Z.ai 79.5%
5 MiniMax M2 MiniMax 77.2%

See all 12 entries →


AgentBench

Model evals / reasoning · Model scope · 10 entries tracked

Rank System Organization Score
1 AgentRL w/ Qwen2.5-32B-Instruct Tsinghua University 70.4%
2 AgentRL w/ Qwen2.5-14B-Instruct Tsinghua University 67.7%
3 AgentRL w/ GLM-4-9B-0414 Tsinghua University 65.0%
4 AgentRL w/ Qwen2.5-7B-Instruct Tsinghua University 62.0%
5 AgentRL w/ Qwen2.5-3B-Instruct Tsinghua University 60.0%

See all 10 entries →

How to read these tables

  • Within-benchmark only. Scores measure different things on different benchmarks; don't compare a WebVoyager number to a SWE-bench Verified number.
  • Source-linked. Each score links to the original report — paper, blog post, or official leaderboard. Treat anything self-reported with the appropriate dose of skepticism.
  • Scope matters. Agent pages reflect full system setups (model + tools + policy). Model pages emphasize base model capability under a stated harness. Mixed pages combine both — read the per-row notes on the site before drawing conclusions.

Contributing

We welcome new entries, corrections, methodology notes, and new benchmark pages. See CONTRIBUTING.md for the evidence standard, JSON schema expectations, ranking rules, and new-leaderboard checklist.

At minimum, every submitted score needs a public source URL that directly supports the benchmark, system name, score, and setup notes. If you update leaderboard data, run npm run update-readme before opening a pull request.

License

MIT