AI Browser Agent Leaderboards
Open reference for evaluating AI browser agents, computer-use systems, and coding agents across the public benchmarks teams actually compare on.

Live site: leaderboard.steel.dev — full rankings, methodology notes, and per-result detail.
Maintained by Steel, the open-source browser API for AI agents.
Top results by benchmark
The tables below show the current top entries on each tracked benchmark. Each section links to the full leaderboard with sources, methodology, and additional context.
WebVoyager
Browser agents · Agent scope · 19 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Alumnium (new) | Alumnium | 98.5% |
| 2 | Surfer 2 | H Company | 97.1% |
| 3 | Magnitude | Magnitude | 93.9% |
| 4 | Surfer-H + Holo1 | H Company | 92.2% |
| 5 | Browserable | Browserable | 90.4% |
BrowseComp
Research/search · Mixed scope · 82 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | GPT-5.5 Pro | OpenAI | 90.1% |
| 2 | GPT-5.4 Pro | OpenAI | 89.3% |
| 3 | MiroThinker-H1 | MiroMind | 88.2% |
| 4 | Claude Mythos Preview | Anthropic | 86.9% |
| 5 | Kimi K2.6 | Moonshot AI | 86.3% |
WebArena
Browser agents · Agent scope · 49 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | WebTactix (DeepSeek v3.2) | WebTactix | 74.3% |
| 2 | OpAgent | CodeFuse AI | 71.6% |
| 3 | ColorBrowserAgent | MadeAgents | 71.2% |
| 4 | Claude Code + GBOX MCP | GBOX AI | 68.0% |
| 5 | DeepSky Agent | DeepSky | 66.9% |
SWE-bench Verified
Coding · Model scope · 16 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Claude Mythos | Anthropic | 93.9% |
| 2 | Claude Opus 4.8 (new) | Anthropic | 88.6% |
| 3 | Claude Opus 4.7 | Anthropic | 87.6% |
| 4 | Claude Opus 4.5 | Anthropic | 80.9% |
| 5 | Claude Opus 4.6 | Anthropic | 80.8% |
OSWorld
Computer use · Agent scope · 17 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Claude Opus 4.8 (new) | Anthropic | 83.4% |
| 2 | Mythos Preview (new) | Anthropic | 79.6% |
| 3 | OSAgent | TheAGI Company | 76.26% |
| 4 | GPT-5.4 (new) | OpenAI | 75.0% |
| 5 | Claude Opus 4.6 | Anthropic | 72.7% |
GAIA
Model evals / reasoning · Agent scope · 21 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | OPS-Agentic-Search (new) | Alibaba Cloud | 92.36% |
| 1 | openJiuwen-deepagent (new) | Suzhou AI Lab / Shuqian Tech | 92.36% |
| 3 | openJiuwen-deepagent (GPT5/Gemini) | openJiuwen | 91.69% |
| 4 | Lemon Agent | Lenovo CTO Org | 91.36% |
| 5 | JoinAI V2.2 | JoinAI-CMCC | 90.7% |
ClawBench
Browser agents · Agent scope · 7 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | 33.3% |
| 2 | GLM-5 (new) | Z.ai | 24.2% |
| 3 | Gemini 3 Flash | 19.0% | |
| 4 | Claude Haiku 4.5 | Anthropic | 18.3% |
| 5 | GPT-5.4 | OpenAI | 6.5% |
Online-Mind2Web
Browser agents · Agent scope · 22 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Browser Use Cloud (bu-max) (new) | Browser-Use | 97.0% |
| 2 | GPT-5.4 Native Computer Use | OpenAI | 93.0% |
| 3 | ABP + Claude Opus 4.6 | theredsix | 90.53% |
| 4 | TinyFish | TinyFish AI | 90.0% |
| 5 | UI-TARS-2 | ByteDance / VLM-Research | 88.2% |
τ-bench
Model evals / reasoning · Model scope · 12 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | Step-3.5-Flash | StepFun | 88.2% |
| 2 | GLM-4.7 | Z.ai | 87.4% |
| 3 | MiMo-V2-Flash | Xiaomi | 80.3% |
| 4 | GLM-4.7-Flash | Z.ai | 79.5% |
| 5 | MiniMax M2 | MiniMax | 77.2% |
AgentBench
Model evals / reasoning · Model scope · 10 entries tracked
| Rank | System | Organization | Score |
|---|---|---|---|
| 1 | AgentRL w/ Qwen2.5-32B-Instruct | Tsinghua University | 70.4% |
| 2 | AgentRL w/ Qwen2.5-14B-Instruct | Tsinghua University | 67.7% |
| 3 | AgentRL w/ GLM-4-9B-0414 | Tsinghua University | 65.0% |
| 4 | AgentRL w/ Qwen2.5-7B-Instruct | Tsinghua University | 62.0% |
| 5 | AgentRL w/ Qwen2.5-3B-Instruct | Tsinghua University | 60.0% |
How to read these tables
- Within-benchmark only. Scores measure different things on different benchmarks; don't compare a WebVoyager number to a SWE-bench Verified number.
- Source-linked. Each score links to the original report — paper, blog post, or official leaderboard. Treat anything self-reported with the appropriate dose of skepticism.
- Scope matters. Agent pages reflect full system setups (model + tools + policy). Model pages emphasize base model capability under a stated harness. Mixed pages combine both — read the per-row notes on the site before drawing conclusions.
Contributing
We welcome new entries, corrections, methodology notes, and new benchmark pages. See
CONTRIBUTING.md for the evidence standard, JSON schema expectations, ranking
rules, and new-leaderboard checklist.
At minimum, every submitted score needs a public source URL that directly supports the benchmark,
system name, score, and setup notes. If you update leaderboard data, run npm run update-readme
before opening a pull request.
License
MIT