browseruse-agent-bench

About browseruse-agent-bench

Real-world browser-agent benchmark: 210 tasks across 107 websites, multi-agent/multi-browser evaluation, reproducible leaderboard and result submissions.

l

Published by

lexmount

Visit View Profile

README.md

View on GitHub

Landing Page • Issues • Discussions • Leaderboard • Documentation • Dataset

English | 简体中文

Why browseruse-agent-bench

browseruse-agent-bench is a reproducible evaluation framework for browser agents. LexBench-Browser is the built-in public dataset used by the default benchmark workflow. Together they make external results easy to run, compare, cite, and submit back.

What you can do	Why it matters
Run LexBench-Browser: 210 public tasks across 107 real websites	Test browser agents on long-tail multilingual workflows beyond toy pages
Compare Agent × Model × Browser × Eval	Separate agent quality from model choice, browser backend, and judge strategy
Inspect leaderboard, cost, latency, token usage, and trajectories	Debug failures instead of only reporting a final score
Submit agents, dataset tasks, and reproducible results	Turn forks and PRs into visible benchmark contributions

Description

browseruse-agent-bench is an all-in-one evaluation framework for AI browser agents, designed to benchmark multiple agents across multiple datasets, browser backends, and models under controlled and reproducible settings. The Python package/CLI is published as browseruse-bench and bubench. It supports both local and cloud browsers, integrates LLM-as-Judge for automated evaluation, and provides a built-in local leaderboard along with efficiency and cost metrics such as agent steps, end-to-end latency, and token usage.

Supported Datasets

[x] LexBench-Browser — Browser-agent dataset covering e-commerce, social, academic, financial, and other mainstream Chinese/English websites (v1.0, 2026-04-30)
- All (210, no login required)
- lexmount (118, mainland-accessible websites) / global (92, international websites)
- Hugging Face: Lexmount/LexBench-Browser
[x] Online-Mind2Web — Real website interaction tasks
- All (300) / Hard (hard subset)
[x] BrowseComp — Browser operation competition tasks, no login required
- All (1266)
[ ] More benchmarks

Details: Benchmarks overview.

Supported Agents & Browsers

Agent	Supported Browsers
browser-use	`Chrome-Local`, `lexmount`, `browser-use-cloud`, `agentbay`, `browserbase`, `browserless`, `steel`
skyvern	`local`, `lexmount`, `skyvern-cloud`, `agentbay`, `browserbase`, `browserless`, `steel`
Agent-TARS	Built-in browser
More agents	—

Details: Agents overview.

News

[2026.04.30] 🎉 browseruse-agent-bench v1.0 — initial open-source release. The LexBench-Browser dataset v1.0 ships 210 public tasks across 107 distinct websites with a 6-category × 16-tag robustness label system; reference integrations cover browser-use, skyvern, Agent-TARS and deepbrowse.

Quickstart

1. Clone the repository

git clone https://github.com/lexmount/browseruse-agent-bench.git
cd browseruse-agent-bench

2. Install dependencies (Python>=3.11)

Requires uv (recommended). Select the section for your agent.

Note: browser-use and skyvern have conflicting dependencies and cannot be installed together. If you plan to run multiple agents in parallel, refer to the Environment Isolation section in the documentation.

browser-use

uv sync --extra browser-use
source .venv/bin/activate          # macOS / Linux
.venv\Scripts\Activate.ps1         # Windows PowerShell

skyvern

uv sync --extra skyvern
source .venv/bin/activate          # macOS / Linux
.venv\Scripts\Activate.ps1         # Windows PowerShell

Agent-TARS (requires Node.js 18+)

uv sync
npm install -g @agent-tars/[email protected]
source .venv/bin/activate          # macOS / Linux
.venv\Scripts\Activate.ps1         # Windows PowerShell

After activation, the bubench CLI is available on your PATH. Without activation, prefix every bubench … command in the following steps with uv run (e.g. uv run bubench run …).

3. Configure

Principle: .env holds sensitive credentials (API keys). config.example.yaml → config.yaml (git-ignored) holds all agent, model, browser, and eval settings in one place.

3.1 Shared credentials (.env)

cp .env.example .env
vim .env

Variable	Description	Sign up	Required
`OPENAI_API_KEY`	API key for agents and evaluation	platform.openai.com	✅
`OPENAI_BASE_URL`	Custom API base URL (e.g. LiteLLM proxy)	—	Optional
`LEXMOUNT_API_KEY` + `LEXMOUNT_PROJECT_ID`	Lexmount cloud browser	browser.lexmount.cn	When using lexmount
`BROWSER_USE_API_KEY`	Browser Use cloud browser	browser-use.com	When using browser-use-cloud
`AGENTBAY_API_KEY`	AgentBay cloud browser	agentbay.ai	When using agentbay
`BROWSERBASE_API_KEY` + optional `BROWSERBASE_PROJECT_ID`	Browserbase cloud browser	browserbase.com	When using browserbase
`BROWSERLESS_API_KEY`	Browserless BaaS cloud browser	browserless.io	When using browserless
`STEEL_API_KEY`	Steel.dev cloud browser	steel.dev	When using steel
`HF_ENDPOINT=https://hf-mirror.com`	HuggingFace mirror (China)	—	Optional

3.2 Runtime config (config.yaml)

cp config.example.yaml config.yaml
vim config.yaml

All agents are configured in one file. Runtime config is resolved as: agents.<agent> + models.<model> + browsers.<browser>.

Field	Description
`default.model`	Default model key (overridden by `--model`)
`default.browser`	Default browser key (overridden by `--browser-id`)
`agents.<agent>.*`	Agent params: `max_steps`, `timeout`, `use_vision`, etc.
`models.<name>.model_type`	Provider: `BROWSER_USE`, `OPENAI`, `AZURE`, `GEMINI`, `ANTHROPIC`
`models.<name>.model_id`	Model ID (e.g. `gpt-4.1`, `qwen3.5-plus`, `kimi-k2.5`)
`models.<name>.api_key`	API key for this model (supports `$ENV_VAR` expansion)
`models.<name>.base_url`	API base URL (optional, supports `$ENV_VAR` expansion)
`browsers.<name>.browser_id`	Browser backend: `Chrome-Local`, `lexmount`, `browser-use-cloud`, `agentbay`, `browserbase`, `browserless`, `steel`, `cdp`
`eval.model` + `eval.api_key` + `eval.base_url`	Evaluation model settings

To switch per run, use --model <name> and --browser-id <name>.

4. Install Skills (Optional)

bubench skills

Installs the prebuilt developer-friendly skills pack (browseruse_bench/skills/) into your agent toolchain.

5. Run & Evaluate

Run

bubench run --agent {AGENT} --data {BENCHMARK} --mode first_n --count 3
# Output: experiments/{benchmark}/{split}/{agent}/{model_id}/{timestamp}/

# Example: LexBench-Browser (no login required)
bubench run --agent browser-use --data LexBench-Browser --mode first_n --count 3
# Output: experiments/LexBench-Browser/All/browser-use/gpt-4.1/20260101_120000/

Evaluate

bubench eval --agent {AGENT} --data {BENCHMARK} --model-id {MODEL_ID}

# Example
bubench eval --agent browser-use --data LexBench-Browser --model-id gpt-4.1

--split is optional — the benchmark's default_split (from data_info.json) is used automatically. Pass --split <name> only to override the default. For the full parameter reference, see the Quickstart docs.

Data Loading

Use --data-source to control where benchmark data is loaded from:

Mode	Description	Example
`local` (default)	Uses local files under `benchmarks/{benchmark}/data/`, errors if missing	`--data-source local`
`huggingface`	Downloads to HF cache (`~/.cache/huggingface`), does not write back to repo	`--data-source huggingface`
`huggingface` + `--force-download`	Forces re-download, refreshes HF cache	`--data-source huggingface --force-download`

Speed up in China: Set HF_ENDPOINT=https://hf-mirror.com in .env. Private datasets: Set HF_TOKEN=hf_your_token_here in .env.

Details: Data Loading.

📖 For complete guides, API reference, and more examples, see the full documentation.

Leaderboard

We provide an interactive local leaderboard to compare agent performance across benchmarks.

Generate leaderboard HTML:

bubench leaderboard

Deploy leaderboard service (temporary process):

bubench server --host 0.0.0.0 --port 8012 &

Deploy leaderboard service (systemd):

sudo bubench service install
sudo bubench service start

See Leaderboard Documentation for more details.

Access URLs (default port 8012):

Local leaderboard: http://localhost:8012
Local API docs: http://localhost:8012/docs
Remote leaderboard: http://<SERVER_IP>:8012/
Remote API docs: http://<SERVER_IP>:8012/docs

Visualization

An interactive experiment explorer for browsing agent trajectories, evaluation details, and per-task API logs — complements the static leaderboard with task-level drill-down.

# Start server (auto-regenerates index when experiment files change)
bubench viz --watch

# Access at http://localhost:8080

Options:

bubench viz --port 8090              # custom port (default: 8080)
bubench viz --generate-only          # regenerate experiments.json and exit
bubench viz --watch-interval 5       # poll interval in seconds (default: 3)

For remote sharing with tmux and firewall configuration, see Visualization Documentation.

Acknowledgements

Some code in this project is cited and modified from Online-Mind2Web and simple-evals.

Citation

@misc{lexbench_browser_2026,
    title        = {LexBench-Browser: A Real-World Browser Agent Benchmark with Long-Tail and Multilingual Tasks},
    author       = {Lexmount Research and Collaborators},
    year         = {2026},
    howpublished = {\url{https://lexmount.github.io/browseruse-agent-bench/}},
    note         = {Open benchmark; v1.0 reference release},
}

Contact

Questions, benchmark proposals, agent integrations, and result reproductions are welcome:

Report bugs or request features in GitHub Issues.
Ask questions and discuss results in GitHub Discussions.
Email official result, dataset, and collaboration questions to [email protected].
Track upcoming releases in Milestones.
Use Contributing when opening pull requests or adding a new agent/benchmark.
See Governance and Evaluation Protocol for result review rules.

Coming Soon

🔐 Login-state preservation — first-class support for reusing browser login across eval runs, so login-gated tasks can be benchmarked end-to-end without manual re-login. Stay tuned.

Roadmap/ Development Plan

Refer to our Milestones for upcoming versions and deadlines.