WebVoyager Benchmark Results
Browserable - Open source state of the art browser agent
Browserable has achieved 90.4% on the WebVoyager benchmark. This is best-in-class performance across all web agents. This was done across 567 web tasks which are part of the benchmark.
Browserable is fully open source and self-hostable. You can check out the Github repo here.
We ran the benchmark evaluation with primary LLM as Gemini 2.0 Flash and backup LLMs as GPT-4o and Claude 3.5 Sonnet (for rate limiting).
Benchmark Details
We have removed 56 tasks from the original task list for the following reasons:
- Some tasks are outdated or no longer valid because of website updates.
- Some tasks are not possible to do with a bare bones browser agent without human intervention (like logging in to a website).
- Wherever possible, we have updated the task to be more relevant to the current state of the web. (e.g. updating the date in a flight booking website).
Coming Soon
- Shareable links for individual runs
- Full list of results
The total LLM cost of running the 567 tasks was ~70 USD.
What is the WebVoyager?
The WebVoyager benchmark is a standardized evaluation suite designed to measure the real-world browsing capabilities of autonomous agents. It tests how well an agent can complete tasks in real browser environments—like navigating a website, filling forms, clicking buttons, and extracting information—without human assistance on live websites like Amazon, Apple, Google flights etc.