WebGym logo WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

1 Microsoft Microsoft 2 UIUC UIUC 3 CMU CMU
*Work partly done during internship at Microsoft

Recipe Overview 🧑‍🍳

WebGym Demo

Demo

WebGym Teaser

Example rollouts from visual web agents trained on different training environments. Tasks in prior large-scale training setups were relatively simple, e.g., Test-Time-Interaction (TTI), resulting in failures of trained agents on many held-out tasks. We build WebGym, a significantly larger training environment supporting harder and more diverse tasks, totaling nearly 300k tasks (3× the size of TTI), and train an agent via online reinforcement learning to acquire more generalizable skills.

Abstract

We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5× rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.

Tasks in WebGym

Training with online RL requires a large number of model-generated rollouts guided by reliable reward signals. To learn generalizable policies, the task set must (1) span diverse domains and websites, (2) cover a range of difficulties from atomic to compositional tasks, and (3) include verifiable evaluators that translate outcomes into meaningful learning signals.

Task Sourcing Statistics

Source Task Set Difficulties # Websites # Tasks
InSTA-v3 146,348 146,441
PAE-WebVoyager 13 128,499
AgentSynth-Web 328 2,086
BrowseComp 7 1,266
TravelPlanner 1 1,225
Mind2Web-Live 76 542
Online Mind2Web 139 300
DeepShop 1 150
Mind2Web-2 44 130
GAIA-Web 1 87
WebGym (Ours) 127,645 292,092

Task sourcing for WebGym. We aggregate web agent task sets from 10 widely-used benchmarks and environments to seed our procedural construction. WebGym provides the largest task set with difficulty annotations.

Task Decomposition

Task Decomposition Illustration

Task decomposition in WebGym. We decompose complex tasks into atomic subtasks, enabling compositional learning from simple to complex tasks.

Task Set Characteristics

Domain distribution in WebGym. Distribution of tasks across domains according to the Mind2Web-2 taxonomy (6 domains, 24 subdomains). The task set spans 127,645 unique websites with broad domain coverage, promoting diverse training samples across e-commerce, information retrieval, social media, entertainment, and more.

Domain Distribution

Difficulty distribution across training and out-of-distribution test sets. The transparent bars overlaying the solid bars represent decomposed tasks, while the slashed bars indicate the test set. Around 80% of tasks are Easy (difficulty 1-3), which is intentional to encourage learning generalizable basic skills while focusing on more specific objectives at higher difficulty levels.

Difficulty Distribution

Distribution of trajectory lengths by task difficulty (run-time statistics). Violin plots show the probability density of trajectory lengths for answered trajectories, with the KDE mode (black line, where the violin is widest) indicating the most likely length and mean (red) indicating the average. Easy tasks (difficulty 1-3) require 7.8 steps on average, medium tasks (4-6) require 9.9 steps, and hard tasks (7+) demand 11.9 steps, confirming that harder tasks involve more complex multi-step action and navigation.

Length vs Difficulty

Agreement between automated evaluators and human judgment. Rubric-based evaluation (with explicit criteria) consistently improves agreement over task-only evaluation, yielding higher accuracy and precision across LLM-based evaluators. Among the evaluators, GPT-4o shows the largest shift after adding the rubric: precision increases the most while recall drops slightly, indicating that the rubric makes GPT-4o apply stricter, more conservative pass criteria.

Rubric-based Evaluation

High-Throughput Rollout System

To enable efficient RL training at scale, we developed a high-throughput asynchronous rollout system designed specifically for web agents. Our system achieves a 4-5× speedup compared to naive implementations.

System Architecture

Our rollout system employs an asynchronous architecture that decouples environment simulation from policy inference. This design enables efficient batching of policy forward passes across multiple concurrent environments, maximizing GPU utilization while maintaining high throughput.

Async Design

Async design (ours): The asynchronous architecture decouples different threads in a batch.

Key Features

  • Asynchronous Architecture: Decouples environment simulation from policy inference, allowing environments to run independently while batching policy queries
  • Efficient Batching: Batches policy forward passes across multiple concurrent environments to maximize GPU utilization
  • Scalable Design: Can collect 1,800 trajectories (avg. 13.2 steps) in 30 minutes using 128 CPUs and 24 H100 GPUs
  • Browser Pool Management: Manages a pool of browser instances to minimize startup overhead and maximize throughput

Performance Metrics (Hover to see details)

Throughput

4-5×

speedup vs. naive implementation

4.5× Speedup
Naive
Ours
Collection Rate

1,800

trajectories in 30 minutes

Average Steps

13.2

steps per trajectory

Resources

128 CPUs
24 H100s

hardware configuration

Design Benefits

The asynchronous design enables efficient scaling to large batch sizes, which is critical for online RL training. By batching policy queries across multiple environments, we achieve high GPU utilization while maintaining low latency for individual environment steps. This design is particularly effective for web agents, where environment simulation (browser interactions) can be parallelized across CPU cores while policy inference is batched on GPUs.

Performance Benchmarks

We conducted comprehensive benchmarks to evaluate the scalability and efficiency of our rollout system. The results demonstrate strong scaling with both increased parallelism and GPU resources.

CPU Scaling Performance

CPU Scaling: Rollout time decreases significantly with increased CPU parallelism, demonstrating efficient scaling.

Training Results

We train agents using a simple REINFORCE-style RL training procedure on WebGym. Our results show that fine-tuning Qwen3-VL-Instruct-8B with RL on WebGym significantly improves performance on held-out test tasks.

Main Results

Base Model Prompt RL on WebGym Performance (%)
Proprietary Models GPT-4o Memory 27.1
GPT-5 Memory 29.8
Open-source Models Qwen3-VL-Instruct-8B Official 27.4
Qwen3-VL-Instruct-8B Memory 26.2
Qwen3-VL-Thinking-8B Memory 28.2
Qwen3-VL-Instruct-8B Memory ✓ (Ours) 42.9

WebGym empowers the small Qwen3-VL-Instruct-8B model to achieve state-of-the-art performance on the holdout test task set, arriving at a 42.9% success rate, significantly surpassing all agents empowered by proprietary models and other open-source models.

Performance Analysis

We analyze the effects of scaling different dimensions of WebGym: task set breadth (domain diversity), depth (difficulty composition), and size, as well as the train-time interaction horizon.

Training Performance Curves

Below we show the training curves comparing different ablations and showing performance across difficulty levels.

Ablations on Base Models, Prompting, and Action Filtering

Figure: Ablation study on model variants, prompting, and action filtering. Curves are smoothed with EMA (α=0.5).

Exploring Scaling Dimensions and Difficulty-Aware Training

(a) Overall

(b) Easy (1-3)

(c) Medium (4-6)

(d) Hard (7+)

Figure: Impact of scaling dimensions on test-set success rate across difficulty levels. Curves are smoothed with EMA (α=0.5).

BibTeX

@article{bai2025webgym,
    title={WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks},
    author={Bai, Hao and Taymanov, Alexey and Zhang, Tong and Kumar, Aviral and Whitehead, Spencer},
    journal={arXiv preprint},
    year={2025}
}