Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment
Sources: http://bair.berkeley.edu/blog/2025/03/25/rl-av-smoothing, bair.berkeley.edu
TL;DR
- Berkeley BAIR researchers deployed 100 reinforcement learning (RL) controlled cars on a busy highway to dampen stop-and-go waves and reduce fuel consumption for all drivers.
- The effort combines fast, data-driven simulations with real-world deployment, showing meaningful energy savings and smoother traffic at scale.
- Field tests on I-24 near Nashville suggest 15–20% energy savings around the controlled vehicles, with overall congestion reduced and wave amplitudes dampened.
- The approach relies on decentralized control using only basic sensor inputs, compatible with standard adaptive cruise control (ACC) systems in consumer vehicles.
- The work highlights a path from simulation to large-scale deployment, while outlining routes for further improvements via richer traffic models, data, and potential 5G-enabled communication.
Context and background
Stop-and-go waves are common on busy highways when minor fluctuations in driving behavior propagate backward through traffic, lowering energy efficiency and increasing emissions and crash risk. Traditional traffic-management tools like ramp metering and variable speed limits require infrastructure or centralized coordination. The Berkeley AI Research (BAIR) team argues that autonomous vehicles (AVs) can dynamically adapt in real time to dampen these waves, provided the control policy balances energy efficiency with safety and throughput. The researchers built fast, data-driven simulations that faithfully mirror highway stop-and-go dynamics. They trained RL agents in environments that replay real traffic trajectories recorded on Interstate 24 (I-24) near Nashville, Tennessee. This data-driven realism was essential to bridging the simulation-to-reality gap and ensuring the policies could generalize to real traffic patterns. Observations for the RL agents are intentionally local: each AV uses only its own speed, the speed of the vehicle ahead, and the gap between the two. From these inputs, the agent outputs either an instantaneous acceleration or a desired speed for the AV. This minimalist sensing scheme makes the resulting controllers deployable on most modern vehicles in a decentralized fashion, without extra infrastructure. The reward structure was critical. It had to reflect multiple objectives: reduce fuel consumption, maintain throughput, and preserve safe, reasonable behavior around human drivers. The team introduced dynamic minimum and maximum gap thresholds to prevent the policy from exploiting energy savings by stopping in unsafe locations. They also penalized the fuel consumption of human-driven vehicles behind the AV to avoid learning selfish strategies that save energy for the AV while harming surrounding traffic. In simulation, the approach achieved substantial fuel savings—up to 20% across all road users in the most congested scenarios when AVs represented a small share of traffic (fewer than 5% of vehicles on the road). The AVs themselves could be standard consumer cars equipped with a smart adaptive cruise control (ACC) system.
What’s new
The centerpiece of the work is a large-scale, real-world test: deploying RL-controlled vehicles on 100 cars on I-24 during peak traffic hours over several days, in what the team dubbed the MegaVanderTest. This represents the largest mixed-autonomy traffic-smoothing experiment to date. Before the field test, the researchers validated the controllers in simulation and then tested them on hardware to ensure robust performance in real traffic conditions. During the field test, dozens of overhead cameras captured millions of vehicle trajectories. A computer vision pipeline extracted these trajectories for analysis. Across these data, initial metrics showed a trend toward lower energy use near AVs and reduced variation in speeds and accelerations, both indicators of dampened stop-and-go waves. The field results align with prior simulation findings: energy savings around controlled cars were observed, with a broader signal of congestion reduction when AVs were present. Notably, the deployment remained decentralized—there was no explicit cooperation or communication between AVs—and kept to current autonomous driving deployment realities. The team also emphasized practical deployment considerations: the RL controllers can be integrated with existing ACC systems, enabling near-term field deployment at scale. The experiment suggests that more vehicles equipped with smart traffic-smoothing control will yield progressively larger benefits for all road users, including reductions in fuel consumption and emissions.
Why it matters (impact for developers/enterprises)
- Energy efficiency and emissions: The combination of RL-driven smoothing and decentralized control demonstrates a viable path to lower fuel consumption for both AVs and human-driven vehicles in dense traffic. The observed energy savings in field tests point to tangible emissions reductions if deployed at scale.
- Low hardware requirements: The approach relies on basic sensor data already available in many vehicles and a standard ACC-enabled platform, which lowers barriers to broad adoption.
- Gradual deployment model: The policy is designed to operate safely and effectively with limited penetration of AVs, and it can be extended to additional data sources and communications channels (e.g., 5G) as more vehicles gain connectivity.
- Simulation-to-field bridge: By using real traffic data to seed simulations and then validating in hardware, the work demonstrates a practical workflow for bringing RL-based traffic-control ideas from research to real roads.
- Industry relevance: As cities and mobility providers seek scalable solutions to congestion and energy efficiency, the MegaVanderTest offers a blueprint for evaluating and deploying decentralized traffic-smoothing controllers at scale.
Technical details or Implementation (what was done)
- Environment and data sources: The team used a mixed-autonomy traffic environment in which RL agents learned to dampen stop-and-go waves and reduce energy use for nearby human drivers. They replayed highway trajectories derived from experimental data on I-24 to generate realistic dynamics for training.
- Observations and actions: Each AV’s observation is limited to the vehicle’s own speed, the speed of the lead vehicle, and the gap between them. The RL policy chooses either an instantaneous acceleration or a target speed for the AV.
- Reward design: Balancing energy efficiency with safe and reasonable driving behavior was central. To prevent degenerate behavior (e.g., stopping on the highway to save energy), the researchers implemented dynamic gap thresholds and penalized the energy consumption of human drivers behind the AVs.
- Deployment and evaluation: After simulation validation, the RL controllers were deployed on 100 real vehicles during morning rush on I-24. Surrounding traffic continued normally and drivers were unaware of the experiment. Data collection occurred via overhead cameras, followed by trajectory extraction and energy-use modeling.
- Results and metrics: Two complementary signals emerged. First, field observations indicated a trend of reduced energy consumption near AVs and lower variance in speeds and accelerations, consistent with dampened stop-and-go waves. Second, the field results showed energy savings in the range of 15–20% around the controlled cars. A separate simulation result reported up to 20% live in the most congested scenarios with a small AV fraction.
- Key deployment characteristics: The 100-car field test was decentralized, reflecting current autonomous deployment practices with no explicit AV-to-AV coordination. The work points to pathways for further improvement through faster, more accurate simulations, richer human-driving models, and richer data inputs, including potential centralized planning or 5G-enabled communication between AVs.
- Practical takeaway: The controllers were designed to be compatible with standard ACC systems, enabling scalable field deployment without requiring bespoke hardware.
Tables: key findings (comparative snapshot)
| Setting | AV penetration | Typical energy savings | Notes |---|---:|---:|---| | Simulation (most congested scenarios) |
References
More news
GPT-5-Codex Addendum: Agentic Coding Optimized GPT-5 with Safety Measures
An addendum detailing GPT-5-Codex, a GPT-5 variant optimized for agentic coding within Codex, with safety mitigations and multi-platform availability.
Checklist-Based Feedback Outperforms Reward Models for Aligning Language Models
A new RL approach uses instruction-derived checklists to guide alignment, outperforming fixed-criteria reward models across multiple benchmarks on Qwen2.5-7B-Instruct, presented at ICLR 2025.
Reinforcement Learning with NVIDIA NeMo-RL: Megatron-Core Backed Optimized Training Throughput
NVIDIA NeMo-RL v0.3 adds Megatron-Core support to boost training throughput for large models, addressing DTensor limitations, with long-context training, MoE support, and simplified configuration.
Scaling LLM Reinforcement Learning with ProRL v2: Prolonged Training for Continuous Improvement
NVIDIA Research introduces ProRL v2, the latest evolution of Prolonged Reinforcement Learning for LLMs. It explores thousands of extra RL steps, new stabilization techniques, and broad benchmarking to push sustained improvements beyond traditional RL schedules.
Maximize Robotics Performance with Post-Training NVIDIA Cosmos Reason
NVIDIA Cosmos Reason is an open, fully customizable reasoning vision-language model for physical AI and robotics. It enables step-by-step multimodal reasoning and boosts robotics performance through post-training refinements.
R²D²: Boost Robot Training with World Foundation Models and NVIDIA Research Workflows
NVIDIA Research explores World Foundation Models (WFMs) for synthetic data generation and data curation to accelerate robot and industrial AI training, detailing Cosmos Predict, Cosmos Transfer, and Cosmos Reason and related workflows.