Scaling RL for Traffic Smoothing: 100-AV Highway Deployment
Sources: http://bair.berkeley.edu/blog/2025/03/25/rl-av-smoothing, http://bair.berkeley.edu/blog/2025/03/25/rl-av-smoothing/, BAIR Blog
Overview
Stop-and-go waves are common in dense highway traffic and lead to energy waste and higher emissions. Researchers deployed 100 reinforcement learning (RL) controlled cars on a real highway (I-24 near Nashville) to learn driving strategies that dampen these waves, improve energy efficiency, and maintain throughput around human drivers. The team built fast, data-driven simulations derived from experimental highway data to train RL agents to optimize energy use while operating safely around humans. A key finding is that a small fraction of well-controlled autonomous vehicles (AVs) can meaningfully improve traffic flow and reduce fuel use for all road users. The deployed controllers are designed to be hardware-friendly, operating in a decentralized fashion using only basic sensor input available on most modern vehicles. The project chronicles the path from simulation to the field, detailing steps taken to bridge the gap between training and real-world deployment. The researchers used highway trajectory data from I-24 to replay and generate unstable traffic patterns in simulation, enabling AVs to learn smoothing strategies behind human-driven traffic. The approach emphasizes local sensing: the observations fed to the RL agent are the AV’s own speed, the speed of the vehicle in front, and the gap between them. Based on these signals, the agent prescribes either an instantaneous acceleration or a desired speed for the AV. The reward function is crafted to balance energy efficiency with throughput and safe, reasonable driving, and it incorporates dynamic minimum and maximum gap thresholds to avoid degenerate, unsafe behaviors. The design also penalizes fuel consumption of human drivers behind the AV to discourage selfish optimization by the RL controller. In simulation, the learned behavior typically keeps slightly larger gaps than humans, allowing AVs to absorb upcoming slowdowns more effectively. In the most congested scenarios, simulations reported up to ~20% total fuel savings across road users with fewer than 5% AV penetration. Importantly, the RL controllers were designed to work with standard adaptive cruise control (ACC) hardware and can operate in a decentralized manner without special infrastructure. Following simulation validation, the team deployed the RL controllers in the field in what they called MegaVanderTest: a large-scale, 100-vehicle experiment on I-24 during peak traffic hours. Data collection used overhead cameras to reconstruct millions of vehicle trajectories, enabling detailed analysis of traffic dynamics and energy use.
Key features
- Data-driven RL training in fast, realistic traffic simulations built from experimental highway data.
- Local observations: AV speed, lead-vehicle speed, and inter-vehicle gap.
- Reward shaping that balances energy efficiency, throughput, and safety; dynamic gap thresholds to prevent unsafe behavior.
- Decentralized deployment compatible with standard ACC hardware and radar sensors; no special infrastructure required.
- Large-scale field validation (MegaVanderTest): 100 vehicles during morning rush hours with millions of trajectories collected.
- Reported energy savings of up to ~20% in congested scenarios; notable reductions in speed/acceleration variance as a proxy for wave dampening.
- Observed that closer-following AVs can reduce energy use in downstream traffic and lower congestion footprints.
- Evidence of potential future gains with faster simulations, better human-driver models, and exploration of 5G-enabled coordination.
- The field test was conducted without explicit inter-AV communication, aligning with current autonomous-vehicle deployments.
- Integrated deployment with existing adaptive cruise control (ACC) systems to enable scale.
Common use cases
- Smoothing traffic and reducing fuel consumption on congested highways with minimal infrastructure changes.
- Deploying RL-based traffic-smoothing controllers on existing AVs equipped with ACC to achieve broader energy efficiency benefits.
- Bridging simulation-to-reality gaps in mixed-autonomy traffic research and informing future large-scale experiments.
- Exploring data-driven human-driving models and advanced sensing methods to improve model fidelity and robustness.
Setup & installation
Not all setup or installation commands are provided in the source. The work describes training RL agents in fast simulations built from real data, validating in hardware, and then deploying the controllers on 100 vehicles. No command-level steps are included in the source.
# Setup commands not provided in the source
Quick start
The source provides a high-level blueprint but does not include runnable code or a ready-to-run quickstart. A minimal outline derived from the content is:
- Build fast, data-driven highway simulations using real trajectory data.
- Train RL agents to optimize energy efficiency while maintaining throughput and safety around human drivers.
- Validate controllers in hardware, then deploy on a small fleet of AVs.
- Collect and analyze field data to quantify energy savings and traffic smoothing effects.
# Quick start not provided by the source
Pros and cons
- Pros
- Demonstrates scalable, decentralized control that can be deployed on standard AVs without new infrastructure.
- Evidence of meaningful energy savings (15–20%) around controlled vehicles in field data.
- Field deployment (MegaVanderTest) on 100 cars represents one of the largest mixed-autonomy experiments to date.
- Uses local sensor information, enabling deployment with existing radar-based sensing and ACC.
- Smoothing effects observed as reduced speed/acceleration variance, indicating dampened stop-and-go waves.
- Cons
- Bridging the sim-to-reality gap remains a challenge; the paper emphasizes the need for faster, more accurate simulations and better human-driver models.
- Future gains may rely on enhanced data sharing or explicit inter-AV communication (e.g., over 5G), which was not deployed in the field test.
- The reward design requires careful balancing to avoid unsafe or suboptimal behaviors; dynamic gap thresholds are used to mitigate this risk.
Alternatives (brief comparisons)
| Approach | Key traits | Pros | Cons |- | - | - | - |Ramp metering / infrastructure control | Centralized, infrastructure-based traffic management | Can shape traffic at network scale without relying on vehicle penetration | Requires infrastructure, coordination, and investment; limited impact if AV penetration is low |Variable speed limits | Infrastructure-based speed control on corridors | Simple policy that can reduce stop-and-go waves | Needs sensor/communication coverage; limited adaptivity to mixed autonomy |RL-based AV smoothing (this work) | Decentralized, vehicle-level control using local observations | Scales with vehicle adoption; can operate without new infrastructure; leverages existing ACC hardware | Sim-to-real challenges; benefits depend on AV penetration; field results depend on driver behavior behind AVs |
Pricing or License
Not specified in the source.
References
More resources
Getting Started with NVIDIA Isaac for Healthcare Using the Telesurgery Workflow
A production-ready, modular telesurgery workflow from NVIDIA Isaac for Healthcare unifies simulation and clinical deployment across a low-latency, three-computer architecture. It covers video/sensor streaming, robot control, haptics, and simulation to support training and remote procedures.
NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit
NVFP4 is a 4-bit data format delivering FP16-level accuracy with the throughput and memory efficiency of 4-bit precision, extended to pretraining for large language models. This profile covers 12B-scale experiments, stability, and industry collaborations.
Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era
An in‑depth profile of NVIDIA Blackwell Ultra, its dual‑die NV‑HBI design, NVFP4 precision, 288 GB HBM3e per GPU, and system‑level interconnects powering AI factories and large‑scale inference.
NVIDIA NeMo-RL Megatron-Core: Optimized Training Throughput
Overview of NeMo-RL v0.3 with Megatron-Core backend for post-training large models, detailing 6D/4D parallelism, GPU-optimized kernels, and simplified configuration to boost reinforcement learning throughput on models at scale.
Nemotron Nano 2 9B: Open Reasoning Model with 6x Throughput for Edge and Enterprise
Open Nemotron Nano 2 9B delivers leading accuracy and up to 6x throughput with a Hybrid Transformer–Mamba backbone and a configurable thinking budget, aimed at edge, PC and enterprise AI agents.
Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training
A concise resource on composing multiple parallelism strategies (DP, FSDP, TP, CP) with Accelerate and Axolotl to train large models across many GPUs, with guidance on configuration, use cases, and trade‑offs.