Vision Language Model Alignment in TRL: GRPO, GSPO, and MPO

Overview

Vision Language Models (VLMs) are becoming increasingly capable, but aligning them to human preferences remains crucial for reliable real-world use. In TRL, we previously showed post-training of VLMs with Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). The latest developments add two new multimodal alignment methods—Group Relative Policy Optimization (GRPO) and Group Sequence Policy Optimization (GSPO)—as well as Mixed Preference Optimization (MPO). These methods extract more signal from preference data and scale better with modern VLMs. TRL now also provides native SFT support for vision-language models, and the project releases training scripts and demo notebooks to get started quickly. DPO optimizes preferences between pairs of model responses using a contrastive loss (chosen vs. rejected answers). While DPO remains a strong baseline, GRPO, GSPO, and MPO introduce richer signal and stability for multimodal settings. MPO, in particular, extends DPO with multiple losses: the DPO preference loss (sigmoid), a quality loss from Binary Classifier Optimization (BCO), and a generation loss from SFT. This combined loss can yield notable improvements (e.g., reported 6.2 points on MathVista in the referenced work). The TRL team has added support for this combined loss in the DPOTrainer class, enabling easier experimentation. A complete notebook demonstrates how to use MPO in practice. GRPO (Group Relative Policy Optimization) is a cutting-edge alignment method originally introduced in DeepSeek Math and later integrated into DeepSeek R1. It augments PPO by performing policy updates over groups of trajectories (i.e., batches of dialogue rollouts), which helps average out reward noise and encourages a broader concept of a good response rather than single high-reward samples. TRL adds GRPO support for vision-language models, with reward functions crafted to reveal the aspects of interest in the multimodal setting. You can run the training by constructing a GRPOConfig and a GRPOTrainer, supplying the reward functions, and invoking train(). A full notebook example is provided for reference. GSPO (Group Sequence Policy Optimization) is a variant of GRPO that computes importance-sampling weights at the sequence level rather than per-token. This tends to offer more stable training, and its benefits are particularly relevant for models with Mixture-of-Experts (MoE) architectures. TRL’s latest release includes GSPO with multimodal support, and the trainer follows a parallel setup to GRPO with additional parameters drawn from the original paper. A concise walkthrough is available in the accompanying notebook. Together with native SFT support for VLMs, these methods give practitioners a palette of choices to align multimodal models with human preferences while addressing the weaknesses observed with SFT-only or standard DPO setups. The blog also notes that a direct comparison table highlights how model responses differ under these methods.

Earlier attempts to align VLMs with instruction-following via SFT can suffer from distribution shifts when reasoning tasks are required. DPO improves alignment with preferences but can lead to less coherent rationales and repetitive outputs. MPO aims to balance these aspects by combining losses that encourage quality generation, preference alignment, and fluent responses. The combination is designed to scale to larger, more diverse multimodal datasets and models.

Key features

Multimodal alignment methods: GRPO, GSPO, and MPO for VLMs.
MPO combines three losses: DPO preference loss (sigmoid), BCO quality loss, and SFT generation loss.
MPO has demonstrated performance gains (e.g., 6.2 points on MathVista in the cited work).
DPOTrainer enhancements: MPO can be used by enabling a combined loss in DPOConfig and DPOTrainer.
GRPO extends PPO with group-based trajectory updates to reduce reward-noise and encourage broader notions of good responses.
GSPO improves stability by computing importance sampling weights at the sequence level, with relevance to MoE-style models.
Native SFT support for vision-language models in TRL, with training scripts and demo notebooks.
Practical guidance via notebooks, including a complete example for multimodal alignment.
Candid discussion of limits: SFT-alone alignment may underperform on tasks requiring reasoning; DPO can produce repetitive or less coherent rationales; MPO addresses these gaps.

Common use cases

Align VLMs to human preferences for tasks requiring both vision and language, such as multimodal instruction following, reasoning with image contexts, and justification generation.
Leverage richer preference signals to improve beyond pairwise comparisons, especially with large, diverse multimodal datasets.
Mitigate distribution shifts associated with SFT-only pipelines by incorporating policy optimization methods (GRPO/GSPO) and multimodal-specific signals (MPO).
Improve coherence and reduce repetition in model rationales by combining objective components (DPO, BCO quality, SFT generation loss).
Scale experiments to larger models and datasets using the group-based or sequence-based optimization updates that GRPO/GSPO provide.
Validate approaches with dedicated notebooks and examples that accompany the TRL releases.

Setup & installation

Setup and installation specifics are not provided in the source excerpt. The TRL project ships training scripts and notebooks to enable experimentation with GRPO, GSPO, MPO, and SFT for VLMs, but exact commands, environments, and dependencies are not enumerated here. See the referenced blog for guidance and the accompanying notebooks for concrete examples.

Not specified in the source.

Quick start

A minimal runnable example is not provided in the excerpt. The blog mentions a complete notebook example to explore the methods, including how to initialize DPOConfig and DPOTrainer for MPO, and mentions GRPO/GSPO workflows with sample reward functions. Refer to the notebook linked in the post for a hands-on start.

Not provided in the source.

Pros and cons

Pros
GRPO reduces reward noise by updating over groups of trajectories, promoting a broader notion of good responses.
GSPO provides sequence-level stability in importance sampling, with particular relevance to MoE models.
MPO yields richer training signals by combining DPO, BCO, and SFT losses, potentially improving multimodal tasks (e.g., MathVista).
Native SFT support for VLMs simplifies end-to-end training pipelines.
The approach aligns with modern VLMs’ scale and data diversity, with available notebooks to facilitate experimentation.
Cons
The post notes that a full training script for GRPO is not provided in the post; users must rely on notebooks to implement the workflow.
There are caveats about SFT-only reasoning tasks and potential coherence issues with DPO-based alignment, underscoring the need for MPO or other multimodal objectives.
Training multimodal alignments can require large models, extensive data, and substantial compute; results are shown on subset experiments as caveats for generalization.

Alternatives (brief comparisons)

| Method | Core idea | Strengths | Trade-offs |---|---|---|---| | SFT | Supervised fine-tuning on instructions | Simple to implement; aligns to written instructions | May suffer from distribution shift in reasoning tasks; lacks explicit preference modeling |DPO | Pairwise preference optimization | Directly optimizes user preferences; strong baseline | Can produce less coherent rationales or repetitive outputs |MPO | DPO + BCO + SFT losses | Rich, multimodal objective; gains on MathVista reported | More complex to tune; requires careful balancing of losses |GRPO | Grouped PPO-style updates | Robust to reward noise; aligns broader tastes | Full training script not always provided; group design choices matter |GSPO | Sequence-level PPO variants | Stable with sequence-level importance sampling; good for MoE | Requires sequence-level weight estimation; may be more complex to implement |

Pricing or License

License and pricing information are not specified in the source. For licensing and usage terms, consult the TRL repository or the Hugging Face blog linked below.