SlowFast-LLaVA-1.5: Token-Efficient Video LLMs for Long-Form Understanding

TL;DR

SlowFast-LLaVA-1.5 (SF-LLaVA-1.5) is a family of token-efficient video large language models for long-form video understanding. Apple ML Research
It integrates the two-stream SlowFast mechanism into a streamlined training pipeline and performs joint video–image training on a data mixture composed only of publicly available datasets. Apple ML Research
The design targets efficient model scales (1B and 3B) but demonstrates robust performance across 1B–7B, achieving state-of-the-art results on long-form video benchmarks. Apple ML Research
In addition to SF-LLaVA-1.5, the publication highlights related streaming and multimodal research, including StreamBridge for streaming Video-LLMs and egocentric video QA data generation efforts. Apple ML Research

Context and background

The SlowFast-LLaVA-1.5 family is presented as a token-efficient approach to long-form video understanding. Building on the SlowFast two-stream architecture, the authors integrate a streamlined training pipeline that jointly trains video and image modalities using a carefully curated set of publicly available datasets. This aligns with a broader Apple ML Research push toward efficient, scalable multimodal models that can operate effectively on devices with constrained resources. The reported results show strong performance across a spectrum of tasks and model sizes, from 1B to 7B parameters, including state-of-the-art performance on long-form video benchmarks such as LongVideoBench and MLVU. The work underscores a design philosophy: achieve high accuracy with attention to compute and memory efficiency while remaining accessible via public data. The page also references related work on streaming capabilities and multimodal egocentric video understanding, reflecting a broader research agenda around real-time video comprehension and data collection strategies. May 12, 2025; research areas include Computer Vision and Methods and Algorithms. This source also highlights StreamBridge, a framework to convert offline Video-LLMs into streaming-capable models, addressing multi-turn real-time understanding and proactive responses. Apple ML Research

What’s new

Introduction of SlowFast-LLaVA-1.5 as a family of token-efficient video LLMs focused on long-form video understanding. Apple ML Research
Incorporation of the two-stream SlowFast mechanism into a streamlined, joint video–image training pipeline. Apple ML Research
Training uses a data mixture composed exclusively of publicly available datasets, with emphasis on efficient scales (1B and 3B). Apple ML Research
Demonstrated strong performance across model sizes from 1B to 7B, achieving state-of-the-art results on long-form video benchmarks (LongVideoBench and MLVU) and solid performance on a range of video tasks. Apple ML Research
The publication also discusses broader multimodal research efforts, including StreamBridge for streaming adaptation of offline Video-LLMs and egocentric QA data generation (Ego4D) as part of related work. Apple ML Research

Why it matters (impact for developers/enterprises)

Token efficiency and compact model scales (1B–3B being a focus) make high-quality video understanding more accessible for mobile and edge deployments, reducing compute and memory demands without compromising accuracy on long-form content. Apple ML Research
The demonstrated strength at 1B–7B scales suggests flexible deployment options for a range of applications, from video search and content moderation to interactive assistants that can reason over extended video streams. Apple ML Research
Streaming adaptations, as highlighted by related work like StreamBridge, point toward practical online inference scenarios where models need to process video streams in real time with memory-efficient mechanisms. Apple ML Research
The inclusion of publicly available data for training helps address reproducibility and accessibility concerns, enabling researchers and organizations to build and evaluate robust video LLMs without relying on proprietary datasets. Apple ML Research

Technical details or Implementation

SF-LLaVA-1.5 builds on the SlowFast two-stream design, integrating it into a streamlined training workflow that jointly optimizes video and image inputs. A carefully curated mixture of publicly accessible datasets forms the training data, emphasizing token efficiency and scalability. The emphasis on 1B and 3B scales targets mobile-friendly deployment while still delivering state-of-the-art performance on long-form video benchmarks like LongVideoBench and MLVU. The results indicate robust performance across model sizes from 1B to 7B, underscoring the approach’s versatility. In addition to the SF-LLaVA-1.5 work, the source mentions StreamBridge, a framework to transform offline Video-LLMs into streaming-capable models. StreamBridge addresses two core challenges in online scenarios: limited capability for multi-turn real-time understanding and lack of proactive response mechanisms. It uses a memory buffer combined with a round-decayed compression strategy to support streaming inference. This demonstrates Apple’s broader interest in making video LLMs practical for real-time applications in addition to static evaluation benchmarks. The page also notes efforts toward multimodal egocentric video understanding, including generating 7 million QA samples for Ego4D data to support QA data generation. Apple ML Research

Key table: model scales and capabilities

| Model size | Primary goal | Benchmarks noted | Notes |---|---|---|---| | 1B | Mobile-friendly, efficient long-form understanding | LongVideoBench, MLVU | Strong baseline across video tasks |3B | Balanced accuracy and efficiency | LongVideoBench, MLVU | Improved robustness across benchmarks |7B | Full-precision long-form understanding | LongVideoBench, MLVU | Robust performance across tasks |

Key takeaways

SF-LLaVA-1.5 represents a token-efficient family of video LLMs tailored for long-form understanding, using public data for training.
The two-stream SlowFast integration enables effective video and image joint modeling within a streamlined pipeline.
Focusing on 1B and 3B scales supports mobile-friendly deployments while achieving state-of-the-art performance on long-form benchmarks.
The broader publication ecosystem includes StreamBridge for streaming adaptation and egocentric QA data generation, illustrating a wider push toward practical, real-time multimodal systems. Apple ML Research

FAQ

What is SF-LLaVA-1.5?

It is a family of token-efficient video large language models designed for long-form video understanding, incorporating a SlowFast two-stream mechanism in a streamlined training pipeline with public data.
What data is used for training?

joint video–image training setup uses a data mixture composed only of publicly available datasets. [Apple ML Research](https://machinelearning.apple.com/research/slowfast-llava)
What model sizes are emphasized?

The work focuses on efficient scales (1B and 3B) but demonstrates robust results across 1B–7B. [Apple ML Research](https://machinelearning.apple.com/research/slowfast-llava)
What is StreamBridge?

StreamBridge is a framework to transform offline Video-LLMs into streaming-capable models, addressing real-time understanding and proactive response through a memory buffer and round-decayed compression. [Apple ML Research](https://machinelearning.apple.com/research/slowfast-llava)
Are there related multimodal efforts mentioned?

Yes, the publication references ongoing work in egocentric video understanding with 7M QA samples for Ego4D and other streaming-enabled research. [Apple ML Research](https://machinelearning.apple.com/research/slowfast-llava)