SlowFast-LLaVA-1.5: Token-Efficient Video LLMs for Long-Form Understanding
Sources: https://machinelearning.apple.com/research/slowfast-llava, Apple ML Research
TL;DR
- SlowFast-LLaVA-1.5 (SF-LLaVA-1.5) is a family of token-efficient video large language models for long-form video understanding. Apple ML Research
- It integrates the two-stream SlowFast mechanism into a streamlined training pipeline and performs joint video–image training on a data mixture composed only of publicly available datasets. Apple ML Research
- The design targets efficient model scales (1B and 3B) but demonstrates robust performance across 1B–7B, achieving state-of-the-art results on long-form video benchmarks. Apple ML Research
- In addition to SF-LLaVA-1.5, the publication highlights related streaming and multimodal research, including StreamBridge for streaming Video-LLMs and egocentric video QA data generation efforts. Apple ML Research
Context and background
The SlowFast-LLaVA-1.5 family is presented as a token-efficient approach to long-form video understanding. Building on the SlowFast two-stream architecture, the authors integrate a streamlined training pipeline that jointly trains video and image modalities using a carefully curated set of publicly available datasets. This aligns with a broader Apple ML Research push toward efficient, scalable multimodal models that can operate effectively on devices with constrained resources. The reported results show strong performance across a spectrum of tasks and model sizes, from 1B to 7B parameters, including state-of-the-art performance on long-form video benchmarks such as LongVideoBench and MLVU. The work underscores a design philosophy: achieve high accuracy with attention to compute and memory efficiency while remaining accessible via public data. The page also references related work on streaming capabilities and multimodal egocentric video understanding, reflecting a broader research agenda around real-time video comprehension and data collection strategies. May 12, 2025; research areas include Computer Vision and Methods and Algorithms. This source also highlights StreamBridge, a framework to convert offline Video-LLMs into streaming-capable models, addressing multi-turn real-time understanding and proactive responses. Apple ML Research
What’s new
- Introduction of SlowFast-LLaVA-1.5 as a family of token-efficient video LLMs focused on long-form video understanding. Apple ML Research
- Incorporation of the two-stream SlowFast mechanism into a streamlined, joint video–image training pipeline. Apple ML Research
- Training uses a data mixture composed exclusively of publicly available datasets, with emphasis on efficient scales (1B and 3B). Apple ML Research
- Demonstrated strong performance across model sizes from 1B to 7B, achieving state-of-the-art results on long-form video benchmarks (LongVideoBench and MLVU) and solid performance on a range of video tasks. Apple ML Research
- The publication also discusses broader multimodal research efforts, including StreamBridge for streaming adaptation of offline Video-LLMs and egocentric QA data generation (Ego4D) as part of related work. Apple ML Research
Why it matters (impact for developers/enterprises)
- Token efficiency and compact model scales (1B–3B being a focus) make high-quality video understanding more accessible for mobile and edge deployments, reducing compute and memory demands without compromising accuracy on long-form content. Apple ML Research
- The demonstrated strength at 1B–7B scales suggests flexible deployment options for a range of applications, from video search and content moderation to interactive assistants that can reason over extended video streams. Apple ML Research
- Streaming adaptations, as highlighted by related work like StreamBridge, point toward practical online inference scenarios where models need to process video streams in real time with memory-efficient mechanisms. Apple ML Research
- The inclusion of publicly available data for training helps address reproducibility and accessibility concerns, enabling researchers and organizations to build and evaluate robust video LLMs without relying on proprietary datasets. Apple ML Research
Technical details or Implementation
SF-LLaVA-1.5 builds on the SlowFast two-stream design, integrating it into a streamlined training workflow that jointly optimizes video and image inputs. A carefully curated mixture of publicly accessible datasets forms the training data, emphasizing token efficiency and scalability. The emphasis on 1B and 3B scales targets mobile-friendly deployment while still delivering state-of-the-art performance on long-form video benchmarks like LongVideoBench and MLVU. The results indicate robust performance across model sizes from 1B to 7B, underscoring the approach’s versatility. In addition to the SF-LLaVA-1.5 work, the source mentions StreamBridge, a framework to transform offline Video-LLMs into streaming-capable models. StreamBridge addresses two core challenges in online scenarios: limited capability for multi-turn real-time understanding and lack of proactive response mechanisms. It uses a memory buffer combined with a round-decayed compression strategy to support streaming inference. This demonstrates Apple’s broader interest in making video LLMs practical for real-time applications in addition to static evaluation benchmarks. The page also notes efforts toward multimodal egocentric video understanding, including generating 7 million QA samples for Ego4D data to support QA data generation. Apple ML Research
Key table: model scales and capabilities
| Model size | Primary goal | Benchmarks noted | Notes |---|---|---|---| | 1B | Mobile-friendly, efficient long-form understanding | LongVideoBench, MLVU | Strong baseline across video tasks |3B | Balanced accuracy and efficiency | LongVideoBench, MLVU | Improved robustness across benchmarks |7B | Full-precision long-form understanding | LongVideoBench, MLVU | Robust performance across tasks |
Key takeaways
- SF-LLaVA-1.5 represents a token-efficient family of video LLMs tailored for long-form understanding, using public data for training.
- The two-stream SlowFast integration enables effective video and image joint modeling within a streamlined pipeline.
- Focusing on 1B and 3B scales supports mobile-friendly deployments while achieving state-of-the-art performance on long-form benchmarks.
- The broader publication ecosystem includes StreamBridge for streaming adaptation and egocentric QA data generation, illustrating a wider push toward practical, real-time multimodal systems. Apple ML Research
FAQ
-
What is SF-LLaVA-1.5?
It is a family of token-efficient video large language models designed for long-form video understanding, incorporating a SlowFast two-stream mechanism in a streamlined training pipeline with public data.
-
What data is used for training?
joint video–image training setup uses a data mixture composed only of publicly available datasets. [Apple ML Research](https://machinelearning.apple.com/research/slowfast-llava)
-
What model sizes are emphasized?
The work focuses on efficient scales (1B and 3B) but demonstrates robust results across 1B–7B. [Apple ML Research](https://machinelearning.apple.com/research/slowfast-llava)
-
What is StreamBridge?
StreamBridge is a framework to transform offline Video-LLMs into streaming-capable models, addressing real-time understanding and proactive response through a memory buffer and round-decayed compression. [Apple ML Research](https://machinelearning.apple.com/research/slowfast-llava)
-
Are there related multimodal efforts mentioned?
Yes, the publication references ongoing work in egocentric video understanding with 7M QA samples for Ego4D and other streaming-enabled research. [Apple ML Research](https://machinelearning.apple.com/research/slowfast-llava)
References
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo
NVIDIA Dynamo offloads KV Cache from GPU memory to cost-efficient storage, enabling longer context windows, higher concurrency, and lower inference costs for large-scale LLMs and generative AI workloads.