Seq vs Seq: Ettin — Paired Encoders and Decoders Redefine Open-Data LLM Benchmarks
Sources: https://huggingface.co/blog/ettin, Hugging Face Blog
TL;DR
- Ettin introduces the first state-of-the-art paired encoder-only and decoder-only models trained with identical data, model shapes, and training recipes source.
- The suite covers six scales from 17M to 1B parameters for both encoders and decoders source.
- Training follows a three-phase recipe: Pre-training on 1.7T tokens with shorter contexts, Context Extension to 8K tokens on 250B tokens, and a final Decay phase with 100B tokens from premium sources source.
- All training data is public and reproducible, enabling apples-to-apples comparisons; encoders excel at classification and retrieval, while decoders lead in generation source.
- In benchmarks, Ettin models beat or match baselines like Llama 3.2 1B and SmolLM2 on key tasks, with strong gains on knowledge-intensive tasks such as SciQ source.
Context and background
The LLM community has largely converged on decoder-only models (GPT, Llama, Qwen) for generation, while encoder-only models (like BERT) remain central to production tasks such as classification, retrieval, and embeddings. Encoders are typically faster and more memory-efficient for discriminative tasks, but historically lagged in generative capability compared to decoders. Ettin situates itself as a controlled, apples-to-apples comparison by training encoder-only and decoder-only architectures under the same data, the same model shapes, and identical training recipes; the only differences are attention patterns and the training objectives. The project builds on the ModernBERT recipe, which adapted modern techniques from decoder training to encoder training, providing a strong, shared foundation for both architectures. The work is named after the two-headed Norse giant Ettin to reflect its paired-architecture approach. All training data used in Ettin is public and reproducible, reinforcing the emphasis on open science and reproducibility within the community. You can continue to train these models on new data or propose a new recipe to push results further source.
What’s new
Ettin presents a coherent suite of six models for both encoders and decoders, ranging from 17M up to 1B parameters. The same data, model shapes, and training recipes are used across architectures to enable fair comparisons of how attention patterns and objectives shape learning outcomes. Specifically:
- Two training objectives are explored in a controlled setting: masked language modeling (MLM) for encoders and causal language modeling (CLM) for decoders. The same data and recipe are applied to both, allowing apples-to-apples analysis of architecture vs objective.
- The data regime is designed to maximize realism and openness: training data is public and reproducible, enabling users to reproduce results or extend training with new data.
- Ettin’s three-phase training approach mirrors the ModernBERT lineage while extending it to cover longer contexts and higher-quality data filters. Phase 1 pre-trains on a diverse mix of sources with 1.7T tokens and short 1024-token contexts to establish foundational knowledge. Phase 2 extends context length to 8K tokens using higher-quality filtered data. Phase 3 completes with 100B tokens drawn from premium sources (including scientific papers and textbooks) while gradually decaying the learning rate source.
- The encoded and decoded models deliver state-of-the-art performance for open-data models across tasks and sizes. On multiple benchmarks, Ettin encoders outperform ModernBERT, while Ettin decoders match or exceed established baselines such as Llama 3.2 and SmolLM2, with pronounced gains on knowledge-intensive tasks like SciQ.
- The results emphasize fundamental architectural advantages that persist when data, recipes, and scales are controlled: encoders excel at classification and retrieval, while decoders maintain an edge in generation as model size grows. A surprising note is that near-identical data and recipes can reveal performance gaps attributable primarily to architectural choices rather than solely the training objective.
- The project also experiments with switching objectives after initial training (continuing with the opposite objective for an additional 50B tokens) to probe how architecture choice influences learning beyond standard metrics and even bias metrics such as WinoGender. This demonstrates that the architecture matters as a fundamental factor in learning, not just the chosen objective source.
- For practitioners and researchers, Ettin provides practical boilerplates and a clear path to reproduce or build upon the work. The authors explicitly invite users to try the models and consider downstream tasks such as classification, retrieval, and generation using encoder-based or decoder-based workflows source.
Why it matters (impact for developers/enterprises)
The core contribution of Ettin is to enable fair, apples-to-apples comparisons between encoder and decoder architectures under identical data and training recipes. This clarity helps teams choose the right backbone for a given application, whether it is a fast, on-device discriminative model or a capable generative system that can handle long-context tasks. The open-data nature of Ettin lowers the barrier to entry for research and production teams seeking to adapt, reproduce, or extend high-quality models without relying on proprietary data. In practice, enterprises can assess: (1) which architecture best suits their primary tasks (classification, retrieval vs. generation), (2) how scale influences performance within a fixed data regime, and (3) how training objectives might shape model behavior beyond accuracy, including bias-related aspects that are increasingly scrutinized in deployment scenarios source.
Technical details or Implementation
The Ettin suite comprises six model scales for each architecture (encoder and decoder), spanning 17M to 1B parameters. Across both families, training uses an identical data mix, identical model shapes, and the same training recipe to guarantee apples-to-apples comparisons. The three-phase training regimen is a central feature: 1) Pre-training on 1.7T tokens with shorter 1024-token contexts to establish robust foundations; 2) Context Extension to 8K tokens using higher-quality filtered data to capture longer-range dependencies; and 3) Decay phase with 100B tokens from premium sources, while progressively reducing the learning rate. A key experimental twist involved continuing training for 50B tokens with the opposite objective (MLM on encoders, CLM on decoders) to study how objectives shape learning under the same architecture and data regime. The open nature of Ettin’s data and training ingredients means researchers can explore architecture-specific advantages and bias dynamics with transparent inputs source.
Key takeaways
- Ettin establishes a fair apples-to-apples framework for comparing encoder and decoder architectures, using identical data, shapes, and recipes.
- Encoders tend to dominate classification and retrieval tasks, even at smaller sizes, while decoders maintain advantages in generation, particularly as scale increases.
- The six-scale coverage (17M–1B) for both encoders and decoders enables a wide range of deployment options, from fast on-device models to larger, high-performance systems.
- All training data for Ettin is public and reproducible, lowering the barrier to replication, extension, and community-driven improvements.
- A deliberate experiment swapping objectives after initial training highlights that architecture choice matters beyond the objective alone, with measurable effects on learning dynamics and behavior such as bias benchmarking.
- The Ettin suite provides practical boilerplates and a clear path for practitioners to try open-data models across tasks, including classification, retrieval, and generation pipelines source.
FAQ
-
What is Ettin?
Ettin is described as the first suite of state-of-the-art paired encoder-only and decoder-only models trained using identical data, model shapes, and training recipes, differing only in attention patterns and objectives [source](https://huggingface.co/blog/ettin).
-
How many model sizes does Ettin include and what range?
The complete suite includes six scales for both encoders and decoders, ranging from 17M to 1B parameters [source](https://huggingface.co/blog/ettin).
-
What data regime is used for training?
Ettin uses a three-phase training approach with 1.7T tokens for pre-training, 250B tokens for context extension to 8K, and 100B tokens during the decay phase from premium data sources; all data is public and reproducible [source](https://huggingface.co/blog/ettin).
-
How do the results compare to established baselines?
Ettin encoders outperform ModernBERT on various tasks, while Ettin decoders outperform or match baselines such as Llama 3.2 and SmolLM2, especially on knowledge-intensive tasks like SciQ [source](https://huggingface.co/blog/ettin).
-
Can I reproduce or try these models myself?
Yes. The authors provide boilerplates at the end of the blog post, and all training data is public, enabling others to reproduce or extend the work [source](https://huggingface.co/blog/ettin).
References
More news
First look at the Google Home app powered by Gemini
The Verge reports Google is updating the Google Home app to bring Gemini features, including an Ask Home search bar, a redesigned UI, and Gemini-driven controls for the home.
Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection
Security researchers demonstrated a prompt-injection attack called Shadow Leak that leveraged ChatGPT’s Deep Research to covertly extract data from a Gmail inbox. OpenAI patched the flaw; the case highlights risks of agentic AI.
Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)
NVIDIA and Berkeley Lab unveil Huge Ensembles (HENS), an open-source AI tool that forecasts low-likelihood, high-impact weather events using 27,000 years of data, with ready-to-run options.
Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference
Scaleway is now a supported Inference Provider on the Hugging Face Hub, enabling serverless inference directly on model pages with JS and Python SDKs. Access popular open-weight models and enjoy scalable, low-latency AI workflows.
Google expands Gemini in Chrome with cross-platform rollout and no membership fee
Gemini AI in Chrome gains access to tabs, history, and Google properties, rolling out to Mac and Windows in the US without a fee, and enabling task automation and Workspace integrations.
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo
NVIDIA Dynamo offloads KV Cache from GPU memory to cost-efficient storage, enabling longer context windows, higher concurrency, and lower inference costs for large-scale LLMs and generative AI workloads.