How Meta Keeps Its AI Hardware Reliable

TL;DR

Meta operates a global AI infrastructure with thousands of hardware components, designed for training large-scale models and advanced AI applications, relying on specialized file systems and PyTorch workloads.
Hardware issues fall into three broad categories: binary hardware failures, transient errors, and silent data corruptions (SDCs), with SDCs posing unique challenges at scale.
Detection mechanisms and telemetry, including RAS telemetry, reduce downtime and enable rapid triage, while mitigation strategies for SDCs are applied at the cluster level.
Case studies from Meta’s Llama 3 models show that hardware issues in SRAMs, HBMs, processing grids, and network switches significantly affect reliability, highlighting the need for robust detection and repair pipelines.

Context and background

Meta’s AI infrastructure is a global ecosystem of hardware components and servers connected through network fabric across distributed data centers. The stack integrates storage, compute, and network architectures with unique file systems and PyTorch applications tailored for both training and inference workloads. This setup supports training large-scale models and advanced AI applications such as text-to-image generation and object segmentation. Since 2018, Meta has pursued a reliability journey that has revealed novel failure modes across disks, CPUs, memories, switches, GPUs, ASICs, and networks, often leading the industry in discovering failure types. The goal has been to establish mitigation policies that sustain smooth operation and availability for billions of users and thousands of internal use cases. Meta article.

What’s new

Meta emphasizes that large AI training clusters operate with thousands of accelerators in a synchronous environment, making any component failure capable of interrupting training. The focus is on reducing hardware failures during training through detection, diagnostics, and rapid restarts with healthy servers and accelerators. This requires optimizing fault categorization, device triage, node selection, cluster validation, and checkpoint restore. Lessons drawn from running the Llama 3 herd of models show that hardware faults in SRAMs, HBMs, processing grids, and network switch hardware significantly impact AI cluster reliability, with over 66% of training interruptions attributed to such failures. The scale of AI deployments also introduces challenges such as accelerators that may be less reliable than CPUs due to complexity and limited telemetry, network complexity that can misattribute failures, and GPU software stack errors that may require extensive configuration to correct. Reducing hardware and configuration failures is therefore central to improving cluster efficiency. A key takeaway is the classification of hardware faults into three broad categories. Hardware failures are often binary states—devices power on or off—and, while becoming more frequent in larger fleets, remain easier to triage, root-cause, and repair with simple health checks. Transient errors are reproducible in a limited sense and can be load-dependent or partially observable, such as issues from thermal runaway or random uncorrectable errors. Mitigation involves understanding the conditions under which these faults manifest and using larger-scale data to aid triage and pattern matching, including “traps” to trigger faults under controlled conditions. Advances in RAS telemetry in hyperscale infrastructure have greatly improved this process. Silent errors or silent data corruptions (SDCs) occur when hardware miscomputes without detectable traces, leading to incorrect results that can remain unnoticed until dramatic deviations are observed. Detecting SDCs requires extensive engineering and costly telemetry to trace data corruption back to specific devices. Case studies illustrate that SDCs are a real concern in hyperscale infrastructures, with the frequency of silent faults rising as silicon density increases in accelerators. Historically, soft-error-related bitflips were reduced to about one fault per million devices, but SDCs now occur at roughly one fault per thousand devices, a much higher rate. Because SDCs depend on data values, device voltage, frequency, operating temperature, and wear over the life cycle, they create a data-dependent and difficult-to-test problem that demands continuous, periodic testing across a random state space throughout a device’s life. To protect applications from SDCs, Meta employs several detection mechanisms described in its papers, “Detecting Silent Errors in the Wild” and “Hardware Sentinel.” Together, these mechanisms provide substantial fleet-wide coverage at scale for detecting and protecting infrastructure against SDCs. In AI training, SDCs can lead to incorrect computations during forward and backward passes, causing divergence from the intended training path and reduced training efficacy. Inference workloads also suffer, as incorrect results can propagate to thousands of inference consumers and affect systems such as recommendation engines or LLM outputs. In training, NaN propagation and corrupted gradient variance are two common SDC manifestations:

NaN propagation pushes a representable value into an incorrect representation, generating NaNs that cascade across a training iteration and board/domain boundaries, often requiring identification and quarantining of the offending accelerator and nodes to resolve the issue.
Corrupted gradient variance can cause gradient explosion or implosion or trap the algorithm in local minima, with corrupt values exchanged as true values and progressively degrading training progress. These issues can be time-delayed and harder to trace, potentially requiring long restarts and significant debugging.

Why it matters (impact for developers/enterprises)

The scale and complexity of Meta’s AI workloads mean that undetected SDCs and misattributed faults can cause substantial computational waste, extended training times, and degraded model quality. Training with thousands of accelerators in sync means that a single faulty device can halt or derail an entire training run, while silent faults can masquerade as progress, making root cause analysis extremely challenging. For enterprises running large AI pipelines, the implication is clear: robust fault detection, rapid triage, and repeatable testing are essential to maintain reliable training and inference at scale. The combination of infrastructure-level telemetry, structured fault categorization, and specialized detection mechanisms helps reduce downtime and improves resilience for production AI workloads.

Technical details or Implementation

The implementation framework Meta describes rests on three fault categories and a multi-layered detection approach. First, hardware failures present as binary states: a device powers on or powers off, so simple health checks can verify presence and configuration. Although such faults become more frequent as configurations and device scales grow, they remain tractable to triage and repair in large fleets. Second, transient errors are characterized by their reproducibility and can be load-dependent or partially observable. By analyzing manifestation conditions and leveraging large-scale telemetry, Meta can set traps and trigger mitigation when needed, especially during non-production testing where artificial workloads help make faults more repeatable. Advances in telemetry—referred to as RAS telemetry—have significantly improved triage speed and accuracy in hyperscale environments. Third, silent data corruptions (SDCs) are the most challenging: miscomputations leave no obvious traces and can propagate through training and inference. Meta notes that SDCs require extensive engineering and costly telemetry to trace data corruption to a particular device. The detection framework includes mechanisms described in their papers to achieve strong fleet-wide SDC coverage at scale. The two most common SDC cases in training are NaN propagation and corrupted gradient variance, each with distinct symptoms and debugging challenges. In training, NaNs can propagate across the training path, while gradient variances can distort the optimization trajectory over many iterations, potentially hiding faults until much later. Mitigation strategies for dealing with SDCs in AI training workloads are classified into infrastructure strategies and stack strategies. These are applied during operational triage at the cluster level and are designed to manage and mitigate SDCs across the training stack. The strategies work in concert with the larger aim of reducing hardware and configuration failures to improve overall cluster efficiency and reliability. Meta article.

Key takeaways

Meta’s AI infrastructure relies on a multi-component hardware ecosystem, where reliability is essential for training large models and deploying AI applications.
There are three fault categories driving reliability work: binary hardware failures, transient errors, and silent data corruptions (SDCs).
SDCs present the biggest challenge at scale due to their data-dependent nature and lack of traces, requiring advanced telemetry and dedicated detection mechanisms.
Case studies with Llama 3 illustrate the real-world impact of hardware faults on training interruptions and the importance of robust triage, checkpointing, and restart processes.
Meta couples infrastructure and software stack strategies to mitigate SDCs and improve the overall resilience of AI training and inference workloads.

FAQ

What are the three broad categories of hardware faults Meta observes?

Hardware failures (binary states), transient errors (reproducible under certain conditions), and silent data corruptions (SDCs) that miscompute without traces.
Why are SDCs particularly problematic in AI training and inference?

SDCs can cause incorrect forward and backward computations, leading to training divergence or incorrect inference results and are hard to trace due to data-dependence.
What approaches does Meta use to detect and mitigate SDCs at scale?

Meta uses detection mechanisms described in its papers (Detecting Silent Errors in the Wild and Hardware Sentinel) and classifies mitigation into infrastructure and stack strategies applied during cluster-level triage.
How does telemetry improve reliability in large AI clusters?

Telemetry, including RAS data, improves fault detection, triage speed, and pattern recognition to identify and repair faults more efficiently across thousands of devices.

References

https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable
Detecting Silent Errors in the Wild
Hardware Sentinel

How Meta Keeps Its AI Hardware Reliable

TL;DR

Context and background

What’s new

Why it matters (impact for developers/enterprises)

Technical details or Implementation

Key takeaways

FAQ

References

More news

First look at the Google Home app powered by Gemini

Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection

Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)

Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference

Google expands Gemini in Chrome with cross-platform rollout and no membership fee

Kaggle Grandmasters Playbook: 7 Battle-Tested Techniques for Tabular Data Modeling