Why Doesn’t My Model Work? A Practical Guide to ML Pitfalls

Overview

Training a model that looks good on a holdout test set and then finding it fails on real-world data is a common, frustrating experience. The article discusses how machine learning can easily go astray through pitfalls that aren’t obvious at first glance. In the two decades I’ve worked in ML, I’ve seen many cases where models seemed to perform well during development but produced little utility in practice. This piece surveys the kinds of mistakes that inflate apparent performance and suggests ways to avoid them, including the REFORMS checklist for doing ML-based science. Misleading data is a natural starting point for trouble. The core of ML is data, and when data quality is weak, models can appear successful yet be useless outside the training environment. Covid-19 prediction efforts, for example, relied on public datasets that later carried signals that didn’t reflect the underlying phenomena. In some cases, datasets contained overlapping records, mislabellings, or hidden variables that inadvertently helped models predict the labels without learning meaningful patterns. Hidden variables are features that correlate with the target in the data but aren’t causally related to the phenomenon of interest. In Covid chest imaging, for instance, the orientation of a scan could correlate with disease status rather than the disease itself, causing models to learn to predict posture rather than illness. A related issue is spurious correlations. These are patterns that happen to align with the labels but aren’t causally connected. A classic toy example is a tank image dataset where the model leverages background cues or time-of-day cues rather than the tank itself. In practice, many datasets contain such spurious cues, which can lead to poor generalization. Deep models are particularly adept at capturing patterns in background pixels or other incidental features, increasing susceptibility to distribution shifts and adversarial perturbations. If a model latches onto background cues, small changes can flip predictions, compromising robustness. Adversarial training can help but is expensive; a simpler approach is to inspect what information the model uses to make decisions—saliency maps or other explanations can reveal reliance on non-signal features. Label quality matters too. When labeling is done by humans, biases or mistakes creep in, especially in subjective tasks like sentiment analysis. Even widely used benchmarks can carry mislabelings at a non-trivial rate. The upshot is that a small improvement in reported accuracy might reflect modeling noise or labeling bias rather than a true improvement in understanding the task. Beyond data quality, the pipeline can leak information that should not be available during training. Data leakage occurs when knowledge from the test set (or future data) influences the model during training. A common form is conducting preprocessing steps, such as centering, scaling, or feature selection, on the entire dataset before splitting into train and test sets. This lets the model learn from test-data characteristics, inflating performance estimates. In time-series tasks, look-ahead bias is a particularly pernicious leakage—future information shown in the training set can falsely boost test metrics. Correcting these leaks often dramatically reduces reported performance, sometimes from near-perfect to modestly better than random. In response, the article advocates practices to improve reliability, including the REFORMS checklist for ML-based science. The aim is to make the data, the modeling choices, and the evaluation more transparent and robust, reducing the risk that a model only “works” under the very conditions under which it was developed.

Key features

Detect and mitigate misleading data early: guard against garbage in, garbage out.
Identify hidden variables and verify they aren’t driving predictions.
Be wary of spurious correlations that can inflate apparent performance.
Examine model explanations to ensure the model relies on meaningful signals, not background cues.
Prevent data leakage by keeping test data isolated during preprocessing and feature selection.
Watch for look-ahead bias in time-series and other domains where future information can slip into training.
Mind labeling biases and mislabels; recognize that subjectivity can lead to overfitting to annotator idiosyncrasies.
Use reputation-based checklists (REFORMS) to structure rigorous, transparent ML experiments.
Acknowledge the trade-offs of defenses like adversarial training, weighing cost against robustness.

Common use cases

Scientific ML studies where reported results may overstate real-world utility due to data leakage or hidden variables.
Covid-19 predictive modeling efforts where datasets contained latent signals or mislabelings that inflated performance on test splits.
Image and time-series benchmarks where background patterns or timing cues unintentionally correlate with labels, undermining generalization.
Deployments in safety- or health-critical domains where robust generalization and resistance to distribution shift are essential.
Reproducibility-focused research that investigates why models appear to work in some settings but fail when applied to new data.

Setup & installation

# Setup & installation not provided in the source article.
echo "No setup commands are available in the source article."

Note: The article references the REFORMS checklist and emphasizes careful evaluation and reporting, but it does not provide concrete setup or installation steps for a software package.

Quick start

Read the article to understand common ML pitfalls: data quality issues (mislabeling, hidden variables, spurious correlations), data leakage, and look-ahead bias.
When designing a model, inspect what the model is using to make decisions. If a saliency map or explainability tool highlights background features or non-meaningful cues, anticipate poor generalization.
Audit the data pipeline for leakage. Ensure that any preprocessing, scaling, or feature selection is performed after the train/test split. Be especially cautious with time-series data where future information can leak into training.
Evaluate the impact of label quality. Check for annotator biases and mislabeling rates that could distort performance comparisons.
Consider simpler robustness strategies before resorting to expensive defenses. If the model’s apparent strength relies on spurious cues, invest in data curation and proper evaluation more than heavy adverarial training.
Use the REFORMS checklist to document data provenance, modeling choices, and evaluation procedures to improve transparency and reproducibility.

Pros and cons

Pros:
Highlights practical failure modes often overlooked in academia.
Emphasizes data quality, leakage control, and explainability as pillars of robust ML.
Recommends a structured evaluation framework (REFORMS) to improve trust in results.
Cons:
Some mitigation strategies (e.g., adversarial training) can be expensive and may not be feasible for all projects.
The guidance is high-level and may require domain-specific adaptation for concrete pipelines.
The discussion focuses on pitfalls and evaluation; it does not prescribe a single, universal workflow.

Alternatives (brief comparisons)

| Approach | What it targets | Pros | Cons |---|---|---|---| | Adversarial training | Robustness to adversarial examples | Can improve resilience to perturbations | Computationally expensive; not a cure-all for leakage or data quality issues |Post-hoc explainability | Understanding model decisions | Helps detect reliance on non-signal features | Explanations can be noisy; may not reflect causal relevance |Data-centric cleanup | Fix data quality issues, labels, and leakage | Often yields greater gains with less model complexity | Requires data engineering discipline; may be time-consuming |Rigorous evaluation protocols | Structured, reproducible experiments | Improves trust and comparability | Requires discipline and documentation; may slow rapid iteration |