Empowering air quality research with secure, ML-driven predictive analytics

TL;DR

A data imputation workflow fills PM2.5 gaps using ML trained in SageMaker Canvas, orchestrated by AWS Lambda and AWS Step Functions.
The sample dataset includes over 15 million records from March 2022 to Oct 2022 across Kenya and Nigeria, from 23 sensors in 15 locations.
Predictions are generated for missing PM2.5 values within a range of plus or minus 4.875 µg/m³ to preserve trend accuracy.
The solution emphasizes security, encryption, and private-network deployment, with a shared responsibility model guiding customer protections.

Context and background

Air pollution remains a critical environmental health challenge in Africa. Organizations like sensors.AFRICA have deployed hundreds of air quality sensors to monitor conditions, but data gaps persist due to power instability and connectivity issues in high-risk regions where maintenance is limited. Missing PM2.5 data reduces statistical power and biases parameter estimates, hindering reliable trend detection and sound conclusions about air quality patterns. These gaps compromise evidence-based decisions for pollution control, health impact assessments, and regulatory compliance. PM2.5 exposure contributes to millions of premature deaths globally, underscoring the importance of accurate forecasting for public health. The post showcases the time-series forecasting capability of Amazon SageMaker Canvas, a low-code/no-code ML platform, to predict PM2.5 values from incomplete datasets. SageMaker Canvas offers resilience to incomplete data, enabling continuous operation of air quality networks during sensor outages or maintenance periods. This helps environmental agencies and public health officials maintain uninterrupted access to critical air quality information for timely alerts and long-term trend analysis. The approach combines SageMaker Canvas forecasting with a data imputation workflow implemented using Amazon SageMaker AI, AWS Lambda, and AWS Step Functions. A sample training dataset sourced from openAFRICA contains over 15 million records (Mar 2022–Oct 2022) from 23 sensor devices across 15 locations in Kenya and Nigeria, illustrating how the solution can be adapted to real-world PM2.5 datasets. The README in the repository provides detailed deployment guidance. The solution’s architecture centers on two ML components: a training workflow and an inference workflow. These workflows are built with SageMaker Canvas for model development and exporting a trained model for batch inference. The end-to-end process begins by extracting sensor data from a database, importing it into SageMaker Canvas for transformation and model training, and then exporting a Canvas-trained model for batch processing. The approach supports retraining as new PM2.5 data become available, ensuring models stay current with evolving sensor patterns.

What’s new

This article demonstrates a complete, secure data-imputation workflow for PM2.5 data gaps using a combination of AWS services. Key features include:

Time-series forecasting with SageMaker Canvas to predict missing PM2.5 values in datasets containing gaps.
A robust end-to-end workflow: data extraction from a database, Canvas-based model training, model export, and a batch transform inference pipeline.
Daily orchestration: a Lambda function runs every 24 hours to trigger a SageMaker batch transform job on newly received data with gaps, then updates the dataset with predicted values.
Realistic sample data: the demonstration uses a training corpus of over 15 million records from 23 sensors across 15 locations in Kenya and Nigeria (Mar 2022–Oct 2022).
Security-first deployment: encryption at rest for S3, Aurora PostgreSQL-compatible database, and SageMaker Canvas; encryption in transit via SSL/TLS; temporary IAM-based credentials for RDS access; least-privilege Lambda roles; and a private-subnet deployment using VPC endpoints for S3 and SageMaker AI.
IaC-driven deployment: model training and deployment stages described within a CDK-based workflow, enabling repeatable, version-controlled updates. To implement the approach, the project provides a Git repository with sample code and a README for step-by-step deployment. The architecture aims to deliver completed PM2.5 datasets to public health decision-makers, supporting timely pollution alerts and more comprehensive long-term analyses.

Why it matters (impact for developers/enterprises)

For developers and enterprises working on environmental monitoring, this approach offers:

Resilient data pipelines that maintain operational continuity despite sensor downtime, reducing data gaps and downtime costs.
Actionable PM2.5 insights extracted from incomplete data without requiring complete data streams, helping researchers and policymakers maintain timely trend visibility.
A scalable, secure, cloud-based solution that integrates with existing data stores (e.g., relational databases like Aurora) and object storage (S3) while maintaining strict security controls.
An auditable deployment pattern: infrastructure-as-code via CDK enables consistent, repeatable deployments and easier updates when sensor networks change.
Clear guidance on protecting data in transit and at rest, with granular IAM permissions and private networking to minimize exposure. This work aligns with public health objectives by enabling continuous monitoring and more reliable PM2.5 trend analyses, ultimately supporting evidence-based pollution control strategies and regulatory compliance.

Technical details or Implementation

The solution comprises two main ML components: a training workflow and an inference workflow, both integrated into a secure end-to-end pipeline.

Data inputs and training
Historical PM2.5 datasets are ingested from a relational database and prepared in SageMaker Canvas for predictive analysis.
Canvas supports training a model for single-target PM2.5 prediction with data wrangling steps (transformations, feature engineering) suitable for time-series forecasting.
After training, Canvas exports the model for batch inference.
The training dataset referenced in the example contains over 15 million records spanning March 2022 to Oct 2022, drawn from 23 sensor devices across 15 locations in Kenya and Nigeria.
Inference and data imputation
A Step Functions orchestration coordinates the workflow, with a Lambda function invoked every 24 hours.
The Lambda function starts a SageMaker Batch Transform job to predict missing PM2.5 values for the new data with gaps.
The batch transform processes the entire dataset in one pass, and the Lambda function updates the existing dataset with the predicted values.
The resulting completed dataset enables distribution to public health decision-makers for more effective pattern analysis of PM2.5 data.
Model lifecycle and deployment
After training and evaluation (including RMSE and other metrics), the model is registered in the SageMaker model registry and deployed for batch inference.
CDK-based deployment creates a SageMaker AI domain and user profile, then provisions the necessary resources for model training and inference.
A workflow includes creating the SageMaker model in a VPC, deploying the batch transform job, and updating the infrastructure with the new model ID via cdk deploy.
The solution supports retraining with updated PM2.5 datasets to adapt to evolving sensor data patterns.
Security and compliance highlights
Encryption at rest is enabled for Amazon S3, Aurora PostgreSQL-compatible database, and the SageMaker Canvas application.
Encryption in transit is enforced by SSL/TLS for all connections from Lambda functions.
Temporary dynamic credentials are used for Amazon RDS access via IAM authentication, eliminating static passwords.
Each Lambda function operates with least-privilege permissions tailored to its function.
The Lambda functions, Aurora PostgreSQL-compatible instance, and SageMaker Batch Transform jobs run in private VPC subnets that do not traverse the public Internet, with VPC endpoints for S3 and SageMaker AI.
Configuration and extensibility
The architecture is designed to be adaptable to future configuration changes via CDK, with a configuration file detailing parameter defaults.
The approach emphasizes a security-first mindset and aligns with AWS Shared Responsibility Model, encouraging customers to review responsibilities for secure deployment.
Tables: key architecture components and roles

Component	Role
SageMaker Canvas	Training and exporting the predictive model for batch inference
AWS Lambda	Orchestrates data updates and kicks off batch transforms on a 24-hour cadence
AWS Step Functions	Coordinates the end-to-end workflow across services
Amazon Aurora PostgreSQL-Compatible	Stores sensor data with IAM-authenticated access
Amazon S3	Immutable data lake storage with encryption at rest

References and deployment notes
The approach is documented with sample code and step-by-step deployment guidance in the referenced Git repository: [email protected]:aws-samples/sample-empowering-air-quality-research-secure-machine-learning-predictive-analytics.git
The published article is available at https://aws.amazon.com/blogs/machine-learning/empowering-air-quality-research-secure-ml-driven-predictive-analytics/.

Key takeaways

Incomplete PM2.5 data can be effectively imputed using a SageMaker Canvas-based model and a batch inference pipeline.
Daily Lambda-triggered batch transforms keep datasets up-to-date with minimal downtime and interruptions to monitoring.
Security-by-design choices—including encryption, IAM-based credentials, and private subnets—help protect sensitive environmental data.
An IaC-driven deployment pattern supports repeatable, auditable infrastructure changes and easier scaling.

FAQ

How does this solution impute missing PM2.5 values?

The workflow trains a SageMaker Canvas model on historical PM2.5 data and uses a SageMaker Batch Transform job to predict missing values within a range of plus or minus 4.875 µg/m³ of the actual PM2.5 concentration.
What data were used in the example?

sample training dataset from openAFRICA containing over 15 million records from March 2022 to Oct 2022, collected across Kenya and Nigeria from 23 sensor devices at 15 locations.
Which AWS services comprise the end-to-end pipeline?

SageMaker Canvas for model training and export, AWS Lambda for orchestration, AWS Step Functions for workflow coordination, SageMaker Batch Transform for inference, Amazon Aurora PostgreSQL-compatible database for storage, and Amazon S3 for data lake storage, all in a secure network configuration.
How is security implemented?

Encryption at rest for S3, Aurora, and SageMaker Canvas; TLS for data in transit; IAM authentication for temporary RDS credentials; least-privilege Lambda roles; private subnets with VPC endpoints; and alignment with the AWS Shared Responsibility Model.
How can I deploy this in my environment?

The solution provides a CDK-based deployment approach and a README in the Git repository, enabling you to reproduce the end-to-end workflow and adapt it to your PM2.5 datasets.

References

https://aws.amazon.com/blogs/machine-learning/empowering-air-quality-research-secure-ml-driven-predictive-analytics/
[email protected]:aws-samples/sample-empowering-air-quality-research-secure-machine-learning-predictive-analytics.git

Empowering air quality research with secure, ML-driven predictive analytics

TL;DR

Context and background

What’s new

Why it matters (impact for developers/enterprises)

Technical details or Implementation

Key takeaways

FAQ

References

More news

First look at the Google Home app powered by Gemini

Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection

Move AI agents from proof of concept to production with Amazon Bedrock AgentCore

Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)

Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference

Google expands Gemini in Chrome with cross-platform rollout and no membership fee