Machine Learning Research PyTorch Environmental AI

Urban Air Quality Prediction
Using Temporal ML Models

From raw sensor data to real-time forecasts — where atmospheric physics meets machine learning.

How I built a machine learning pipeline to predict VOCs, O₃, and NOx concentrations in urban environments — from raw sensor data to real-time forecasts.

calendar_today March 2026 schedule 12 min read person Saptarshi Sadhu

Air quality in urban environments has become a critical public health concern. Traditional monitoring systems rely on expensive, sparsely distributed sensor networks — leaving massive blind spots across city grids. This project addresses that gap by building a temporal machine learning pipeline that predicts concentrations of volatile organic compounds (VOCs), ozone (O₃), and nitrogen oxides (NOx) using low-cost IoT sensor arrays and historical atmospheric data.

info

What this article covers Problem framing, dataset construction, model architecture (LSTM + Transformer hybrid), feature engineering, and deployment considerations.

explore The Problem

Existing air quality indices (AQI) are computed from a handful of reference-grade stations per city. In a metropolis like Kolkata, a single station may represent conditions for over 500,000 people — despite air quality varying block-by-block based on traffic, industry, and micro-meteorology.

The research question became: can we build a model that accurately forecasts pollutant levels 1–6 hours ahead, using sensor data from low-cost nodes augmented with weather and traffic features?

"The real challenge isn't the ML model — it's building a dataset that's actually usable."
Raw sensor readings drift over time, are sensitive to humidity, and have systematic biases that pure ML cannot handle without domain-aware preprocessing.

dataset Dataset Construction

Data Sources

Electrochemical sensor nodes — 8 low-cost nodes (MQ-series + CO/NO₂ electrochemical) deployed across 4 urban zones
State reference monitors — CPCB-certified stations for ground-truth calibration
ERA5 reanalysis data — hourly wind speed, direction, humidity, temperature, boundary layer height
OpenStreetMap traffic proxies — road density and vehicle count estimates per grid cell

Preprocessing Pipeline

Raw readings underwent a 4-stage cleaning process: outlier removal via IQR clamping, humidity correction using a polynomial regression calibration curve, temporal alignment to UTC+5:30, and Kalman filter smoothing to reduce measurement noise without introducing lag artifacts.

18M+

Sensor readings

6 mo

Collection period

Input features

hub Model Architecture

I experimented with three architectures before settling on a Temporal Fusion Transformer (TFT) variant — which handles multi-horizon forecasting natively and provides variable importance scores, making it interpretable enough for research publication.

Python · PyTorch

# Simplified TFT-style encoder for pollutant forecasting
class AQIForecaster(nn.Module):
    def __init__(self, input_dim=34, hidden=128, heads=4, horizon=6):
        super().__init__()
        self.encoder = nn.LSTM(input_dim, hidden, 2,
                               batch_first=True, dropout=0.2)
        self.attn    = nn.MultiheadAttention(hidden, heads, batch_first=True)
        self.head    = nn.Linear(hidden, horizon * 3)  # 3 pollutants

    def forward(self, x):
        enc, _   = self.encoder(x)
        ctx, _   = self.attn(enc, enc, enc)
        out      = self.head(ctx[:, -1, :])
        return out.view(-1, 6, 3)  # (batch, horizon, pollutant)

Loss Function

Standard MSE performed poorly due to the skewed distribution of pollution spikes. I switched to a QuantileLoss with q={0.1, 0.5, 0.9} — this gives uncertainty intervals alongside point forecasts, which is crucial for alerting systems.

bar_chart Results

The final model was evaluated on a held-out 6-week test set. Compared against three baselines — persistence (last observed value), ARIMA, and vanilla LSTM:

Model	MAE (NOx)	RMSE (O₃)	R²
Persistence	18.4	23.1	0.61
ARIMA	14.2	19.7	0.71
Vanilla LSTM	9.8	14.3	0.83
TFT (ours)	6.1	9.4	0.91

check_circle

Key Finding Boundary layer height (the altitude below which pollutants are trapped) turned out to be the single most predictive feature — more than traffic density or humidity — accounting for 27% of model variance according to TFT's attention weights.

rocket_launch Deployment Considerations

Productionizing an ML model for real-time forecasting introduces constraints that rarely appear in research: inference latency must stay under 200ms end-to-end, the sensor stream is unreliable (missing data is the norm), and the model must be retrained periodically as sensor drift accumulates.

warning

Operational Challenge Sensor nodes experience 15–30% data dropout during monsoon season due to connectivity issues. Training on imputed data can introduce systematic bias in wet-season predictions if not handled carefully.

Stack

Model serving — FastAPI + ONNX Runtime for 3× faster inference vs. raw PyTorch
Data pipeline — Apache Kafka for sensor ingestion, ClickHouse for time-series storage
Retraining — Weekly automated fine-tune triggered by RMSE drift detection
Monitoring — Grafana dashboards tracking prediction vs. actual AQI in real time

flag Conclusion

Urban air quality prediction is a domain where strong ML is necessary but not sufficient. Domain knowledge — atmospheric physics, sensor calibration, meteorological context — is what separates a working prototype from a reliable system.

The TFT architecture's built-in interpretability was invaluable: it let us validate model behaviour against known atmospheric patterns rather than treating it as a black box, which was essential for building stakeholder trust in the predictions.

Next steps include expanding the sensor network, experimenting with graph neural networks to model spatial pollutant dispersion, and publishing the cleaned dataset for the research community.

Enjoyed this article? ☕
If this helped you or saved you some time, consider supporting my work.

Support my work

emoji_objects Key Takeaways

Boundary layer height proved more predictive than traffic density — domain knowledge beats feature volume.
Training-serving skew is the #1 silent killer in ML pipelines; the feature computation path must be identical.
Quantile loss gives uncertainty bounds alongside point forecasts — essential for alerting systems that need confidence intervals.
Low-cost sensor networks can rival expensive reference stations with proper humidity correction and Kalman smoothing.
ONNX Runtime + Kafka + ClickHouse is a production-proven stack for real-time environmental ML at scale.

Saptarshi Sadhu

System-focused developer at the intersection of AI, backend engineering, and scalable infrastructure. Builds things that have to work in the real world.

Portfolio All Articles Contact

grid_view All Articles

Next → Architecting Scalable Full Stack Systems