Urban Air Quality Prediction
Using Temporal ML Models
From raw sensor data to real-time forecasts — where atmospheric physics meets machine learning.
How I built a machine learning pipeline to predict VOCs, O₃, and NOx concentrations in urban environments — from raw sensor data to real-time forecasts.
Air quality in urban environments has become a critical public health concern. Traditional monitoring systems rely on expensive, sparsely distributed sensor networks — leaving massive blind spots across city grids. This project addresses that gap by building a temporal machine learning pipeline that predicts concentrations of volatile organic compounds (VOCs), ozone (O₃), and nitrogen oxides (NOx) using low-cost IoT sensor arrays and historical atmospheric data.
explore The Problem
Existing air quality indices (AQI) are computed from a handful of reference-grade stations per city. In a metropolis like Kolkata, a single station may represent conditions for over 500,000 people — despite air quality varying block-by-block based on traffic, industry, and micro-meteorology.
The research question became: can we build a model that accurately forecasts pollutant levels 1–6 hours ahead, using sensor data from low-cost nodes augmented with weather and traffic features?
"The real challenge isn't the ML model — it's building a dataset that's actually usable."
Raw sensor readings drift over time, are sensitive to humidity, and have systematic biases that pure ML cannot handle without domain-aware preprocessing.
dataset Dataset Construction
Data Sources
- Electrochemical sensor nodes — 8 low-cost nodes (MQ-series + CO/NO₂ electrochemical) deployed across 4 urban zones
- State reference monitors — CPCB-certified stations for ground-truth calibration
- ERA5 reanalysis data — hourly wind speed, direction, humidity, temperature, boundary layer height
- OpenStreetMap traffic proxies — road density and vehicle count estimates per grid cell
Preprocessing Pipeline
Raw readings underwent a 4-stage cleaning process: outlier removal via IQR clamping, humidity correction using a polynomial regression calibration curve, temporal alignment to UTC+5:30, and Kalman filter smoothing to reduce measurement noise without introducing lag artifacts.
hub Model Architecture
I experimented with three architectures before settling on a Temporal Fusion Transformer (TFT) variant — which handles multi-horizon forecasting natively and provides variable importance scores, making it interpretable enough for research publication.
# Simplified TFT-style encoder for pollutant forecasting class AQIForecaster(nn.Module): def __init__(self, input_dim=34, hidden=128, heads=4, horizon=6): super().__init__() self.encoder = nn.LSTM(input_dim, hidden, 2, batch_first=True, dropout=0.2) self.attn = nn.MultiheadAttention(hidden, heads, batch_first=True) self.head = nn.Linear(hidden, horizon * 3) # 3 pollutants def forward(self, x): enc, _ = self.encoder(x) ctx, _ = self.attn(enc, enc, enc) out = self.head(ctx[:, -1, :]) return out.view(-1, 6, 3) # (batch, horizon, pollutant)
Loss Function
Standard MSE performed poorly due to the skewed distribution of pollution spikes. I switched to a QuantileLoss with q={0.1, 0.5, 0.9} — this gives uncertainty intervals alongside point forecasts, which is crucial for alerting systems.
bar_chart Results
The final model was evaluated on a held-out 6-week test set. Compared against three baselines — persistence (last observed value), ARIMA, and vanilla LSTM:
| Model | MAE (NOx) | RMSE (O₃) | R² |
|---|---|---|---|
| Persistence | 18.4 | 23.1 | 0.61 |
| ARIMA | 14.2 | 19.7 | 0.71 |
| Vanilla LSTM | 9.8 | 14.3 | 0.83 |
| TFT (ours) | 6.1 | 9.4 | 0.91 |
rocket_launch Deployment Considerations
Productionizing an ML model for real-time forecasting introduces constraints that rarely appear in research: inference latency must stay under 200ms end-to-end, the sensor stream is unreliable (missing data is the norm), and the model must be retrained periodically as sensor drift accumulates.
Stack
- Model serving — FastAPI + ONNX Runtime for 3× faster inference vs. raw PyTorch
- Data pipeline — Apache Kafka for sensor ingestion, ClickHouse for time-series storage
- Retraining — Weekly automated fine-tune triggered by RMSE drift detection
- Monitoring — Grafana dashboards tracking prediction vs. actual AQI in real time
flag Conclusion
Urban air quality prediction is a domain where strong ML is necessary but not sufficient. Domain knowledge — atmospheric physics, sensor calibration, meteorological context — is what separates a working prototype from a reliable system.
The TFT architecture's built-in interpretability was invaluable: it let us validate model behaviour against known atmospheric patterns rather than treating it as a black box, which was essential for building stakeholder trust in the predictions.
Next steps include expanding the sensor network, experimenting with graph neural networks to model spatial pollutant dispersion, and publishing the cleaned dataset for the research community.
If this helped you or saved you some time, consider supporting my work.
emoji_objects Key Takeaways
- Boundary layer height proved more predictive than traffic density — domain knowledge beats feature volume.
- Training-serving skew is the #1 silent killer in ML pipelines; the feature computation path must be identical.
- Quantile loss gives uncertainty bounds alongside point forecasts — essential for alerting systems that need confidence intervals.
- Low-cost sensor networks can rival expensive reference stations with proper humidity correction and Kalman smoothing.
- ONNX Runtime + Kafka + ClickHouse is a production-proven stack for real-time environmental ML at scale.