Urban Air Quality Prediction Using Temporal ML Models
From raw sensor data to real-time forecasts — where atmospheric physics meets machine learning.
Air quality in urban environments has become a critical public health concern. Traditional monitoring systems rely on expensive, sparsely distributed sensor networks — leaving massive blind spots across city grids. This project addresses that gap by building a temporal machine learning pipeline that predicts concentrations of volatile organic compounds (VOCs), ozone (O₃), and nitrogen oxides (NOx) using low-cost IoT sensor arrays and historical atmospheric data.
explore The Problem
Existing air quality indices (AQI) are computed from a handful of reference-grade stations per city. In a metropolis like Kolkata, a single station may represent conditions for over 500,000 people — despite air quality varying block-by-block based on traffic, industry, and micro-meteorology.
The research question became: can we build a model that accurately forecasts pollutant levels 1–6 hours ahead, using sensor data from low-cost nodes augmented with weather and traffic features?
"The real challenge isn't the ML model — it's building a dataset that's actually usable."
Raw sensor readings drift over time, are sensitive to humidity, and have systematic biases that pure ML cannot handle without domain-aware preprocessing.
dataset Dataset Construction
Data Sources
- Electrochemical sensor nodes — 8 low-cost nodes (MQ-series + CO/NO₂ electrochemical) deployed across 4 urban zones
- State reference monitors — CPCB-certified stations for ground-truth calibration
- ERA5 reanalysis data — hourly wind speed, direction, humidity, temperature, boundary layer height
- OpenStreetMap traffic proxies — road density and vehicle count estimates per grid cell
Preprocessing Pipeline
Raw readings underwent a 4-stage cleaning process: outlier removal via IQR clamping, humidity correction using a polynomial regression calibration curve, temporal alignment to UTC+5:30, and Kalman filter smoothing to reduce measurement noise without introducing lag artifacts.
hub Model Architecture
After benchmarking ARIMA, XGBoost, LSTM, and Transformer approaches, the best results came from a Temporal Fusion Transformer (TFT) — a model that combines recurrent layers with multi-head attention, specifically designed for multi-horizon time-series forecasting with heterogeneous inputs.
# TFT configuration (PyTorch Forecasting) from pytorch_forecasting import TemporalFusionTransformer, TimeSeriesDataSet tft = TemporalFusionTransformer.from_dataset( training, learning_rate=3e-4, hidden_size=64, attention_head_size=4, dropout=0.15, hidden_continuous_size=32, loss=QuantileLoss(), log_interval=10, )
monitoring Results
The TFT model achieved a MAE of 4.2 µg/m³ on VOC prediction (6-hour horizon) — a 31% improvement over the LSTM baseline. Crucially, it also provides interpretable attention weights that reveal which features (time-of-day, wind direction, boundary layer height) the model uses at each step.
deployed_code What's Next
The next phase involves deploying a lightweight version of the model to an edge device (Raspberry Pi 4) co-located with sensor nodes, enabling real-time inference without cloud round-trips. The calibration pipeline will be retrained quarterly to compensate for long-term sensor drift.