Urban Air Quality Prediction Using Temporal ML Models

From raw sensor data to real-time forecasts — where atmospheric physics meets machine learning.

Air quality in urban environments has become a critical public health concern. Traditional monitoring systems rely on expensive, sparsely distributed sensor networks — leaving massive blind spots across city grids. This project addresses that gap by building a temporal machine learning pipeline that predicts concentrations of volatile organic compounds (VOCs), ozone (O₃), and nitrogen oxides (NOx) using low-cost IoT sensor arrays and historical atmospheric data.

info

What this article covers Problem framing, dataset construction, model architecture (LSTM + Transformer hybrid), feature engineering, and deployment considerations.

explore The Problem

Existing air quality indices (AQI) are computed from a handful of reference-grade stations per city. In a metropolis like Kolkata, a single station may represent conditions for over 500,000 people — despite air quality varying block-by-block based on traffic, industry, and micro-meteorology.

The research question became: can we build a model that accurately forecasts pollutant levels 1–6 hours ahead, using sensor data from low-cost nodes augmented with weather and traffic features?

"The real challenge isn't the ML model — it's building a dataset that's actually usable."
Raw sensor readings drift over time, are sensitive to humidity, and have systematic biases that pure ML cannot handle without domain-aware preprocessing.

dataset Dataset Construction

Data Sources

Electrochemical sensor nodes — 8 low-cost nodes (MQ-series + CO/NO₂ electrochemical) deployed across 4 urban zones
State reference monitors — CPCB-certified stations for ground-truth calibration
ERA5 reanalysis data — hourly wind speed, direction, humidity, temperature, boundary layer height
OpenStreetMap traffic proxies — road density and vehicle count estimates per grid cell

Preprocessing Pipeline

Raw readings underwent a 4-stage cleaning process: outlier removal via IQR clamping, humidity correction using a polynomial regression calibration curve, temporal alignment to UTC+5:30, and Kalman filter smoothing to reduce measurement noise without introducing lag artifacts.

18M+

Sensor readings

6 mo

Collection period

Input features

hub Model Architecture

After benchmarking ARIMA, XGBoost, LSTM, and Transformer approaches, the best results came from a Temporal Fusion Transformer (TFT) — a model that combines recurrent layers with multi-head attention, specifically designed for multi-horizon time-series forecasting with heterogeneous inputs.

python

# TFT configuration (PyTorch Forecasting)
from pytorch_forecasting import TemporalFusionTransformer, TimeSeriesDataSet

tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=3e-4,
    hidden_size=64,
    attention_head_size=4,
    dropout=0.15,
    hidden_continuous_size=32,
    loss=QuantileLoss(),
    log_interval=10,
)

monitoring Results

The TFT model achieved a MAE of 4.2 µg/m³ on VOC prediction (6-hour horizon) — a 31% improvement over the LSTM baseline. Crucially, it also provides interpretable attention weights that reveal which features (time-of-day, wind direction, boundary layer height) the model uses at each step.

check_circle

Key takeaway Domain-aware feature engineering — specifically the 4-stage sensor calibration pipeline — contributed more to final model accuracy than the choice of architecture.

deployed_code What's Next

The next phase involves deploying a lightweight version of the model to an edge device (Raspberry Pi 4) co-located with sensor nodes, enabling real-time inference without cloud round-trips. The calibration pipeline will be retrained quarterly to compensate for long-term sensor drift.