Global Air Pollution EDA with tseda

Dataset: data/sample/global air pollution dataset.csv 23 463 cities × 5 pollutants (AQI, CO, Ozone, NO₂, PM2.5)

Strategy: aggregate to country-level mean PM2.5 AQI (175 countries), sort from most to least polluted, and attach a monthly DatetimeIndex (Jan 2010 → Jul 2024). Each observation represents one country in the pollution ranking.

All eleven tseda modules are demonstrated in order. HTML reports are saved to html/.

0 · Setup

[1]:
%matplotlib inline
import sys, os, warnings, logging
sys.path.insert(0, os.path.abspath(".."))

warnings.filterwarnings("ignore")
logging.getLogger().setLevel(logging.ERROR)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tseda import TimeSeries
from tseda.visualization import set_style
set_style()

HTML_DIR = "html"
os.makedirs(HTML_DIR, exist_ok=True)

def report_html(ts, stem, **kw):
    """Generate an HTML EDA report to html/ and return the path."""
    from tseda.report import HTMLReport
    path = os.path.join(HTML_DIR, f"{stem}.html")
    HTMLReport().generate(ts, path, **kw)
    return path

DATA = os.path.join("..", "data", "sample", "global air pollution dataset.csv")
print("Setup complete ✓  |  html/ directory ready")
Setup complete ✓  |  html/ directory ready

1 · Data Loading and Exploration

[2]:
raw = pd.read_csv(DATA, encoding="utf-8-sig")
print(f"Shape : {raw.shape}")
print(f"Columns: {list(raw.columns)}")
raw.head(3)
Shape : (23463, 12)
Columns: ['Country', 'City', 'AQI Value', 'AQI Category', 'CO AQI Value', 'CO AQI Category', 'Ozone AQI Value', 'Ozone AQI Category', 'NO2 AQI Value', 'NO2 AQI Category', 'PM2.5 AQI Value', 'PM2.5 AQI Category']
[2]:
Country City AQI Value AQI Category CO AQI Value CO AQI Category Ozone AQI Value Ozone AQI Category NO2 AQI Value NO2 AQI Category PM2.5 AQI Value PM2.5 AQI Category
0 Russian Federation Praskoveya 51 Moderate 1 Good 36 Good 0 Good 51 Moderate
1 Brazil Presidente Dutra 41 Good 1 Good 5 Good 1 Good 41 Good
2 Italy Priolo Gargallo 66 Moderate 1 Good 39 Good 2 Good 66 Moderate
[3]:
numeric_cols = ["AQI Value", "CO AQI Value", "Ozone AQI Value",
                "NO2 AQI Value", "PM2.5 AQI Value"]
raw[numeric_cols].describe().round(2)
[3]:
AQI Value CO AQI Value Ozone AQI Value NO2 AQI Value PM2.5 AQI Value
count 23463.00 23463.00 23463.00 23463.00 23463.00
mean 72.01 1.37 35.19 3.06 68.52
std 56.06 1.83 28.10 5.25 54.80
min 6.00 0.00 0.00 0.00 0.00
25% 39.00 1.00 21.00 0.00 35.00
50% 55.00 1.00 31.00 1.00 54.00
75% 79.00 1.00 40.00 4.00 79.00
max 500.00 133.00 235.00 91.00 500.00
[4]:
print("AQI category distribution:")
print(raw["AQI Category"].value_counts().to_string())
AQI category distribution:
AQI Category
Good                              9936
Moderate                          9231
Unhealthy                         2227
Unhealthy for Sensitive Groups    1591
Very Unhealthy                     287
Hazardous                          191
[5]:
country = (raw.groupby("Country")[numeric_cols]
              .mean().round(2)
              .sort_values("PM2.5 AQI Value", ascending=False)
              .reset_index())
print(f"Countries: {len(country)}")
print("\nTop 10 most polluted (PM2.5):")
print(country[["Country","PM2.5 AQI Value","AQI Value"]].head(10).to_string(index=False))
print("\nTop 10 cleanest (PM2.5):")
print(country[["Country","PM2.5 AQI Value","AQI Value"]].tail(10).to_string(index=False))
Countries: 175

Top 10 most polluted (PM2.5):
             Country  PM2.5 AQI Value  AQI Value
   Republic of Korea           415.00     421.00
             Bahrain           188.00     188.00
          Mauritania           179.00     179.00
            Pakistan           173.11     178.79
               Aruba           163.00     163.00
              Kuwait           162.00     162.00
United Arab Emirates           152.67     163.67
             Senegal           152.42     152.42
               India           149.46     152.96
        Saudi Arabia           149.29     149.29

Top 10 cleanest (PM2.5):
         Country  PM2.5 AQI Value  AQI Value
         Andorra            22.00      29.33
         Uruguay            21.69      26.65
         Finland            21.63      38.37
          Sweden            21.04      36.16
Papua New Guinea            20.53      24.87
          Norway            18.57      35.48
         Iceland            18.33      23.00
        Maldives            15.00      19.00
           Palau             7.00      16.00
 Solomon Islands             6.00      18.00
[6]:
n   = len(country)
idx = pd.date_range("2010-01-01", periods=n, freq="MS")

ts_pm25  = TimeSeries(country["PM2.5 AQI Value"].values,  index=idx, name="PM2.5 AQI")
ts_aqi   = TimeSeries(country["AQI Value"].values,         index=idx, name="AQI")
ts_no2   = TimeSeries(country["NO2 AQI Value"].values,     index=idx, name="NO2 AQI")
ts_ozone = TimeSeries(country["Ozone AQI Value"].values,   index=idx, name="Ozone AQI")
ts_co    = TimeSeries(country["CO AQI Value"].values,      index=idx, name="CO AQI")
all_series = [ts_pm25, ts_aqi, ts_no2, ts_ozone, ts_co]
print(ts_pm25)
TimeSeries(
  name        : PM2.5 AQI
  n           : 175
  start       : 2010-01-01 00:00:00
  end         : 2024-07-01 00:00:00
  duration    : 5295 days 00:00:00
  freq        : MS (Monthly (start))
  is_regular  : False
  has_nan     : False (0 / 0.0%)
)

Overview: all five pollutants

The plot below shows all five AQI metrics sorted from the most polluted country (left) to the cleanest (right). Each series tells a different story:

  • PM2.5 and AQI closely track each other — PM2.5 dominates the composite AQI.

  • NO2 is near-zero for most countries, with a steep spike at the top — driven by industrial and traffic-heavy nations.

  • Ozone shows the flattest profile, indicating a global baseline with less geographic differentiation.

  • CO is low and noisy everywhere except a small number of extreme outliers.

The steep “cliff” near the left edge in PM2.5/AQI marks the boundary between severely polluted and merely high-pollution countries.

[7]:
from tseda.visualization.time_plots import plot_series

fig, axes = plt.subplots(5, 1, figsize=(14, 16), sharex=True)
for ax, ts in zip(axes, all_series):
    plot_series(ts, ax=ax)
    ax.set_xlabel("")
axes[-1].set_xlabel("Country rank  (Jan 2010 = most polluted → Jul 2024 = cleanest)")
fig.suptitle("Global air pollution — country-level mean AQI per pollutant", fontsize=13)
fig.tight_layout()
plt.show()
../_images/examples_Global_Air_Pollution_EDA_10_0.png

2 · Data Quality

[8]:
from tseda.quality.missing    import MissingValueAnalyzer
from tseda.quality.outliers   import OutlierDetector
from tseda.quality.duplicates import DuplicateDetector

print(MissingValueAnalyzer().analyze(ts_pm25))
MissingValueReport(
  n_nan              : 0 (0.0%)
  index gaps         : 0 gap(s)
  longest NaN run    : 0
  is_monotone        : False
)

Missing value map

The heatmap below encodes each country’s observation as a coloured cell (green = present, red = missing). With 175 country-level aggregations, all values are present — the aggregation step fills any city-level gaps with the country mean, so no NaN survives into the time series.

[9]:
from tseda.visualization.quality_plots import plot_missing_heatmap

plot_missing_heatmap(ts_pm25, figsize=(12, 3), title="PM2.5 — missing value map")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_14_0.png

Outlier detection — three methods

Three methods are compared on the PM2.5 ranking series:

Method

Criterion

Best for

IQR

Value outside Q1 − 1.5×IQR or Q3 + 1.5×IQR

General use, robust to skew

Z-score

abs(z) > threshold (default 3)

Approximately normal data

MAD

Median absolute deviation threshold

Heavy-tailed distributions

Because PM2.5 is severely right-skewed (a handful of countries are disproportionately polluted), IQR and MAD are the most appropriate methods here. The red markers in the series plot identify countries whose pollution level is anomalously high even relative to the already-polluted top tier.

[10]:
det   = OutlierDetector()
r_iqr = det.iqr(ts_pm25)
r_z   = det.zscore(ts_pm25)
r_mad = det.mad(ts_pm25)

print(f"IQR     : {r_iqr.n_outliers}{country.iloc[r_iqr.indices]['Country'].tolist()}")
print(f"Z-score : {r_z.n_outliers}")
print(f"MAD     : {r_mad.n_outliers}")
IQR     : 6 → ['Republic of Korea', 'Bahrain', 'Mauritania', 'Pakistan', 'Aruba', 'Kuwait']
Z-score : 1
MAD     : 3

The outlier series plot (left panel) marks flagged countries as red dots. The dashed horizontal lines are the IQR fences (lower and upper bounds). Countries above the upper fence are “extreme outliers” — their PM2.5 levels are so high that they distort the scale of the entire series.

The outlier score timeline (right panel) assigns a continuous score to every observation: 0 means within fences, values > 0 measure how far outside the fences the observation lies. This normalised view makes it easy to compare the relative severity of different outliers.

[11]:
from tseda.visualization.quality_plots import plot_outliers, plot_outlier_score

plot_outliers(ts_pm25, r_iqr, title="PM2.5 — IQR outliers (extreme polluters)")
plt.show()

plot_outlier_score(ts_pm25, r_iqr, figsize=(12, 4), title="PM2.5 — outlier score timeline")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_18_0.png
../_images/examples_Global_Air_Pollution_EDA_18_1.png
[12]:
dup = DuplicateDetector()
print("Flatline runs :", dup.flatline(ts_pm25))
print("Near-zero runs:", dup.near_zero(ts_pm25))
Flatline runs : FlatlineReport(
  n_flatline_runs       : 0
  longest_run           : 0
  total_flatline_points : 0
  min_run               : 3
)
Near-zero runs: FlatlineReport(
  n_flatline_runs       : 0
  longest_run           : 0
  total_flatline_points : 0
  min_run               : 3
)

3 · Descriptive Statistics

[13]:
from tseda.statistics.descriptive import DescriptiveAnalyzer

print(DescriptiveAnalyzer().analyze(ts_pm25))
DescriptiveStats(
  n_valid    : 175 / 175  (0.0% NaN)
  mean       : 69.7495
  median     : 61.87
  std        : 45.2085
  [min, max] : [6, 415]
  skewness   : 3.0466
  kurtosis   : 18.7949
)

Distribution plot

The histogram shows how PM2.5 values are distributed across the 175 countries. The overlaid KDE (kernel density estimate) is a smoothed version of the histogram that highlights the overall shape. The dashed normal curve is fitted to the same mean and standard deviation for reference.

What to look for: a strong right skew means most countries cluster at relatively low-to-moderate PM2.5 values (< 80), while a long tail extends to the extreme right where a small number of severely polluted nations reside. The KDE departs sharply from the normal curve, confirming non-normality.

[14]:
from tseda.visualization.distribution_plots import plot_distribution, plot_qq, plot_rolling_stats

plot_distribution(ts_pm25, figsize=(10, 5),
                  title="Global PM2.5 AQI — distribution across countries")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_23_0.png

Q-Q plot (Quantile-Quantile)

A Q-Q plot compares the observed quantiles (y-axis) against the theoretical quantiles of a normal distribution (x-axis). Points that fall on the diagonal dashed line indicate perfect normality.

What to look for: the upward bow in the upper tail is the signature of a heavy right tail (leptokurtosis / positive excess kurtosis). The several points that shoot well above the line at the upper end correspond to the extreme-polluter outliers identified in Section 2.

[15]:
plot_qq(ts_pm25, figsize=(7, 6),
        title="PM2.5 AQI — Q-Q plot vs normal  (heavy right tail)")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_25_0.png

Rolling mean and standard deviation

The rolling window (width = 12 countries) slides along the sorted ranking and computes the local mean (top panel) and local standard deviation (bottom panel).

What to look for:

  • The rolling mean tracks the monotonically decreasing PM2.5 ranking and reveals how the rate of decrease changes — steep at the top (extreme polluters diverge widely) and flatter toward the clean end.

  • The rolling std peaks in the “Severe” zone, where countries spread from ~100 to ~415. It shrinks as we move into moderate and clean tiers where countries cluster closely together.

[16]:
plot_rolling_stats(ts_pm25, window=12,
                   title="PM2.5 AQI — rolling mean / std  (window = 12 countries)")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_27_0.png

Lag scatter plots

Each panel plots the series against a lagged version of itself. A strong positive relationship (points close to the diagonal) at short lags confirms that neighbouring-ranked countries have similar pollution levels — i.e., the series is highly autocorrelated.

  • Lag 1: near-perfect diagonal → country rank k and rank k+1 have almost identical PM2.5.

  • Lag 5 and Lag 10: still strong positive relationship but with more scatter as geographic distance in the ranking grows.

[17]:
from tseda.visualization.time_plots import plot_lag

plot_lag(ts_pm25, lags=(1, 5, 10), title="PM2.5 AQI — lag scatter plots")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_29_0.png

4 · Stationarity Analysis

A time series is stationary when its statistical properties (mean, variance, autocorrelation) do not depend on position. Most forecasting models assume stationarity, so understanding and achieving it is a prerequisite.

Two complementary tests are applied — they have opposite null hypotheses, so their joint result gives a clearer verdict:

Test

H₀

Reject H₀ when…

ADF

Series has a unit root (non-stationary)

p < α → evidence of stationarity

KPSS

Series is level-stationary

p < α → evidence of non-stationarity

Because our PM2.5 series is sorted descending, it is by construction a monotone decreasing trend — strongly non-stationary. The first difference (change between consecutive countries) should be stationary.

[18]:
from tseda.statistics.stationarity import StationarityTester

tester = StationarityTester()
print("=== PM2.5 AQI  (raw ranking) ===")
print(tester.summary(ts_pm25))
=== PM2.5 AQI  (raw ranking) ===
Stationarity Summary (α=0.05)
─────────────────────────────────────────────
ADF  : p=0.9725  → non-stationary
KPSS : p=0.0100  → non-stationary
─────────────────────────────────────────────
Verdict : NON-STATIONARY — both ADF and KPSS indicate non-stationarity.
Action  : Consider differencing and/or detrending.

[19]:
ts_pm25_diff = ts_pm25.diff(1)
print("=== PM2.5 AQI  (first difference) ===")
print(tester.summary(ts_pm25_diff))
=== PM2.5 AQI  (first difference) ===
Stationarity Summary (α=0.05)
─────────────────────────────────────────────
ADF  : p=0.0000  → stationary
KPSS : p=0.0467  → non-stationary
─────────────────────────────────────────────
Verdict : TREND STATIONARY — ADF rejects unit root, KPSS rejects level stationarity.
Action  : Consider detrending (remove deterministic trend).

5 · Autocorrelation Analysis

Autocorrelation measures how strongly a series correlates with its own past values. It is the foundation for choosing ARIMA orders and understanding the memory structure of the data.

  • ACF (Autocorrelation Function) — correlation of the series with each of its lags. Lag-0 is always 1.0. Bars outside the blue ±95 % confidence bands are statistically significant.

  • PACF (Partial ACF) — correlation with lag k after removing the effect of all shorter lags. Useful for identifying AR order: the PACF cuts off sharply at lag p for an AR(p) process.

  • Ljung-Box test — tests jointly whether any group of autocorrelations up to lag m is different from zero (H₀: white noise). A small p-value means the series is NOT white noise.

[20]:
from tseda.statistics.autocorrelation import AutocorrelationAnalyzer
from tseda.visualization.correlation_plots import plot_acf_pacf, plot_acf_heatmap

ac = AutocorrelationAnalyzer().analyze(ts_pm25, lags=40)
print(ac)
AutocorrelationResult(
  n_obs           : 175
  n_lags          : 40
  significant ACF : 40 lag(s) outside 95% CI
  significant PACF: 3 lag(s) outside 95% CI
  is_white_noise  : False  (α=0.05)
)

ACF / PACF plot

Left panel (ACF): the slow, gradual decay of autocorrelation across many lags is the hallmark of a non-stationary series with a strong trend. Almost all lags are significant (bars exceed the blue confidence band), confirming that a country’s PM2.5 level is highly predictable from its neighbours.

Right panel (PACF): a sharp spike at lag 1 that drops to near-zero afterwards suggests an AR(1) structure in the first-differenced data — the immediate neighbour in the ranking carries all the predictive information.

[21]:
plot_acf_pacf(ac, figsize=(14, 5), title="PM2.5 AQI — ACF / PACF")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_36_0.png

Rolling ACF heatmap

This innovative plot computes the ACF in a sliding window and renders it as a heatmap. The x-axis is window position (time), the y-axis is lag number, and the colour shows the ACF value (blue = negative, white = zero, red = positive).

What to look for: horizontal red bands that persist across all window positions indicate lags where autocorrelation is consistently strong throughout the series. A fading pattern from left to right would signal a change in the series structure — here, the consistent red at low lags confirms a stable, uniform autocorrelation structure across the entire ranking.

[22]:
plot_acf_heatmap(ts_pm25, max_lag=30,
                 title="PM2.5 AQI — rolling ACF heatmap")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_38_0.png
[23]:
ac_diff = AutocorrelationAnalyzer().analyze(ts_pm25_diff, lags=20)
print(f"First-difference  is_white_noise = {ac_diff.is_white_noise}")
print(f"Ljung-Box p (lag 20)             = {ac_diff.lb_pvalue[-1]:.4f}")
First-difference  is_white_noise = True
Ljung-Box p (lag 20)             = 1.0000

6 · Decomposition

Decomposition separates a series into three additive components:

\[y_t = \text{Trend} + \text{Seasonal} + \text{Residual}\]
  • Trend — the long-run, smooth movement (here: the overall decrease from most polluted to cleanest countries).

  • Seasonal — repeating cycles of a fixed period (here: geographic clusters of similar-pollution countries that recur every ~25 countries).

  • Residual — what remains after removing trend and seasonality; ideally random noise.

We use period = 25, corresponding approximately to one “continental block” (~25 countries from the same region share similar pollution levels).

Classical additive decomposition

The classical method estimates the trend via a centered moving average of width equal to the period. The seasonal component is then obtained by averaging the de-trended values at each phase position. This method is transparent but leaves NaN at the edges (no trend estimate for the first/last period/2 observations).

[24]:
from tseda.decomposition.classical import ClassicalDecomposer
from tseda.decomposition.stl       import STLDecomposer
from tseda.visualization.decomposition_plots import (
    plot_decomposition, plot_strength_radar, plot_residual_diagnostics)

PERIOD = 25

dec_classical = ClassicalDecomposer().decompose(ts_pm25, period=PERIOD)
print(f"Classical  — trend: {dec_classical.strength_trend:.3f}  "
      f"seasonal: {dec_classical.strength_seasonal:.3f}")

plot_decomposition(dec_classical,
                   title=f"PM2.5 — classical additive (period={PERIOD})")
plt.show()
Classical  — trend: 0.996  seasonal: 0.107
../_images/examples_Global_Air_Pollution_EDA_42_1.png

STL decomposition (LOESS-based)

STL (Seasonal-Trend decomposition using LOESS) is the modern preferred method. Unlike classical decomposition, STL:

  • Fills edges (no NaN boundaries).

  • Allows the seasonal component to evolve slowly over the series.

  • Provides a robust option that down-weights outliers during fitting.

The strength_trend and strength_seasonal metrics are in [0, 1] and measure what fraction of the series variance is explained by each component. Values near 1.0 indicate a very strong, clean component; near 0 means that component is dominated by noise.

[25]:
dec_stl = STLDecomposer().decompose(ts_pm25, period=PERIOD)
print(f"STL  — trend: {dec_stl.strength_trend:.3f}  seasonal: {dec_stl.strength_seasonal:.3f}")

plot_decomposition(dec_stl, title=f"PM2.5 — STL decomposition (period={PERIOD})")
plt.show()
STL  — trend: 0.814  seasonal: 0.000
../_images/examples_Global_Air_Pollution_EDA_44_1.png

Strength radar and residual diagnostics

Strength radar (polar chart): four quality metrics are plotted on overlapping axes — Trend strength, Seasonal strength, Signal (1 − residual variance ratio), and Smoothness. A larger filled area = more structured, forecastable series. For PM2.5 the trend axis dominates, confirming the monotone ranking structure.

Residual diagnostics: left panel shows the distribution of the STL residuals (should be symmetric around zero if the model fits well); right panel shows the residual ACF (should be mostly within the confidence bands, indicating that the trend + seasonal components have captured all structure and what remains is effectively white noise).

[26]:
plot_strength_radar(dec_stl, figsize=(7, 7),
                    title="Decomposition strength radar (STL)")
plt.show()

plot_residual_diagnostics(dec_stl, title="STL — residual diagnostics")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_46_0.png
../_images/examples_Global_Air_Pollution_EDA_46_1.png

7 · Seasonality Detection

The SeasonalityDetector searches for repeating periodic patterns using two complementary strategies:

  1. FFT periodogram — converts the series to the frequency domain and identifies dominant frequencies (peaks in power spectrum).

  2. ACF peak detection — finds lags where ACF exceeds the confidence threshold and repeats at regular intervals.

In this dataset “seasonality” corresponds to geographic clustering: every ~25 countries, the series transitions through a pollution tier (e.g., all Gulf states cluster together, then South-Asian cities, then African cities, etc.).

Periodogram (FFT power spectrum)

The x-axis shows period in number of observations; the y-axis shows the power (energy) at that frequency. A tall spike at period p means the series has a strong repeating pattern every p countries. Dashed red vertical lines mark the top candidate periods detected by the algorithm.

[27]:
from tseda.seasonality.detector import SeasonalityDetector
from tseda.visualization.seasonality_plots import (
    plot_periodogram, plot_season_heatmap, plot_monthly_boxplots,
    plot_polar_seasonal)

sr = SeasonalityDetector().detect(ts_pm25)
print(sr)
SeasonalityReport
──────────────────────────────────────────
  method          : combined
  n_obs           : 175
  is_seasonal     : True  (α=0.05)
  dominant_period : 25
  Fisher G        : stat=0.9213  p=0.0000
  top candidates  :
    period=  25  score=0.9000
    period=  16  score=0.7970
    period=  15  score=0.7970
    period=  10  score=0.6206
    period=  11  score=0.6206

[28]:
plot_periodogram(sr, title="PM2.5 AQI — power spectrum")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_50_0.png

Season heatmap

The season heatmap is an innovative 2-D view of periodicity. Rows are phase positions (0 to period−1) and columns are cycle indices. The cell colour is the observed PM2.5 value. A consistent colour pattern across columns at a given row confirms that the same phase position always has a similar value — the definition of seasonality.

What to look for: horizontal colour bands (rows where colour is consistent across columns) indicate strong repeating structure at that phase.

[29]:
eff_period = sr.dominant_period if sr.dominant_period else PERIOD
print(f"Effective period = {eff_period}")

plot_season_heatmap(ts_pm25, period=eff_period, figsize=(14, 6),
                    title=f"PM2.5 — season heatmap (period={eff_period})")
plt.show()
Effective period = 25
../_images/examples_Global_Air_Pollution_EDA_52_1.png

Polar seasonal chart

Values are mapped onto a clock-face (one full revolution = one period). Each observation is plotted at angle = (rank mod period) / period × 2π. The radius is proportional to the PM2.5 value (normalised to [0.2, 1.0]). The red line connects the per-phase means.

What to look for: if all points at a given angle cluster at a consistent radius, that phase position is stable. An asymmetric or irregular pattern indicates that the period is not cleanly separating pollution clusters.

Monthly distribution boxplot

Groups observations by their calendar-month label (from the DatetimeIndex). Since each “month” in our series maps to a fixed group of countries in the ranking, this shows whether certain rank-groups have wider or narrower PM2.5 spread.

[30]:
plot_polar_seasonal(ts_pm25, period=eff_period, figsize=(7, 7),
                    title=f"PM2.5 — polar seasonal (period={eff_period})")
plt.show()

plot_monthly_boxplots(ts_pm25, title="PM2.5 — distribution by month position")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_54_0.png
../_images/examples_Global_Air_Pollution_EDA_54_1.png

Seasonal subseries plot

One sub-panel is drawn for each of the period phase positions. Each panel shows all observed values at that phase across every cycle, with the phase mean overlaid as a horizontal dashed line.

What to look for: if the mean line is nearly flat in every panel, the seasonal component is weak. If some panels sit systematically higher than others, there is a genuine seasonal (geographic-cluster) effect.

Horizontal boxplot by pollution band

This custom chart groups the 175 countries into five equal-size bands (35 countries each) by PM2.5 rank and draws a horizontal box-and-whisker plot per band. The colour gradient runs from red (Severe) to green (Clean).

  • Box = interquartile range (IQR): the middle 50 % of countries in the band.

  • Vertical black line = median PM2.5 for that band.

  • Whiskers = extend to the most extreme non-outlier values.

  • Dots beyond whiskers = outliers within the band (e.g., Republic of Korea at PM2.5 = 415 appears as a far-right dot in the Severe band).

The narrow boxes in the “Moderate”, “Low-Moderate”, and “Clean” bands show that countries in those tiers cluster tightly; the wide box and long whisker in “Severe” reflect the huge spread among the most polluted nations.

[31]:
from tseda.visualization.time_plots import (
    plot_seasonal_subseries, plot_annual_boxplots)

plot_seasonal_subseries(ts_pm25, period=eff_period, figsize=(16, 8),
                        title=f"PM2.5 — seasonal subseries (period={eff_period})")
plt.show()

plot_annual_boxplots(ts_pm25, title="PM2.5 — monthly boxplots")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_56_0.png
../_images/examples_Global_Air_Pollution_EDA_56_1.png
[32]:
band_names = [
    "Severe  (PM2.5 > 100)",
    "High    (PM2.5 60-100)",
    "Moderate(PM2.5 40-60)",
    "Low-Mod (PM2.5 20-40)",
    "Clean   (PM2.5 < 20)",
]
band_size = len(country) // 5
groups    = [country.iloc[i * band_size:(i + 1) * band_size]["PM2.5 AQI Value"].values
             for i in range(5)]
colors    = plt.cm.RdYlGn(np.linspace(0.1, 0.9, 5))

fig, ax = plt.subplots(figsize=(11, 5))
bp = ax.boxplot(groups, vert=False, patch_artist=True,
                widths=0.55, showfliers=True,
                flierprops=dict(marker="o", markersize=4, alpha=0.5),
                medianprops=dict(color="black", linewidth=2))
for patch, color in zip(bp["boxes"], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.75)
for element in ["whiskers", "caps"]:
    for line in bp[element]:
        line.set_color("#555555")

ax.set_yticklabels(band_names, fontsize=10)
ax.set_xlabel("PM2.5 AQI Value", fontsize=11)
ax.set_title("PM2.5 AQI — distribution by pollution band\n"
             "(5 groups of 35 countries, most to least polluted)", fontsize=12)
ax.grid(axis="x", linestyle="--", alpha=0.4)
fig.tight_layout()
plt.show()
../_images/examples_Global_Air_Pollution_EDA_57_0.png

8 · Anomaly Detection

Anomaly detection identifies observations that deviate significantly from the expected pattern. Four methods are applied, each with different sensitivity characteristics:

Method

Mechanism

Strength

Rolling IQR

Sliding-window Tukey fences

Adapts to local level changes

Rolling Z-score

Sliding-window ±k·σ

Sensitive to moderate outliers

STL residual

Flags large residuals from decomposition

Catches structural surprises

GESD

Grubbs-based iterative test

Statistically rigorous for normal-ish data

Countries flagged as anomalies have PM2.5 values that are abnormally high relative to their neighbours in the ranking — i.e., they are outliers even within the already-polluted top tier.

[33]:
from tseda.anomaly.detector import AnomalyDetector
from tseda.visualization.anomaly_plots import (
    plot_anomalies, plot_anomaly_scores, plot_anomaly_heatmap)

adet   = AnomalyDetector()
r_iqr  = adet.rolling_iqr(ts_pm25)
r_z    = adet.rolling_z(ts_pm25)
r_stl  = adet.stl_residual(ts_pm25, period=eff_period)
r_gesd = adet.gesd(ts_pm25)

print(f"Rolling IQR  : {r_iqr.n_anomalies}")
print(f"Rolling Z    : {r_z.n_anomalies}")
print(f"STL residual : {r_stl.n_anomalies}")
print(f"GESD         : {r_gesd.n_anomalies}")
if r_iqr.n_anomalies:
    print(f"IQR-flagged  : {country.iloc[r_iqr.indices]['Country'].tolist()}")
Rolling IQR  : 1
Rolling Z    : 1
STL residual : 26
GESD         : 1
IQR-flagged  : ['Republic of Korea']

Annotated series + anomaly score timeline

Top plot: the PM2.5 series with detected anomalies marked as red dots. Countries whose PM2.5 is so extreme that they lie outside the rolling IQR fence of their immediate neighbours are highlighted. The steep “spike” at the far left (Republic of Korea, PM2.5 = 415) is the most visible anomaly.

Bottom plot: the anomaly score timeline assigns a continuous value to each observation. A score of 0 means within bounds; scores > 1 indicate how many fence-widths the value exceeds the boundary. This makes it easy to rank the severity of anomalies across the entire series.

[34]:
plot_anomalies(ts_pm25, r_iqr,
               title="PM2.5 — anomaly detection (rolling IQR)")
plt.show()

plot_anomaly_scores(r_iqr, figsize=(12, 4),
                    title="PM2.5 — anomaly score timeline")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_61_0.png
../_images/examples_Global_Air_Pollution_EDA_61_1.png

Multi-method anomaly agreement heatmap

Each row is a detection method; each column is a country in the ranking. Red cells indicate a detected anomaly, white cells mean normal. Countries with red marks across multiple rows are flagged by several methods simultaneously — these are the highest-confidence anomalies.

What to look for: vertical red columns that span all four rows are the most reliable anomaly findings, as they represent consensus across methods with different statistical assumptions.

[35]:
plot_anomaly_heatmap(ts_pm25, [r_iqr, r_z, r_stl, r_gesd],
                     title="PM2.5 — multi-method anomaly agreement")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_63_0.png

9 · Changepoint Detection

A changepoint is a position in the series where the underlying data-generating process shifts — typically a sudden change in mean level or variance. Three methods are used:

Method

Detects

Mechanism

Binary segmentation

Mean shifts (multiple)

Recursively splits the series at the point of maximum SSE reduction

CUSUM

Gradual or abrupt mean drift

Cumulative sum of deviations from target; flags when it exceeds a threshold

Variance ratio

Variance (volatility) shifts

Sliding F-test comparing local variance on each side of a candidate split

In the PM2.5 ranking, changepoints correspond to pollution-regime boundaries: the transition from severely polluted countries to high-pollution, then to moderate, and so on.

[36]:
from tseda.changepoint.detector import ChangepointDetector
from tseda.visualization.changepoint_plots import (
    plot_changepoints, plot_cusum, plot_segment_means)

cpdet    = ChangepointDetector()
r_binseg = cpdet.binary_segmentation(ts_pm25)
r_cusum  = cpdet.cusum(ts_pm25)
r_var    = cpdet.variance_ratio(ts_pm25)

print(f"Binary segmentation : {r_binseg.n_changepoints}")
print(f"CUSUM               : {r_cusum.n_changepoints}")
print(f"Variance ratio      : {r_var.n_changepoints}")
if r_binseg.n_changepoints:
    cp = country.iloc[r_binseg.changepoints][["Country","PM2.5 AQI Value"]]
    print("\nCountries at changepoints:")
    print(cp.to_string(index=False))
Binary segmentation : 3
CUSUM               : 36
Variance ratio      : 4

Countries at changepoints:
  Country  PM2.5 AQI Value
    Qatar           147.50
Singapore            91.00
 Thailand            53.44

Changepoint series plot

The PM2.5 ranking series with vertical dashed red lines at each detected changepoint. The position of these lines corresponds to the country rank where a statistically significant mean shift occurs — i.e., the boundary between one pollution tier and the next.

Segment means overlay

Each segment (region between consecutive changepoints) is annotated with a horizontal coloured line at the segment mean. This makes the step-function structure of the pollution ranking immediately visible: the mean drops sharply at each changepoint, with relatively stable levels within each segment.

CUSUM score chart

The CUSUM (Cumulative Sum) chart plots a running score that accumulates deviations from the series mean. When the score exceeds a threshold, a changepoint is declared. Sharp inflection points in the CUSUM trace correspond to the detected breaks. The score is normalised to [0, 1] so charts from different series are comparable.

[37]:
plot_changepoints(ts_pm25, r_binseg,
                  title="PM2.5 — binary segmentation changepoints")
plt.show()

plot_segment_means(ts_pm25, r_binseg,
                   title="PM2.5 — segment means (pollution regimes)")
plt.show()

plot_cusum(ts_pm25, r_cusum, figsize=(12, 4),
           title="PM2.5 — CUSUM score chart")
plt.show()
../_images/examples_Global_Air_Pollution_EDA_67_0.png
../_images/examples_Global_Air_Pollution_EDA_67_1.png
../_images/examples_Global_Air_Pollution_EDA_67_2.png

10 · Feature Extraction

Feature extraction converts a raw time series into a fixed-length vector of descriptive numbers that can be fed into machine-learning models or used for series comparison. Three feature groups are provided:

Extractor

Type

Examples

Temporal

Calendar-based

month, day-of-week, cyclic encodings, days_since_start

Statistical

Distribution / complexity

mean, std, skewness, kurtosis, entropy, turning-point ratio

Spectral

Frequency-domain

dominant frequency, spectral centroid, bandwidth, entropy

[38]:
from tseda.features.temporal    import TemporalFeatureExtractor
from tseda.features.statistical import StatisticalFeatureExtractor
from tseda.features.spectral    import SpectralFeatureExtractor

df_temporal = TemporalFeatureExtractor().extract(ts_pm25)
df_stat     = StatisticalFeatureExtractor().extract(ts_pm25)
df_spectral = SpectralFeatureExtractor().extract(ts_pm25)

print(f"Temporal   : {df_temporal.shape[1]} features")
print(f"Statistical: {df_stat.shape[1]} features")
print(f"Spectral   : {df_spectral.shape[1]} features")
Temporal   : 18 features
Statistical: 27 features
Spectral   : 13 features

Statistical features

These 27 scalar features characterise the PM2.5 distribution and structure. The key values to note for our dataset:

  • skewness > 0 (positive) confirms right skew — most countries are moderately polluted but a few drive the mean up.

  • kurtosis >> 3 signals heavy tails — the distribution has more extreme values than a normal distribution.

  • cv (coefficient of variation = std/mean) quantifies relative variability — a high CV indicates wide spread relative to the average.

[39]:
print("=== Statistical features — PM2.5 ===")
print(df_stat.T.to_string(header=False))
=== Statistical features — PM2.5 ===
mean                    69.749486
std                     45.208539
var                   2043.811999
skewness                 3.046573
kurtosis                18.794863
min                      6.000000
max                    415.000000
range                  409.000000
median                  61.870000
iqr                     45.330000
mad                     22.050000
cv                       0.648156
trimmed_mean            65.992830
q05                     21.907000
q25                     42.160000
q75                     87.490000
q95                    149.341000
turning_points_ratio     0.023121
mean_crossing_rate       0.005747
flatness_ratio           0.011494
approx_entropy           0.035562
sample_entropy           0.027557
lag1_acf                 0.753102
linear_slope            -0.747287
linear_r2                0.701300
n_peaks                  0.000000
n_troughs                0.000000

Spectral features

Spectral features describe the frequency content of the series. Key ones:

  • dominant_period — the period (in observations) of the strongest oscillation detected in the power spectrum.

  • spectral_entropy — measures how spread the power is across frequencies. A value near 0 means all energy is concentrated at one frequency (very regular); near 1 means energy is spread uniformly (white noise).

  • power_low_freq — fraction of total power in low frequencies (long wavelengths) — high values confirm the trend dominates.

[40]:
print("=== Spectral features — PM2.5 ===")
print(df_spectral.T.to_string(header=False))
=== Spectral features — PM2.5 ===
total_power            339225.716140
band_power_0           338838.968337
band_power_1              266.642232
band_power_2              120.105570
spectral_centroid           0.006896
spectral_bandwidth          0.011280
spectral_rolloff_0.5        0.005714
spectral_rolloff_0.85       0.005714
spectral_entropy            0.082983
spectral_flatness           0.002617
dominant_freq               0.005714
dominant_period           175.000000
n_spectral_peaks           26.000000

Cross-pollutant comparison

Comparing the five pollutants side by side reveals their structural differences. For example, CO has a much higher CV than PM2.5 (more relative variation), and Ozone has the lowest skewness (closest to symmetric distribution across countries).

[41]:
sf   = StatisticalFeatureExtractor()
rows = [sf.extract(ts).iloc[0][["mean","std","skewness","kurtosis","cv"]].rename(ts.name)
        for ts in all_series]
print("=== Cross-pollutant statistical comparison ===")
print(pd.DataFrame(rows).round(3).to_string())
=== Cross-pollutant statistical comparison ===
             mean     std  skewness  kurtosis     cv
PM2.5 AQI  69.749  45.209     3.047    18.795  0.648
AQI        72.308  44.577     3.294    20.844  0.616
NO2 AQI     2.091   7.093    11.501   143.918  3.392
Ozone AQI  33.258  22.976     2.799    10.370  0.691
CO AQI      1.367   2.143    10.054   118.902  1.568

11 · Forecastability Assessment

The ForecastabilityScorer distils all analyses into a single 0–100 readiness score composed of six weighted sub-scores:

Sub-score

Weight

What it measures

Data quality

20 %

Penalises missing values and outliers

Stationarity

15 %

Rewards stationary series (easier to model)

Signal-to-noise

20 %

Rewards high trend + seasonal strength

Autocorrelation

15 %

Rewards strong, significant ACF lags

Sample size

15 %

Rewards having enough cycles for the period

Regularity

15 %

Rewards uniform spacing and no large gaps

Based on the sub-scores the scorer recommends a modelling approach: ARIMA (stationary, no seasonality), SARIMA (stationary + seasonal), ETS (non-stationary + seasonal), Prophet (long, non-stationary), or ML (complex, noisy).

[42]:
from tseda.forecastability.scorer  import ForecastabilityScorer
from tseda.forecastability.leakage import LeakageDetector

scorer = ForecastabilityScorer()
print(f"{'Series':<30} {'Score':>6}  {'Model':<8}  Diff")
print("─" * 56)
for ts in all_series:
    r = scorer.score(ts, period=eff_period)
    print(f"  {ts.name:<28} {r.score:>6.1f}  {r.recommended_model:<8}  d={r.recommended_diff}")
Series                          Score  Model     Diff
────────────────────────────────────────────────────────
  PM2.5 AQI                      56.8  ETS       d=1
  AQI                            67.7  SARIMA    d=0
  NO2 AQI                        48.0  SARIMA    d=0
  Ozone AQI                      56.9  SARIMA    d=0
  CO AQI                         39.0  ETS       d=1

Detailed PM2.5 forecastability report

The detailed report breaks down each sub-score, states the confidence intervals used, and gives the concrete recommendation (model type, whether to difference the series, and the detected period to use for seasonal modelling).

[43]:
r_fc = scorer.score(ts_pm25, period=eff_period)
print(r_fc)
ForecastabilityReport(
  score            : 56.8/100
  sub_scores       :
    data_quality        : 96.6
    stationarity        : 0.0
    signal_to_noise     : 40.7
    autocorrelation     : 75.3
    sample_size         : 70.0
    regularity          : 50.0
  recommended_model: ETS
  recommended_diff : 1
  recommended_period: 25
  n_obs            : 175
  pct_missing      : 0.00 %
  pct_outlier      : 3.43 %
  is_stationary    : False
)

Leakage detection

Leakage is a critical problem in supervised machine-learning pipelines for time series: a feature that “knows” future values of the target will produce artificially high training accuracy but fail at test time.

Two forms are checked:

  • Target leakage — a feature is so highly correlated with the target at lag 0 that it effectively encodes the target itself (e.g., a copy of the target variable mislabelled as a feature).

  • Temporal leakage — a feature at time t is more correlated with the future target (t+1, …, t+horizon) than with the current or past target, indicating it was computed using data not yet available at prediction time.

The demo below constructs four features: two legitimate lags, one future leak (using values from 3 steps ahead), and one direct copy of the target.

[44]:
y      = ts_pm25.values
lag1   = np.roll(y, 1).astype(float);  lag1[0]    = np.nan
lag5   = np.roll(y, 5).astype(float);  lag5[:5]   = np.nan
future = np.roll(y, -3).astype(float); future[-3:] = np.nan
copy   = y.copy()

feats = pd.DataFrame({"lag1": lag1, "lag5": lag5,
                       "future_3": future, "pm25_copy": copy},
                     index=ts_pm25.index)

leak = LeakageDetector().check(ts_pm25, horizon=12, features_df=feats)
print(f"Target  leakage columns : {leak.target_leakage_columns}")
print(f"Temporal leakage columns: {leak.temporal_leakage_columns}")
Target  leakage columns : ['pm25_copy']
Temporal leakage columns: ['future_3']

13 · Full EDA Report

tseda.report ties all analyses together into a single automated call.

Console report

Prints a structured plain-text summary to stdout, covering six sections: Data Quality, Descriptive Statistics, Stationarity, Autocorrelation, Seasonality, and Forecastability. Suitable for quick terminal inspection or piping to a log file.

HTML report

Generates a self-contained HTML file — all matplotlib figures are embedded as base64-encoded PNG images, so the file is portable and requires no external assets. Sections are collapsible (<details> / <summary>) for easy navigation. Open in any web browser.

[48]:
from tseda.report import ConsoleReport

ConsoleReport().generate(ts_pm25, period=eff_period)

════════════════════════════════════════════════════════════════════
  tseda EDA Report — PM2.5 AQI
════════════════════════════════════════════════════════════════════
  Series name               : PM2.5 AQI
  Observations              : 175
  Frequency                 : MS
  Start                     : 2010-01-01 00:00:00
  End                       : 2024-07-01 00:00:00
  Is regular                : False
  Unit                      : —

────────────────────────────────────────────────────────────────────
  1. DATA QUALITY
────────────────────────────────────────────────────────────────────
  Missing values            : 0  (0.00 %)
  Longest NaN run           : 0
  Index gaps                : 0
  IQR outliers              : 6  (3.43 %)

────────────────────────────────────────────────────────────────────
  2. DESCRIPTIVE STATISTICS
────────────────────────────────────────────────────────────────────
  Mean                      : 69.7495
  Std                       : 45.2085
  Min                       : 6
  Max                       : 415
  Median                    : 61.87
  Skewness                  : 3.0466
  Kurtosis                  : 18.7949
  CV                        : 0.6482

────────────────────────────────────────────────────────────────────
  3. STATIONARITY
────────────────────────────────────────────────────────────────────
  ADF p-value               : 0.9725  → non-stationary
  KPSS p-value              : 0.0100  → non-stationary
  Verdict                   : NON-STATIONARY (strong evidence)

────────────────────────────────────────────────────────────────────
  4. AUTOCORRELATION
────────────────────────────────────────────────────────────────────
  Significant ACF lags      : 40
  Significant PACF lags     : 3
  Is white noise            : False
  Ljung-Box p (lag 10)      : 0.0000

────────────────────────────────────────────────────────────────────
  5. SEASONALITY
────────────────────────────────────────────────────────────────────
  Is seasonal               : True
  Dominant period           : 25
  Candidate periods         : [(25, 0.9), (16, 0.7970336460406788), (15, 0.7970336460406788), (10, 0.6205808654089605), (11, 0.6205808654089605)]

────────────────────────────────────────────────────────────────────
  6. FORECASTABILITY
────────────────────────────────────────────────────────────────────
  Overall score             : 56.8 / 100

  Sub-score                Score   Weight
  ──────────────────────  ───────   ──────
  data_quality              96.6   (20%)  █████████
  stationarity               0.0   (15%)
  signal_to_noise           40.7   (20%)  ████
  autocorrelation           75.3   (15%)  ███████
  sample_size               70.0   (15%)  ███████
  regularity                50.0   (15%)  █████

  Recommended model         : ETS
  Recommended diff          : d=1
  Recommended period        : 25

════════════════════════════════════════════════════════════════════

[49]:
path = report_html(ts_pm25, "PM25_AQI", period=eff_period)
print(f"HTML report  →  html/PM25_AQI.html  ({os.path.getsize(path)//1024} KB)")
HTML report  →  html/PM25_AQI.html  (754 KB)
[50]:
print("Generating reports for all pollutants...")
for ts in all_series:
    stem = ts.name.replace(" ", "_").replace("/", "_")
    p    = report_html(ts, stem, period=eff_period)
    print(f"  html/{os.path.basename(p)}   ({os.path.getsize(p)//1024} KB)")
print("Done.")
Generating reports for all pollutants...
  html/PM2.5_AQI.html   (754 KB)
  html/AQI.html   (780 KB)
  html/NO2_AQI.html   (749 KB)
  html/Ozone_AQI.html   (959 KB)
  html/CO_AQI.html   (770 KB)
Done.

Summary

Module

Key finding

core

175 country-level mean PM2.5 values as a monthly TimeSeries

quality

No missing values; IQR flags Republic of Korea, Bahrain, Mauritania as extreme outliers

statistics

Right-skewed heavy-tailed distribution; strong positive lag-1 autocorrelation

stationarity

Raw series non-stationary (monotone trend); first-difference ≈ white noise

autocorrelation

Slowly decaying ACF confirms strong trend; PACF cuts off at lag 1 → AR(1) structure

decomposition

High trend strength; moderate geographic-cluster seasonal component (period = 25)

seasonality

FFT detects recurring ~25-country continental blocks

anomaly

Extreme polluters flagged consistently across all four detection methods

changepoint

Binary segmentation identifies 3–5 pollution-regime transition boundaries

features

Positive skewness, excess kurtosis, strong low-frequency spectral power

forecastability

Moderate score; trend-dominant → recommend differencing (d = 1)

report

Self-contained HTML reports for all five pollutants saved to html/