Data Quality

tseda.quality

Data quality diagnostics for time series.

Public API

MissingValueReport

Immutable result of MissingValueAnalyzer.

MissingValueAnalyzer

Detect NaN values, index gaps, and interpolate.

OutlierReport

Immutable result of OutlierDetector.

OutlierDetector

IQR, Z-score, MAD, and GESD outlier detection.

FlatlineReport

Immutable result of DuplicateDetector.

DuplicateDetector

Consecutive flat-line and near-zero segment detection.

class tseda.quality.MissingValueReport(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)[source]

Bases: object

Immutable summary of missing values in a TimeSeries.

Parameters:
n_nan

Number of NaN values in the observed array.

Type:

int

pct_nan

Percentage of NaN observations (0–100).

Type:

float

n_gaps

Number of missing timestamps (index gaps) when the series frequency is known. -1 when frequency is unknown.

Type:

int

gap_locations

Start timestamp of each index gap. Empty when n_gaps <= 0.

Type:

list of pandas.Timestamp

longest_nan_run

Length of the longest consecutive run of NaN values.

Type:

int

nan_run_lengths

Lengths of every consecutive NaN run (ascending order).

Type:

list of int

nan_positions

Integer positions (0-based) of all NaN values.

Type:

numpy.ndarray

is_monotone_missing

True when all NaN values cluster at the start or end of the series (monotone missing pattern — easier to handle).

Type:

bool

n_nan: int
pct_nan: float
n_gaps: int
gap_locations: List[Timestamp]
longest_nan_run: int
nan_run_lengths: List[int]
nan_positions: ndarray
is_monotone_missing: bool
__repr__()[source]

Return repr(self).

Return type:

str

__init__(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)
Parameters:
Return type:

None

class tseda.quality.MissingValueAnalyzer[source]

Bases: object

Analyze and repair missing values in a TimeSeries.

This class is stateless — instantiate once and call its methods on different series objects.

analyze(ts)[source]

Return a MissingValueReport for ts.

Parameters:

ts (TimeSeries)

Return type:

MissingValueReport

interpolate(ts, method)[source]

Fill NaN values and return a new TimeSeries.

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020-01-01", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()
>>> r = ana.analyze(ts)
>>> r.n_nan
2
>>> filled = ana.interpolate(ts)
>>> filled.has_nan
False
analyze(ts)[source]

Compute a complete missing-value summary for ts.

Parameters:

ts (TimeSeries) – The series to analyze.

Return type:

MissingValueReport

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=4, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0])
>>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx))
>>> report.n_nan
2
>>> report.longest_nan_run
2
interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]

Fill NaN values and return a new TimeSeries.

Parameters:
  • ts (TimeSeries) – Series to fill.

  • method (str, optional) –

    Interpolation strategy. One of:

    • "linear" — linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit is None.

    • "forward" — forward-fill (carry last observed value).

    • "backward" — backward-fill (carry next observed value).

    • "nearest" — fill with the nearest non-NaN value.

    • "zero" — fill with 0.0.

    • "constant" — fill with fill_value (must be provided).

    • "spline" — cubic spline (requires scipy).

  • limit (int, optional) – Maximum number of consecutive NaN values to fill. None fills all gaps.

  • fill_value (float, optional) – Used only with method="constant".

Returns:

A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.

Return type:

TimeSeries

Raises:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()

Linear interpolation:

>>> filled = ana.interpolate(ts, "linear")
>>> filled.values.tolist()
[1.0, 2.0, 3.0, 4.0, 5.0]

Forward fill:

>>> fwd = ana.interpolate(ts, "forward")
>>> fwd.values.tolist()
[1.0, 1.0, 1.0, 4.0, 5.0]
class tseda.quality.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]

Bases: object

Immutable outlier detection result.

Parameters:
mask

Boolean array of shape (n,); True where an outlier was found.

Type:

numpy.ndarray

indices

Integer positions (0-based) of detected outliers.

Type:

numpy.ndarray

timestamps

Timestamps of detected outliers.

Type:

pandas.DatetimeIndex

values

Observed values at outlier positions.

Type:

numpy.ndarray

method

Name of the detection method used.

Type:

str

n_outliers

Number of outliers detected.

Type:

int

lower_bound

Lower fence / threshold (when applicable).

Type:

float or None

upper_bound

Upper fence / threshold (when applicable).

Type:

float or None

mask: ndarray
indices: ndarray
timestamps: DatetimeIndex
values: ndarray
method: str
n_outliers: int
lower_bound: float | None
upper_bound: float | None
__repr__()[source]

Return repr(self).

Return type:

str

__init__(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)
Parameters:
Return type:

None

class tseda.quality.OutlierDetector[source]

Bases: object

Detect, remove, or clip outliers in a TimeSeries.

This class is stateless — create one instance and reuse across many series.

iqr(ts, k=1.5)[source]

Tukey IQR fence method.

Parameters:
Return type:

OutlierReport

zscore(ts, threshold=3.0)[source]

Standard Z-score method.

Parameters:
Return type:

OutlierReport

mad(ts, threshold=3.5)[source]

Median Absolute Deviation method.

Parameters:
Return type:

OutlierReport

gesd(ts, alpha=0.05, max_outliers=10)[source]

Generalized Extreme Studentized Deviate test.

Parameters:
Return type:

OutlierReport

remove(ts, report)[source]

Replace outlier values with NaN.

Parameters:
Return type:

TimeSeries

clip(ts, report)[source]

Clip outlier values to the fence bounds.

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=6, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx)
>>> det = OutlierDetector()

IQR detection:

>>> r = det.iqr(ts)
>>> r.n_outliers
1
>>> int(r.indices[0])
4

Remove the outlier (replace with NaN):

>>> cleaned = det.remove(ts, r)
>>> cleaned.has_nan
True
iqr(ts, k=1.5)[source]

Detect outliers using Tukey’s IQR fences.

Points below Q1 - k * IQR or above Q3 + k * IQR are flagged. NaN values are excluded from the quartile computation but not flagged as outliers.

Parameters:
  • ts (TimeSeries) – Input series.

  • k (float, optional) –

    Fence multiplier. Common choices:

    • 1.5 — standard outlier fence (default).

    • 3.0 — extreme outlier fence.

Return type:

OutlierReport

Raises:

ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx)
>>> OutlierDetector().iqr(ts).n_outliers
1
zscore(ts, threshold=3.0)[source]

Detect outliers using the standard Z-score.

A value is flagged when |z| > threshold, where z = (x - mean) / std.

Parameters:
  • ts (TimeSeries) – Input series.

  • threshold (float, optional) – Z-score cut-off. Default 3.0 (≈ 0.3 % false-positive rate under normality).

Return type:

OutlierReport

Raises:

ValueError – If threshold is not positive or if std == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx)
>>> OutlierDetector().zscore(ts).n_outliers
1
mad(ts, threshold=3.5)[source]

Detect outliers using the Median Absolute Deviation (MAD).

A value is flagged when the modified Z-score exceeds threshold:

\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]

where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).

This is more robust than the Z-score for skewed distributions or heavy-tailed noise.

Parameters:
  • ts (TimeSeries) – Input series.

  • threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend 3.5 (default).

Return type:

OutlierReport

Raises:

ValueError – If threshold is not positive or MAD == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx)
>>> OutlierDetector().mad(ts).n_outliers
1
gesd(ts, *, alpha=0.05, max_outliers=10)[source]

Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.

The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.

Parameters:
  • ts (TimeSeries) – Input series.

  • alpha (float, optional) – Significance level. Default 0.05.

  • max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than n // 2.

Returns:

lower_bound and upper_bound are None (GESD uses per-iteration critical values, not fixed fences).

Return type:

OutlierReport

Raises:
  • ValueError – If alpha is not in (0, 1) or max_outliers is invalid.

  • ImportError – If scipy is not installed (needed for the t-distribution CDF).

Notes

The critical value at each step i (1-indexed) is:

\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]

where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> rng = np.random.default_rng(0)
>>> idx = pd.date_range("2020", periods=50, freq="D")
>>> vals = rng.standard_normal(50)
>>> vals[10] = 15.0   # plant a spike
>>> ts  = TimeSeries(vals, index=idx)
>>> r   = OutlierDetector().gesd(ts)
>>> 10 in r.indices
True
remove(ts, report)[source]

Replace detected outlier values with NaN.

Parameters:
Returns:

A new series with outliers replaced by NaN.

Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> cleaned = det.remove(ts, det.iqr(ts))
>>> cleaned.has_nan
True
clip(ts, report)[source]

Clip outlier values to the fence bounds of report.

Parameters:
  • ts (TimeSeries) – The original series.

  • report (OutlierReport) – Must have non-None lower_bound and upper_bound (i.e., from iqr, zscore, or mad).

Returns:

A new series with values clamped to [lower_bound, upper_bound].

Return type:

TimeSeries

Raises:

ValueError – If report has no bounds (e.g., from gesd).

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> r   = det.iqr(ts)
>>> clipped = det.clip(ts, r)
>>> float(clipped.values.max()) < 100.0
True
class tseda.quality.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]

Bases: object

Immutable summary of flat-line segments in a TimeSeries.

Parameters:
n_flatline_runs

Number of runs that meet or exceed min_run in length.

Type:

int

longest_run

Length of the single longest flat-line run.

Type:

int

total_flatline_points

Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).

Type:

int

runs

Each element is a tuple (start_pos, end_pos, value) where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.

Type:

list of (start_pos, end_pos, value)

mask

Boolean array; True at every position that is part of a qualifying flat-line run.

Type:

numpy.ndarray

min_run

The minimum run length used for this report.

Type:

int

n_flatline_runs: int
longest_run: int
total_flatline_points: int
runs: List[Tuple[int, int, float]]
mask: ndarray
min_run: int
__repr__()[source]

Return repr(self).

Return type:

str

__init__(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)
Parameters:
Return type:

None

class tseda.quality.DuplicateDetector[source]

Bases: object

Detect consecutive duplicate (flat-line) value runs.

flatline(ts, min_run=3, atol=0.0)[source]

Detect flat-line segments of repeated values.

Parameters:
Return type:

FlatlineReport

near_zero(ts, min_run=3, threshold=1e-8)[source]

Detect segments where the series is stuck near zero.

Parameters:
Return type:

FlatlineReport

remove_flatlines(ts, report)[source]

Replace flat-line positions with NaN (keeping the first value).

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.longest_run
4
flatline(ts, min_run=3, *, atol=0.0)[source]

Detect consecutive runs of identical (or near-identical) values.

Parameters:
  • ts (TimeSeries) – Input series.

  • min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default 3.

  • atol (float, optional) – Absolute tolerance for equality. Two values a and b are considered equal when |a - b| <= atol. Default 0.0 (exact equality).

Return type:

FlatlineReport

Raises:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

Exact flat line of length 4:

>>> idx  = pd.date_range("2020", periods=7, freq="D")
>>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.runs[0]
(1, 4, 3.0)

No flat line (min_run too high):

>>> r2 = DuplicateDetector().flatline(ts, min_run=5)
>>> r2.n_flatline_runs
0
near_zero(ts, min_run=3, *, threshold=1e-08)[source]

Detect segments where the series is stuck near zero.

Only consecutive runs where every value satisfies |x| <= threshold are reported. This differs from flatline(), which detects any repeated value regardless of magnitude.

Parameters:
  • ts (TimeSeries) – Input series.

  • min_run (int, optional) – Minimum run length. Default 3.

  • threshold (float, optional) – Maximum absolute value to count as “near zero”. Default 1e-8.

Returns:

Runs where every value satisfies |x| <= threshold.

Return type:

FlatlineReport

Raises:

ValueError – If threshold < 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().near_zero(ts, min_run=3)
>>> r.n_flatline_runs
1
remove_flatlines(ts, report, *, keep_first=True)[source]

Replace flat-line positions with NaN.

Parameters:
  • ts (TimeSeries) – The original series.

  • report (FlatlineReport) – Result from flatline() or near_zero().

  • keep_first (bool, optional) – When True (default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. When False, the entire run including the first observation is set to NaN.

Returns:

A new series with flat-line values replaced by NaN.

Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=6, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> cleaned = det.remove_flatlines(ts, r, keep_first=True)
>>> cleaned.n_nan
2

Missing-value analysis for time series.

Two distinct concepts are handled here:

  • Value NaN — a timestamp is present in the index but its observed value is numpy.nan.

  • Index gap — a timestamp that should exist (given the series frequency) is absent from the index entirely.

Both are reported by MissingValueAnalyzer. Interpolation of NaN values is also provided via MissingValueAnalyzer.interpolate().

Classes

MissingValueReport

Immutable result dataclass returned by MissingValueAnalyzer.analyze().

MissingValueAnalyzer

Stateless analyzer; all methods accept a TimeSeries and return plain Python / numpy objects or a new TimeSeries.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx = pd.date_range("2020-01-01", periods=10, freq="D")
>>> vals = np.array([1.0, np.nan, 3.0, np.nan, np.nan, 6.0, 7.0, 8.0, np.nan, 10.0])
>>> ts  = TimeSeries(vals, index=idx)
>>> ana = MissingValueAnalyzer()
>>> report = ana.analyze(ts)
>>> report.n_nan
3
>>> report.pct_nan
30.0
class tseda.quality.missing.MissingValueReport(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)[source]

Bases: object

Immutable summary of missing values in a TimeSeries.

Parameters:
n_nan

Number of NaN values in the observed array.

Type:

int

pct_nan

Percentage of NaN observations (0–100).

Type:

float

n_gaps

Number of missing timestamps (index gaps) when the series frequency is known. -1 when frequency is unknown.

Type:

int

gap_locations

Start timestamp of each index gap. Empty when n_gaps <= 0.

Type:

list of pandas.Timestamp

longest_nan_run

Length of the longest consecutive run of NaN values.

Type:

int

nan_run_lengths

Lengths of every consecutive NaN run (ascending order).

Type:

list of int

nan_positions

Integer positions (0-based) of all NaN values.

Type:

numpy.ndarray

is_monotone_missing

True when all NaN values cluster at the start or end of the series (monotone missing pattern — easier to handle).

Type:

bool

n_nan: int
pct_nan: float
n_gaps: int
gap_locations: List[Timestamp]
longest_nan_run: int
nan_run_lengths: List[int]
nan_positions: ndarray
is_monotone_missing: bool
__repr__()[source]

Return repr(self).

Return type:

str

__init__(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)
Parameters:
Return type:

None

class tseda.quality.missing.MissingValueAnalyzer[source]

Bases: object

Analyze and repair missing values in a TimeSeries.

This class is stateless — instantiate once and call its methods on different series objects.

analyze(ts)[source]

Return a MissingValueReport for ts.

Parameters:

ts (TimeSeries)

Return type:

MissingValueReport

interpolate(ts, method)[source]

Fill NaN values and return a new TimeSeries.

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020-01-01", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()
>>> r = ana.analyze(ts)
>>> r.n_nan
2
>>> filled = ana.interpolate(ts)
>>> filled.has_nan
False
analyze(ts)[source]

Compute a complete missing-value summary for ts.

Parameters:

ts (TimeSeries) – The series to analyze.

Return type:

MissingValueReport

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=4, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0])
>>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx))
>>> report.n_nan
2
>>> report.longest_nan_run
2
interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]

Fill NaN values and return a new TimeSeries.

Parameters:
  • ts (TimeSeries) – Series to fill.

  • method (str, optional) –

    Interpolation strategy. One of:

    • "linear" — linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit is None.

    • "forward" — forward-fill (carry last observed value).

    • "backward" — backward-fill (carry next observed value).

    • "nearest" — fill with the nearest non-NaN value.

    • "zero" — fill with 0.0.

    • "constant" — fill with fill_value (must be provided).

    • "spline" — cubic spline (requires scipy).

  • limit (int, optional) – Maximum number of consecutive NaN values to fill. None fills all gaps.

  • fill_value (float, optional) – Used only with method="constant".

Returns:

A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.

Return type:

TimeSeries

Raises:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()

Linear interpolation:

>>> filled = ana.interpolate(ts, "linear")
>>> filled.values.tolist()
[1.0, 2.0, 3.0, 4.0, 5.0]

Forward fill:

>>> fwd = ana.interpolate(ts, "forward")
>>> fwd.values.tolist()
[1.0, 1.0, 1.0, 4.0, 5.0]

Outlier detection for time series.

Four detection methods are provided, all implemented with pure numpy / scipy — no machine-learning dependencies:

Method

Statistic

Best for

IQR

Tukey fences

Symmetric or skewed data

Z-score

Mean / std deviation

Approximately normal data

MAD

Median absolute deviation

Skewed data, heavy tails

GESD

Generalized ESD test

Known-normal; # unknown

All detectors return an OutlierReport and expose .remove() / .clip() helpers that return cleaned TimeSeries objects.

Classes

OutlierReport

Immutable result dataclass.

OutlierDetector

Stateless detector.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020-01-01", periods=10, freq="D")
>>> vals = np.array([1, 2, 2, 3, 100, 2, 3, 2, 1, 2], dtype=float)
>>> ts  = TimeSeries(vals, index=idx)
>>> det = OutlierDetector()
>>> r   = det.iqr(ts)
>>> r.n_outliers
1
class tseda.quality.outliers.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]

Bases: object

Immutable outlier detection result.

Parameters:
mask

Boolean array of shape (n,); True where an outlier was found.

Type:

numpy.ndarray

indices

Integer positions (0-based) of detected outliers.

Type:

numpy.ndarray

timestamps

Timestamps of detected outliers.

Type:

pandas.DatetimeIndex

values

Observed values at outlier positions.

Type:

numpy.ndarray

method

Name of the detection method used.

Type:

str

n_outliers

Number of outliers detected.

Type:

int

lower_bound

Lower fence / threshold (when applicable).

Type:

float or None

upper_bound

Upper fence / threshold (when applicable).

Type:

float or None

mask: ndarray
indices: ndarray
timestamps: DatetimeIndex
values: ndarray
method: str
n_outliers: int
lower_bound: float | None
upper_bound: float | None
__repr__()[source]

Return repr(self).

Return type:

str

__init__(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)
Parameters:
Return type:

None

class tseda.quality.outliers.OutlierDetector[source]

Bases: object

Detect, remove, or clip outliers in a TimeSeries.

This class is stateless — create one instance and reuse across many series.

iqr(ts, k=1.5)[source]

Tukey IQR fence method.

Parameters:
Return type:

OutlierReport

zscore(ts, threshold=3.0)[source]

Standard Z-score method.

Parameters:
Return type:

OutlierReport

mad(ts, threshold=3.5)[source]

Median Absolute Deviation method.

Parameters:
Return type:

OutlierReport

gesd(ts, alpha=0.05, max_outliers=10)[source]

Generalized Extreme Studentized Deviate test.

Parameters:
Return type:

OutlierReport

remove(ts, report)[source]

Replace outlier values with NaN.

Parameters:
Return type:

TimeSeries

clip(ts, report)[source]

Clip outlier values to the fence bounds.

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=6, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx)
>>> det = OutlierDetector()

IQR detection:

>>> r = det.iqr(ts)
>>> r.n_outliers
1
>>> int(r.indices[0])
4

Remove the outlier (replace with NaN):

>>> cleaned = det.remove(ts, r)
>>> cleaned.has_nan
True
iqr(ts, k=1.5)[source]

Detect outliers using Tukey’s IQR fences.

Points below Q1 - k * IQR or above Q3 + k * IQR are flagged. NaN values are excluded from the quartile computation but not flagged as outliers.

Parameters:
  • ts (TimeSeries) – Input series.

  • k (float, optional) –

    Fence multiplier. Common choices:

    • 1.5 — standard outlier fence (default).

    • 3.0 — extreme outlier fence.

Return type:

OutlierReport

Raises:

ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx)
>>> OutlierDetector().iqr(ts).n_outliers
1
zscore(ts, threshold=3.0)[source]

Detect outliers using the standard Z-score.

A value is flagged when |z| > threshold, where z = (x - mean) / std.

Parameters:
  • ts (TimeSeries) – Input series.

  • threshold (float, optional) – Z-score cut-off. Default 3.0 (≈ 0.3 % false-positive rate under normality).

Return type:

OutlierReport

Raises:

ValueError – If threshold is not positive or if std == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx)
>>> OutlierDetector().zscore(ts).n_outliers
1
mad(ts, threshold=3.5)[source]

Detect outliers using the Median Absolute Deviation (MAD).

A value is flagged when the modified Z-score exceeds threshold:

\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]

where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).

This is more robust than the Z-score for skewed distributions or heavy-tailed noise.

Parameters:
  • ts (TimeSeries) – Input series.

  • threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend 3.5 (default).

Return type:

OutlierReport

Raises:

ValueError – If threshold is not positive or MAD == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx)
>>> OutlierDetector().mad(ts).n_outliers
1
gesd(ts, *, alpha=0.05, max_outliers=10)[source]

Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.

The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.

Parameters:
  • ts (TimeSeries) – Input series.

  • alpha (float, optional) – Significance level. Default 0.05.

  • max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than n // 2.

Returns:

lower_bound and upper_bound are None (GESD uses per-iteration critical values, not fixed fences).

Return type:

OutlierReport

Raises:
  • ValueError – If alpha is not in (0, 1) or max_outliers is invalid.

  • ImportError – If scipy is not installed (needed for the t-distribution CDF).

Notes

The critical value at each step i (1-indexed) is:

\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]

where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> rng = np.random.default_rng(0)
>>> idx = pd.date_range("2020", periods=50, freq="D")
>>> vals = rng.standard_normal(50)
>>> vals[10] = 15.0   # plant a spike
>>> ts  = TimeSeries(vals, index=idx)
>>> r   = OutlierDetector().gesd(ts)
>>> 10 in r.indices
True
remove(ts, report)[source]

Replace detected outlier values with NaN.

Parameters:
Returns:

A new series with outliers replaced by NaN.

Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> cleaned = det.remove(ts, det.iqr(ts))
>>> cleaned.has_nan
True
clip(ts, report)[source]

Clip outlier values to the fence bounds of report.

Parameters:
  • ts (TimeSeries) – The original series.

  • report (OutlierReport) – Must have non-None lower_bound and upper_bound (i.e., from iqr, zscore, or mad).

Returns:

A new series with values clamped to [lower_bound, upper_bound].

Return type:

TimeSeries

Raises:

ValueError – If report has no bounds (e.g., from gesd).

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> r   = det.iqr(ts)
>>> clipped = det.clip(ts, r)
>>> float(clipped.values.max()) < 100.0
True

Flat-line and near-constant segment detection for time series.

Timestamp duplicates are rejected at construction time by validate_datetime_index(). This module addresses the complementary problem: consecutive identical or near-zero values, which typically signal:

  • A stuck sensor / ADC saturation.

  • A data-pipeline bug that forward-filled data without marking it.

  • A genuine flat segment that may confuse differencing-based methods.

Classes

FlatlineReport

Immutable result dataclass returned by DuplicateDetector.flatline().

DuplicateDetector

Stateless detector.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020-01-01", periods=10, freq="D")
>>> vals = np.array([1.0, 2.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0, 5.0, 6.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.longest_run
4
class tseda.quality.duplicates.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]

Bases: object

Immutable summary of flat-line segments in a TimeSeries.

Parameters:
n_flatline_runs

Number of runs that meet or exceed min_run in length.

Type:

int

longest_run

Length of the single longest flat-line run.

Type:

int

total_flatline_points

Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).

Type:

int

runs

Each element is a tuple (start_pos, end_pos, value) where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.

Type:

list of (start_pos, end_pos, value)

mask

Boolean array; True at every position that is part of a qualifying flat-line run.

Type:

numpy.ndarray

min_run

The minimum run length used for this report.

Type:

int

n_flatline_runs: int
longest_run: int
total_flatline_points: int
runs: List[Tuple[int, int, float]]
mask: ndarray
min_run: int
__repr__()[source]

Return repr(self).

Return type:

str

__init__(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)
Parameters:
Return type:

None

class tseda.quality.duplicates.DuplicateDetector[source]

Bases: object

Detect consecutive duplicate (flat-line) value runs.

flatline(ts, min_run=3, atol=0.0)[source]

Detect flat-line segments of repeated values.

Parameters:
Return type:

FlatlineReport

near_zero(ts, min_run=3, threshold=1e-8)[source]

Detect segments where the series is stuck near zero.

Parameters:
Return type:

FlatlineReport

remove_flatlines(ts, report)[source]

Replace flat-line positions with NaN (keeping the first value).

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.longest_run
4
flatline(ts, min_run=3, *, atol=0.0)[source]

Detect consecutive runs of identical (or near-identical) values.

Parameters:
  • ts (TimeSeries) – Input series.

  • min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default 3.

  • atol (float, optional) – Absolute tolerance for equality. Two values a and b are considered equal when |a - b| <= atol. Default 0.0 (exact equality).

Return type:

FlatlineReport

Raises:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

Exact flat line of length 4:

>>> idx  = pd.date_range("2020", periods=7, freq="D")
>>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.runs[0]
(1, 4, 3.0)

No flat line (min_run too high):

>>> r2 = DuplicateDetector().flatline(ts, min_run=5)
>>> r2.n_flatline_runs
0
near_zero(ts, min_run=3, *, threshold=1e-08)[source]

Detect segments where the series is stuck near zero.

Only consecutive runs where every value satisfies |x| <= threshold are reported. This differs from flatline(), which detects any repeated value regardless of magnitude.

Parameters:
  • ts (TimeSeries) – Input series.

  • min_run (int, optional) – Minimum run length. Default 3.

  • threshold (float, optional) – Maximum absolute value to count as “near zero”. Default 1e-8.

Returns:

Runs where every value satisfies |x| <= threshold.

Return type:

FlatlineReport

Raises:

ValueError – If threshold < 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().near_zero(ts, min_run=3)
>>> r.n_flatline_runs
1
remove_flatlines(ts, report, *, keep_first=True)[source]

Replace flat-line positions with NaN.

Parameters:
  • ts (TimeSeries) – The original series.

  • report (FlatlineReport) – Result from flatline() or near_zero().

  • keep_first (bool, optional) – When True (default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. When False, the entire run including the first observation is set to NaN.

Returns:

A new series with flat-line values replaced by NaN.

Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=6, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> cleaned = det.remove_flatlines(ts, r, keep_first=True)
>>> cleaned.n_nan
2