Data Quality 

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer

>>> idx  = pd.date_range("2020-01-01", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()
>>> r = ana.analyze(ts)
>>> r.n_nan
2
>>> filled = ana.interpolate(ts)
>>> filled.has_nan
False

analyze(ts)[source]

Compute a complete missing-value summary for ts.

Parameters:: ts (TimeSeries) – The series to analyze.
Return type:: MissingValueReport

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=4, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0])
>>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx))
>>> report.n_nan
2
>>> report.longest_nan_run
2

interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]

Fill NaN values and return a new TimeSeries.

Parameters:

ts (TimeSeries) – Series to fill.
method (str, optional) –
Interpolation strategy. One of:
- "linear" — linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit is None.
- "forward" — forward-fill (carry last observed value).
- "backward" — backward-fill (carry next observed value).
- "nearest" — fill with the nearest non-NaN value.
- "zero" — fill with 0.0.
- "constant" — fill with fill_value (must be provided).
- "spline" — cubic spline (requires scipy).
limit (int, optional) – Maximum number of consecutive NaN values to fill. None fills all gaps.
fill_value (float, optional) – Used only with method="constant".

Returns:

A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.

Return type:

Raises:

TypeError – If ts is not a TimeSeries.
ValueError – If method is not recognised, or if "constant" is chosen without supplying fill_value.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()

Linear interpolation:

>>> filled = ana.interpolate(ts, "linear")
>>> filled.values.tolist()
[1.0, 2.0, 3.0, 4.0, 5.0]

Forward fill:

>>> fwd = ana.interpolate(ts, "forward")
>>> fwd.values.tolist()
[1.0, 1.0, 1.0, 4.0, 5.0]

class tseda.quality.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]

Bases: object

Immutable outlier detection result.

Parameters:

mask (ndarray)
indices (ndarray)
timestamps (DatetimeIndex)
values (ndarray)
method (str)
n_outliers (int)
lower_bound (float | None)
upper_bound (float | None)

mask

Boolean array of shape (n,); True where an outlier was found.

Type:: numpy.ndarray

indices

Integer positions (0-based) of detected outliers.

Type:: numpy.ndarray

timestamps

Timestamps of detected outliers.

Type:: pandas.DatetimeIndex

values

Observed values at outlier positions.

Type:: numpy.ndarray

method

Name of the detection method used.

Type:: str

n_outliers

Number of outliers detected.

Type:: int

lower_bound

Lower fence / threshold (when applicable).

Type:: float or None

upper_bound

Upper fence / threshold (when applicable).

Type:: float or None

mask: ndarray

indices: ndarray

timestamps: DatetimeIndex

values: ndarray

method: str

n_outliers: int

lower_bound: float | None

upper_bound: float | None

__repr__()[source]

Return repr(self).

Return type:: str

__init__(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)

Parameters:

mask (ndarray)
indices (ndarray)
timestamps (DatetimeIndex)
values (ndarray)
method (str)
n_outliers (int)
lower_bound (float | None)
upper_bound (float | None)

Return type:

None

class tseda.quality.OutlierDetector[source]

Bases: object

Detect, remove, or clip outliers in a TimeSeries.

This class is stateless — create one instance and reuse across many series.

iqr(ts, k=1.5)[source]

Tukey IQR fence method.

Parameters:

ts (TimeSeries)
k (float)

Return type:

zscore(ts, threshold=3.0)[source]

Standard Z-score method.

Parameters:

ts (TimeSeries)
threshold (float)

Return type:

mad(ts, threshold=3.5)[source]

Median Absolute Deviation method.

Parameters:

ts (TimeSeries)
threshold (float)

Return type:

gesd(ts, alpha=0.05, max_outliers=10)[source]

Generalized Extreme Studentized Deviate test.

Parameters:

ts (TimeSeries)
alpha (float)
max_outliers (int)

Return type:

remove(ts, report)[source]

Replace outlier values with NaN.

Parameters:

ts (TimeSeries)
report (OutlierReport)

Return type:

clip(ts, report)[source]

Clip outlier values to the fence bounds.

Parameters:

ts (TimeSeries)
report (OutlierReport)

Return type:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector

>>> idx = pd.date_range("2020", periods=6, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx)
>>> det = OutlierDetector()

IQR detection:

>>> r = det.iqr(ts)
>>> r.n_outliers
1
>>> int(r.indices[0])
4

Remove the outlier (replace with NaN):

>>> cleaned = det.remove(ts, r)
>>> cleaned.has_nan
True

iqr(ts, k=1.5)[source]

Detect outliers using Tukey’s IQR fences.

Points below Q1 - k * IQR or above Q3 + k * IQR are flagged. NaN values are excluded from the quartile computation but not flagged as outliers.

Parameters:

ts (TimeSeries) – Input series.
k (float, optional) –
Fence multiplier. Common choices:
- 1.5 — standard outlier fence (default).
- 3.0 — extreme outlier fence.

Return type:

Raises:

ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx)
>>> OutlierDetector().iqr(ts).n_outliers
1

zscore(ts, threshold=3.0)[source]

Detect outliers using the standard Z-score.

A value is flagged when |z| > threshold, where z = (x - mean) / std.

Parameters:

ts (TimeSeries) – Input series.
threshold (float, optional) – Z-score cut-off. Default 3.0 (≈ 0.3 % false-positive rate under normality).

Return type:

Raises:

ValueError – If threshold is not positive or if std == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx)
>>> OutlierDetector().zscore(ts).n_outliers
1

mad(ts, threshold=3.5)[source]

Detect outliers using the Median Absolute Deviation (MAD).

A value is flagged when the modified Z-score exceeds threshold:

\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]

where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).

This is more robust than the Z-score for skewed distributions or heavy-tailed noise.

Parameters:

ts (TimeSeries) – Input series.
threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend 3.5 (default).

Return type:

Raises:

ValueError – If threshold is not positive or MAD == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx)
>>> OutlierDetector().mad(ts).n_outliers
1

gesd(ts, *, alpha=0.05, max_outliers=10)[source]

Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.

The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.

Parameters:

ts (TimeSeries) – Input series.
alpha (float, optional) – Significance level. Default 0.05.
max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than n // 2.

Returns:

lower_bound and upper_bound are None (GESD uses per-iteration critical values, not fixed fences).

Return type:

Raises:

ValueError – If alpha is not in (0, 1) or max_outliers is invalid.
ImportError – If scipy is not installed (needed for the t-distribution CDF).

Notes

The critical value at each step i (1-indexed) is:

\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]

where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> rng = np.random.default_rng(0)
>>> idx = pd.date_range("2020", periods=50, freq="D")
>>> vals = rng.standard_normal(50)
>>> vals[10] = 15.0   # plant a spike
>>> ts  = TimeSeries(vals, index=idx)
>>> r   = OutlierDetector().gesd(ts)
>>> 10 in r.indices
True

remove(ts, report)[source]

Replace detected outlier values with NaN.

Parameters:

ts (TimeSeries) – The original series.
report (OutlierReport) – Result from one of the detection methods.

Returns:

A new series with outliers replaced by NaN.

Return type:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> cleaned = det.remove(ts, det.iqr(ts))
>>> cleaned.has_nan
True

clip(ts, report)[source]

Clip outlier values to the fence bounds of report.

Parameters:

ts (TimeSeries) – The original series.
report (OutlierReport) – Must have non-None lower_bound and upper_bound (i.e., from iqr, zscore, or mad).

Returns:

A new series with values clamped to [lower_bound, upper_bound].

Return type:

Raises:

ValueError – If report has no bounds (e.g., from gesd).

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> r   = det.iqr(ts)
>>> clipped = det.clip(ts, r)
>>> float(clipped.values.max()) < 100.0
True

class tseda.quality.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]

Bases: object

Immutable summary of flat-line segments in a TimeSeries.

Parameters:

n_flatline_runs (int)
longest_run (int)
total_flatline_points (int)
runs (List[Tuple[int, int, float]])
mask (ndarray)
min_run (int)

n_flatline_runs

Number of runs that meet or exceed min_run in length.

Type:: int

longest_run

Length of the single longest flat-line run.

Type:: int

total_flatline_points

Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).

Type:: int

runs

Each element is a tuple (start_pos, end_pos, value) where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.

Type:: list of (start_pos, end_pos, value)

mask

Boolean array; True at every position that is part of a qualifying flat-line run.

Type:: numpy.ndarray

min_run

The minimum run length used for this report.

Type:: int

n_flatline_runs: int

longest_run: int

total_flatline_points: int

runs: List[Tuple[int, int, float]]

mask: ndarray

min_run: int

__repr__()[source]

Return repr(self).

Return type:: str

__init__(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)

Parameters:

n_flatline_runs (int)
longest_run (int)
total_flatline_points (int)
runs (List[Tuple[int, int, float]])
mask (ndarray)
min_run (int)

Return type:

None

class tseda.quality.DuplicateDetector[source]

Bases: object

Detect consecutive duplicate (flat-line) value runs.

flatline(ts, min_run=3, atol=0.0)[source]

Detect flat-line segments of repeated values.

Parameters:

ts (TimeSeries)
min_run (int)
atol (float)

Return type:

near_zero(ts, min_run=3, threshold=1e-8)[source]

Detect segments where the series is stuck near zero.

Parameters:

ts (TimeSeries)
min_run (int)
threshold (float)

Return type:

remove_flatlines(ts, report)[source]

Replace flat-line positions with NaN (keeping the first value).

Parameters:

ts (TimeSeries)
report (FlatlineReport)
keep_first (bool)

Return type:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.longest_run
4

flatline(ts, min_run=3, *, atol=0.0)[source]

Detect consecutive runs of identical (or near-identical) values.

Parameters:

ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default 3.
atol (float, optional) – Absolute tolerance for equality. Two values a and b are considered equal when |a - b| <= atol. Default 0.0 (exact equality).

Return type:

Raises:

TypeError – If ts is not a TimeSeries.
ValueError – If min_run < 2.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

Exact flat line of length 4:

>>> idx  = pd.date_range("2020", periods=7, freq="D")
>>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.runs[0]
(1, 4, 3.0)

No flat line (min_run too high):

>>> r2 = DuplicateDetector().flatline(ts, min_run=5)
>>> r2.n_flatline_runs
0

near_zero(ts, min_run=3, *, threshold=1e-08)[source]

Detect segments where the series is stuck near zero.

Only consecutive runs where every value satisfies |x| <= threshold are reported. This differs from flatline(), which detects any repeated value regardless of magnitude.

Parameters:

ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum run length. Default 3.
threshold (float, optional) – Maximum absolute value to count as “near zero”. Default 1e-8.

Returns:

Runs where every value satisfies |x| <= threshold.

Return type:

Raises:

ValueError – If threshold < 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().near_zero(ts, min_run=3)
>>> r.n_flatline_runs
1

remove_flatlines(ts, report, *, keep_first=True)[source]

Replace flat-line positions with NaN.

Parameters:

ts (TimeSeries) – The original series.
report (FlatlineReport) – Result from flatline() or near_zero().
keep_first (bool, optional) – When True (default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. When False, the entire run including the first observation is set to NaN.

Returns:

A new series with flat-line values replaced by NaN.

Return type:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

>>> idx  = pd.date_range("2020", periods=6, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> cleaned = det.remove_flatlines(ts, r, keep_first=True)
>>> cleaned.n_nan
2

Missing-value analysis for time series.

Two distinct concepts are handled here:

Value NaN — a timestamp is present in the index but its observed value is numpy.nan.
Index gap — a timestamp that should exist (given the series frequency) is absent from the index entirely.

Both are reported by MissingValueAnalyzer. Interpolation of NaN values is also provided via MissingValueAnalyzer.interpolate().

Classes

MissingValueReport: Immutable result dataclass returned by MissingValueAnalyzer.analyze().
MissingValueAnalyzer: Stateless analyzer; all methods accept a TimeSeries and return plain Python / numpy objects or a new TimeSeries.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer

>>> idx = pd.date_range("2020-01-01", periods=10, freq="D")
>>> vals = np.array([1.0, np.nan, 3.0, np.nan, np.nan, 6.0, 7.0, 8.0, np.nan, 10.0])
>>> ts  = TimeSeries(vals, index=idx)
>>> ana = MissingValueAnalyzer()
>>> report = ana.analyze(ts)
>>> report.n_nan
3
>>> report.pct_nan
30.0

class tseda.quality.missing.MissingValueReport(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)[source]

Bases: object

Immutable summary of missing values in a TimeSeries.

Parameters:

n_nan (int)
pct_nan (float)
n_gaps (int)
gap_locations (List[Timestamp])
longest_nan_run (int)
nan_run_lengths (List[int])
nan_positions (ndarray)
is_monotone_missing (bool)

n_nan

Number of NaN values in the observed array.

Type:: int

pct_nan

Percentage of NaN observations (0–100).

Type:: float

n_gaps

Number of missing timestamps (index gaps) when the series frequency is known. -1 when frequency is unknown.

Type:: int

gap_locations

Start timestamp of each index gap. Empty when n_gaps <= 0.

Type:: list of pandas.Timestamp

longest_nan_run

Length of the longest consecutive run of NaN values.

Type:: int

nan_run_lengths

Lengths of every consecutive NaN run (ascending order).

Type:: list of int

nan_positions

Integer positions (0-based) of all NaN values.

Type:: numpy.ndarray

is_monotone_missing

True when all NaN values cluster at the start or end of the series (monotone missing pattern — easier to handle).

Type:: bool

n_nan: int

pct_nan: float

n_gaps: int

gap_locations: List[Timestamp]

longest_nan_run: int

nan_run_lengths: List[int]

nan_positions: ndarray

is_monotone_missing: bool

__repr__()[source]

Return repr(self).

Return type:: str

__init__(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)

Parameters:

n_nan (int)
pct_nan (float)
n_gaps (int)
gap_locations (List[Timestamp])
longest_nan_run (int)
nan_run_lengths (List[int])
nan_positions (ndarray)
is_monotone_missing (bool)

Return type:

None

class tseda.quality.missing.MissingValueAnalyzer[source]

Bases: object

Analyze and repair missing values in a TimeSeries.

This class is stateless — instantiate once and call its methods on different series objects.

analyze(ts)[source]

Return a MissingValueReport for ts.

Parameters:: ts (TimeSeries)
Return type:: MissingValueReport

interpolate(ts, method)[source]

Fill NaN values and return a new TimeSeries.

Parameters:

ts (TimeSeries)
method (str)
limit (int | None)
fill_value (float | None)

Return type:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer

>>> idx  = pd.date_range("2020-01-01", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()
>>> r = ana.analyze(ts)
>>> r.n_nan
2
>>> filled = ana.interpolate(ts)
>>> filled.has_nan
False

analyze(ts)[source]

Compute a complete missing-value summary for ts.

Parameters:: ts (TimeSeries) – The series to analyze.
Return type:: MissingValueReport

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=4, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0])
>>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx))
>>> report.n_nan
2
>>> report.longest_nan_run
2

interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]

Fill NaN values and return a new TimeSeries.

Parameters:

ts (TimeSeries) – Series to fill.
method (str, optional) –
Interpolation strategy. One of:
- "linear" — linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit is None.
- "forward" — forward-fill (carry last observed value).
- "backward" — backward-fill (carry next observed value).
- "nearest" — fill with the nearest non-NaN value.
- "zero" — fill with 0.0.
- "constant" — fill with fill_value (must be provided).
- "spline" — cubic spline (requires scipy).
limit (int, optional) – Maximum number of consecutive NaN values to fill. None fills all gaps.
fill_value (float, optional) – Used only with method="constant".

Returns:

A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.

Return type:

Raises:

TypeError – If ts is not a TimeSeries.
ValueError – If method is not recognised, or if "constant" is chosen without supplying fill_value.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()

Linear interpolation:

>>> filled = ana.interpolate(ts, "linear")
>>> filled.values.tolist()
[1.0, 2.0, 3.0, 4.0, 5.0]

Forward fill:

>>> fwd = ana.interpolate(ts, "forward")
>>> fwd.values.tolist()
[1.0, 1.0, 1.0, 4.0, 5.0]

Outlier detection for time series.

Four detection methods are provided, all implemented with pure numpy / scipy — no machine-learning dependencies:

Method	Statistic	Best for
IQR	Tukey fences	Symmetric or skewed data
Z-score	Mean / std deviation	Approximately normal data
MAD	Median absolute deviation	Skewed data, heavy tails
GESD	Generalized ESD test	Known-normal; # unknown

All detectors return an OutlierReport and expose .remove() / .clip() helpers that return cleaned TimeSeries objects.

Classes

OutlierReport: Immutable result dataclass.
OutlierDetector: Stateless detector.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector

>>> idx = pd.date_range("2020-01-01", periods=10, freq="D")
>>> vals = np.array([1, 2, 2, 3, 100, 2, 3, 2, 1, 2], dtype=float)
>>> ts  = TimeSeries(vals, index=idx)
>>> det = OutlierDetector()
>>> r   = det.iqr(ts)
>>> r.n_outliers
1

class tseda.quality.outliers.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]

Bases: object

Immutable outlier detection result.

Parameters:

mask (ndarray)
indices (ndarray)
timestamps (DatetimeIndex)
values (ndarray)
method (str)
n_outliers (int)
lower_bound (float | None)
upper_bound (float | None)

mask

Boolean array of shape (n,); True where an outlier was found.

Type:: numpy.ndarray

indices

Integer positions (0-based) of detected outliers.

Type:: numpy.ndarray

timestamps

Timestamps of detected outliers.

Type:: pandas.DatetimeIndex

values

Observed values at outlier positions.

Type:: numpy.ndarray

method

Name of the detection method used.

Type:: str

n_outliers

Number of outliers detected.

Type:: int

lower_bound

Lower fence / threshold (when applicable).

Type:: float or None

upper_bound

Upper fence / threshold (when applicable).

Type:: float or None

mask: ndarray

indices: ndarray

timestamps: DatetimeIndex

values: ndarray

method: str

n_outliers: int

lower_bound: float | None

upper_bound: float | None

__repr__()[source]

Return repr(self).

Return type:: str

__init__(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)

Parameters:

mask (ndarray)
indices (ndarray)
timestamps (DatetimeIndex)
values (ndarray)
method (str)
n_outliers (int)
lower_bound (float | None)
upper_bound (float | None)

Return type:

None

class tseda.quality.outliers.OutlierDetector[source]

Bases: object

Detect, remove, or clip outliers in a TimeSeries.

This class is stateless — create one instance and reuse across many series.

iqr(ts, k=1.5)[source]

Tukey IQR fence method.

Parameters:

ts (TimeSeries)
k (float)

Return type:

zscore(ts, threshold=3.0)[source]

Standard Z-score method.

Parameters:

ts (TimeSeries)
threshold (float)

Return type:

mad(ts, threshold=3.5)[source]

Median Absolute Deviation method.

Parameters:

ts (TimeSeries)
threshold (float)

Return type:

gesd(ts, alpha=0.05, max_outliers=10)[source]

Generalized Extreme Studentized Deviate test.

Parameters:

ts (TimeSeries)
alpha (float)
max_outliers (int)

Return type:

remove(ts, report)[source]

Replace outlier values with NaN.

Parameters:

ts (TimeSeries)
report (OutlierReport)

Return type:

clip(ts, report)[source]

Clip outlier values to the fence bounds.

Parameters:

ts (TimeSeries)
report (OutlierReport)

Return type:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector

>>> idx = pd.date_range("2020", periods=6, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx)
>>> det = OutlierDetector()

IQR detection:

>>> r = det.iqr(ts)
>>> r.n_outliers
1
>>> int(r.indices[0])
4

Remove the outlier (replace with NaN):

>>> cleaned = det.remove(ts, r)
>>> cleaned.has_nan
True

iqr(ts, k=1.5)[source]

Detect outliers using Tukey’s IQR fences.

Points below Q1 - k * IQR or above Q3 + k * IQR are flagged. NaN values are excluded from the quartile computation but not flagged as outliers.

Parameters:

ts (TimeSeries) – Input series.
k (float, optional) –
Fence multiplier. Common choices:
- 1.5 — standard outlier fence (default).
- 3.0 — extreme outlier fence.

Return type:

Raises:

ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx)
>>> OutlierDetector().iqr(ts).n_outliers
1

zscore(ts, threshold=3.0)[source]

Detect outliers using the standard Z-score.

A value is flagged when |z| > threshold, where z = (x - mean) / std.

Parameters:

ts (TimeSeries) – Input series.
threshold (float, optional) – Z-score cut-off. Default 3.0 (≈ 0.3 % false-positive rate under normality).

Return type:

Raises:

ValueError – If threshold is not positive or if std == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx)
>>> OutlierDetector().zscore(ts).n_outliers
1

mad(ts, threshold=3.5)[source]

Detect outliers using the Median Absolute Deviation (MAD).

A value is flagged when the modified Z-score exceeds threshold:

\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]

where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).

This is more robust than the Z-score for skewed distributions or heavy-tailed noise.

Parameters:

ts (TimeSeries) – Input series.
threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend 3.5 (default).

Return type:

Raises:

ValueError – If threshold is not positive or MAD == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx)
>>> OutlierDetector().mad(ts).n_outliers
1

gesd(ts, *, alpha=0.05, max_outliers=10)[source]

Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.

The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.

Parameters:

ts (TimeSeries) – Input series.
alpha (float, optional) – Significance level. Default 0.05.
max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than n // 2.

Returns:

lower_bound and upper_bound are None (GESD uses per-iteration critical values, not fixed fences).

Return type:

Raises:

ValueError – If alpha is not in (0, 1) or max_outliers is invalid.
ImportError – If scipy is not installed (needed for the t-distribution CDF).

Notes

The critical value at each step i (1-indexed) is:

\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]

where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> rng = np.random.default_rng(0)
>>> idx = pd.date_range("2020", periods=50, freq="D")
>>> vals = rng.standard_normal(50)
>>> vals[10] = 15.0   # plant a spike
>>> ts  = TimeSeries(vals, index=idx)
>>> r   = OutlierDetector().gesd(ts)
>>> 10 in r.indices
True

remove(ts, report)[source]

Replace detected outlier values with NaN.

Parameters:

ts (TimeSeries) – The original series.
report (OutlierReport) – Result from one of the detection methods.

Returns:

A new series with outliers replaced by NaN.

Return type:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> cleaned = det.remove(ts, det.iqr(ts))
>>> cleaned.has_nan
True

clip(ts, report)[source]

Clip outlier values to the fence bounds of report.

Parameters:

ts (TimeSeries) – The original series.
report (OutlierReport) – Must have non-None lower_bound and upper_bound (i.e., from iqr, zscore, or mad).

Returns:

A new series with values clamped to [lower_bound, upper_bound].

Return type:

Raises:

ValueError – If report has no bounds (e.g., from gesd).

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> r   = det.iqr(ts)
>>> clipped = det.clip(ts, r)
>>> float(clipped.values.max()) < 100.0
True

Flat-line and near-constant segment detection for time series.

Timestamp duplicates are rejected at construction time by validate_datetime_index(). This module addresses the complementary problem: consecutive identical or near-zero values, which typically signal:

A stuck sensor / ADC saturation.
A data-pipeline bug that forward-filled data without marking it.
A genuine flat segment that may confuse differencing-based methods.

Classes

FlatlineReport: Immutable result dataclass returned by DuplicateDetector.flatline().
DuplicateDetector: Stateless detector.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

>>> idx  = pd.date_range("2020-01-01", periods=10, freq="D")
>>> vals = np.array([1.0, 2.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0, 5.0, 6.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.longest_run
4

class tseda.quality.duplicates.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]

Bases: object

Immutable summary of flat-line segments in a TimeSeries.

Parameters:

n_flatline_runs (int)
longest_run (int)
total_flatline_points (int)
runs (List[Tuple[int, int, float]])
mask (ndarray)
min_run (int)

n_flatline_runs

Number of runs that meet or exceed min_run in length.

Type:: int

longest_run

Length of the single longest flat-line run.

Type:: int

total_flatline_points

Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).

Type:: int

runs

Each element is a tuple (start_pos, end_pos, value) where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.

Type:: list of (start_pos, end_pos, value)

mask

Boolean array; True at every position that is part of a qualifying flat-line run.

Type:: numpy.ndarray

min_run

The minimum run length used for this report.

Type:: int

n_flatline_runs: int

longest_run: int

total_flatline_points: int

runs: List[Tuple[int, int, float]]

mask: ndarray

min_run: int

__repr__()[source]

Return repr(self).

Return type:: str

__init__(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)

Parameters:

n_flatline_runs (int)
longest_run (int)
total_flatline_points (int)
runs (List[Tuple[int, int, float]])
mask (ndarray)
min_run (int)

Return type:

None

class tseda.quality.duplicates.DuplicateDetector[source]

Bases: object

Detect consecutive duplicate (flat-line) value runs.

flatline(ts, min_run=3, atol=0.0)[source]

Detect flat-line segments of repeated values.

Parameters:

ts (TimeSeries)
min_run (int)
atol (float)

Return type:

near_zero(ts, min_run=3, threshold=1e-8)[source]

Detect segments where the series is stuck near zero.

Parameters:

ts (TimeSeries)
min_run (int)
threshold (float)

Return type:

remove_flatlines(ts, report)[source]

Replace flat-line positions with NaN (keeping the first value).

Parameters:

ts (TimeSeries)
report (FlatlineReport)
keep_first (bool)

Return type:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.longest_run
4

flatline(ts, min_run=3, *, atol=0.0)[source]

Detect consecutive runs of identical (or near-identical) values.

Parameters:

ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default 3.
atol (float, optional) – Absolute tolerance for equality. Two values a and b are considered equal when |a - b| <= atol. Default 0.0 (exact equality).

Return type:

Raises:

TypeError – If ts is not a TimeSeries.
ValueError – If min_run < 2.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

Exact flat line of length 4:

>>> idx  = pd.date_range("2020", periods=7, freq="D")
>>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.runs[0]
(1, 4, 3.0)

No flat line (min_run too high):

>>> r2 = DuplicateDetector().flatline(ts, min_run=5)
>>> r2.n_flatline_runs
0

near_zero(ts, min_run=3, *, threshold=1e-08)[source]

Detect segments where the series is stuck near zero.

Only consecutive runs where every value satisfies |x| <= threshold are reported. This differs from flatline(), which detects any repeated value regardless of magnitude.

Parameters:

ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum run length. Default 3.
threshold (float, optional) – Maximum absolute value to count as “near zero”. Default 1e-8.

Returns:

Runs where every value satisfies |x| <= threshold.

Return type:

Raises:

ValueError – If threshold < 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().near_zero(ts, min_run=3)
>>> r.n_flatline_runs
1

remove_flatlines(ts, report, *, keep_first=True)[source]

Replace flat-line positions with NaN.

Parameters:

ts (TimeSeries) – The original series.
report (FlatlineReport) – Result from flatline() or near_zero().
keep_first (bool, optional) – When True (default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. When False, the entire run including the first observation is set to NaN.

Returns:

A new series with flat-line values replaced by NaN.

Return type: