tseda.quality

tseda.quality

Data quality diagnostics for time series.

Public API

MissingValueReport

Immutable result of MissingValueAnalyzer.

MissingValueAnalyzer

Detect NaN values, index gaps, and interpolate.

OutlierReport

Immutable result of OutlierDetector.

OutlierDetector

IQR, Z-score, MAD, and GESD outlier detection.

FlatlineReport

Immutable result of DuplicateDetector.

DuplicateDetector

Consecutive flat-line and near-zero segment detection.

class tseda.quality.MissingValueReport(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)[source]

Bases: object

Immutable summary of missing values in a TimeSeries.

Parameters:
n_nan

Number of NaN values in the observed array.

Type:

int

pct_nan

Percentage of NaN observations (0–100).

Type:

float

n_gaps

Number of missing timestamps (index gaps) when the series frequency is known. -1 when frequency is unknown.

Type:

int

gap_locations

Start timestamp of each index gap. Empty when n_gaps <= 0.

Type:

list of pandas.Timestamp

longest_nan_run

Length of the longest consecutive run of NaN values.

Type:

int

nan_run_lengths

Lengths of every consecutive NaN run (ascending order).

Type:

list of int

nan_positions

Integer positions (0-based) of all NaN values.

Type:

numpy.ndarray

is_monotone_missing

True when all NaN values cluster at the start or end of the series (monotone missing pattern — easier to handle).

Type:

bool

n_nan: int
pct_nan: float
n_gaps: int
gap_locations: List[Timestamp]
longest_nan_run: int
nan_run_lengths: List[int]
nan_positions: ndarray
is_monotone_missing: bool
__repr__()[source]

Return repr(self).

Return type:

str

__init__(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)
Parameters:
Return type:

None

class tseda.quality.MissingValueAnalyzer[source]

Bases: object

Analyze and repair missing values in a TimeSeries.

This class is stateless — instantiate once and call its methods on different series objects.

analyze(ts)[source]

Return a MissingValueReport for ts.

Parameters:

ts (TimeSeries)

Return type:

MissingValueReport

interpolate(ts, method)[source]

Fill NaN values and return a new TimeSeries.

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020-01-01", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()
>>> r = ana.analyze(ts)
>>> r.n_nan
2
>>> filled = ana.interpolate(ts)
>>> filled.has_nan
False
analyze(ts)[source]

Compute a complete missing-value summary for ts.

Parameters:

ts (TimeSeries) – The series to analyze.

Return type:

MissingValueReport

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=4, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0])
>>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx))
>>> report.n_nan
2
>>> report.longest_nan_run
2
interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]

Fill NaN values and return a new TimeSeries.

Parameters:
  • ts (TimeSeries) – Series to fill.

  • method (str, optional) –

    Interpolation strategy. One of:

    • "linear" — linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit is None.

    • "forward" — forward-fill (carry last observed value).

    • "backward" — backward-fill (carry next observed value).

    • "nearest" — fill with the nearest non-NaN value.

    • "zero" — fill with 0.0.

    • "constant" — fill with fill_value (must be provided).

    • "spline" — cubic spline (requires scipy).

  • limit (int, optional) – Maximum number of consecutive NaN values to fill. None fills all gaps.

  • fill_value (float, optional) – Used only with method="constant".

Returns:

A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.

Return type:

TimeSeries

Raises:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()

Linear interpolation:

>>> filled = ana.interpolate(ts, "linear")
>>> filled.values.tolist()
[1.0, 2.0, 3.0, 4.0, 5.0]

Forward fill:

>>> fwd = ana.interpolate(ts, "forward")
>>> fwd.values.tolist()
[1.0, 1.0, 1.0, 4.0, 5.0]
class tseda.quality.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]

Bases: object

Immutable outlier detection result.

Parameters:
mask

Boolean array of shape (n,); True where an outlier was found.

Type:

numpy.ndarray

indices

Integer positions (0-based) of detected outliers.

Type:

numpy.ndarray

timestamps

Timestamps of detected outliers.

Type:

pandas.DatetimeIndex

values

Observed values at outlier positions.

Type:

numpy.ndarray

method

Name of the detection method used.

Type:

str

n_outliers

Number of outliers detected.

Type:

int

lower_bound

Lower fence / threshold (when applicable).

Type:

float or None

upper_bound

Upper fence / threshold (when applicable).

Type:

float or None

mask: ndarray
indices: ndarray
timestamps: DatetimeIndex
values: ndarray
method: str
n_outliers: int
lower_bound: float | None
upper_bound: float | None
__repr__()[source]

Return repr(self).

Return type:

str

__init__(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)
Parameters:
Return type:

None

class tseda.quality.OutlierDetector[source]

Bases: object

Detect, remove, or clip outliers in a TimeSeries.

This class is stateless — create one instance and reuse across many series.

iqr(ts, k=1.5)[source]

Tukey IQR fence method.

Parameters:
Return type:

OutlierReport

zscore(ts, threshold=3.0)[source]

Standard Z-score method.

Parameters:
Return type:

OutlierReport

mad(ts, threshold=3.5)[source]

Median Absolute Deviation method.

Parameters:
Return type:

OutlierReport

gesd(ts, alpha=0.05, max_outliers=10)[source]

Generalized Extreme Studentized Deviate test.

Parameters:
Return type:

OutlierReport

remove(ts, report)[source]

Replace outlier values with NaN.

Parameters:
Return type:

TimeSeries

clip(ts, report)[source]

Clip outlier values to the fence bounds.

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=6, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx)
>>> det = OutlierDetector()

IQR detection:

>>> r = det.iqr(ts)
>>> r.n_outliers
1
>>> int(r.indices[0])
4

Remove the outlier (replace with NaN):

>>> cleaned = det.remove(ts, r)
>>> cleaned.has_nan
True
iqr(ts, k=1.5)[source]

Detect outliers using Tukey’s IQR fences.

Points below Q1 - k * IQR or above Q3 + k * IQR are flagged. NaN values are excluded from the quartile computation but not flagged as outliers.

Parameters:
  • ts (TimeSeries) – Input series.

  • k (float, optional) –

    Fence multiplier. Common choices:

    • 1.5 — standard outlier fence (default).

    • 3.0 — extreme outlier fence.

Return type:

OutlierReport

Raises:

ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx)
>>> OutlierDetector().iqr(ts).n_outliers
1
zscore(ts, threshold=3.0)[source]

Detect outliers using the standard Z-score.

A value is flagged when |z| > threshold, where z = (x - mean) / std.

Parameters:
  • ts (TimeSeries) – Input series.

  • threshold (float, optional) – Z-score cut-off. Default 3.0 (≈ 0.3 % false-positive rate under normality).

Return type:

OutlierReport

Raises:

ValueError – If threshold is not positive or if std == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx)
>>> OutlierDetector().zscore(ts).n_outliers
1
mad(ts, threshold=3.5)[source]

Detect outliers using the Median Absolute Deviation (MAD).

A value is flagged when the modified Z-score exceeds threshold:

\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]

where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).

This is more robust than the Z-score for skewed distributions or heavy-tailed noise.

Parameters:
  • ts (TimeSeries) – Input series.

  • threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend 3.5 (default).

Return type:

OutlierReport

Raises:

ValueError – If threshold is not positive or MAD == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx)
>>> OutlierDetector().mad(ts).n_outliers
1
gesd(ts, *, alpha=0.05, max_outliers=10)[source]

Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.

The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.

Parameters:
  • ts (TimeSeries) – Input series.

  • alpha (float, optional) – Significance level. Default 0.05.

  • max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than n // 2.

Returns:

lower_bound and upper_bound are None (GESD uses per-iteration critical values, not fixed fences).

Return type:

OutlierReport

Raises:
  • ValueError – If alpha is not in (0, 1) or max_outliers is invalid.

  • ImportError – If scipy is not installed (needed for the t-distribution CDF).

Notes

The critical value at each step i (1-indexed) is:

\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]

where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> rng = np.random.default_rng(0)
>>> idx = pd.date_range("2020", periods=50, freq="D")
>>> vals = rng.standard_normal(50)
>>> vals[10] = 15.0   # plant a spike
>>> ts  = TimeSeries(vals, index=idx)
>>> r   = OutlierDetector().gesd(ts)
>>> 10 in r.indices
True
remove(ts, report)[source]

Replace detected outlier values with NaN.

Parameters:
Returns:

A new series with outliers replaced by NaN.

Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> cleaned = det.remove(ts, det.iqr(ts))
>>> cleaned.has_nan
True
clip(ts, report)[source]

Clip outlier values to the fence bounds of report.

Parameters:
  • ts (TimeSeries) – The original series.

  • report (OutlierReport) – Must have non-None lower_bound and upper_bound (i.e., from iqr, zscore, or mad).

Returns:

A new series with values clamped to [lower_bound, upper_bound].

Return type:

TimeSeries

Raises:

ValueError – If report has no bounds (e.g., from gesd).

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> r   = det.iqr(ts)
>>> clipped = det.clip(ts, r)
>>> float(clipped.values.max()) < 100.0
True
class tseda.quality.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]

Bases: object

Immutable summary of flat-line segments in a TimeSeries.

Parameters:
n_flatline_runs

Number of runs that meet or exceed min_run in length.

Type:

int

longest_run

Length of the single longest flat-line run.

Type:

int

total_flatline_points

Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).

Type:

int

runs

Each element is a tuple (start_pos, end_pos, value) where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.

Type:

list of (start_pos, end_pos, value)

mask

Boolean array; True at every position that is part of a qualifying flat-line run.

Type:

numpy.ndarray

min_run

The minimum run length used for this report.

Type:

int

n_flatline_runs: int
longest_run: int
total_flatline_points: int
runs: List[Tuple[int, int, float]]
mask: ndarray
min_run: int
__repr__()[source]

Return repr(self).

Return type:

str

__init__(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)
Parameters:
Return type:

None

class tseda.quality.DuplicateDetector[source]

Bases: object

Detect consecutive duplicate (flat-line) value runs.

flatline(ts, min_run=3, atol=0.0)[source]

Detect flat-line segments of repeated values.

Parameters:
Return type:

FlatlineReport

near_zero(ts, min_run=3, threshold=1e-8)[source]

Detect segments where the series is stuck near zero.

Parameters:
Return type:

FlatlineReport

remove_flatlines(ts, report)[source]

Replace flat-line positions with NaN (keeping the first value).

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.longest_run
4
flatline(ts, min_run=3, *, atol=0.0)[source]

Detect consecutive runs of identical (or near-identical) values.

Parameters:
  • ts (TimeSeries) – Input series.

  • min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default 3.

  • atol (float, optional) – Absolute tolerance for equality. Two values a and b are considered equal when |a - b| <= atol. Default 0.0 (exact equality).

Return type:

FlatlineReport

Raises:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

Exact flat line of length 4:

>>> idx  = pd.date_range("2020", periods=7, freq="D")
>>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.runs[0]
(1, 4, 3.0)

No flat line (min_run too high):

>>> r2 = DuplicateDetector().flatline(ts, min_run=5)
>>> r2.n_flatline_runs
0
near_zero(ts, min_run=3, *, threshold=1e-08)[source]

Detect segments where the series is stuck near zero.

Only consecutive runs where every value satisfies |x| <= threshold are reported. This differs from flatline(), which detects any repeated value regardless of magnitude.

Parameters:
  • ts (TimeSeries) – Input series.

  • min_run (int, optional) – Minimum run length. Default 3.

  • threshold (float, optional) – Maximum absolute value to count as “near zero”. Default 1e-8.

Returns:

Runs where every value satisfies |x| <= threshold.

Return type:

FlatlineReport

Raises:

ValueError – If threshold < 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().near_zero(ts, min_run=3)
>>> r.n_flatline_runs
1
remove_flatlines(ts, report, *, keep_first=True)[source]

Replace flat-line positions with NaN.

Parameters:
  • ts (TimeSeries) – The original series.

  • report (FlatlineReport) – Result from flatline() or near_zero().

  • keep_first (bool, optional) – When True (default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. When False, the entire run including the first observation is set to NaN.

Returns:

A new series with flat-line values replaced by NaN.

Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=6, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> cleaned = det.remove_flatlines(ts, r, keep_first=True)
>>> cleaned.n_nan
2

Missing Values

class tseda.quality.missing.MissingValueReport(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)[source]

Bases: object

Immutable summary of missing values in a TimeSeries.

Parameters:
n_nan

Number of NaN values in the observed array.

Type:

int

pct_nan

Percentage of NaN observations (0–100).

Type:

float

n_gaps

Number of missing timestamps (index gaps) when the series frequency is known. -1 when frequency is unknown.

Type:

int

gap_locations

Start timestamp of each index gap. Empty when n_gaps <= 0.

Type:

list of pandas.Timestamp

longest_nan_run

Length of the longest consecutive run of NaN values.

Type:

int

nan_run_lengths

Lengths of every consecutive NaN run (ascending order).

Type:

list of int

nan_positions

Integer positions (0-based) of all NaN values.

Type:

numpy.ndarray

is_monotone_missing

True when all NaN values cluster at the start or end of the series (monotone missing pattern — easier to handle).

Type:

bool

n_nan: int
pct_nan: float
n_gaps: int
gap_locations: List[Timestamp]
longest_nan_run: int
nan_run_lengths: List[int]
nan_positions: ndarray
is_monotone_missing: bool
__repr__()[source]

Return repr(self).

Return type:

str

__init__(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)
Parameters:
Return type:

None

class tseda.quality.missing.MissingValueAnalyzer[source]

Bases: object

Analyze and repair missing values in a TimeSeries.

This class is stateless — instantiate once and call its methods on different series objects.

analyze(ts)[source]

Return a MissingValueReport for ts.

Parameters:

ts (TimeSeries)

Return type:

MissingValueReport

interpolate(ts, method)[source]

Fill NaN values and return a new TimeSeries.

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020-01-01", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()
>>> r = ana.analyze(ts)
>>> r.n_nan
2
>>> filled = ana.interpolate(ts)
>>> filled.has_nan
False
analyze(ts)[source]

Compute a complete missing-value summary for ts.

Parameters:

ts (TimeSeries) – The series to analyze.

Return type:

MissingValueReport

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=4, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0])
>>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx))
>>> report.n_nan
2
>>> report.longest_nan_run
2
interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]

Fill NaN values and return a new TimeSeries.

Parameters:
  • ts (TimeSeries) – Series to fill.

  • method (str, optional) –

    Interpolation strategy. One of:

    • "linear" — linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit is None.

    • "forward" — forward-fill (carry last observed value).

    • "backward" — backward-fill (carry next observed value).

    • "nearest" — fill with the nearest non-NaN value.

    • "zero" — fill with 0.0.

    • "constant" — fill with fill_value (must be provided).

    • "spline" — cubic spline (requires scipy).

  • limit (int, optional) – Maximum number of consecutive NaN values to fill. None fills all gaps.

  • fill_value (float, optional) – Used only with method="constant".

Returns:

A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.

Return type:

TimeSeries

Raises:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx  = pd.date_range("2020", periods=5, freq="D")
>>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> ana  = MissingValueAnalyzer()

Linear interpolation:

>>> filled = ana.interpolate(ts, "linear")
>>> filled.values.tolist()
[1.0, 2.0, 3.0, 4.0, 5.0]

Forward fill:

>>> fwd = ana.interpolate(ts, "forward")
>>> fwd.values.tolist()
[1.0, 1.0, 1.0, 4.0, 5.0]

Outliers

class tseda.quality.outliers.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]

Bases: object

Immutable outlier detection result.

Parameters:
mask

Boolean array of shape (n,); True where an outlier was found.

Type:

numpy.ndarray

indices

Integer positions (0-based) of detected outliers.

Type:

numpy.ndarray

timestamps

Timestamps of detected outliers.

Type:

pandas.DatetimeIndex

values

Observed values at outlier positions.

Type:

numpy.ndarray

method

Name of the detection method used.

Type:

str

n_outliers

Number of outliers detected.

Type:

int

lower_bound

Lower fence / threshold (when applicable).

Type:

float or None

upper_bound

Upper fence / threshold (when applicable).

Type:

float or None

mask: ndarray
indices: ndarray
timestamps: DatetimeIndex
values: ndarray
method: str
n_outliers: int
lower_bound: float | None
upper_bound: float | None
__repr__()[source]

Return repr(self).

Return type:

str

__init__(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)
Parameters:
Return type:

None

class tseda.quality.outliers.OutlierDetector[source]

Bases: object

Detect, remove, or clip outliers in a TimeSeries.

This class is stateless — create one instance and reuse across many series.

iqr(ts, k=1.5)[source]

Tukey IQR fence method.

Parameters:
Return type:

OutlierReport

zscore(ts, threshold=3.0)[source]

Standard Z-score method.

Parameters:
Return type:

OutlierReport

mad(ts, threshold=3.5)[source]

Median Absolute Deviation method.

Parameters:
Return type:

OutlierReport

gesd(ts, alpha=0.05, max_outliers=10)[source]

Generalized Extreme Studentized Deviate test.

Parameters:
Return type:

OutlierReport

remove(ts, report)[source]

Replace outlier values with NaN.

Parameters:
Return type:

TimeSeries

clip(ts, report)[source]

Clip outlier values to the fence bounds.

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=6, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx)
>>> det = OutlierDetector()

IQR detection:

>>> r = det.iqr(ts)
>>> r.n_outliers
1
>>> int(r.indices[0])
4

Remove the outlier (replace with NaN):

>>> cleaned = det.remove(ts, r)
>>> cleaned.has_nan
True
iqr(ts, k=1.5)[source]

Detect outliers using Tukey’s IQR fences.

Points below Q1 - k * IQR or above Q3 + k * IQR are flagged. NaN values are excluded from the quartile computation but not flagged as outliers.

Parameters:
  • ts (TimeSeries) – Input series.

  • k (float, optional) –

    Fence multiplier. Common choices:

    • 1.5 — standard outlier fence (default).

    • 3.0 — extreme outlier fence.

Return type:

OutlierReport

Raises:

ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx)
>>> OutlierDetector().iqr(ts).n_outliers
1
zscore(ts, threshold=3.0)[source]

Detect outliers using the standard Z-score.

A value is flagged when |z| > threshold, where z = (x - mean) / std.

Parameters:
  • ts (TimeSeries) – Input series.

  • threshold (float, optional) – Z-score cut-off. Default 3.0 (≈ 0.3 % false-positive rate under normality).

Return type:

OutlierReport

Raises:

ValueError – If threshold is not positive or if std == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx)
>>> OutlierDetector().zscore(ts).n_outliers
1
mad(ts, threshold=3.5)[source]

Detect outliers using the Median Absolute Deviation (MAD).

A value is flagged when the modified Z-score exceeds threshold:

\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]

where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).

This is more robust than the Z-score for skewed distributions or heavy-tailed noise.

Parameters:
  • ts (TimeSeries) – Input series.

  • threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend 3.5 (default).

Return type:

OutlierReport

Raises:

ValueError – If threshold is not positive or MAD == 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx)
>>> OutlierDetector().mad(ts).n_outliers
1
gesd(ts, *, alpha=0.05, max_outliers=10)[source]

Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.

The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.

Parameters:
  • ts (TimeSeries) – Input series.

  • alpha (float, optional) – Significance level. Default 0.05.

  • max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than n // 2.

Returns:

lower_bound and upper_bound are None (GESD uses per-iteration critical values, not fixed fences).

Return type:

OutlierReport

Raises:
  • ValueError – If alpha is not in (0, 1) or max_outliers is invalid.

  • ImportError – If scipy is not installed (needed for the t-distribution CDF).

Notes

The critical value at each step i (1-indexed) is:

\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]

where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> rng = np.random.default_rng(0)
>>> idx = pd.date_range("2020", periods=50, freq="D")
>>> vals = rng.standard_normal(50)
>>> vals[10] = 15.0   # plant a spike
>>> ts  = TimeSeries(vals, index=idx)
>>> r   = OutlierDetector().gesd(ts)
>>> 10 in r.indices
True
remove(ts, report)[source]

Replace detected outlier values with NaN.

Parameters:
Returns:

A new series with outliers replaced by NaN.

Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> cleaned = det.remove(ts, det.iqr(ts))
>>> cleaned.has_nan
True
clip(ts, report)[source]

Clip outlier values to the fence bounds of report.

Parameters:
  • ts (TimeSeries) – The original series.

  • report (OutlierReport) – Must have non-None lower_bound and upper_bound (i.e., from iqr, zscore, or mad).

Returns:

A new series with values clamped to [lower_bound, upper_bound].

Return type:

TimeSeries

Raises:

ValueError – If report has no bounds (e.g., from gesd).

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=5, freq="D")
>>> ts  = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx)
>>> det = OutlierDetector()
>>> r   = det.iqr(ts)
>>> clipped = det.clip(ts, r)
>>> float(clipped.values.max()) < 100.0
True

Flat-line / Duplicates

class tseda.quality.duplicates.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]

Bases: object

Immutable summary of flat-line segments in a TimeSeries.

Parameters:
n_flatline_runs

Number of runs that meet or exceed min_run in length.

Type:

int

longest_run

Length of the single longest flat-line run.

Type:

int

total_flatline_points

Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).

Type:

int

runs

Each element is a tuple (start_pos, end_pos, value) where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.

Type:

list of (start_pos, end_pos, value)

mask

Boolean array; True at every position that is part of a qualifying flat-line run.

Type:

numpy.ndarray

min_run

The minimum run length used for this report.

Type:

int

n_flatline_runs: int
longest_run: int
total_flatline_points: int
runs: List[Tuple[int, int, float]]
mask: ndarray
min_run: int
__repr__()[source]

Return repr(self).

Return type:

str

__init__(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)
Parameters:
Return type:

None

class tseda.quality.duplicates.DuplicateDetector[source]

Bases: object

Detect consecutive duplicate (flat-line) value runs.

flatline(ts, min_run=3, atol=0.0)[source]

Detect flat-line segments of repeated values.

Parameters:
Return type:

FlatlineReport

near_zero(ts, min_run=3, threshold=1e-8)[source]

Detect segments where the series is stuck near zero.

Parameters:
Return type:

FlatlineReport

remove_flatlines(ts, report)[source]

Replace flat-line positions with NaN (keeping the first value).

Parameters:
Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.longest_run
4
flatline(ts, min_run=3, *, atol=0.0)[source]

Detect consecutive runs of identical (or near-identical) values.

Parameters:
  • ts (TimeSeries) – Input series.

  • min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default 3.

  • atol (float, optional) – Absolute tolerance for equality. Two values a and b are considered equal when |a - b| <= atol. Default 0.0 (exact equality).

Return type:

FlatlineReport

Raises:

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector

Exact flat line of length 4:

>>> idx  = pd.date_range("2020", periods=7, freq="D")
>>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.runs[0]
(1, 4, 3.0)

No flat line (min_run too high):

>>> r2 = DuplicateDetector().flatline(ts, min_run=5)
>>> r2.n_flatline_runs
0
near_zero(ts, min_run=3, *, threshold=1e-08)[source]

Detect segments where the series is stuck near zero.

Only consecutive runs where every value satisfies |x| <= threshold are reported. This differs from flatline(), which detects any repeated value regardless of magnitude.

Parameters:
  • ts (TimeSeries) – Input series.

  • min_run (int, optional) – Minimum run length. Default 3.

  • threshold (float, optional) – Maximum absolute value to count as “near zero”. Default 1e-8.

Returns:

Runs where every value satisfies |x| <= threshold.

Return type:

FlatlineReport

Raises:

ValueError – If threshold < 0.

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=8, freq="D")
>>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> r    = DuplicateDetector().near_zero(ts, min_run=3)
>>> r.n_flatline_runs
1
remove_flatlines(ts, report, *, keep_first=True)[source]

Replace flat-line positions with NaN.

Parameters:
  • ts (TimeSeries) – The original series.

  • report (FlatlineReport) – Result from flatline() or near_zero().

  • keep_first (bool, optional) – When True (default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. When False, the entire run including the first observation is set to NaN.

Returns:

A new series with flat-line values replaced by NaN.

Return type:

TimeSeries

Examples

>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx  = pd.date_range("2020", periods=6, freq="D")
>>> vals = np.array([1.0, 5.0, 5.0, 5.0, 2.0, 3.0])
>>> ts   = TimeSeries(vals, index=idx)
>>> det  = DuplicateDetector()
>>> r    = det.flatline(ts, min_run=3)
>>> cleaned = det.remove_flatlines(ts, r, keep_first=True)
>>> cleaned.n_nan
2