Data Quality
tseda.quality
Data quality diagnostics for time series.
Public API
- MissingValueReport
Immutable result of
MissingValueAnalyzer.- MissingValueAnalyzer
Detect NaN values, index gaps, and interpolate.
- OutlierReport
Immutable result of
OutlierDetector.- OutlierDetector
IQR, Z-score, MAD, and GESD outlier detection.
- FlatlineReport
Immutable result of
DuplicateDetector.- DuplicateDetector
Consecutive flat-line and near-zero segment detection.
- class tseda.quality.MissingValueReport(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)[source]
Bases:
objectImmutable summary of missing values in a
TimeSeries.- Parameters:
- n_gaps
Number of missing timestamps (index gaps) when the series frequency is known.
-1when frequency is unknown.- Type:
- gap_locations
Start timestamp of each index gap. Empty when
n_gaps <= 0.- Type:
- nan_positions
Integer positions (0-based) of all NaN values.
- Type:
- is_monotone_missing
Truewhen all NaN values cluster at the start or end of the series (monotone missing pattern — easier to handle).- Type:
- __init__(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)
- class tseda.quality.MissingValueAnalyzer[source]
Bases:
objectAnalyze and repair missing values in a
TimeSeries.This class is stateless — instantiate once and call its methods on different series objects.
- analyze(ts)[source]
Return a
MissingValueReportfor ts.- Parameters:
ts (TimeSeries)
- Return type:
- interpolate(ts, method)[source]
Fill NaN values and return a new
TimeSeries.- Parameters:
ts (TimeSeries)
method (str)
limit (int | None)
fill_value (float | None)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx = pd.date_range("2020-01-01", periods=5, freq="D") >>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> ana = MissingValueAnalyzer() >>> r = ana.analyze(ts) >>> r.n_nan 2 >>> filled = ana.interpolate(ts) >>> filled.has_nan False
- analyze(ts)[source]
Compute a complete missing-value summary for ts.
- Parameters:
ts (TimeSeries) – The series to analyze.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer >>> idx = pd.date_range("2020", periods=4, freq="D") >>> vals = np.array([1.0, np.nan, np.nan, 4.0]) >>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx)) >>> report.n_nan 2 >>> report.longest_nan_run 2
- interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]
Fill NaN values and return a new
TimeSeries.- Parameters:
ts (TimeSeries) – Series to fill.
method (str, optional) –
Interpolation strategy. One of:
"linear"— linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit isNone."forward"— forward-fill (carry last observed value)."backward"— backward-fill (carry next observed value)."nearest"— fill with the nearest non-NaN value."zero"— fill with 0.0."constant"— fill with fill_value (must be provided)."spline"— cubic spline (requires scipy).
limit (int, optional) – Maximum number of consecutive NaN values to fill.
Nonefills all gaps.fill_value (float, optional) – Used only with
method="constant".
- Returns:
A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.
- Return type:
- Raises:
TypeError – If ts is not a
TimeSeries.ValueError – If method is not recognised, or if
"constant"is chosen without supplying fill_value.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer >>> idx = pd.date_range("2020", periods=5, freq="D") >>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> ana = MissingValueAnalyzer()
Linear interpolation:
>>> filled = ana.interpolate(ts, "linear") >>> filled.values.tolist() [1.0, 2.0, 3.0, 4.0, 5.0]
Forward fill:
>>> fwd = ana.interpolate(ts, "forward") >>> fwd.values.tolist() [1.0, 1.0, 1.0, 4.0, 5.0]
- class tseda.quality.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]
Bases:
objectImmutable outlier detection result.
- Parameters:
- mask
Boolean array of shape
(n,);Truewhere an outlier was found.- Type:
- indices
Integer positions (0-based) of detected outliers.
- Type:
- timestamps
Timestamps of detected outliers.
- Type:
- values
Observed values at outlier positions.
- Type:
- timestamps: DatetimeIndex
- class tseda.quality.OutlierDetector[source]
Bases:
objectDetect, remove, or clip outliers in a
TimeSeries.This class is stateless — create one instance and reuse across many series.
- iqr(ts, k=1.5)[source]
Tukey IQR fence method.
- Parameters:
ts (TimeSeries)
k (float)
- Return type:
- zscore(ts, threshold=3.0)[source]
Standard Z-score method.
- Parameters:
ts (TimeSeries)
threshold (float)
- Return type:
- mad(ts, threshold=3.5)[source]
Median Absolute Deviation method.
- Parameters:
ts (TimeSeries)
threshold (float)
- Return type:
- gesd(ts, alpha=0.05, max_outliers=10)[source]
Generalized Extreme Studentized Deviate test.
- Parameters:
ts (TimeSeries)
alpha (float)
max_outliers (int)
- Return type:
- remove(ts, report)[source]
Replace outlier values with NaN.
- Parameters:
ts (TimeSeries)
report (OutlierReport)
- Return type:
- clip(ts, report)[source]
Clip outlier values to the fence bounds.
- Parameters:
ts (TimeSeries)
report (OutlierReport)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=6, freq="D") >>> ts = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx) >>> det = OutlierDetector()
IQR detection:
>>> r = det.iqr(ts) >>> r.n_outliers 1 >>> int(r.indices[0]) 4
Remove the outlier (replace with NaN):
>>> cleaned = det.remove(ts, r) >>> cleaned.has_nan True
- iqr(ts, k=1.5)[source]
Detect outliers using Tukey’s IQR fences.
Points below
Q1 - k * IQRor aboveQ3 + k * IQRare flagged. NaN values are excluded from the quartile computation but not flagged as outliers.- Parameters:
ts (TimeSeries) – Input series.
k (float, optional) –
Fence multiplier. Common choices:
1.5— standard outlier fence (default).3.0— extreme outlier fence.
- Return type:
- Raises:
ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx) >>> OutlierDetector().iqr(ts).n_outliers 1
- zscore(ts, threshold=3.0)[source]
Detect outliers using the standard Z-score.
A value is flagged when
|z| > threshold, wherez = (x - mean) / std.- Parameters:
ts (TimeSeries) – Input series.
threshold (float, optional) – Z-score cut-off. Default
3.0(≈ 0.3 % false-positive rate under normality).
- Return type:
- Raises:
ValueError – If threshold is not positive or if std == 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx) >>> OutlierDetector().zscore(ts).n_outliers 1
- mad(ts, threshold=3.5)[source]
Detect outliers using the Median Absolute Deviation (MAD).
A value is flagged when the modified Z-score exceeds threshold:
\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).
This is more robust than the Z-score for skewed distributions or heavy-tailed noise.
- Parameters:
ts (TimeSeries) – Input series.
threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend
3.5(default).
- Return type:
- Raises:
ValueError – If threshold is not positive or MAD == 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx) >>> OutlierDetector().mad(ts).n_outliers 1
- gesd(ts, *, alpha=0.05, max_outliers=10)[source]
Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.
The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.
- Parameters:
ts (TimeSeries) – Input series.
alpha (float, optional) – Significance level. Default
0.05.max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than
n // 2.
- Returns:
lower_boundandupper_boundareNone(GESD uses per-iteration critical values, not fixed fences).- Return type:
- Raises:
ValueError – If alpha is not in
(0, 1)or max_outliers is invalid.ImportError – If scipy is not installed (needed for the t-distribution CDF).
Notes
The critical value at each step i (1-indexed) is:
\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> rng = np.random.default_rng(0) >>> idx = pd.date_range("2020", periods=50, freq="D") >>> vals = rng.standard_normal(50) >>> vals[10] = 15.0 # plant a spike >>> ts = TimeSeries(vals, index=idx) >>> r = OutlierDetector().gesd(ts) >>> 10 in r.indices True
- remove(ts, report)[source]
Replace detected outlier values with NaN.
- Parameters:
ts (TimeSeries) – The original series.
report (OutlierReport) – Result from one of the detection methods.
- Returns:
A new series with outliers replaced by NaN.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx) >>> det = OutlierDetector() >>> cleaned = det.remove(ts, det.iqr(ts)) >>> cleaned.has_nan True
- clip(ts, report)[source]
Clip outlier values to the fence bounds of report.
- Parameters:
ts (TimeSeries) – The original series.
report (OutlierReport) – Must have non-
Nonelower_boundandupper_bound(i.e., fromiqr,zscore, ormad).
- Returns:
A new series with values clamped to
[lower_bound, upper_bound].- Return type:
- Raises:
ValueError – If report has no bounds (e.g., from
gesd).
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx) >>> det = OutlierDetector() >>> r = det.iqr(ts) >>> clipped = det.clip(ts, r) >>> float(clipped.values.max()) < 100.0 True
- class tseda.quality.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]
Bases:
objectImmutable summary of flat-line segments in a
TimeSeries.- Parameters:
- total_flatline_points
Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).
- Type:
- runs
Each element is a tuple
(start_pos, end_pos, value)where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.- Type:
list of (start_pos, end_pos, value)
- mask
Boolean array;
Trueat every position that is part of a qualifying flat-line run.- Type:
- class tseda.quality.DuplicateDetector[source]
Bases:
objectDetect consecutive duplicate (flat-line) value runs.
- flatline(ts, min_run=3, atol=0.0)[source]
Detect flat-line segments of repeated values.
- Parameters:
ts (TimeSeries)
min_run (int)
atol (float)
- Return type:
- near_zero(ts, min_run=3, threshold=1e-8)[source]
Detect segments where the series is stuck near zero.
- Parameters:
ts (TimeSeries)
min_run (int)
threshold (float)
- Return type:
- remove_flatlines(ts, report)[source]
Replace flat-line positions with NaN (keeping the first value).
- Parameters:
ts (TimeSeries)
report (FlatlineReport)
keep_first (bool)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=8, freq="D") >>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0]) >>> ts = TimeSeries(vals, index=idx) >>> det = DuplicateDetector() >>> r = det.flatline(ts, min_run=3) >>> r.n_flatline_runs 1 >>> r.longest_run 4
- flatline(ts, min_run=3, *, atol=0.0)[source]
Detect consecutive runs of identical (or near-identical) values.
- Parameters:
ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default
3.atol (float, optional) – Absolute tolerance for equality. Two values
aandbare considered equal when|a - b| <= atol. Default0.0(exact equality).
- Return type:
- Raises:
TypeError – If ts is not a
TimeSeries.ValueError – If min_run < 2.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
Exact flat line of length 4:
>>> idx = pd.date_range("2020", periods=7, freq="D") >>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> r = DuplicateDetector().flatline(ts, min_run=3) >>> r.n_flatline_runs 1 >>> r.runs[0] (1, 4, 3.0)
No flat line (min_run too high):
>>> r2 = DuplicateDetector().flatline(ts, min_run=5) >>> r2.n_flatline_runs 0
- near_zero(ts, min_run=3, *, threshold=1e-08)[source]
Detect segments where the series is stuck near zero.
Only consecutive runs where every value satisfies
|x| <= thresholdare reported. This differs fromflatline(), which detects any repeated value regardless of magnitude.- Parameters:
ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum run length. Default
3.threshold (float, optional) – Maximum absolute value to count as “near zero”. Default
1e-8.
- Returns:
Runs where every value satisfies
|x| <= threshold.- Return type:
- Raises:
ValueError – If threshold < 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=8, freq="D") >>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0]) >>> ts = TimeSeries(vals, index=idx) >>> r = DuplicateDetector().near_zero(ts, min_run=3) >>> r.n_flatline_runs 1
- remove_flatlines(ts, report, *, keep_first=True)[source]
Replace flat-line positions with NaN.
- Parameters:
ts (TimeSeries) – The original series.
report (FlatlineReport) – Result from
flatline()ornear_zero().keep_first (bool, optional) – When
True(default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. WhenFalse, the entire run including the first observation is set to NaN.
- Returns:
A new series with flat-line values replaced by NaN.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=6, freq="D") >>> vals = np.array([1.0, 5.0, 5.0, 5.0, 2.0, 3.0]) >>> ts = TimeSeries(vals, index=idx) >>> det = DuplicateDetector() >>> r = det.flatline(ts, min_run=3) >>> cleaned = det.remove_flatlines(ts, r, keep_first=True) >>> cleaned.n_nan 2
Missing-value analysis for time series.
Two distinct concepts are handled here:
Value NaN — a timestamp is present in the index but its observed value is
numpy.nan.Index gap — a timestamp that should exist (given the series frequency) is absent from the index entirely.
Both are reported by MissingValueAnalyzer. Interpolation of NaN
values is also provided via MissingValueAnalyzer.interpolate().
Classes
- MissingValueReport
Immutable result dataclass returned by
MissingValueAnalyzer.analyze().- MissingValueAnalyzer
Stateless analyzer; all methods accept a
TimeSeriesand return plain Python / numpy objects or a newTimeSeries.
Examples
>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx = pd.date_range("2020-01-01", periods=10, freq="D")
>>> vals = np.array([1.0, np.nan, 3.0, np.nan, np.nan, 6.0, 7.0, 8.0, np.nan, 10.0])
>>> ts = TimeSeries(vals, index=idx)
>>> ana = MissingValueAnalyzer()
>>> report = ana.analyze(ts)
>>> report.n_nan
3
>>> report.pct_nan
30.0
- class tseda.quality.missing.MissingValueReport(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)[source]
Bases:
objectImmutable summary of missing values in a
TimeSeries.- Parameters:
- n_gaps
Number of missing timestamps (index gaps) when the series frequency is known.
-1when frequency is unknown.- Type:
- gap_locations
Start timestamp of each index gap. Empty when
n_gaps <= 0.- Type:
- nan_positions
Integer positions (0-based) of all NaN values.
- Type:
- is_monotone_missing
Truewhen all NaN values cluster at the start or end of the series (monotone missing pattern — easier to handle).- Type:
- __init__(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)
- class tseda.quality.missing.MissingValueAnalyzer[source]
Bases:
objectAnalyze and repair missing values in a
TimeSeries.This class is stateless — instantiate once and call its methods on different series objects.
- analyze(ts)[source]
Return a
MissingValueReportfor ts.- Parameters:
ts (TimeSeries)
- Return type:
- interpolate(ts, method)[source]
Fill NaN values and return a new
TimeSeries.- Parameters:
ts (TimeSeries)
method (str)
limit (int | None)
fill_value (float | None)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx = pd.date_range("2020-01-01", periods=5, freq="D") >>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> ana = MissingValueAnalyzer() >>> r = ana.analyze(ts) >>> r.n_nan 2 >>> filled = ana.interpolate(ts) >>> filled.has_nan False
- analyze(ts)[source]
Compute a complete missing-value summary for ts.
- Parameters:
ts (TimeSeries) – The series to analyze.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer >>> idx = pd.date_range("2020", periods=4, freq="D") >>> vals = np.array([1.0, np.nan, np.nan, 4.0]) >>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx)) >>> report.n_nan 2 >>> report.longest_nan_run 2
- interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]
Fill NaN values and return a new
TimeSeries.- Parameters:
ts (TimeSeries) – Series to fill.
method (str, optional) –
Interpolation strategy. One of:
"linear"— linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit isNone."forward"— forward-fill (carry last observed value)."backward"— backward-fill (carry next observed value)."nearest"— fill with the nearest non-NaN value."zero"— fill with 0.0."constant"— fill with fill_value (must be provided)."spline"— cubic spline (requires scipy).
limit (int, optional) – Maximum number of consecutive NaN values to fill.
Nonefills all gaps.fill_value (float, optional) – Used only with
method="constant".
- Returns:
A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.
- Return type:
- Raises:
TypeError – If ts is not a
TimeSeries.ValueError – If method is not recognised, or if
"constant"is chosen without supplying fill_value.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer >>> idx = pd.date_range("2020", periods=5, freq="D") >>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> ana = MissingValueAnalyzer()
Linear interpolation:
>>> filled = ana.interpolate(ts, "linear") >>> filled.values.tolist() [1.0, 2.0, 3.0, 4.0, 5.0]
Forward fill:
>>> fwd = ana.interpolate(ts, "forward") >>> fwd.values.tolist() [1.0, 1.0, 1.0, 4.0, 5.0]
Outlier detection for time series.
Four detection methods are provided, all implemented with pure numpy / scipy — no machine-learning dependencies:
Method |
Statistic |
Best for |
|---|---|---|
IQR |
Tukey fences |
Symmetric or skewed data |
Z-score |
Mean / std deviation |
Approximately normal data |
MAD |
Median absolute deviation |
Skewed data, heavy tails |
GESD |
Generalized ESD test |
Known-normal; # unknown |
All detectors return an OutlierReport and expose .remove() /
.clip() helpers that return cleaned TimeSeries
objects.
Classes
- OutlierReport
Immutable result dataclass.
- OutlierDetector
Stateless detector.
Examples
>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020-01-01", periods=10, freq="D")
>>> vals = np.array([1, 2, 2, 3, 100, 2, 3, 2, 1, 2], dtype=float)
>>> ts = TimeSeries(vals, index=idx)
>>> det = OutlierDetector()
>>> r = det.iqr(ts)
>>> r.n_outliers
1
- class tseda.quality.outliers.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]
Bases:
objectImmutable outlier detection result.
- Parameters:
- mask
Boolean array of shape
(n,);Truewhere an outlier was found.- Type:
- indices
Integer positions (0-based) of detected outliers.
- Type:
- timestamps
Timestamps of detected outliers.
- Type:
- values
Observed values at outlier positions.
- Type:
- timestamps: DatetimeIndex
- class tseda.quality.outliers.OutlierDetector[source]
Bases:
objectDetect, remove, or clip outliers in a
TimeSeries.This class is stateless — create one instance and reuse across many series.
- iqr(ts, k=1.5)[source]
Tukey IQR fence method.
- Parameters:
ts (TimeSeries)
k (float)
- Return type:
- zscore(ts, threshold=3.0)[source]
Standard Z-score method.
- Parameters:
ts (TimeSeries)
threshold (float)
- Return type:
- mad(ts, threshold=3.5)[source]
Median Absolute Deviation method.
- Parameters:
ts (TimeSeries)
threshold (float)
- Return type:
- gesd(ts, alpha=0.05, max_outliers=10)[source]
Generalized Extreme Studentized Deviate test.
- Parameters:
ts (TimeSeries)
alpha (float)
max_outliers (int)
- Return type:
- remove(ts, report)[source]
Replace outlier values with NaN.
- Parameters:
ts (TimeSeries)
report (OutlierReport)
- Return type:
- clip(ts, report)[source]
Clip outlier values to the fence bounds.
- Parameters:
ts (TimeSeries)
report (OutlierReport)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=6, freq="D") >>> ts = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx) >>> det = OutlierDetector()
IQR detection:
>>> r = det.iqr(ts) >>> r.n_outliers 1 >>> int(r.indices[0]) 4
Remove the outlier (replace with NaN):
>>> cleaned = det.remove(ts, r) >>> cleaned.has_nan True
- iqr(ts, k=1.5)[source]
Detect outliers using Tukey’s IQR fences.
Points below
Q1 - k * IQRor aboveQ3 + k * IQRare flagged. NaN values are excluded from the quartile computation but not flagged as outliers.- Parameters:
ts (TimeSeries) – Input series.
k (float, optional) –
Fence multiplier. Common choices:
1.5— standard outlier fence (default).3.0— extreme outlier fence.
- Return type:
- Raises:
ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx) >>> OutlierDetector().iqr(ts).n_outliers 1
- zscore(ts, threshold=3.0)[source]
Detect outliers using the standard Z-score.
A value is flagged when
|z| > threshold, wherez = (x - mean) / std.- Parameters:
ts (TimeSeries) – Input series.
threshold (float, optional) – Z-score cut-off. Default
3.0(≈ 0.3 % false-positive rate under normality).
- Return type:
- Raises:
ValueError – If threshold is not positive or if std == 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx) >>> OutlierDetector().zscore(ts).n_outliers 1
- mad(ts, threshold=3.5)[source]
Detect outliers using the Median Absolute Deviation (MAD).
A value is flagged when the modified Z-score exceeds threshold:
\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).
This is more robust than the Z-score for skewed distributions or heavy-tailed noise.
- Parameters:
ts (TimeSeries) – Input series.
threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend
3.5(default).
- Return type:
- Raises:
ValueError – If threshold is not positive or MAD == 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx) >>> OutlierDetector().mad(ts).n_outliers 1
- gesd(ts, *, alpha=0.05, max_outliers=10)[source]
Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.
The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.
- Parameters:
ts (TimeSeries) – Input series.
alpha (float, optional) – Significance level. Default
0.05.max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than
n // 2.
- Returns:
lower_boundandupper_boundareNone(GESD uses per-iteration critical values, not fixed fences).- Return type:
- Raises:
ValueError – If alpha is not in
(0, 1)or max_outliers is invalid.ImportError – If scipy is not installed (needed for the t-distribution CDF).
Notes
The critical value at each step i (1-indexed) is:
\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> rng = np.random.default_rng(0) >>> idx = pd.date_range("2020", periods=50, freq="D") >>> vals = rng.standard_normal(50) >>> vals[10] = 15.0 # plant a spike >>> ts = TimeSeries(vals, index=idx) >>> r = OutlierDetector().gesd(ts) >>> 10 in r.indices True
- remove(ts, report)[source]
Replace detected outlier values with NaN.
- Parameters:
ts (TimeSeries) – The original series.
report (OutlierReport) – Result from one of the detection methods.
- Returns:
A new series with outliers replaced by NaN.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx) >>> det = OutlierDetector() >>> cleaned = det.remove(ts, det.iqr(ts)) >>> cleaned.has_nan True
- clip(ts, report)[source]
Clip outlier values to the fence bounds of report.
- Parameters:
ts (TimeSeries) – The original series.
report (OutlierReport) – Must have non-
Nonelower_boundandupper_bound(i.e., fromiqr,zscore, ormad).
- Returns:
A new series with values clamped to
[lower_bound, upper_bound].- Return type:
- Raises:
ValueError – If report has no bounds (e.g., from
gesd).
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx) >>> det = OutlierDetector() >>> r = det.iqr(ts) >>> clipped = det.clip(ts, r) >>> float(clipped.values.max()) < 100.0 True
Flat-line and near-constant segment detection for time series.
Timestamp duplicates are rejected at construction time by
validate_datetime_index(). This module
addresses the complementary problem: consecutive identical or near-zero
values, which typically signal:
A stuck sensor / ADC saturation.
A data-pipeline bug that forward-filled data without marking it.
A genuine flat segment that may confuse differencing-based methods.
Classes
- FlatlineReport
Immutable result dataclass returned by
DuplicateDetector.flatline().- DuplicateDetector
Stateless detector.
Examples
>>> import numpy as np, pandas as pd
>>> from tseda import TimeSeries
>>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020-01-01", periods=10, freq="D")
>>> vals = np.array([1.0, 2.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0, 5.0, 6.0])
>>> ts = TimeSeries(vals, index=idx)
>>> det = DuplicateDetector()
>>> r = det.flatline(ts, min_run=3)
>>> r.n_flatline_runs
1
>>> r.longest_run
4
- class tseda.quality.duplicates.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]
Bases:
objectImmutable summary of flat-line segments in a
TimeSeries.- Parameters:
- total_flatline_points
Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).
- Type:
- runs
Each element is a tuple
(start_pos, end_pos, value)where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.- Type:
list of (start_pos, end_pos, value)
- mask
Boolean array;
Trueat every position that is part of a qualifying flat-line run.- Type:
- class tseda.quality.duplicates.DuplicateDetector[source]
Bases:
objectDetect consecutive duplicate (flat-line) value runs.
- flatline(ts, min_run=3, atol=0.0)[source]
Detect flat-line segments of repeated values.
- Parameters:
ts (TimeSeries)
min_run (int)
atol (float)
- Return type:
- near_zero(ts, min_run=3, threshold=1e-8)[source]
Detect segments where the series is stuck near zero.
- Parameters:
ts (TimeSeries)
min_run (int)
threshold (float)
- Return type:
- remove_flatlines(ts, report)[source]
Replace flat-line positions with NaN (keeping the first value).
- Parameters:
ts (TimeSeries)
report (FlatlineReport)
keep_first (bool)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=8, freq="D") >>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0]) >>> ts = TimeSeries(vals, index=idx) >>> det = DuplicateDetector() >>> r = det.flatline(ts, min_run=3) >>> r.n_flatline_runs 1 >>> r.longest_run 4
- flatline(ts, min_run=3, *, atol=0.0)[source]
Detect consecutive runs of identical (or near-identical) values.
- Parameters:
ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default
3.atol (float, optional) – Absolute tolerance for equality. Two values
aandbare considered equal when|a - b| <= atol. Default0.0(exact equality).
- Return type:
- Raises:
TypeError – If ts is not a
TimeSeries.ValueError – If min_run < 2.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
Exact flat line of length 4:
>>> idx = pd.date_range("2020", periods=7, freq="D") >>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> r = DuplicateDetector().flatline(ts, min_run=3) >>> r.n_flatline_runs 1 >>> r.runs[0] (1, 4, 3.0)
No flat line (min_run too high):
>>> r2 = DuplicateDetector().flatline(ts, min_run=5) >>> r2.n_flatline_runs 0
- near_zero(ts, min_run=3, *, threshold=1e-08)[source]
Detect segments where the series is stuck near zero.
Only consecutive runs where every value satisfies
|x| <= thresholdare reported. This differs fromflatline(), which detects any repeated value regardless of magnitude.- Parameters:
ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum run length. Default
3.threshold (float, optional) – Maximum absolute value to count as “near zero”. Default
1e-8.
- Returns:
Runs where every value satisfies
|x| <= threshold.- Return type:
- Raises:
ValueError – If threshold < 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=8, freq="D") >>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0]) >>> ts = TimeSeries(vals, index=idx) >>> r = DuplicateDetector().near_zero(ts, min_run=3) >>> r.n_flatline_runs 1
- remove_flatlines(ts, report, *, keep_first=True)[source]
Replace flat-line positions with NaN.
- Parameters:
ts (TimeSeries) – The original series.
report (FlatlineReport) – Result from
flatline()ornear_zero().keep_first (bool, optional) – When
True(default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. WhenFalse, the entire run including the first observation is set to NaN.
- Returns:
A new series with flat-line values replaced by NaN.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=6, freq="D") >>> vals = np.array([1.0, 5.0, 5.0, 5.0, 2.0, 3.0]) >>> ts = TimeSeries(vals, index=idx) >>> det = DuplicateDetector() >>> r = det.flatline(ts, min_run=3) >>> cleaned = det.remove_flatlines(ts, r, keep_first=True) >>> cleaned.n_nan 2