tseda.quality
tseda.quality
Data quality diagnostics for time series.
Public API
- MissingValueReport
Immutable result of
MissingValueAnalyzer.- MissingValueAnalyzer
Detect NaN values, index gaps, and interpolate.
- OutlierReport
Immutable result of
OutlierDetector.- OutlierDetector
IQR, Z-score, MAD, and GESD outlier detection.
- FlatlineReport
Immutable result of
DuplicateDetector.- DuplicateDetector
Consecutive flat-line and near-zero segment detection.
- class tseda.quality.MissingValueReport(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)[source]
Bases:
objectImmutable summary of missing values in a
TimeSeries.- Parameters:
- n_gaps
Number of missing timestamps (index gaps) when the series frequency is known.
-1when frequency is unknown.- Type:
- gap_locations
Start timestamp of each index gap. Empty when
n_gaps <= 0.- Type:
- nan_positions
Integer positions (0-based) of all NaN values.
- Type:
- is_monotone_missing
Truewhen all NaN values cluster at the start or end of the series (monotone missing pattern — easier to handle).- Type:
- __init__(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)
- class tseda.quality.MissingValueAnalyzer[source]
Bases:
objectAnalyze and repair missing values in a
TimeSeries.This class is stateless — instantiate once and call its methods on different series objects.
- analyze(ts)[source]
Return a
MissingValueReportfor ts.- Parameters:
ts (TimeSeries)
- Return type:
- interpolate(ts, method)[source]
Fill NaN values and return a new
TimeSeries.- Parameters:
ts (TimeSeries)
method (str)
limit (int | None)
fill_value (float | None)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx = pd.date_range("2020-01-01", periods=5, freq="D") >>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> ana = MissingValueAnalyzer() >>> r = ana.analyze(ts) >>> r.n_nan 2 >>> filled = ana.interpolate(ts) >>> filled.has_nan False
- analyze(ts)[source]
Compute a complete missing-value summary for ts.
- Parameters:
ts (TimeSeries) – The series to analyze.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer >>> idx = pd.date_range("2020", periods=4, freq="D") >>> vals = np.array([1.0, np.nan, np.nan, 4.0]) >>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx)) >>> report.n_nan 2 >>> report.longest_nan_run 2
- interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]
Fill NaN values and return a new
TimeSeries.- Parameters:
ts (TimeSeries) – Series to fill.
method (str, optional) –
Interpolation strategy. One of:
"linear"— linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit isNone."forward"— forward-fill (carry last observed value)."backward"— backward-fill (carry next observed value)."nearest"— fill with the nearest non-NaN value."zero"— fill with 0.0."constant"— fill with fill_value (must be provided)."spline"— cubic spline (requires scipy).
limit (int, optional) – Maximum number of consecutive NaN values to fill.
Nonefills all gaps.fill_value (float, optional) – Used only with
method="constant".
- Returns:
A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.
- Return type:
- Raises:
TypeError – If ts is not a
TimeSeries.ValueError – If method is not recognised, or if
"constant"is chosen without supplying fill_value.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer >>> idx = pd.date_range("2020", periods=5, freq="D") >>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> ana = MissingValueAnalyzer()
Linear interpolation:
>>> filled = ana.interpolate(ts, "linear") >>> filled.values.tolist() [1.0, 2.0, 3.0, 4.0, 5.0]
Forward fill:
>>> fwd = ana.interpolate(ts, "forward") >>> fwd.values.tolist() [1.0, 1.0, 1.0, 4.0, 5.0]
- class tseda.quality.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]
Bases:
objectImmutable outlier detection result.
- Parameters:
- mask
Boolean array of shape
(n,);Truewhere an outlier was found.- Type:
- indices
Integer positions (0-based) of detected outliers.
- Type:
- timestamps
Timestamps of detected outliers.
- Type:
- values
Observed values at outlier positions.
- Type:
- timestamps: DatetimeIndex
- class tseda.quality.OutlierDetector[source]
Bases:
objectDetect, remove, or clip outliers in a
TimeSeries.This class is stateless — create one instance and reuse across many series.
- iqr(ts, k=1.5)[source]
Tukey IQR fence method.
- Parameters:
ts (TimeSeries)
k (float)
- Return type:
- zscore(ts, threshold=3.0)[source]
Standard Z-score method.
- Parameters:
ts (TimeSeries)
threshold (float)
- Return type:
- mad(ts, threshold=3.5)[source]
Median Absolute Deviation method.
- Parameters:
ts (TimeSeries)
threshold (float)
- Return type:
- gesd(ts, alpha=0.05, max_outliers=10)[source]
Generalized Extreme Studentized Deviate test.
- Parameters:
ts (TimeSeries)
alpha (float)
max_outliers (int)
- Return type:
- remove(ts, report)[source]
Replace outlier values with NaN.
- Parameters:
ts (TimeSeries)
report (OutlierReport)
- Return type:
- clip(ts, report)[source]
Clip outlier values to the fence bounds.
- Parameters:
ts (TimeSeries)
report (OutlierReport)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=6, freq="D") >>> ts = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx) >>> det = OutlierDetector()
IQR detection:
>>> r = det.iqr(ts) >>> r.n_outliers 1 >>> int(r.indices[0]) 4
Remove the outlier (replace with NaN):
>>> cleaned = det.remove(ts, r) >>> cleaned.has_nan True
- iqr(ts, k=1.5)[source]
Detect outliers using Tukey’s IQR fences.
Points below
Q1 - k * IQRor aboveQ3 + k * IQRare flagged. NaN values are excluded from the quartile computation but not flagged as outliers.- Parameters:
ts (TimeSeries) – Input series.
k (float, optional) –
Fence multiplier. Common choices:
1.5— standard outlier fence (default).3.0— extreme outlier fence.
- Return type:
- Raises:
ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx) >>> OutlierDetector().iqr(ts).n_outliers 1
- zscore(ts, threshold=3.0)[source]
Detect outliers using the standard Z-score.
A value is flagged when
|z| > threshold, wherez = (x - mean) / std.- Parameters:
ts (TimeSeries) – Input series.
threshold (float, optional) – Z-score cut-off. Default
3.0(≈ 0.3 % false-positive rate under normality).
- Return type:
- Raises:
ValueError – If threshold is not positive or if std == 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx) >>> OutlierDetector().zscore(ts).n_outliers 1
- mad(ts, threshold=3.5)[source]
Detect outliers using the Median Absolute Deviation (MAD).
A value is flagged when the modified Z-score exceeds threshold:
\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).
This is more robust than the Z-score for skewed distributions or heavy-tailed noise.
- Parameters:
ts (TimeSeries) – Input series.
threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend
3.5(default).
- Return type:
- Raises:
ValueError – If threshold is not positive or MAD == 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx) >>> OutlierDetector().mad(ts).n_outliers 1
- gesd(ts, *, alpha=0.05, max_outliers=10)[source]
Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.
The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.
- Parameters:
ts (TimeSeries) – Input series.
alpha (float, optional) – Significance level. Default
0.05.max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than
n // 2.
- Returns:
lower_boundandupper_boundareNone(GESD uses per-iteration critical values, not fixed fences).- Return type:
- Raises:
ValueError – If alpha is not in
(0, 1)or max_outliers is invalid.ImportError – If scipy is not installed (needed for the t-distribution CDF).
Notes
The critical value at each step i (1-indexed) is:
\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> rng = np.random.default_rng(0) >>> idx = pd.date_range("2020", periods=50, freq="D") >>> vals = rng.standard_normal(50) >>> vals[10] = 15.0 # plant a spike >>> ts = TimeSeries(vals, index=idx) >>> r = OutlierDetector().gesd(ts) >>> 10 in r.indices True
- remove(ts, report)[source]
Replace detected outlier values with NaN.
- Parameters:
ts (TimeSeries) – The original series.
report (OutlierReport) – Result from one of the detection methods.
- Returns:
A new series with outliers replaced by NaN.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx) >>> det = OutlierDetector() >>> cleaned = det.remove(ts, det.iqr(ts)) >>> cleaned.has_nan True
- clip(ts, report)[source]
Clip outlier values to the fence bounds of report.
- Parameters:
ts (TimeSeries) – The original series.
report (OutlierReport) – Must have non-
Nonelower_boundandupper_bound(i.e., fromiqr,zscore, ormad).
- Returns:
A new series with values clamped to
[lower_bound, upper_bound].- Return type:
- Raises:
ValueError – If report has no bounds (e.g., from
gesd).
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx) >>> det = OutlierDetector() >>> r = det.iqr(ts) >>> clipped = det.clip(ts, r) >>> float(clipped.values.max()) < 100.0 True
- class tseda.quality.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]
Bases:
objectImmutable summary of flat-line segments in a
TimeSeries.- Parameters:
- total_flatline_points
Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).
- Type:
- runs
Each element is a tuple
(start_pos, end_pos, value)where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.- Type:
list of (start_pos, end_pos, value)
- mask
Boolean array;
Trueat every position that is part of a qualifying flat-line run.- Type:
- class tseda.quality.DuplicateDetector[source]
Bases:
objectDetect consecutive duplicate (flat-line) value runs.
- flatline(ts, min_run=3, atol=0.0)[source]
Detect flat-line segments of repeated values.
- Parameters:
ts (TimeSeries)
min_run (int)
atol (float)
- Return type:
- near_zero(ts, min_run=3, threshold=1e-8)[source]
Detect segments where the series is stuck near zero.
- Parameters:
ts (TimeSeries)
min_run (int)
threshold (float)
- Return type:
- remove_flatlines(ts, report)[source]
Replace flat-line positions with NaN (keeping the first value).
- Parameters:
ts (TimeSeries)
report (FlatlineReport)
keep_first (bool)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=8, freq="D") >>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0]) >>> ts = TimeSeries(vals, index=idx) >>> det = DuplicateDetector() >>> r = det.flatline(ts, min_run=3) >>> r.n_flatline_runs 1 >>> r.longest_run 4
- flatline(ts, min_run=3, *, atol=0.0)[source]
Detect consecutive runs of identical (or near-identical) values.
- Parameters:
ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default
3.atol (float, optional) – Absolute tolerance for equality. Two values
aandbare considered equal when|a - b| <= atol. Default0.0(exact equality).
- Return type:
- Raises:
TypeError – If ts is not a
TimeSeries.ValueError – If min_run < 2.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
Exact flat line of length 4:
>>> idx = pd.date_range("2020", periods=7, freq="D") >>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> r = DuplicateDetector().flatline(ts, min_run=3) >>> r.n_flatline_runs 1 >>> r.runs[0] (1, 4, 3.0)
No flat line (min_run too high):
>>> r2 = DuplicateDetector().flatline(ts, min_run=5) >>> r2.n_flatline_runs 0
- near_zero(ts, min_run=3, *, threshold=1e-08)[source]
Detect segments where the series is stuck near zero.
Only consecutive runs where every value satisfies
|x| <= thresholdare reported. This differs fromflatline(), which detects any repeated value regardless of magnitude.- Parameters:
ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum run length. Default
3.threshold (float, optional) – Maximum absolute value to count as “near zero”. Default
1e-8.
- Returns:
Runs where every value satisfies
|x| <= threshold.- Return type:
- Raises:
ValueError – If threshold < 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=8, freq="D") >>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0]) >>> ts = TimeSeries(vals, index=idx) >>> r = DuplicateDetector().near_zero(ts, min_run=3) >>> r.n_flatline_runs 1
- remove_flatlines(ts, report, *, keep_first=True)[source]
Replace flat-line positions with NaN.
- Parameters:
ts (TimeSeries) – The original series.
report (FlatlineReport) – Result from
flatline()ornear_zero().keep_first (bool, optional) – When
True(default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. WhenFalse, the entire run including the first observation is set to NaN.
- Returns:
A new series with flat-line values replaced by NaN.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=6, freq="D") >>> vals = np.array([1.0, 5.0, 5.0, 5.0, 2.0, 3.0]) >>> ts = TimeSeries(vals, index=idx) >>> det = DuplicateDetector() >>> r = det.flatline(ts, min_run=3) >>> cleaned = det.remove_flatlines(ts, r, keep_first=True) >>> cleaned.n_nan 2
Missing Values
- class tseda.quality.missing.MissingValueReport(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)[source]
Bases:
objectImmutable summary of missing values in a
TimeSeries.- Parameters:
- n_gaps
Number of missing timestamps (index gaps) when the series frequency is known.
-1when frequency is unknown.- Type:
- gap_locations
Start timestamp of each index gap. Empty when
n_gaps <= 0.- Type:
- nan_positions
Integer positions (0-based) of all NaN values.
- Type:
- is_monotone_missing
Truewhen all NaN values cluster at the start or end of the series (monotone missing pattern — easier to handle).- Type:
- __init__(n_nan, pct_nan, n_gaps, gap_locations, longest_nan_run, nan_run_lengths, nan_positions, is_monotone_missing)
- class tseda.quality.missing.MissingValueAnalyzer[source]
Bases:
objectAnalyze and repair missing values in a
TimeSeries.This class is stateless — instantiate once and call its methods on different series objects.
- analyze(ts)[source]
Return a
MissingValueReportfor ts.- Parameters:
ts (TimeSeries)
- Return type:
- interpolate(ts, method)[source]
Fill NaN values and return a new
TimeSeries.- Parameters:
ts (TimeSeries)
method (str)
limit (int | None)
fill_value (float | None)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer
>>> idx = pd.date_range("2020-01-01", periods=5, freq="D") >>> vals = np.array([1.0, np.nan, 3.0, np.nan, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> ana = MissingValueAnalyzer() >>> r = ana.analyze(ts) >>> r.n_nan 2 >>> filled = ana.interpolate(ts) >>> filled.has_nan False
- analyze(ts)[source]
Compute a complete missing-value summary for ts.
- Parameters:
ts (TimeSeries) – The series to analyze.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer >>> idx = pd.date_range("2020", periods=4, freq="D") >>> vals = np.array([1.0, np.nan, np.nan, 4.0]) >>> report = MissingValueAnalyzer().analyze(TimeSeries(vals, index=idx)) >>> report.n_nan 2 >>> report.longest_nan_run 2
- interpolate(ts, method='linear', *, limit=None, fill_value=None)[source]
Fill NaN values and return a new
TimeSeries.- Parameters:
ts (TimeSeries) – Series to fill.
method (str, optional) –
Interpolation strategy. One of:
"linear"— linear interpolation between neighbours (default). Leading and trailing NaN are filled with the nearest observed boundary value when limit isNone."forward"— forward-fill (carry last observed value)."backward"— backward-fill (carry next observed value)."nearest"— fill with the nearest non-NaN value."zero"— fill with 0.0."constant"— fill with fill_value (must be provided)."spline"— cubic spline (requires scipy).
limit (int, optional) – Maximum number of consecutive NaN values to fill.
Nonefills all gaps.fill_value (float, optional) – Used only with
method="constant".
- Returns:
A new series with NaN values replaced. Metadata (name, unit, freq, description) is preserved.
- Return type:
- Raises:
TypeError – If ts is not a
TimeSeries.ValueError – If method is not recognised, or if
"constant"is chosen without supplying fill_value.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.missing import MissingValueAnalyzer >>> idx = pd.date_range("2020", periods=5, freq="D") >>> vals = np.array([1.0, np.nan, np.nan, 4.0, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> ana = MissingValueAnalyzer()
Linear interpolation:
>>> filled = ana.interpolate(ts, "linear") >>> filled.values.tolist() [1.0, 2.0, 3.0, 4.0, 5.0]
Forward fill:
>>> fwd = ana.interpolate(ts, "forward") >>> fwd.values.tolist() [1.0, 1.0, 1.0, 4.0, 5.0]
Outliers
- class tseda.quality.outliers.OutlierReport(mask, indices, timestamps, values, method, n_outliers, lower_bound, upper_bound)[source]
Bases:
objectImmutable outlier detection result.
- Parameters:
- mask
Boolean array of shape
(n,);Truewhere an outlier was found.- Type:
- indices
Integer positions (0-based) of detected outliers.
- Type:
- timestamps
Timestamps of detected outliers.
- Type:
- values
Observed values at outlier positions.
- Type:
- timestamps: DatetimeIndex
- class tseda.quality.outliers.OutlierDetector[source]
Bases:
objectDetect, remove, or clip outliers in a
TimeSeries.This class is stateless — create one instance and reuse across many series.
- iqr(ts, k=1.5)[source]
Tukey IQR fence method.
- Parameters:
ts (TimeSeries)
k (float)
- Return type:
- zscore(ts, threshold=3.0)[source]
Standard Z-score method.
- Parameters:
ts (TimeSeries)
threshold (float)
- Return type:
- mad(ts, threshold=3.5)[source]
Median Absolute Deviation method.
- Parameters:
ts (TimeSeries)
threshold (float)
- Return type:
- gesd(ts, alpha=0.05, max_outliers=10)[source]
Generalized Extreme Studentized Deviate test.
- Parameters:
ts (TimeSeries)
alpha (float)
max_outliers (int)
- Return type:
- remove(ts, report)[source]
Replace outlier values with NaN.
- Parameters:
ts (TimeSeries)
report (OutlierReport)
- Return type:
- clip(ts, report)[source]
Clip outlier values to the fence bounds.
- Parameters:
ts (TimeSeries)
report (OutlierReport)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector
>>> idx = pd.date_range("2020", periods=6, freq="D") >>> ts = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0, 2.0], index=idx) >>> det = OutlierDetector()
IQR detection:
>>> r = det.iqr(ts) >>> r.n_outliers 1 >>> int(r.indices[0]) 4
Remove the outlier (replace with NaN):
>>> cleaned = det.remove(ts, r) >>> cleaned.has_nan True
- iqr(ts, k=1.5)[source]
Detect outliers using Tukey’s IQR fences.
Points below
Q1 - k * IQRor aboveQ3 + k * IQRare flagged. NaN values are excluded from the quartile computation but not flagged as outliers.- Parameters:
ts (TimeSeries) – Input series.
k (float, optional) –
Fence multiplier. Common choices:
1.5— standard outlier fence (default).3.0— extreme outlier fence.
- Return type:
- Raises:
ValueError – If k is not positive, or if fewer than 4 non-NaN values exist.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 2.0, 2.0, 100.0], index=idx) >>> OutlierDetector().iqr(ts).n_outliers 1
- zscore(ts, threshold=3.0)[source]
Detect outliers using the standard Z-score.
A value is flagged when
|z| > threshold, wherez = (x - mean) / std.- Parameters:
ts (TimeSeries) – Input series.
threshold (float, optional) – Z-score cut-off. Default
3.0(≈ 0.3 % false-positive rate under normality).
- Return type:
- Raises:
ValueError – If threshold is not positive or if std == 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([0.0, 0.1, 0.0, -0.1, 10.0], index=idx) >>> OutlierDetector().zscore(ts).n_outliers 1
- mad(ts, threshold=3.5)[source]
Detect outliers using the Median Absolute Deviation (MAD).
A value is flagged when the modified Z-score exceeds threshold:
\[M_i = \frac{0.6745 \,(x_i - \tilde{x})}{\text{MAD}}\]where \(\tilde{x}\) is the median and \(\text{MAD} = \text{median}(|x_i - \tilde{x}|)\).
This is more robust than the Z-score for skewed distributions or heavy-tailed noise.
- Parameters:
ts (TimeSeries) – Input series.
threshold (float, optional) – Modified Z-score cut-off. Iglewicz & Hoaglin (1993) recommend
3.5(default).
- Return type:
- Raises:
ValueError – If threshold is not positive or MAD == 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([2.0, 2.1, 1.9, 2.0, 50.0], index=idx) >>> OutlierDetector().mad(ts).n_outliers 1
- gesd(ts, *, alpha=0.05, max_outliers=10)[source]
Detect outliers using the Generalized ESD (Extreme Studentized Deviate) test.
The GESD test (Rosner, 1983) sequentially removes the most extreme value and tests whether it is a statistical outlier, up to max_outliers iterations. It assumes the underlying data are approximately normal after removing the outliers.
- Parameters:
ts (TimeSeries) – Input series.
alpha (float, optional) – Significance level. Default
0.05.max_outliers (int, optional) – Upper bound on the number of outliers to test for. Default 10. Must be less than
n // 2.
- Returns:
lower_boundandupper_boundareNone(GESD uses per-iteration critical values, not fixed fences).- Return type:
- Raises:
ValueError – If alpha is not in
(0, 1)or max_outliers is invalid.ImportError – If scipy is not installed (needed for the t-distribution CDF).
Notes
The critical value at each step i (1-indexed) is:
\[\lambda_i = \frac{(n - i) \, t_{p, n-i-1}} {\sqrt{(n-i-1+t_{p,n-i-1}^2)(n-i+1)}}\]where \(p = 1 - \alpha / (2(n - i + 1))\) and \(t_{p, \nu}\) is the \(p\)-quantile of the t-distribution with \(\nu\) degrees of freedom.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> rng = np.random.default_rng(0) >>> idx = pd.date_range("2020", periods=50, freq="D") >>> vals = rng.standard_normal(50) >>> vals[10] = 15.0 # plant a spike >>> ts = TimeSeries(vals, index=idx) >>> r = OutlierDetector().gesd(ts) >>> 10 in r.indices True
- remove(ts, report)[source]
Replace detected outlier values with NaN.
- Parameters:
ts (TimeSeries) – The original series.
report (OutlierReport) – Result from one of the detection methods.
- Returns:
A new series with outliers replaced by NaN.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx) >>> det = OutlierDetector() >>> cleaned = det.remove(ts, det.iqr(ts)) >>> cleaned.has_nan True
- clip(ts, report)[source]
Clip outlier values to the fence bounds of report.
- Parameters:
ts (TimeSeries) – The original series.
report (OutlierReport) – Must have non-
Nonelower_boundandupper_bound(i.e., fromiqr,zscore, ormad).
- Returns:
A new series with values clamped to
[lower_bound, upper_bound].- Return type:
- Raises:
ValueError – If report has no bounds (e.g., from
gesd).
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.outliers import OutlierDetector >>> idx = pd.date_range("2020", periods=5, freq="D") >>> ts = TimeSeries([1.0, 2.0, 100.0, 2.0, 1.0], index=idx) >>> det = OutlierDetector() >>> r = det.iqr(ts) >>> clipped = det.clip(ts, r) >>> float(clipped.values.max()) < 100.0 True
Flat-line / Duplicates
- class tseda.quality.duplicates.FlatlineReport(n_flatline_runs, longest_run, total_flatline_points, runs, mask, min_run)[source]
Bases:
objectImmutable summary of flat-line segments in a
TimeSeries.- Parameters:
- total_flatline_points
Total number of observations that belong to a qualifying flat-line run (includes the first observation of each run).
- Type:
- runs
Each element is a tuple
(start_pos, end_pos, value)where start_pos and end_pos are 0-based integer positions and value is the repeated value. Only runs of length >= min_run are included.- Type:
list of (start_pos, end_pos, value)
- mask
Boolean array;
Trueat every position that is part of a qualifying flat-line run.- Type:
- class tseda.quality.duplicates.DuplicateDetector[source]
Bases:
objectDetect consecutive duplicate (flat-line) value runs.
- flatline(ts, min_run=3, atol=0.0)[source]
Detect flat-line segments of repeated values.
- Parameters:
ts (TimeSeries)
min_run (int)
atol (float)
- Return type:
- near_zero(ts, min_run=3, threshold=1e-8)[source]
Detect segments where the series is stuck near zero.
- Parameters:
ts (TimeSeries)
min_run (int)
threshold (float)
- Return type:
- remove_flatlines(ts, report)[source]
Replace flat-line positions with NaN (keeping the first value).
- Parameters:
ts (TimeSeries)
report (FlatlineReport)
keep_first (bool)
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=8, freq="D") >>> vals = np.array([1.0, 5.0, 5.0, 5.0, 5.0, 2.0, 3.0, 4.0]) >>> ts = TimeSeries(vals, index=idx) >>> det = DuplicateDetector() >>> r = det.flatline(ts, min_run=3) >>> r.n_flatline_runs 1 >>> r.longest_run 4
- flatline(ts, min_run=3, *, atol=0.0)[source]
Detect consecutive runs of identical (or near-identical) values.
- Parameters:
ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum number of consecutive identical observations to constitute a “flat line”. Default
3.atol (float, optional) – Absolute tolerance for equality. Two values
aandbare considered equal when|a - b| <= atol. Default0.0(exact equality).
- Return type:
- Raises:
TypeError – If ts is not a
TimeSeries.ValueError – If min_run < 2.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
Exact flat line of length 4:
>>> idx = pd.date_range("2020", periods=7, freq="D") >>> vals = np.array([1.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0]) >>> ts = TimeSeries(vals, index=idx) >>> r = DuplicateDetector().flatline(ts, min_run=3) >>> r.n_flatline_runs 1 >>> r.runs[0] (1, 4, 3.0)
No flat line (min_run too high):
>>> r2 = DuplicateDetector().flatline(ts, min_run=5) >>> r2.n_flatline_runs 0
- near_zero(ts, min_run=3, *, threshold=1e-08)[source]
Detect segments where the series is stuck near zero.
Only consecutive runs where every value satisfies
|x| <= thresholdare reported. This differs fromflatline(), which detects any repeated value regardless of magnitude.- Parameters:
ts (TimeSeries) – Input series.
min_run (int, optional) – Minimum run length. Default
3.threshold (float, optional) – Maximum absolute value to count as “near zero”. Default
1e-8.
- Returns:
Runs where every value satisfies
|x| <= threshold.- Return type:
- Raises:
ValueError – If threshold < 0.
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=8, freq="D") >>> vals = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 3.0]) >>> ts = TimeSeries(vals, index=idx) >>> r = DuplicateDetector().near_zero(ts, min_run=3) >>> r.n_flatline_runs 1
- remove_flatlines(ts, report, *, keep_first=True)[source]
Replace flat-line positions with NaN.
- Parameters:
ts (TimeSeries) – The original series.
report (FlatlineReport) – Result from
flatline()ornear_zero().keep_first (bool, optional) – When
True(default), the first observation of each flat-line run is preserved; only the repeated copies are set to NaN. WhenFalse, the entire run including the first observation is set to NaN.
- Returns:
A new series with flat-line values replaced by NaN.
- Return type:
Examples
>>> import numpy as np, pandas as pd >>> from tseda import TimeSeries >>> from tseda.quality.duplicates import DuplicateDetector
>>> idx = pd.date_range("2020", periods=6, freq="D") >>> vals = np.array([1.0, 5.0, 5.0, 5.0, 2.0, 3.0]) >>> ts = TimeSeries(vals, index=idx) >>> det = DuplicateDetector() >>> r = det.flatline(ts, min_run=3) >>> cleaned = det.remove_flatlines(ts, r, keep_first=True) >>> cleaned.n_nan 2