用等量值或多个值替换检测到的异常值

问题描述 投票:0回答:1

我正在分析一个由 R-R 间隔组成的数据集,这些间隔是秒的几分之一,总共等于 240 秒或 4 分钟。

此数据中经常存在一些异常值,我可以使用以下代码来检测和替换:

RR:

[1.076,
 0.957,
 0.939,
 0.956,
 0.942,
 0.888,
 0.821,
 0.903,
 0.806,
 0.971,
 0.973,
 0.999,
 0.909,
 0.811,
 0.921,
 0.915,
 0.917,
 0.801,
 0.815,
 0.94,
 0.946,
 0.963,
 0.911,
 0.804,
 1.015,
 0.981,
 0.971,
 0.901,
 0.809,
 1.005,
 0.962,
 0.986,
 0.911,
 0.816,
 0.99,
 0.948,
 0.969,
 0.832,
 0.932,
 0.928,
 0.972,
 0.932,
 0.809,
 0.932,
 0.928,
 0.974,
 0.943,
 0.81,
 0.936,
 0.918,
 0.94,
 0.941,
 0.826,
 0.909,
 0.913,
 0.947,
 0.965,
 0.913,
 0.895,
 0.941,
 0.963,
 0.954,
 0.905,
 0.803,
 0.954,
 0.917,
 0.949,
 0.896,
 0.777,
 0.913,
 0.905,
 0.957,
 0.918,
 0.799,
 0.966,
 0.975,
 1.021,
 0.986,
 0.839,
 0.938,
 0.95,
 0.973,
 0.959,
 0.823,
 0.842,
 0.959,
 0.946,
 0.983,
 0.924,
 0.814,
 1.002,
 0.972,
 0.98,
 0.967,
 0.836,
 0.95,
 0.957,
 1.012,
 0.939,
 0.812,
 0.964,
 0.937,
 0.963,
 0.931,
 0.92,
 0.963,
 0.995,
 0.985,
 0.914,
 0.914,
 0.943,
 0.977,
 0.957,
 0.822,
 0.926,
 0.932,
 0.992,
 1.061,
 0.999,
 0.841,
 0.983,
 0.955,
 0.972,
 0.823,
 0.809,
 0.769,
 0.765,
 0.728,
 0.697,
 0.699,
 0.694,
 0.694,
 0.695,
 0.689,
 0.692,
 0.697,
 0.76,
 0.669,
 0.676,
 0.673,
 0.67,
 0.668,
 0.665,
 0.666,
 0.753,
 0.778,
 **8.154**,
 0.784,
 0.762,
 0.741,
 0.743,
 0.752,
 0.836,
 0.738,
 0.838,
 0.813,
 0.807,
 0.798,
 0.793,
 0.784,
 0.71,
 0.729,
 0.73,
 0.801,
 0.771,
 0.709,
 0.798,
 0.778,
 0.782,
 0.712,
 0.804,
 0.781,
 0.784,
 0.774,
 0.779,
 0.785,
 0.786,
 0.773,
 0.77,
 0.769,
 0.771,
 0.766,
 0.773,
 0.784,
 0.79,
 0.789,
 0.779,
 0.784,
 0.792,
 0.795,
 0.786,
 0.784,
 0.784,
 0.791,
 0.784,
 0.783,
 0.783,
 0.785,
 0.787,
 0.776,
 0.792,
 0.807,
 0.81,
 0.814,
 0.824,
 0.833,
 0.839,
 0.807,
 0.795,
 0.795,
 0.789,
 0.777,
 0.759,
 0.745,
 0.748,
 0.756,
 0.759,
 0.753,
 0.767,
 0.783,
 0.793,
 0.787,
 0.793,
 0.797,
 0.813,
 0.826,
 0.805,
 0.779,
 0.771,
 0.762,
 0.746,
 0.737,
 0.739,
 0.745,
 0.746,
 0.691,
 0.771,
 0.765,
 0.805,
 0.807,
 0.789,
 0.806,
 0.811,
 0.8,
 0.732,
 0.798,
 0.771,
 0.761,
 0.705,
 0.775,
 0.76,
 0.771,
 0.775,
 0.777,
 0.797,
 0.806,
 0.799,
 0.786,
 0.789,
 0.789,
 0.782,
 0.769,
 0.776,
 0.78,
 0.786,
 0.77,
 0.774,
 0.782,
 0.785,
 0.787,
 0.79,
 0.786,
 0.773,
 0.779,
 0.787]

from scipy import stats
import pandas as pd
import numpy as np

df_rrs = pd.DataFrame(RRs, columns=['RRs'])
mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)
df_rrs.RRs = df_rrs.RRs.mask(mask).interpolate()
df_rrs = df_rrs['RRs'].to_list()

代码执行其应检测的操作,并将该值替换为插值。问题是,这会在数据中留下间隙,因为累积总和为 4 分钟,而在 RR 示例列表中,异常值是 8.154 秒。

所以我上面的代码示例只会将其替换为一个值,并且数据集基本上缩短并缺少 8 秒。

因此,我不需要将其替换为相邻值的平均值/插值,而是需要将其替换为总共相当于 8.154 秒的多个平均值/插值。 理想情况下,每个值也应该进行插值。

执行此操作的最佳方法是什么?

python pandas scipy outliers z-score
1个回答
0
投票

这是一种方法:

最小可重现示例

from scipy import stats
import pandas as pd
import numpy as np

np.random.seed(0) # for reproducibility

RRs = np.random.randint(40, 60, size=(8)).astype("float") / 100

desired_sum = 10

# adding 3 outliers at index `2, 3, 6`
RRs[[2, 3, 6]] += (desired_sum - RRs.sum())/3

df_rrs = pd.DataFrame(RRs, columns=['RRs'])

    RRs
0  0.52
1  0.55
2  2.44 # outlier (consecutive: to be grouped)
3  2.47 # outlier (consecutive: to be grouped)
4  0.43
5  0.47
6  2.53 # outlier
7  0.59

代码

# mask for outliers
mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)

# store outliers
outliers = df_rrs.loc[mask, 'RRs']

# group consecutive outliers
group_outliers = (outliers.index.to_series().diff() != 1).cumsum()

# grouper values as first index value per outlier group
grouper = group_outliers.index.to_series().mask(group_outliers.duplicated()).ffill()

# get sum per grouop
outliers_grouped = outliers.groupby(grouper).sum()

# determine rows needed per group
rows = np.round(outliers_grouped / df_rrs.loc[~mask, 'RRs'].mean())

# isolate index values we no longer want
outliers_excluded = outliers.index.difference(outliers_grouped.index)

# get index `df_rrs` without index values we no longer want
reindex = df_rrs.index.difference(outliers_excluded)

# reindex with `np.repeat` to get appropriate repeats for index values in `rows`
df_rrs = df_rrs.reindex(
    np.repeat(reindex, rows.reindex(reindex, fill_value=1))
    )

# interpolate
df_rrs['RRs'] = df_rrs['RRs'].mask(mask).interpolate()

# scale interpolated values
df_rrs.loc[mask, 'RRs'] = (
    df_rrs.loc[mask, 'RRs'] * (outliers_grouped 
                               / df_rrs.loc[mask, 'RRs'].groupby(level=0).sum())
    )

输出:

        RRs
0  0.520000
1  0.550000
2  0.540191 # index value 2, outlier interpolated (grouped)
2  0.529260
2  0.518328
2  0.507397
2  0.496466
2  0.485534
2  0.474603
2  0.463672
2  0.452740
2  0.441809
4  0.430000
5  0.470000
6  0.467811 # index value 6, outlier interpolated
6  0.486906
6  0.506000
6  0.525094
6  0.544189
7  0.590000

np.isclose
进行求和检查:

np.isclose(df_rrs['RRs'].sum(), desired_sum)
# True

请注意,固有的浮点精度问题可能会导致总和与

desired_sum
之间存在细微差异,因此
df_rrs['RRs'].sum() == desired_sum
可能并不总是
True
(顺便说一下,它就在这里)。这似乎是不可避免的。


说明/中间体

  • 创建掩码并将异常值子集存储为
    outliers
outliers

2    2.44
3    2.47
6    2.53
Name: RRs, dtype: float64
grouper

2    2.0 # one group
3    2.0 # one group
6    6.0
dtype: float64
outliers.groupby(grouper).sum()

2.0    4.91
6.0    2.53
Name: RRs, dtype: float64
  • 现在,我们需要使用
    np.round
    +
    outlier_grouped
    除以正确值的平均值来找出每组需要多少行。
rows

2.0    10.0
6.0     5.0
Name: RRs, dtype: float64
  • 我们需要使用
    df.reindex
    重新索引。首先排除我们在分组过程中剪切掉的索引值:
    index.difference
    outliers
    outliers_grouped.index
outliers.index.difference(outliers_grouped.index)

Index([3], dtype='int64')
  • 接下来,获取
    df_rrs.index
    outliers_excluded
    之间的差异,在本例中排除
    3
  • 现在我们可以使用
    np.repeat
    重新索引,为我们的
    10
    索引获取
    5
    rows
    行,为其他索引获取 1 (
    fill_value=1
    )。
df_rrs.index # reindexed

Index([0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 5, 6, 6, 6, 6, 6, 7], dtype='int64')
  • 到了这个阶段,我们终于可以使用了
    Series.interpolate
  • 最后一步是缩放插值,以便我们达到
    desired sum
    。为此,我们需要在插值上另一个
    df.groupby
    来获取总和作为输入:插值 * (
    outliers_grouped
    / 分组插值总和)。
© www.soinside.com 2019 - 2024. All rights reserved.