我正在分析一个由 R-R 间隔组成的数据集,这些间隔是秒的几分之一,总共等于 240 秒或 4 分钟。
此数据中经常存在一些异常值,我可以使用以下代码来检测和替换:
RR:
[1.076,
0.957,
0.939,
0.956,
0.942,
0.888,
0.821,
0.903,
0.806,
0.971,
0.973,
0.999,
0.909,
0.811,
0.921,
0.915,
0.917,
0.801,
0.815,
0.94,
0.946,
0.963,
0.911,
0.804,
1.015,
0.981,
0.971,
0.901,
0.809,
1.005,
0.962,
0.986,
0.911,
0.816,
0.99,
0.948,
0.969,
0.832,
0.932,
0.928,
0.972,
0.932,
0.809,
0.932,
0.928,
0.974,
0.943,
0.81,
0.936,
0.918,
0.94,
0.941,
0.826,
0.909,
0.913,
0.947,
0.965,
0.913,
0.895,
0.941,
0.963,
0.954,
0.905,
0.803,
0.954,
0.917,
0.949,
0.896,
0.777,
0.913,
0.905,
0.957,
0.918,
0.799,
0.966,
0.975,
1.021,
0.986,
0.839,
0.938,
0.95,
0.973,
0.959,
0.823,
0.842,
0.959,
0.946,
0.983,
0.924,
0.814,
1.002,
0.972,
0.98,
0.967,
0.836,
0.95,
0.957,
1.012,
0.939,
0.812,
0.964,
0.937,
0.963,
0.931,
0.92,
0.963,
0.995,
0.985,
0.914,
0.914,
0.943,
0.977,
0.957,
0.822,
0.926,
0.932,
0.992,
1.061,
0.999,
0.841,
0.983,
0.955,
0.972,
0.823,
0.809,
0.769,
0.765,
0.728,
0.697,
0.699,
0.694,
0.694,
0.695,
0.689,
0.692,
0.697,
0.76,
0.669,
0.676,
0.673,
0.67,
0.668,
0.665,
0.666,
0.753,
0.778,
**8.154**,
0.784,
0.762,
0.741,
0.743,
0.752,
0.836,
0.738,
0.838,
0.813,
0.807,
0.798,
0.793,
0.784,
0.71,
0.729,
0.73,
0.801,
0.771,
0.709,
0.798,
0.778,
0.782,
0.712,
0.804,
0.781,
0.784,
0.774,
0.779,
0.785,
0.786,
0.773,
0.77,
0.769,
0.771,
0.766,
0.773,
0.784,
0.79,
0.789,
0.779,
0.784,
0.792,
0.795,
0.786,
0.784,
0.784,
0.791,
0.784,
0.783,
0.783,
0.785,
0.787,
0.776,
0.792,
0.807,
0.81,
0.814,
0.824,
0.833,
0.839,
0.807,
0.795,
0.795,
0.789,
0.777,
0.759,
0.745,
0.748,
0.756,
0.759,
0.753,
0.767,
0.783,
0.793,
0.787,
0.793,
0.797,
0.813,
0.826,
0.805,
0.779,
0.771,
0.762,
0.746,
0.737,
0.739,
0.745,
0.746,
0.691,
0.771,
0.765,
0.805,
0.807,
0.789,
0.806,
0.811,
0.8,
0.732,
0.798,
0.771,
0.761,
0.705,
0.775,
0.76,
0.771,
0.775,
0.777,
0.797,
0.806,
0.799,
0.786,
0.789,
0.789,
0.782,
0.769,
0.776,
0.78,
0.786,
0.77,
0.774,
0.782,
0.785,
0.787,
0.79,
0.786,
0.773,
0.779,
0.787]
from scipy import stats
import pandas as pd
import numpy as np
df_rrs = pd.DataFrame(RRs, columns=['RRs'])
mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)
df_rrs.RRs = df_rrs.RRs.mask(mask).interpolate()
df_rrs = df_rrs['RRs'].to_list()
代码执行其应检测的操作,并将该值替换为插值。问题是,这会在数据中留下间隙,因为累积总和为 4 分钟,而在 RR 示例列表中,异常值是 8.154 秒。
所以我上面的代码示例只会将其替换为一个值,并且数据集基本上缩短并缺少 8 秒。
因此,我不需要将其替换为相邻值的平均值/插值,而是需要将其替换为总共相当于 8.154 秒的多个平均值/插值。 理想情况下,每个值也应该进行插值。
执行此操作的最佳方法是什么?
这是一种方法:
from scipy import stats
import pandas as pd
import numpy as np
np.random.seed(0) # for reproducibility
RRs = np.random.randint(40, 60, size=(8)).astype("float") / 100
desired_sum = 10
# adding 3 outliers at index `2, 3, 6`
RRs[[2, 3, 6]] += (desired_sum - RRs.sum())/3
df_rrs = pd.DataFrame(RRs, columns=['RRs'])
RRs
0 0.52
1 0.55
2 2.44 # outlier (consecutive: to be grouped)
3 2.47 # outlier (consecutive: to be grouped)
4 0.43
5 0.47
6 2.53 # outlier
7 0.59
代码
# mask for outliers
mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)
# store outliers
outliers = df_rrs.loc[mask, 'RRs']
# group consecutive outliers
group_outliers = (outliers.index.to_series().diff() != 1).cumsum()
# grouper values as first index value per outlier group
grouper = group_outliers.index.to_series().mask(group_outliers.duplicated()).ffill()
# get sum per grouop
outliers_grouped = outliers.groupby(grouper).sum()
# determine rows needed per group
rows = np.round(outliers_grouped / df_rrs.loc[~mask, 'RRs'].mean())
# isolate index values we no longer want
outliers_excluded = outliers.index.difference(outliers_grouped.index)
# get index `df_rrs` without index values we no longer want
reindex = df_rrs.index.difference(outliers_excluded)
# reindex with `np.repeat` to get appropriate repeats for index values in `rows`
df_rrs = df_rrs.reindex(
np.repeat(reindex, rows.reindex(reindex, fill_value=1))
)
# interpolate
df_rrs['RRs'] = df_rrs['RRs'].mask(mask).interpolate()
# scale interpolated values
df_rrs.loc[mask, 'RRs'] = (
df_rrs.loc[mask, 'RRs'] * (outliers_grouped
/ df_rrs.loc[mask, 'RRs'].groupby(level=0).sum())
)
输出:
RRs
0 0.520000
1 0.550000
2 0.540191 # index value 2, outlier interpolated (grouped)
2 0.529260
2 0.518328
2 0.507397
2 0.496466
2 0.485534
2 0.474603
2 0.463672
2 0.452740
2 0.441809
4 0.430000
5 0.470000
6 0.467811 # index value 6, outlier interpolated
6 0.486906
6 0.506000
6 0.525094
6 0.544189
7 0.590000
np.isclose
进行求和检查:
np.isclose(df_rrs['RRs'].sum(), desired_sum)
# True
请注意,固有的浮点精度问题可能会导致总和与
desired_sum
之间存在细微差异,因此 df_rrs['RRs'].sum() == desired_sum
可能并不总是 True
(顺便说一下,它就在这里)。这似乎是不可避免的。
说明/中间体
outliers
。outliers
2 2.44
3 2.47
6 2.53
Name: RRs, dtype: float64
2, 3
)。申请:
index.to_series
+ Series.diff
不等于 1,得到 Series.cumsum
,存储为 group_outliers
。group_outliers
:再次 index.to_series
+ Series.mask
,通过 Series.duplicated
,以及结果上的 Series.ffill
。df.groupby
获取总和,此处存储为 outliers_grouped
。grouper
2 2.0 # one group
3 2.0 # one group
6 6.0
dtype: float64
outliers.groupby(grouper).sum()
2.0 4.91
6.0 2.53
Name: RRs, dtype: float64
np.round
+ outlier_grouped
除以正确值的平均值来找出每组需要多少行。rows
2.0 10.0
6.0 5.0
Name: RRs, dtype: float64
df.reindex
重新索引。首先排除我们在分组过程中剪切掉的索引值:index.difference
与outliers
和outliers_grouped.index
。outliers.index.difference(outliers_grouped.index)
Index([3], dtype='int64')
df_rrs.index
和 outliers_excluded
之间的差异,在本例中排除 3
。np.repeat
重新索引,为我们的 10
索引获取 5
和 rows
行,为其他索引获取 1 (fill_value=1
)。df_rrs.index # reindexed
Index([0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 5, 6, 6, 6, 6, 6, 7], dtype='int64')
Series.interpolate
。desired sum
。为此,我们需要在插值上另一个 df.groupby
来获取总和作为输入:插值 * (outliers_grouped
/ 分组插值总和)。