如何从 pandas 中的相关性中删除重复项？

Question

我的结果有一些问题：

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]

从我的相关矩阵：

dataCorr = data.corr(method='pearson')

我将此矩阵转换为列：

dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()

删除矩阵对角线后：

dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]

但是我仍然有重复的对

level_0             level_1             0
LiftPushSpeed       RT1EntranceSpeed    0.881714
RT1EntranceSpeed    LiftPushSpeed       0.881714

如何避免这个问题？

Answer 1

您可以将下三角值转换为

NaN

并

stack

将其删除：

np.random.seed(12)

data = pd.DataFrame(np.random.randint(20, size=(5,6)))
print (data)
    0   1   2  3   4   5
0  11   6  17  2   3   3
1  12  16  17  5  13   2
2  11  10   0  8  12  13
3  18   3   4  3   1   0
4  18  18  16  6  13   9

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
print (dataCorr)
    0         1         2         3         4         5
0 NaN  0.042609 -0.041656 -0.113998 -0.173011 -0.201122
1 NaN       NaN  0.486901  0.567216  0.914260  0.403469
2 NaN       NaN       NaN -0.412853  0.157747 -0.354012
3 NaN       NaN       NaN       NaN  0.823628  0.858918
4 NaN       NaN       NaN       NaN       NaN  0.635730
5 NaN       NaN       NaN       NaN       NaN       NaN

#in your data change 0.5 to 0.7
dataCorr = dataCorr[abs(dataCorr) >= 0.5].stack().reset_index()
print (dataCorr)
   level_0  level_1         0
0        1        3  0.567216
1        1        4  0.914260
2        3        4  0.823628
3        3        5  0.858918
4        4        5  0.635730

详情：

print (np.tril(np.ones(dataCorr.shape)))
[[ 1.  0.  0.  0.  0.  0.]
 [ 1.  1.  0.  0.  0.  0.]
 [ 1.  1.  1.  0.  0.  0.]
 [ 1.  1.  1.  1.  0.  0.]
 [ 1.  1.  1.  1.  1.  0.]
 [ 1.  1.  1.  1.  1.  1.]]

Answer 2

虽然您已经删除了对角线元素，但恐怕这就是您的代码目前要做的全部事情。

为了解决重复问题，我在排序后连接了两列，然后过滤掉重复项，然后删除连接的列。

这是一个完整的示例 -

import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.01].stack().reset_index()
dataCorr = dataCorr[dataCorr['level_0'].astype(str)!=dataCorr['level_1'].astype(str)]

# filtering out lower/upper triangular duplicates 
dataCorr['ordered-cols'] = dataCorr.apply(lambda x: '-'.join(sorted([x['level_0'],x['level_1']])),axis=1)
dataCorr = dataCorr.drop_duplicates(['ordered-cols'])
dataCorr.drop(['ordered-cols'], axis=1, inplace=True)

print(dataCorr)

Answer 3

截至 2024 年 8 月，已接受答案中的代码现在出现 2 个错误（可能是由于语法更改）：

```
np.bool
```
已弃用。
以下错误 - 由于
```
pandas
```
内部尝试反转掩码
```
~np.tril(np.ones(dataCorr.shape)
```
。

TypeError: ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

我发现只需将

dtype=bool

改为

np.ones()

即可再次工作。

dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape, dtype=bool)))

有关一些附加信息：

如果您不想
```
df.where()
```
隐式反转您的条件，可以使用
```
df.mask()
```
代替
```
pandas
```
。
如果您想用
```
df.where()
```
选择上三角形，只需使用
```
np.triu()
```
而不是
```
np.tril()
```
。不过，只有当您想根据相关性绘制热图时，这才重要。

如何从 pandas 中的相关性中删除重复项？

问题描述投票：0回答：3

3个回答

最新问题

如何从 pandas 中的相关性中删除重复项？

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3