我最近发现这个答案,它提供了 Cramer V 的无偏版本的代码,用于计算两个分类变量的相关性:
import scipy.stats as ss
def cramers_corrected_stat(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))
但是,如果样本数
n
等于第一个特征 r
的类别数,则 rcorr = n - (n-1) = 1
,如果 np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))
为非负数,则会在 (kcorr-1)
中除以零。我用一个简单的例子证实了这一点:
import pandas as pd
data = [
{'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
{'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
{'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
{'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
]
df = pd.DataFrame(data)
confusion_matrix = pd.crosstab(df['name'], df['occupation']) # n = 4 (number of samples), r = 4 (number of unique names), k = 3 (number of unique occupations)
print(cramers_corrected_stat(confusion_matrix))
输出:
/tmp/ipykernel_227998/749514942.py:45: RuntimeWarning: invalid value encountered in scalar divide
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
nan
这是预期的行为吗?
如果是这样,在
n = k
的情况下,例如,当所有样本对于某些特征都具有唯一值时,我应该如何使用校正后的 Cramer's V?
您可以通过引入一个小扰动来处理
n=r
时除以零的问题。我这样修改了你的函数:
你原来的功能:
def cramers_corrected_stat(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))
成为
def cramers_corrected_stat(confusion_matrix):
"""Calculate Cramers V statistic for categorical-categorical association.
Uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328.
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
denominator = min((kcorr-1), (rcorr-1))
if denominator <= 0:
return 0
else:
return np.sqrt(phi2corr / denominator)
您的样本数据(带有
n = 4, r = 4, k=3
):
data = [
{'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
{'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
{'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
{'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
]
df = pd.DataFrame(data)
confusion_matrix = pd.crosstab(df['name'], df['occupation'])
result = cramers_corrected_stat(confusion_matrix)
print(f"Cramer's V Result: {result}")
你会得到
Cramer's V Result: 0