当 n = r

Question

我最近发现这个答案，它提供了 Cramer V 的无偏版本的代码，用于计算两个分类变量的相关性：

import scipy.stats as ss

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))

但是，如果样本数

等于第一个特征

的类别数，则

rcorr = n - (n-1) = 1

，如果

np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))

为非负数，则会在

(kcorr-1)

中除以零。我用一个简单的例子证实了这一点：

import pandas as pd

data = [
    {'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
    {'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
    {'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
    {'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
    ]

df = pd.DataFrame(data) 

confusion_matrix = pd.crosstab(df['name'], df['occupation']) # n = 4 (number of samples), r = 4 (number of unique names), k = 3 (number of unique occupations)
print(cramers_corrected_stat(confusion_matrix))

输出：

/tmp/ipykernel_227998/749514942.py:45: RuntimeWarning: invalid value encountered in scalar divide
  return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
nan

这是预期的行为吗？

如果是这样，在

n = k

的情况下，例如，当所有样本对于某些特征都具有唯一值时，我应该如何使用校正后的 Cramer's V？

Answer 1

您可以通过引入一个小扰动来处理

n=r

时除以零的问题。我这样修改了你的函数：

你原来的功能：

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))

成为

def cramers_corrected_stat(confusion_matrix):
    """Calculate Cramers V statistic for categorical-categorical association.
       Uses correction from Bergsma and Wicher,
       Journal of the Korean Statistical Society 42 (2013): 323-328.
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()  
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    
    denominator = min((kcorr-1), (rcorr-1))
    if denominator <= 0:
        return 0
    else:
        return np.sqrt(phi2corr / denominator)

您的样本数据（带有

n = 4, r = 4, k=3

）：

data = [
    {'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
    {'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
    {'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
    {'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
]

df = pd.DataFrame(data)

confusion_matrix = pd.crosstab(df['name'], df['occupation']) 
result = cramers_corrected_stat(confusion_matrix)
print(f"Cramer's V Result: {result}")

你会得到

Cramer's V Result: 0

当 n = r

问题描述投票：0回答：1

1个回答

最新问题

当 n = r

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1