我有两个数据框,
primary_tumor_df
和healthy_tissue_df
来执行Mann-Whitney U检验。我还从两个数据框中删除了 nan
值。
primary_tumor_df
的结构。
healthy_tissue_df
的结构。
primary_tumor_df.dropna(inplace=True)
healthy_tissue_df.dropna(inplace=True)
这表明不存在
nan
或空值。
但是在执行测试时它给了我以下错误:
from scipy.stats import mannwhitneyu
p_value_dict = {}
for gene in primary_tumor_df.columns:
stats, p_value = mannwhitneyu(primary_tumor_df[gene], healthy_tissue_df[gene],
alternative='two-sided')
错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[9], line 4
2 p_value_dict = {}
3 for gene in primary_tumor_df.columns:
----> 4 stats, p_value = mannwhitneyu(primary_tumor_df[gene],
5 healthy_tissue_df[gene],
6 alternative='two-sided')
7 p_value_dict[gene] = p_value
9 # converting into DataFrame
File ~/.local/lib/python3.10/site-packages/scipy/stats/_axis_nan_policy.py:502, in _axis_nan_policy_factory.<locals>.axis_nan_policy_decorator.<locals>.axis_nan_policy_wrapper(***failed resolving arguments***)
500 if sentinel:
501 samples = _remove_sentinel(samples, paired, sentinel)
--> 502 res = hypotest_fun_out(*samples, **kwds)
503 res = result_to_tuple(res)
504 res = _add_reduced_axes(res, reduced_axes, keepdims)
File ~/.local/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py:460, in mannwhitneyu(x, y, use_continuity, alternative, axis, method)
249 @_axis_nan_policy_factory(MannwhitneyuResult, n_samples=2)
250 def mannwhitneyu(x, y, use_continuity=True, alternative="two-sided",
251 axis=0, method="auto"):
252 r'''Perform the Mann-Whitney U rank test on two independent samples.
253
254 The Mann-Whitney U test is a nonparametric test of the null hypothesis
(...)
456
457 '''
459 x, y, use_continuity, alternative, axis_int, method = (
--> 460 _mwu_input_validation(x, y, use_continuity, alternative, axis, method))
462 x, y, xy = _broadcast_concatenate(x, y, axis)
464 n1, n2 = x.shape[-1], y.shape[-1]
File ~/.local/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py:200, in _mwu_input_validation(x, y, use_continuity, alternative, axis, method)
198 # Would use np.asarray_chkfinite, but infs are OK
199 x, y = np.atleast_1d(x), np.atleast_1d(y)
--> 200 if np.isnan(x).any() or np.isnan(y).any():
201 raise ValueError('`x` and `y` must not contain NaNs.')
202 if np.size(x) == 0 or np.size(y) == 0:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
即使数据框中没有任何
nan
值,为什么它会产生
错误?
问题是
primary_tumor_df
或 healthy_tissue_df
中至少有一列具有 object
dtype,而不是其中任何一个都具有 NaN。
你可以看出,因为最终引发错误的行:
if np.isnan(x).any() or np.isnan(y).any():
正在检查输入中的NaN x
和
y
的
mannwhitneyu
,并且它抱怨
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
数字数据类型不会引发此错误。
import numpy as np
for dtype in [np.uint8, np.int16, np.float32, np.complex64]:
x = np.arange(10., dtype=np.float64)
np.isnan(x) # no error
它们是否有 NaN:
y = x.copy()
y[0] = np.nan
np.isnan(y) # no error
毕竟,isnan
的目的是找到NaN并用布尔数组报告它们的位置。 问题出在非数字数据类型上。
x = np.asarray(x, dtype=object)
np.isnan(x) # error
如果数据确实是数字,但pandas
将其存储为某种更通用的对象类型,则您应该能够通过在将其传递到 SciPy 之前将其转换为浮点类型来解决问题。
import numpy as np
import pandas as pd
from scipy import stats
rng = np.random.default_rng(435982435982345)
primary_tumor_df = pd.DataFrame(rng.random((10, 3)).astype(object))
healthy_tissue_df = pd.DataFrame(rng.random((10, 3)).astype(object))
# generates your error:
# for gene in primary_tumor_df.columns:
# res = stats.mannwhitneyu(primary_tumor_df[gene],
# healthy_tissue_df[gene],
# alternative='two-sided')
# no error
for gene in primary_tumor_df.columns:
res = stats.mannwhitneyu(primary_tumor_df[gene].astype(np.float64),
healthy_tissue_df[gene].astype(np.float64),
alternative='two-sided')
for
循环。
mannwhitneyu
是矢量化的,默认情况下它沿着 axis=0
- 您的列工作。
res = stats.mannwhitneyu(primary_tumor_df.astype(np.float64),
healthy_tissue_df.astype(np.float64),
alternative='two-sided')