11 年后我正在饶有兴趣地阅读这个...
我的代码如下所示:
import numpy as np
from scipy.stats import mannwhitneyu
from tqdm import tqdm
# Basics
runs = 10000
scores = 100000
states = 1000
# Random scores
score = np.random.rand(runs * scores).reshape(runs, scores)
score.shape
# Random states
state = np.random.choice(
[True, False],
p=[0.01, 0.99], # Roughly 1% True, 99% False (on average)
size=scores * states,
).reshape(states, scores)
state.shape
muh_ps = np.zeros((runs, states))
for i in tqdm(range(runs), desc="Runs", total=runs):
sc = score[i]
for st in tqdm(range(states), desc="States", total=states):
rp = sc[state[st]]
rn = sc[~state[st]]
u_statistic, u_pvalue = mannwhitneyu(rp, rn, alternative="two-sided")
muh_ps[i, st] = u_pvalue
print(muh_ps)
我感觉我不应该在列上循环,但我不确定应该如何做......
非常感谢您的指点/优化建议,并对初学者的问题表示歉意。
我尝试过上面显示的“天真的”方法,我原以为它会“慢”......实际上,这还不错:
test_man.py
States: 100%|███████████████████████████████████████| 1000/1000 [00:38<00:00, 25.86it/s]
States: 100%|███████████████████████████████████████| 1000/1000 [00:38<00:00, 25.74it/s]
States: 100%|███████████████████████████████████████| 1000/1000 [00:39<00:00, 25.36it/s]
States: 100%|███████████████████████████████████████| 1000/1000 [00:37<00:00, 26.43it/s]
States: 100%|███████████████████████████████████████| 1000/1000 [00:38<00:00, 26.14it/s]
Runs: 0%| | 5/10000 [03:13<106:48:42, 38.47s/it]
States: 83%|██████████████████████████████████▌ | 827/1000 [00:27<00:06, 25.06it/s]
但是按照这个速度需要5天才能完成...
我尝试将布尔矩阵传递给 mannwhitneyu,但我一定做错了什么。我想知道是否将数据更改为 CuPy,但我认为我需要指导......
你的例子的核心是这样的:
import numpy as np
from scipy import stats
rng = np.random.default_rng(2348923598235)
data = rng.random(size=100)
selectors = rng.random(size=(1000, 100)) > 0.5
def approach1():
res = np.zeros((selectors.shape[0], 2))
for j in range(selectors.shape[0]):
res[j] = stats.mannwhitneyu(data[~selectors[j]], data[selectors[j]])
return res.T
它很慢,因为你必须使用许多不同的“选择器”重复执行测试,并且一切都在Python中循环。
从概念上讲,您希望对代码进行矢量化并避免显式循环。理想情况下,您可以将 2D 数组传递给
mannwhitneyu
,并沿两个轴之一执行计算。这很棘手,因为您的“选择器”具有不同数量的 True/False,因此 2D 数组的每个切片将具有不同数量的元素;它们将是“参差不齐”的数组。
您可以通过传递矩形数组并使用 NaN 屏蔽不需要的元素并使用
nan_policy='omit'
来避免参差不齐的数组。
def approach2():
x = np.broadcast_to(data, selectors.shape)
x = x.copy()
y = x.copy()
x[selectors] = np.nan
y[~selectors] = np.nan
res = stats.mannwhitneyu(x, y, axis=-1, nan_policy='omit')
return res.statistic, res.pvalue
这不会快得多,因为在幕后,Python 循环仍然存在。 (对于大多数具有 scipy.stats
的
nan_policy='True'
函数都是如此,但是许多函数都是矢量化的,如果 nan_policy
不是 True
,则不需要循环。不过,您只能通过查看代码来判断。)ref = approach1()
res2 = approach2()
np.testing.assert_allclose(res2, ref)
%timeit approach1() # 384 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit approach2() # 151 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
但是,如果您对代码的工作原理有很好的了解,则可以将其组合在一起以避免循环,并且仍然忽略 NaN。
def approach3(data, selectors):
x, i = data, selectors
x = np.broadcast_to(x, i.shape)
x = x.copy()
y = x.copy()
x[i] = np.nan
y[~i] = np.nan
x.sort(axis=-1)
x, y, xy = _broadcast_concatenate(x, y, axis=-1)
n2 = np.count_nonzero(i, axis=-1)
n1 = i.shape[-1] - n2
ranks, t = _rankdata(xy, 'average', return_ties=True)
R1 = np.take_along_axis(ranks.cumsum(axis=-1), n1[:, np.newaxis] - 1, axis=-1)
R1 = R1[..., 0]
U1 = R1 - n1 * (n1 + 1) / 2
U2 = n1 * n2 - U1 # as U1 + U2 = n1 * n2
U = np.maximum(U1, U2) # multiply SF by two for two-sided test
z = _get_mwu_z(U, n1, n2, t, axis=-1, continuity=True)
p = stats.norm.sf(z) * 2
p = np.clip(p, 0, 1)
return U1, p
这会产生相同的结果,而且速度要快得多。
ref = approach1()
res3 = approach3(data, selectors)
np.testing.assert_allclose(res3, ref)
%timeit approach3(data, selectors) # 19.4 ms ± 95.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)