scipy 中 Mann-Whitney U 测试的优化建议?

问题描述 投票:0回答:1

11 年后我正在饶有兴趣地阅读这个...

我的代码如下所示:

import numpy as np
from scipy.stats import mannwhitneyu

from tqdm import tqdm

# Basics
runs = 10000
scores = 100000
states = 1000

# Random scores
score = np.random.rand(runs * scores).reshape(runs, scores)
score.shape

# Random states
state = np.random.choice(
    [True, False],
    p=[0.01, 0.99],  # Roughly 1% True, 99% False (on average)
    size=scores * states,
).reshape(states, scores)
state.shape

muh_ps = np.zeros((runs, states))

for i in tqdm(range(runs), desc="Runs", total=runs):
    sc = score[i]

    for st in tqdm(range(states), desc="States", total=states):
        rp = sc[state[st]]
        rn = sc[~state[st]]

        u_statistic, u_pvalue = mannwhitneyu(rp, rn, alternative="two-sided")

        muh_ps[i, st] = u_pvalue

print(muh_ps)

我感觉我不应该在列上循环,但我不确定应该如何做......

非常感谢您的指点/优化建议,并对初学者的问题表示歉意。

我尝试过上面显示的“天真的”方法,我原以为它会“慢”......实际上,这还不错:

test_man.py 
States: 100%|███████████████████████████████████████| 1000/1000 [00:38<00:00, 25.86it/s]
States: 100%|███████████████████████████████████████| 1000/1000 [00:38<00:00, 25.74it/s]
States: 100%|███████████████████████████████████████| 1000/1000 [00:39<00:00, 25.36it/s]
States: 100%|███████████████████████████████████████| 1000/1000 [00:37<00:00, 26.43it/s]
States: 100%|███████████████████████████████████████| 1000/1000 [00:38<00:00, 26.14it/s]
Runs:   0%|                                         |   5/10000 [03:13<106:48:42, 38.47s/it]
States:  83%|██████████████████████████████████▌    |  827/1000 [00:27<00:06, 25.06it/s]

但是按照这个速度需要5天才能完成...

我尝试将布尔矩阵传递给 mannwhitneyu,但我一定做错了什么。我想知道是否将数据更改为 CuPy,但我认为我需要指导......

python numpy scipy vectorization
1个回答
0
投票

你的例子的核心是这样的:

import numpy as np
from scipy import stats
rng = np.random.default_rng(2348923598235)

data = rng.random(size=100)
selectors = rng.random(size=(1000, 100)) > 0.5

def approach1():
    res = np.zeros((selectors.shape[0], 2))
    for j in range(selectors.shape[0]):
        res[j] = stats.mannwhitneyu(data[~selectors[j]], data[selectors[j]])
    return res.T

它很慢,因为你必须使用许多不同的“选择器”重复执行测试,并且一切都在Python中循环。

从概念上讲,您希望对代码进行矢量化并避免显式循环。理想情况下,您可以将 2D 数组传递给

mannwhitneyu
,并沿两个轴之一执行计算。这很棘手,因为您的“选择器”具有不同数量的 True/False,因此 2D 数组的每个切片将具有不同数量的元素;它们将是“参差不齐”的数组。

您可以通过传递矩形数组并使用 NaN 屏蔽不需要的元素并使用

nan_policy='omit'
来避免参差不齐的数组。

def approach2():
    x = np.broadcast_to(data, selectors.shape)
    x = x.copy()
    y = x.copy()
    x[selectors] = np.nan
    y[~selectors] = np.nan
    res = stats.mannwhitneyu(x, y, axis=-1, nan_policy='omit')
    return res.statistic, res.pvalue

这不会快得多,因为在幕后,Python 循环仍然存在。 (对于大多数具有 scipy.stats

nan_policy='True'
函数都是如此,但是许多函数都是矢量化的,如果
nan_policy
不是
True
,则不需要循环。不过,您只能通过查看代码来判断。)
ref = approach1()
res2 = approach2()
np.testing.assert_allclose(res2, ref)

%timeit approach1()  # 384 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit approach2()  # 151 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

但是,如果您对代码的工作原理有很好的了解,则可以将其组合在一起以避免循环,并且仍然忽略 NaN。

def approach3(data, selectors): x, i = data, selectors x = np.broadcast_to(x, i.shape) x = x.copy() y = x.copy() x[i] = np.nan y[~i] = np.nan x.sort(axis=-1) x, y, xy = _broadcast_concatenate(x, y, axis=-1) n2 = np.count_nonzero(i, axis=-1) n1 = i.shape[-1] - n2 ranks, t = _rankdata(xy, 'average', return_ties=True) R1 = np.take_along_axis(ranks.cumsum(axis=-1), n1[:, np.newaxis] - 1, axis=-1) R1 = R1[..., 0] U1 = R1 - n1 * (n1 + 1) / 2 U2 = n1 * n2 - U1 # as U1 + U2 = n1 * n2 U = np.maximum(U1, U2) # multiply SF by two for two-sided test z = _get_mwu_z(U, n1, n2, t, axis=-1, continuity=True) p = stats.norm.sf(z) * 2 p = np.clip(p, 0, 1) return U1, p

这会产生相同的结果,而且速度要快得多。

ref = approach1() res3 = approach3(data, selectors) np.testing.assert_allclose(res3, ref) %timeit approach3(data, selectors) # 19.4 ms ± 95.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

	
© www.soinside.com 2019 - 2024. All rights reserved.