使用列表排序的事实加速numpy关联程序

Question

我目前正在使用python和numpy来计算两个列表之间的相关性：data_0和data_1。每个列表包含相应的时间t0和t1。我想计算0 < t1 - t0 < t_max所有的事件。

 for time_0 in np.nditer(data_0):
        delta_time = np.subtract(data_1, np.full(data_1.size, time_0))
        delta_time = delta_time[delta_time >= 0]
        delta_time = delta_time[delta_time < time_max]

这样做，因为列表被排序，我选择data_1形式的data_1[index_min: index_max]子阵列。所以我实际上需要找到两个索引才能得到我想要的东西。

而有趣的是，当我去下一个time_0，因为data_0也被排序，我只需要找到新的index_min / index_max，如new_index_min >= index_min / new_index_max >= index_max。这意味着我不需要再扫描所有的data_1。（从头开始的数据清单）。

我已经实现了这样的解决方案，不使用numpy方法（只需使用while循环），它给我的结果与以前相同但不如之前快（15倍！）。我认为通常它需要较少的计算，应该有一种方法使用numpy方法更快，但我不知道如何做到这一点。有没有人有想法？

我不确定我是否超级明确，如果您有任何疑问，请不要犹豫。

先感谢您，

保罗

Answer 1

这是使用argsort的矢量化方法。它使用类似于避免全扫描想法的策略：

import numpy as np

def find_gt(ref, data, incl=True):
    out = np.empty(len(ref) + len(data) + 1, int)
    total = (data, ref) if incl else (ref, data)
    out[1:] = np.argsort(np.concatenate(total), kind='mergesort')
    out[0] = -1
    split = (out < len(data)) if incl else (out >= len(ref))
    if incl:
        out[~split] -= len(data)
        split[0] = False
    return np.maximum.accumulate(np.where(split, -1, out))[split] + 1

def find_intervals(ref, data, span, incl=(True, True)):
    index_min = find_gt(ref, data, incl[0])
    index_max = len(ref) - find_gt(-ref[::-1], -span-data[::-1], incl[1])[::-1]
    return index_min, index_max

ref = np.sort(np.random.randint(0,20000,(10000,)))
data = np.sort(np.random.randint(0,20000,(10000,)))
span = 2

idmn, idmx = find_intervals(ref, data, span, (True, True))

print('checking')
for d,mn,mx in zip(data, idmn, idmx):
    assert mn == len(ref) or ref[mn] >= d
    assert mn == 0 or ref[mn-1] < d
    assert mx == len(ref) or ref[mx] > d+span
    assert mx == 0 or ref[mx-1] <= d+span
print('ok')

它的工作原理

间接地将两个集合排序
在一组中找到前一个时间在另一组中，这是使用maximum.reduce完成的
前面的步骤应用两次，第二次将一组中的时间移动跨度

使用列表排序的事实加速numpy关联程序

问题描述投票：1回答：1

1个回答

最新问题

使用列表排序的事实加速numpy关联程序

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1