计算2D Numpy数组中数字对的频率的最有效方法

Question

假设我有以下2D数组：

import numpy as np

np.random.seed(123)
a = np.random.randint(1, 6, size=(5, 3))

产生：

In [371]: a
Out[371]:
array([[3, 5, 3],
       [2, 4, 3],
       [4, 2, 2],
       [1, 2, 2],
       [1, 1, 2]])

有没有比qazxsw poi更有效（Numpy，Pandas等）计算所有数字对的频率的方法？

the following solution

产生这样的东西：

from collections import Counter
from itertools import combinations

def pair_freq(a, sort=False, sort_axis=-1):
    a = np.asarray(a)
    if sort:
        a = np.sort(a, axis=sort_axis)
    res = Counter()
    for row in a:
        res.update(combinations(row, 2))
    return res

res = pair_freq(a)

要么：

In [38]: res
Out[38]:
Counter({(3, 5): 1,
         (3, 3): 1,
         (5, 3): 1,
         (2, 4): 1,
         (2, 3): 1,
         (4, 3): 1,
         (4, 2): 2,
         (2, 2): 2,
         (1, 2): 4,
         (1, 1): 1})

PS结果数据集可能看起来不同 - 例如像多索引Pandas DataFrame或其他东西。

我试图增加In [39]: res.most_common() Out[39]: [((1, 2), 4), ((4, 2), 2), ((2, 2), 2), ((3, 5), 1), ((3, 3), 1), ((5, 3), 1), ((2, 4), 1), ((2, 3), 1), ((4, 3), 1), ((1, 1), 1)]数组的维数，并使用a和所有对的组合列表，但我仍然无法摆脱循环。

更新：

（a）您是否只对2个数字组合的频率感兴趣（并且对3个数字组合的频率不感兴趣）？

是的，我只对组合（2个数字）感兴趣

（b）您是否要将（3,5）视为与（5,3）不同，或者您是否想将它们视为同一事物的两次出现？

实际上两种方法都很好 - 如果需要，我总是可以事先对我的数组进行排序：

np.isin()

UPDATE2：

您是否希望（a，b）和（b，a）之间的区别仅仅由于a和b的源列，或者甚至是其他？明白这个问题，请考虑三排a = np.sort(a, axis=1)。您认为这里的输出应该是什么？什么应该是不同的2元组，它们的频率应该是多少？

[[1,2,1], [3,1,2], [1,2,5]]

我希望得到以下结果：

In [40]: a = np.array([[1,2,1],[3,1,2],[1,2,5]])

In [41]: a
Out[41]:
array([[1, 2, 1],
       [3, 1, 2],
       [1, 2, 5]])

因为它更灵活，所以我想把（a，b）和（b，a）算作我可以做到的同一对元素：

In [42]: pair_freq(a).most_common()
Out[42]:
[((1, 2), 3),
 ((1, 1), 1),
 ((2, 1), 1),
 ((3, 1), 1),
 ((3, 2), 1),
 ((1, 5), 1),
 ((2, 5), 1)]

Answer 1

如果你的元素不是太大非负整数In [43]: pair_freq(a, sort=True).most_common() Out[43]: [((1, 2), 4), ((1, 1), 1), ((1, 3), 1), ((2, 3), 1), ((1, 5), 1), ((2, 5), 1)]很快：

bincount

样品运行：

from collections import Counter
from itertools import combinations
import numpy as np

def pairs(a):
    M = a.max() + 1
    a = a.T
    return sum(np.bincount((M * a[j] + a[j+1:]).ravel(), None, M*M)
               for j in range(len(a) - 1)).reshape(M, M)

def pairs_F_3(a):
    M = a.max() + 1
    return (np.bincount(a[1:].ravel() + M*a[:2].ravel(), None, M*M) +
            np.bincount(a[2].ravel() + M*a[0].ravel(), None, M*M))

def pairs_F(a):
    M = a.max() + 1
    a = np.ascontiguousarray(a.T) # contiguous columns (rows after .T)
                                  # appear to be typically perform better
                                  # thanks @ning chen
    return sum(np.bincount((M * a[j] + a[j+1:]).ravel(), None, M*M)
               for j in range(len(a) - 1)).reshape(M, M)

def pairs_dict(a):
    p = pairs_F(a)
    # p is a 2D table with the frequency of (y, x) at position y, x
    y, x = np.where(p)
    c = p[y, x]
    return {(yi, xi): ci for yi, xi, ci in zip(y, x, c)}

def pair_freq(a, sort=False, sort_axis=-1):
    a = np.asarray(a)
    if sort:
        a = np.sort(a, axis=sort_axis)
    res = Counter()
    for row in a:
        res.update(combinations(row, 2))
    return res


from timeit import timeit
A = [np.random.randint(0, 1000, (1000, 120)),
     np.random.randint(0, 100, (100000, 12))]
for a in A:
    print('shape:', a.shape, 'range:', a.max() + 1)
    res2 = pairs_dict(a)
    res = pair_freq(a)
    print(f'results equal: {res==res2}')
    print('bincount', timeit(lambda:pairs(a), number=10)*100, 'ms')
    print('bc(F)   ', timeit(lambda:pairs_F(a), number=10)*100, 'ms')
    print('bc->dict', timeit(lambda:pairs_dict(a), number=10)*100, 'ms')
    print('Counter ', timeit(lambda:pair_freq(a), number=4)*250,'ms')

Answer 2

我有一个想法，代码如下。我的代码最大的缺点是随着列的增加它运行得非常慢，而且它比@Paul Panzer的代码慢。我向Paul Panzer道歉。

如果你想更快，只需忽略num_to_items的功能。因为shape: (1000, 120) range: 1000 results equal: True bincount 461.14772390574217 ms bc(F) 435.3669326752424 ms bc->dict 932.1215840056539 ms Counter 3473.3258984051645 ms shape: (100000, 12) range: 100 results equal: True bincount 89.80463854968548 ms bc(F) 43.449611216783524 ms bc->dict 46.470773220062256 ms Counter 1987.6734036952257 ms等于(1, 1)。

1*2**20 + 1

计算2D Numpy数组中数字对的频率的最有效方法

问题描述投票：2回答：2

2个回答

最新问题

计算2D Numpy数组中数字对的频率的最有效方法

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2