有效地删除在不同行之间包含重复元素的行

Question

给定2D数组，我可能在索引i]处有一行，可能在索引j的另一行中有一个或多个数字。我需要从数组中删除那些行[[i和j。同样在任何行中，数字始终是该行唯一的。我已经有没有循环的解决方案，基于Numpy。这是我想出的唯一解决方案：def filter_array(arr): # Reshape to 1D without hard copy arr_1d = arr.ravel() # Make a count of only the existing numbers (faster than histogram) u_elem, c = np.unique(arr_1d, return_counts=True) # Get which elements are duplicates. duplicates = u_elem[c > 1] # Get the rows where these duplicates belong dup_idx = np.concatenate([np.where(arr_1d == d)[0] for d in duplicates]) dup_rows = np.unique(dup_idx //9) # Remove the rows from the array b = np.delete(arr, dup_rows, axis=0) return b

这里是输入数组的一个（过度简化的）示例：

a = np.array([ [1, 3, 23, 40, 33], [2, 8, 5, 35, 7], [9, 32, 4, 6, 3], [72, 85, 32, 48, 53], [3, 98, 101, 589, 208], [343, 3223, 4043, 65, 78] ])

经过过滤的数组给出了预期的结果，尽管我没有详尽地检查它是否在所有可能的情况下都有效：

[[ 2 8 5 35 7] [ 343 3223 4043 65 78]]

我的典型数组大小约为10 ^ 5到10 ^ 6行，固定数量为9列。 ％timeit给出大约270毫秒来过滤每个这样的数组。我有一亿。我正在尝试在考虑其他方法（例如GPU）之前在单个CPU上加快速度]

这些数据可能已经存在于Pandas数据框中。

给定2D数组，我可能在索引i处有一行，而在索引j的另一行中可能有一个或多个数字。我需要从数组中删除那些行i和j。同样在任何行中，数字...

Answer 1

我们可以在找到唯一值及其计数后使用np.isin，并使用结果对数组进行索引，从而在此提高一些速度：

有效地删除在不同行之间包含重复元素的行

问题描述投票：0回答：1

1个回答

最新问题

有效地删除在不同行之间包含重复元素的行

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1