向量化嵌套循环以进行成对距离计算

Question

如何让下面的脚本更加高效？这是我上一篇文章

Python嵌套循环问题的后续内容目前，处理包含约 15000 和 1500 行的输入表最多需要两个小时。在 Excel 中手动处理数据需要的时间少了一个数量级 - 不理想！

我知道

iterrows

 是解决问题的一个糟糕方法，矢量化是前进的方向，但我对它在第二个 for 循环中的工作方式感到有点目瞪口呆。

以下脚本摘录需要两个数据帧，

qinsy_file_2
segy_vlookup
```
（忽略那个的命名）。
```

对于

qinsy_file_2

 中的每一行，它会迭代

segy_vlookup

 来计算每个文件中坐标之间的距离。如果这个距离小于预先给定的值（此处称为

buffer

），它将被转录为新的数据帧

out_df

（否则它将越过该行）。

# Loop through Qinsy file
for index_qinsy,row_qinsy in qinsy_file_2.iterrows():
    # Loop through SEGY navigation
    for index_segy,row_segy in segy_vlookup.iterrows():
        # Calculate distance between points
        if ((((segy_vlookup["CDP_X"][index_segy] - qinsy_file_2["CMP Easting"][index_qinsy])**2) + ((segy_vlookup["CDP_Y"][index_segy] - qinsy_file_2["CMP Northing"][index_qinsy])**2))**0.5)<= buffer:
            # Append rows less than the distance modifier to the new dataframe 
            out_df=pd.concat([out_df,row_qinsy])
            break
        else:
                pass

到目前为止我已阅读以下内容：

如何迭代 Pandas DataFrame 中的行？（以及其他类似名称）
寻找更快的方法来迭代 pandas 数据框
使用 pandas 循环数据帧的最有效方法是什么？
https://www.learndatasci.com/solutions/how-iterate-over-rows-pandas/
https://towardsdatascience.com/you-dont-always-have-to-loop-through-rows-in-pandas-22a970b347ac

Answer 1

如果您可以提供带有一些示例数据的最小可重现示例，这将很有用。

我的第一个想法是使用 merge(how='cross') 进行交叉连接。对于大量数据，这是危险的，因为它是笛卡尔积。但是，一旦获得连接数据，您就可以在整个框架中以矢量化方式应用计算。

请参阅

此问题了解更多详情

Answer 2

从您研究的

循环向量化领域扩展，尝试关键字成对距离计算，例如：如何有效计算多个时间序列的欧氏距离矩阵

下面的方法是完全矢量化的（根本没有

for

 循环）。它通过 Numpy 进行 2D 密集部分，然后根据请求返回 pandas 数据帧。
输入数据大小为 1500 * 15000 行时，计算距离并按

过滤行只需不到一秒的时间。<= buffer

模拟数据让

dfA
```
 用 2D 坐标的 
```
qinsy_file_2
```
 点表示 
```
nA = 15000
```
 
```
('xA', 'yA')
dfB
```
 用 2D 坐标的 
```
segy_vlookup
```
 点表示 
```
nB = 1500
```
 
```
('xB', 'yB')

import numpy as np
import pandas as pd
dfA = pd.DataFrame({'xA' : np.random.default_rng(seed=0).random(nA), 
                    'yA' : np.random.default_rng(seed=1).random(nA)})
dfB = pd.DataFrame({'xB' : np.random.default_rng(seed=3).random(nB), 
                    'yB' : np.random.default_rng(seed=4).random(nB)})

如果您想重现该过程，请使用这些给定的种子：

             xA        yA
0      0.636962  0.511822
1      0.269787  0.950464
2      0.040974  0.144160
3      0.016528  0.948649
4      0.813270  0.311831
...         ...       ...
14995  0.457170  0.507611
14996  0.283829  0.828005
14997  0.679110  0.910104
14998  0.273703  0.294932
14999  0.097813  0.366295

[15000 rows x 2 columns]

            xB        yB
0     0.085649  0.943056
1     0.236811  0.511328
2     0.801274  0.976244
3     0.582162  0.080836
4     0.094129  0.607356
...        ...       ...
1495  0.684086  0.821719
1496  0.383240  0.957020
1497  0.341389  0.064735
1498  0.032523  0.234434
1499  0.500383  0.931106

[1500 rows x 2 columns]

这里，

dfA

中的每个点将测试

dfB< 中每个点的距离buffer

，以确定是否应将

dfA

的相应行选择到

df_out

中

距离矩阵下面输出中的每一列代表
dfA

中的一行

D = np.sqrt((np.atleast_2d(dfA['xA'].values) - 
             np.atleast_2d(dfB['xB'].values).T) **2 +
            (np.atleast_2d(dfA['yA'].values) - 
             np.atleast_2d(dfB['yB'].values).T) **2)
D
[[0.69993476 0.18428648 0.80014469 ... 0.59437473 0.67485498 0.57688898]
 [0.40015149 0.44037255 0.41613029 ... 0.59552573 0.21951799 0.20088488]
 [0.49263227 0.53211262 1.12712974 ... 0.13891991 0.86169468 0.93107222]
 ...
 [0.535957   0.88861864 0.3107378  ... 0.91033173 0.2399422  0.38764484]
 [0.66504838 0.75431549 0.0906694  ... 0.93520197 0.24865128 0.14713943]
 [0.44096846 0.23140718 0.91123077 ... 0.17995671 0.6753528  0.69359482]]

现在按阈值过滤（距离缓冲区< ）
D 中每列至少有 一个

 值足以为

df_out

 选择该列，即

dfA

 中的该行。

buffer = 0.01 # arbitrary value
b = np.sum(D<=buffer, axis=0)>0 # boolean selector
[ True False False ... False  True  True] # In this example, at least the first, 
# last and former last rows will be selected, and probably many more in between.

最后在dfA上进行逐行选择：

df_out = dfA[b]
             xA        yA
0      0.636962  0.511822
4      0.813270  0.311831
9      0.935072  0.027559
10     0.815854  0.753513
11     0.002739  0.538143
...         ...       ...
14988  0.039833  0.239034
14994  0.243440  0.428287
14996  0.283829  0.828005
14998  0.273703  0.294932
14999  0.097813  0.366295

[5620 rows x 2 columns]

在此模型示例中，150000 行中有 5620 行进入了选择范围。

当然，您实际的
dfA

 可能具有的任何其他列也将被转录为

df_out

。

更进一步，但这个答案已经代表了对嵌套循环的巨大改进，减少了实际计算的距离数量？

向量化嵌套循环以进行成对距离计算

问题描述投票：0回答：2

2个回答

最新问题

向量化嵌套循环以进行成对距离计算

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2