迭代数据框 - 重新访问？

Question

我有一个大约 500.000 行的数据框。包含飞机雷达数据的长、纬度、高度、日期时间（以及更多数据）。

要计算“倾向”（基于最近接近点 [cpa] 理论的无量纲数），我执行以下操作：

在特定时间窗口（目前为 3 秒，包含大约 20-40 行数据。这只是基于根据“日期时间”选择一组行。然后调用对该块进行操作的函数。

    timeslice_start = row[1]['df']['datetime'].iloc[0]
    timeslice_end = timeslice_start + timeslice

    # while the end of the timeslice has not reached the end of the dataframe:
    while timeslice_end <= row[1]['df']['datetime'].iloc[-1]:
        # Take a group of datapoints that are in the timeslice
        this_group = row[1]['df'][(row[1]['df']['datetime'] >= timeslice_start) & (row[1]['df']['datetime'] <= timeslice_end)]

        do_checks3D(this_group, row[1]['limits'][0], row[1]['limits'][1], row[1]['limits'][2])

这是一种滑动窗口技术。最近我发现 pandas 实际上有一个 pandas.rolling() 的函数，但还没有尝试过，因为我不希望性能有很大的提升。

然后将这组数据传递给一个函数，通过使用带有嵌套“for”循环的“for”循环迭代行来计算该数据块中所有飞机点的 cpa，如下所示：

for i in range(0, len(dframe)):
        # extracting parameters for vector1 of aircraft 1
        
        # and compare it with all the other datapoints in the frame
        for j in range(i+1,len(dframe)):
            # do more extraction for vector2 of aircraft 2
            # and call the cpa function
            cpa(vector1, vector2)

实际上，这给了我和 pandas.combinations() 一样的效果。然而，使用 combinations() 比我现在做的暴力迭代要慢得多。

向前滑动窗口（当前为 1 秒）并再次调用该函数。显然，这会产生许多重叠，需要稍后剔除。

计算一天数据点的所有 cpas，在普通 PC 上将花费我大约 30 分钟。我正在寻找显着加快代码速度的方法。你有什么建议吗？ pandas.rolling() 比我的滑动窗口快吗？有更好的 Pythonic 方式吗？有没有比我的两个嵌套“for”循环更快的方法？

非常欢迎任何建议！

Answer 1

使用

.rolling()

几乎肯定会更快——数据将由 C 核心而不是 Python 处理。您可以将其与

.apply()

结合使用，以将自定义功能应用于窗口的每个框架。

但是，将

.apply()

与python函数一起使用将意味着数据可能会由Python而不是pandas处理，后者速度较慢。可以重写代码来防止这种情况发生，但这可能不是必需的：我会先使用

.apply()

，看看它是否仍然太慢。

迭代数据框 - 重新访问？

问题描述投票：0回答：1

1个回答

最新问题

迭代数据框 - 重新访问？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1