什么是更快，（=。at），（=。loc），（。drop）或（.append）来过滤大型数据帧？

Question

我想要排序大约400k行的Dataframe，包含4列，其中大约一半用if语句取出：

    for a in range (0, howmanytimestorunthrough): 
        if ('Primary' not in DataFrameexample[a]):
            #take out row

到目前为止，我一直在测试以下4个中的任何一个：

newdf.append(emptyline,)
nefdf.at[b,'column1'] = DataFrameexample.at[a,'column1']
nefdf.at[b,'column2'] = DataFrameexample.at[a,'column2']
nefdf.at[b,'column3'] = DataFrameexample.at[a,'column3']
nefdf.at[b,'column4'] = DataFrameexample.at[a,'column4']
b = b + 1

或与.loc相同

newdf.append(emptyline,)
nefdf.loc[b,:] = DataFrameexample.loc[a,:]
b = b + 1

或将if（not in）更改为if（in）并使用：

DataFrameexample = DataFrameexample.drop([k])

或尝试将空行设置为具有值，然后追加它：

notemptyline = pd.Series(DataFrameexample.loc[a,:].values, index = ['column1', 'column2', ...) 
newdf.append(notemptyline, ignore_index=True)

所以从我到目前为止测试的结果来看，它们似乎都在少量行（2000）上运行正常，但是一旦我开始获得更多行，它们会花费指数更长的时间。 .at似乎比.loc快得多，即使我需要它运行4次，但仍然变慢（行的10倍，需要超过10次）。 .drop我想每次尝试复制数据帧，所以真的不起作用？我似乎无法让.append（notemptyline）正常工作，它只是一遍又一遍地替换索引0。

我知道必须有一种有效的方法，我似乎无法完全实现这一目标。有帮助吗？

Answer 1

你的速度问题与.loc vs .at vs ...无关（对于.loc和.at之间的比较，看看这个question），但是来自显式循环数据帧的每一行。熊猫是关于矢量化你的运营。

您希望根据比较过滤数据框。您可以将其转换为布尔索引器。

indexer = df!='Primary'

这将为您提供具有布尔值的4乘n行数据帧。现在，您希望将维度减少到1 x n行，以便如果行（轴1）中的所有值都为true，则值为true。

indexer = indexer.all(axis=1)

现在我们可以使用.loc来获取只有行的索引器是True

df = df.loc[indexer]

这将比遍历行更快。

编辑：

要检查df条目是否包含字符串，您可以替换第一行：

indexer = df.apply(lambda x: x.str.contains('Primary'))

请注意，您通常不希望使用apply语句（在内部使用for循环来自定义函数）来迭代很多元素。在这种情况下，我们循环遍历列，如果你只有几个那么就可以了。

什么是更快，（=。at），（=。loc），（。drop）或（.append）来过滤大型数据帧？

问题描述投票：2回答：1

1个回答

最新问题

什么是更快，（=。at），（=。loc），（。drop）或（.append）来过滤大型数据帧？

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1