如何有效地查找包含列表项的数据框行？

Question

假设我有以下示例

items = ['milk', 'bread', 'water']

df:
name     item1    item2    item3

items_1  milk     water
items_2  milk     rubber   juice
items_3  juice    paper    wood
items_4  bread
items_5  bread    water    milk
items_6  milk     juice

在此示例中，我希望获得其成员完全在项目列表中的所有df行，这意味着：

items_1
items_4
items_5

现在，实际的“ df”数据帧将包含几百万行，即items_ *，因此标题中的“有效”。 “ df”的列数将在10到20之间。此外，将有数千个包含10到20个元素的“ items”列表。

有人可以帮我解决这个问题吗？

Answer 1

    for item in dflist:
        if item not in items:
                print("this df list has an items that is not in the items list")

我知道输出可能不是您想要的输出，但是您不清楚理想的输出。

此for循环的作用是，它将循环浏览df列表中的每个项目（例如items_1，item_2等）。它将查看此列表中的每个项目，并检查它是否在您要检查的项目列表中。

[如果它找到不在您要检查的项目列表中的项目，它将返回它在您的检查列表中找不到项目。这似乎是您在寻找的东西，不在标有“ items”的项目的第一列表中的任何值。因此，这会检查这些内容，从这里您可以轻松丢弃这些内容。

通常，当搜索大数据集时，二进制搜索是可行的方法，但是在这种情况下，除非您可以按字母顺序放置df列表，否则这似乎不可行，如果您不能做我上面写的话。

希望这很有道理！

Answer 2

#set name as index
#allows us to focus on the items columns
#and later allows easy filtering
df = df.set_index("name")

#find rows that are in items
#and get the sum of the boolean
A = df.isin(items).sum(1)

#get the sum of rows
#that are not boolean
#this helps us narrow down
#items completely in the items list
#that are yet affected by null entries
B = df.notna().sum(1)

#compare A and B
#if they match, that implies complete entry in items list
cond = A.eq(B)

#let's see what cond looks : 

 cond

            name
items_1     True
items_2    False
items_3    False
items_4     True
items_5     True
items_6    False
dtype: bool

#filter df with condition to get your rows
df.loc[cond]


           item1    item2   item3
name            
items_1     milk    water   None
items_4     bread   None    None
items_5     bread   water   milk

如何有效地查找包含列表项的数据框行？

问题描述投票：0回答：1

1个回答

最新问题

如何有效地查找包含列表项的数据框行？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1