我如何有效地识别和分类熊猫数据框中的字符串？

Question

我有一个pandas数据框（超过200K行），每个数据框包含一列带有字符串的列，我试图根据它是否包含特定的单词组合来选择每一行。例如，考虑以下行：

dF["the_string"][1] = "The blue and white dog chases many cats "
dF["the_string"][2] = "the green shoe is too big for the cat"
dF["the_string"][3] = "the yellow cat is cute"

我想搜索包含给定单词或一组单词但不包含单词或一组单词的所有行（例如，包含“ cat”但不包含“ dog”并且不包含“ green” ），然后在数据框中填充一列，根据搜索到的组合说“类别”。在这种情况下，column [“ category”] [3]应该=“ Feline”。另一个示例是字符串包含“ dog”但不包含“ green”，将导致column [“ category”] [1] =“ Canine”鉴于我有几十种这样的组合，我正在寻找一种有效的方法来对大型数据集进行处理。我尝试过使用正则表达式导致多行过滤字符串，如下所示：

dF["the_string"].str.contains(r'\b\w+\b [Cc]ats?\b', na=False, regex=True)

但是，鉴于我有很多组合，很多行以及这些组合可以更改的事实，我正在尝试寻找一种有效的方法来管理列表并在200K行上运行它。

非常感谢任何帮助/建议。

Answer 1

尝试一下：

import pandas as pd import numpy as np dF = pd.DataFrame(["The blue and white dog chases many cats ","the green shoe is too big for the cat","the yellow cat is cute"], columns=['the_string']) dF["category"] = dF.the_string dF.category = np.where(pd.Series(dF.category).str.contains('cat') & ~pd.Series(dF.category).str.contains('green') & ~pd.Series(dF.category).str.contains('dog'), "Feline", dF.category) dF.category = np.where(pd.Series(dF.category).str.contains('dog') & ~pd.Series(dF.category).str.contains('green'), "Canine", dF.category)

结果：

the_string category 0 The blue and white dog chases many cats Canine 1 the green shoe is too big for the cat the green shoe is too big for the cat 2 the yellow cat is cute Feline

并且您可以继续添加where子句，直到获得最终结果。希望对您有所帮助。

我如何有效地识别和分类熊猫数据框中的字符串？

问题描述投票：1回答：1

1个回答

最新问题

我如何有效地识别和分类熊猫数据框中的字符串？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1