使用“str.contains”方法过滤 pandas 数据框字符串列

Question

我的数据框看起来像这样，其中

long_category

反映了行中的企业类别：

df = pd.DataFrame({
 'long_category': {0: 'Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists',
  1: 'Shipping Centers, Local Services, Notaries, Mailbox Centers, Printing Services',
  2: 'Department Stores, Shopping, Fashion, Home & Garden, Electronics, Furniture Stores',
  3: 'Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries',
  4: 'Brewpubs, Breweries, Food',
  5: 'Burgers, Fast Food, Sandwiches, Food, Ice Cream & Frozen Yogurt, Restaurants',
  6: 'Sporting Goods, Fashion, Shoe Stores, Shopping, Sports Wear, Accessories',
  7: 'Synagogues, Religious Organizations',
  8: 'Pubs, Restaurants, Italian, Bars, American (Traditional), Nightlife, Greek',
  9: 'Ice Cream & Frozen Yogurt, Fast Food, Burgers, Restaurants, Food'}})

df：

我的目标是根据长类别是否包含

known_categories

中的关键字，将这些类别缩短为以下

known_categories

：

  known_categories = ['restaurant', 'beauty & spas', 'hotels', 'health & medical', 'shopping', 'coffee & tea','automotive', 'pets|veterinian', 'services', 'stores',  'grocery', 'ice cream']

所以，我愿意：

df['short_category'] = np.nan
for cat in known_categories:
    excluded_cats = [x for x in known_categories if x!= cat]
    df['short_category'] [ ~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False)) & (df.long_category.str.contains(cat, regex = True, case = False, na = False))] = cat

重要的是短类别是相互排斥的。例如，索引为 3 的行应放入“餐厅”或“咖啡和茶”短类别，因此出现上述

~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False))

条件。

但这不起作用，如下所示。例如，最后两行的长类别中都有“餐厅”，但只有第一行的短类别捕获了这一点。我原以为最后一个是“餐厅”或“冰淇淋”，因为它包含来自

short_categories

的关键词。那么，我到底错在哪里呢？作为旁注，如果可能的话，我希望能够通过将类别进一步移动到

known_categories

的前面或后面来影响短类别的频率。例如，对于当前的

short_categories

，我希望最后一行将“餐厅”作为短类别。但如果我将

short_categories

中的“冰淇淋”移到“餐厅”之前，我希望最后一行显示“冰淇淋”而不是“餐厅”

Answer 1

您需要

extractall

，然后使用

Categorical

订购类别，并使用

groupby.min

获取首选类别：

import re

known_categories = ['restaurant', 'beauty & spas', 'hotels', 'health & medical',
                    'shopping', 'coffee & tea','automotive', 'pets|veterinian',
                    'services', 'stores',  'grocery', 'ice cream']

# craft the regex
pattern = r'(%s)' % '|'.join(map(re.escape, sorted(known_categories, key=len, reverse=True)))

cat = pd.CategoricalDtype(categories=known_categories, ordered=True)

df['short_category'] = df['long_category'].str.lower().str.extractall(pattern)[0].astype(cat).groupby(level=0).min()

输出：

                                       long_category    short_category
0  Doctors, Traditional Chinese Medicine, Naturop...  health & medical
1  Shipping Centers, Local Services, Notaries, Ma...          services
2  Department Stores, Shopping, Fashion, Home & G...          shopping
3  Restaurants, Food, Bubble Tea, Coffee & Tea, B...        restaurant
4                          Brewpubs, Breweries, Food               NaN
5  Burgers, Fast Food, Sandwiches, Food, Ice Crea...        restaurant
6  Sporting Goods, Fashion, Shoe Stores, Shopping...          shopping
7                Synagogues, Religious Organizations               NaN
8  Pubs, Restaurants, Italian, Bars, American (Tr...        restaurant
9  Ice Cream & Frozen Yogurt, Fast Food, Burgers,...        restaurant

Answer 2

向后浏览您的类别列表（假设稍后有更具体的类别），并且仅在未分配先前类别的情况下将其设置为短类别：

df['short_category'] = ''

for cat in known_categories[::-1]:
    contains_cat = df['long_category'].str.contains(cat, case = False, regex = True)
    no_category = df['short_category'] == ''

    df.loc[contains_cat & no_category, 'short_category'] = cat

使用“str.contains”方法过滤 pandas 数据框字符串列

问题描述投票：0回答：2

2个回答

最新问题

使用“str.contains”方法过滤 pandas 数据框字符串列

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2