使用“str.contains”方法过滤 pandas 数据框字符串列

问题描述 投票:0回答:2

我的数据框看起来像这样,其中

long_category
反映了行中的企业类别:

df = pd.DataFrame({
 'long_category': {0: 'Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists',
  1: 'Shipping Centers, Local Services, Notaries, Mailbox Centers, Printing Services',
  2: 'Department Stores, Shopping, Fashion, Home & Garden, Electronics, Furniture Stores',
  3: 'Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries',
  4: 'Brewpubs, Breweries, Food',
  5: 'Burgers, Fast Food, Sandwiches, Food, Ice Cream & Frozen Yogurt, Restaurants',
  6: 'Sporting Goods, Fashion, Shoe Stores, Shopping, Sports Wear, Accessories',
  7: 'Synagogues, Religious Organizations',
  8: 'Pubs, Restaurants, Italian, Bars, American (Traditional), Nightlife, Greek',
  9: 'Ice Cream & Frozen Yogurt, Fast Food, Burgers, Restaurants, Food'}})

df:

enter image description here

我的目标是根据长类别是否包含

known_categories
中的关键字,将这些类别缩短为以下
known_categories

  known_categories = ['restaurant', 'beauty & spas', 'hotels', 'health & medical', 'shopping', 'coffee & tea','automotive', 'pets|veterinian', 'services', 'stores',  'grocery', 'ice cream']

所以,我愿意:

df['short_category'] = np.nan
for cat in known_categories:
    excluded_cats = [x for x in known_categories if x!= cat]
    df['short_category'] [ ~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False)) & (df.long_category.str.contains(cat, regex = True, case = False, na = False))] = cat

重要的是短类别是相互排斥的。例如,索引为 3 的行应放入“餐厅”或“咖啡和茶”短类别,因此出现上述

~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False))
条件。

但这不起作用,如下所示。例如,最后两行的长类别中都有“餐厅”,但只有第一行的短类别捕获了这一点。我原以为最后一个是“餐厅”或“冰淇淋”,因为它包含来自

short_categories
的关键词。那么,我到底错在哪里呢?作为旁注,如果可能的话,我希望能够通过将类别进一步移动到
known_categories
的前面或后面来影响短类别的频率。例如,对于当前的
short_categories
,我希望最后一行将“餐厅”作为短类别。但如果我将
short_categories
中的“冰淇淋”移到“餐厅”之前,我希望最后一行显示“冰淇淋”而不是“餐厅”

enter image description here

python pandas dataframe
2个回答
0
投票

您需要

extractall
,然后使用
Categorical
订购类别,并使用
groupby.min
获取首选类别:

import re

known_categories = ['restaurant', 'beauty & spas', 'hotels', 'health & medical',
                    'shopping', 'coffee & tea','automotive', 'pets|veterinian',
                    'services', 'stores',  'grocery', 'ice cream']

# craft the regex
pattern = r'(%s)' % '|'.join(map(re.escape, sorted(known_categories, key=len, reverse=True)))

cat = pd.CategoricalDtype(categories=known_categories, ordered=True)

df['short_category'] = df['long_category'].str.lower().str.extractall(pattern)[0].astype(cat).groupby(level=0).min()

输出:

                                       long_category    short_category
0  Doctors, Traditional Chinese Medicine, Naturop...  health & medical
1  Shipping Centers, Local Services, Notaries, Ma...          services
2  Department Stores, Shopping, Fashion, Home & G...          shopping
3  Restaurants, Food, Bubble Tea, Coffee & Tea, B...        restaurant
4                          Brewpubs, Breweries, Food               NaN
5  Burgers, Fast Food, Sandwiches, Food, Ice Crea...        restaurant
6  Sporting Goods, Fashion, Shoe Stores, Shopping...          shopping
7                Synagogues, Religious Organizations               NaN
8  Pubs, Restaurants, Italian, Bars, American (Tr...        restaurant
9  Ice Cream & Frozen Yogurt, Fast Food, Burgers,...        restaurant

0
投票

向后浏览您的类别列表(假设稍后有更具体的类别),并且仅在未分配先前类别的情况下将其设置为短类别:

df['short_category'] = ''

for cat in known_categories[::-1]:
    contains_cat = df['long_category'].str.contains(cat, case = False, regex = True)
    no_category = df['short_category'] == ''

    df.loc[contains_cat & no_category, 'short_category'] = cat
© www.soinside.com 2019 - 2024. All rights reserved.