我的数据框看起来像这样,其中
long_category
反映了行中的企业类别:
df = pd.DataFrame({
'long_category': {0: 'Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists',
1: 'Shipping Centers, Local Services, Notaries, Mailbox Centers, Printing Services',
2: 'Department Stores, Shopping, Fashion, Home & Garden, Electronics, Furniture Stores',
3: 'Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries',
4: 'Brewpubs, Breweries, Food',
5: 'Burgers, Fast Food, Sandwiches, Food, Ice Cream & Frozen Yogurt, Restaurants',
6: 'Sporting Goods, Fashion, Shoe Stores, Shopping, Sports Wear, Accessories',
7: 'Synagogues, Religious Organizations',
8: 'Pubs, Restaurants, Italian, Bars, American (Traditional), Nightlife, Greek',
9: 'Ice Cream & Frozen Yogurt, Fast Food, Burgers, Restaurants, Food'}})
df:
我的目标是根据长类别是否包含
known_categories
中的关键字,将这些类别缩短为以下known_categories
:
known_categories = ['restaurant', 'beauty & spas', 'hotels', 'health & medical', 'shopping', 'coffee & tea','automotive', 'pets|veterinian', 'services', 'stores', 'grocery', 'ice cream']
所以,我愿意:
df['short_category'] = np.nan
for cat in known_categories:
excluded_cats = [x for x in known_categories if x!= cat]
df['short_category'] [ ~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False)) & (df.long_category.str.contains(cat, regex = True, case = False, na = False))] = cat
重要的是短类别是相互排斥的。例如,索引为 3 的行应放入“餐厅”或“咖啡和茶”短类别,因此出现上述
~(df.long_category.str.contains('|'.join(excluded_cats), regex = True, case = False, na = False))
条件。
但这不起作用,如下所示。例如,最后两行的长类别中都有“餐厅”,但只有第一行的短类别捕获了这一点。我原以为最后一个是“餐厅”或“冰淇淋”,因为它包含来自
short_categories
的关键词。那么,我到底错在哪里呢?作为旁注,如果可能的话,我希望能够通过将类别进一步移动到 known_categories
的前面或后面来影响短类别的频率。例如,对于当前的 short_categories
,我希望最后一行将“餐厅”作为短类别。但如果我将 short_categories
中的“冰淇淋”移到“餐厅”之前,我希望最后一行显示“冰淇淋”而不是“餐厅”
extractall
,然后使用 Categorical
订购类别,并使用 groupby.min
获取首选类别:
import re
known_categories = ['restaurant', 'beauty & spas', 'hotels', 'health & medical',
'shopping', 'coffee & tea','automotive', 'pets|veterinian',
'services', 'stores', 'grocery', 'ice cream']
# craft the regex
pattern = r'(%s)' % '|'.join(map(re.escape, sorted(known_categories, key=len, reverse=True)))
cat = pd.CategoricalDtype(categories=known_categories, ordered=True)
df['short_category'] = df['long_category'].str.lower().str.extractall(pattern)[0].astype(cat).groupby(level=0).min()
输出:
long_category short_category
0 Doctors, Traditional Chinese Medicine, Naturop... health & medical
1 Shipping Centers, Local Services, Notaries, Ma... services
2 Department Stores, Shopping, Fashion, Home & G... shopping
3 Restaurants, Food, Bubble Tea, Coffee & Tea, B... restaurant
4 Brewpubs, Breweries, Food NaN
5 Burgers, Fast Food, Sandwiches, Food, Ice Crea... restaurant
6 Sporting Goods, Fashion, Shoe Stores, Shopping... shopping
7 Synagogues, Religious Organizations NaN
8 Pubs, Restaurants, Italian, Bars, American (Tr... restaurant
9 Ice Cream & Frozen Yogurt, Fast Food, Burgers,... restaurant
向后浏览您的类别列表(假设稍后有更具体的类别),并且仅在未分配先前类别的情况下将其设置为短类别:
df['short_category'] = ''
for cat in known_categories[::-1]:
contains_cat = df['long_category'].str.contains(cat, case = False, regex = True)
no_category = df['short_category'] == ''
df.loc[contains_cat & no_category, 'short_category'] = cat