我有一个数据框df
,看起来像这样:
id Type agent_id created_at
0 44525 Stunning 6 bedroom villa in New Delhi 184 2018-03-09
1 44859 Villa for sale in Amritsar 182 2017-02-19
2 45465 House in Faridabad 154 2017-04-17
3 50685 5 Hectre land near New Delhi 113 2017-09-01
4 130728 Duplex in Mumbai 157 2017-02-07
5 130856 Large plot with fantastic views in Mumbai 137 2018-01-16
6 130857 Modern Design Penthouse in Bangalore 199 2017-03-24
我有此表格数据,我正尝试通过从列中提取关键字来清理此数据,并因此使用新列创建一个新的数据框。
Apartment = ['apartment', 'penthouse', 'duplex']
House = ['house', 'villa', 'country estate']
Plot = ['plot', 'land']
Location = ['New Delhi','Mumbai','Bangalore','Amritsar']
因此所需的数据帧应如下所示:
id Type Location agent_id created_at
0 44525 House New Delhi 184 2018-03-09
1 44859 House Amritsar 182 2017-02-19
2 45465 House Faridabad 154 2017-04-17
3 50685 Plot New Delhi 113 2017-09-01
4 130728 Apartment Mumbai 157 2017-02-07
5 130856 Plot Mumbai 137 2018-01-16
6 130857 Apartment Bangalore 199 2017-03-24
所以到目前为止,我已经尝试过了:
import pandas as pd
df = pd.read_csv('test_data.csv')
#i can extract these keywords one by one by using for loops but how
#can i do this work in pandas with minimum possible line of code.
for index, values in df.type.iteritems():
for i in Apartment:
if i in values:
print(i)
df_new = pd. Dataframe(df['id'])
有人可以告诉我如何解决吗?
首先用Location
和str.extract
为正则表达式str.extract
创建|
列:
OR
然后从另一个pat = '|'.join(r"\b{}\b".format(x) for x in Location)
df['Location'] = df['Type'].str.extract('('+ pat + ')', expand=False)
创建字典,将值与键交换,并使用list
和参数str.contains
通过掩码循环设置值:
str.contains
106如果isna(key).any():-> 107提高ValueError('无法使用包含'108个'NA / NaN值')109 return False
ValueError:无法使用包含NA / NaN值的向量进行索引
我遇到了以上错误