我已将此文件从PDF转换为CSV以训练模型。 pdf文件中的三列已合并为csv中的一列,例如ProductID,商品和国家/地区。
我试图在正则表达式的帮助下分隔这些列,但是我不太确定这些列将如何运行。
这是我要处理的数据集:
country/commodity Unit Quantity Value
1 0011101 BREEDING BULLS (OXEN) NO NaN 75
2 DUBAI NaN NaN 75
3 0011102 BREEDING BULLS (BUFFALO) NO 248 1921
4 SRI LUNKA NaN 248 1921
5 0011103 BUFFALO,BREEDING NO NaN 90
6 SRI LUNKA NaN NaN 90
7 0011104 COWS BREEDING NO 1249 258921665
8 AJMAN NaN NaN NaN
9 CYPRUS NaN NaN NaN
我需要此数据采用以下格式:
0 ProductID Commodity Country Unit Quantity Value
1 0011101 BREEDING BULLS (OXEN) DUBAI NaN NaN 75
3 0011102 BREEDING BULLS (BUFFALO) SRI LUNKA NaN 248 1921
4 0011103 BUFFALO,BREEDING SRI LUNKA NaN NaN 90
7 0011104 COWS BREEDING AJMAN NaN NaN NaN
8 0011104 COWS BREEDING CYPRUS NaN NaN NaN
9 0011104 COWS BREEDING CHINA NaN 590 3290
首先,我们用ProductID, Commodity, Country
列中的信息减去以下内容使您的列成为country/commodity
:
str.split
str.extract
Series.where
Series.mask
str.contains
然后我们GroupBy
上的ProductID
一起获取相应产品的信息,为此我们使用named aggregation
,这是pandas 0.25.0
之后的新内容:
# Extract information from country/commodity
df['ProductID'] = df['country/commodity'].str.split(' ', 1).str[0].str.extract('(\d+)').ffill()
df['Commodity'] = df['country/commodity'].str.split('\d+').str[-1].where(df['Unit'].notna())
df['Country'] = df['country/commodity'].mask(df['country/commodity'].str.contains('\d+')).fillna('')
# Groupby ProductID to get information together
df_new = df.groupby(['ProductID']).agg(
Commodity=('Commodity', 'first'),
Country=('Country', ', '.join),
Unit=('Unit', 'first'),
Quantity=('Quantity', 'first'),
Value=('Value', 'first')
).reset_index()
# Remove unnecessary comma's
df_new['Country'] = df_new['Country'].str.lstrip(', ')
输出
ProductID Commodity Country Unit Quantity \
0 0011101 BREEDING BULLS (OXEN) DUBAI NO NaN
1 0011102 BREEDING BULLS (BUFFALO) SRI LUNKA NO 248.0
2 0011103 BUFFALO,BREEDING SRI LUNKA NO NaN
3 0011104 COWS BREEDING AJMAN, CYPRUS NO 1249.0
Value
0 75.0
1 1921.0
2 90.0
3 258921665.0