我有以下熊猫数据框。
org_id org_name location_id loc_status city country
0 100023310 advance GmbH LOC-100052061 ACTIVE Planegg Germany
1 100023310 advance GmbH LOC-100032442 ACTIVE Planegg Germany
2 100023310 advance GmbH LOC-100042003 INACTIVE Planegg Germany
3 100004261 Beacon Limited LOC-100005615 ACTIVE Tunbridge Wells United Kingdom
4 100004261 Beacon Limited LOC-100000912 ACTIVE Crowborough United Kingdom
我想按 org_id、org_name 列对行进行分组,并通过分隔符“|”查找唯一值并连接值其他列值。
我正在使用以下代码行。
gr_columns = [x for x in df.columns if x not in ['location_id', 'loc_status','city', 'country']]
df.groupby(gr_columns).agg(lambda col: '|'.join(col))
然而,最终的数据框缺少一些列(城市和国家)。我得到以下输出。
org_id org_name location_id loc_status
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 ACTIVE|INACTIVE
2 100004261 Beacon Limited LOC-100005615 ACTIVE
还有以下警告。
FutureWarning: Dropping invalid columns in DataFrameGroupBy.agg is deprecated. In a future version, a TypeError will be raised. Before calling .agg, select only columns which should be valid for the function.
df.groupby(gr_columns).agg(lambda col: ','.join(col))
预期输出为:
org_id org_name location_id loc_status city country
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 ACTIVE|INACTIVE Planegg Germany
2 100004261 Beacon Limited LOC-100005615 ACTIVE Tunbridge Wells|Crowborough United Kingdom
非常感谢任何帮助。
我想你在找:
df.groupby(['org_id', 'org_name'], as_index=False).agg(lambda x: '|'.join(x.unique()))
org_id org_name location_id \
0 100004261 Beacon Limited LOC-100005615|LOC-100000912
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003
loc_status city country
0 ACTIVE Tunbridge Wells|Crowborough Kingdom
1 ACTIVE|INACTIVE Planegg Germany
更新
事实上,似乎你想加入具有独特价值的一切:
join_unique = lambda x: '|'.join(x.unique())
out = df.groupby(['org_id', 'org_name'], as_index=False).agg(join_unique)
print(out)
# Output
org_id org_name location_id loc_status city country
0 100004261 Beacon Limited LOC-100000912|LOC-100005615 ACTIVE Crowborough|Tunbridge Wells United Kingdom
1 100023310 advance GmbH LOC-100032442|LOC-100042003|LOC-100052061 ACTIVE|INACTIVE Planegg Germany
旧答案
您可以使用
groupby_agg
:
>>> (df.groupby(['org_id', 'org_name'], as_index=False)
.agg({'location_id': '|'.join, 'city': 'first', 'country': 'first'}))
org_id org_name location_id city country
0 100004261 Beacon Limited LOC-100005615|LOC-100000912 Tunbridge Wells United Kingdom
1 100023310 advance GmbH LOC-100052061|LOC-100032442|LOC-100042003 Planegg Germany
这里有一种方法可以解决您的问题:
print( df.groupby(['org_id','org_name']).apply(lambda d: d.apply(lambda col: '|'.join(set(col)))).reset_index() )
输出:
org_id org_name location_id loc_status city country
0 100004261 Beacon Limited LOC-100000912|LOC-100005615 ACTIVE Crowborough|Tunbridge Wells United Kingdom
1 100023310 advance GmbH LOC-100052061|LOC-100042003|LOC-100032442 INACTIVE|ACTIVE Planegg Germany