在 Pandas dataframa 中按分隔符分组并连接唯一值

问题描述 投票:0回答:3

我有以下熊猫数据框。

    org_id  org_name    location_id             loc_status  city            country
0   100023310   advance GmbH    LOC-100052061   ACTIVE      Planegg         Germany
1   100023310   advance GmbH    LOC-100032442   ACTIVE      Planegg         Germany
2   100023310   advance GmbH    LOC-100042003   INACTIVE    Planegg         Germany
3   100004261   Beacon Limited  LOC-100005615   ACTIVE      Tunbridge Wells United Kingdom
4   100004261   Beacon Limited  LOC-100000912   ACTIVE      Crowborough     United Kingdom

我想按 org_id、org_name 列对行进行分组,并通过分隔符“|”查找唯一值并连接值其他列值。

我正在使用以下代码行。

gr_columns = [x for x in df.columns if x not in ['location_id', 'loc_status','city', 'country']]
df.groupby(gr_columns).agg(lambda col: '|'.join(col))

然而,最终的数据框缺少一些列(城市和国家)。我得到以下输出。

    org_id  org_name    location_id             loc_status
1   100023310   advance GmbH    LOC-100052061|LOC-100032442|LOC-100042003   ACTIVE|INACTIVE     
2   100004261   Beacon Limited  LOC-100005615   ACTIVE     

还有以下警告。


FutureWarning: Dropping invalid columns in DataFrameGroupBy.agg is deprecated. In a future version, a TypeError will be raised. Before calling .agg, select only columns which should be valid for the function.
  df.groupby(gr_columns).agg(lambda col: ','.join(col))

预期输出为:

    org_id  org_name    location_id             loc_status  city            country
1   100023310   advance GmbH    LOC-100052061|LOC-100032442|LOC-100042003   ACTIVE|INACTIVE     Planegg         Germany
2   100004261   Beacon Limited  LOC-100005615   ACTIVE      Tunbridge Wells|Crowborough United Kingdom

非常感谢任何帮助。

python pandas group-by aggregate
3个回答
1
投票

我想你在找:

df.groupby(['org_id', 'org_name'], as_index=False).agg(lambda x: '|'.join(x.unique()))




    org_id        org_name                                location_id  \
0  100004261  Beacon Limited                LOC-100005615|LOC-100000912   
1  100023310    advance GmbH  LOC-100052061|LOC-100032442|LOC-100042003   

        loc_status                         city  country  
0           ACTIVE  Tunbridge Wells|Crowborough  Kingdom  
1  ACTIVE|INACTIVE                      Planegg  Germany 

1
投票

更新

事实上,似乎你想加入具有独特价值的一切:

join_unique = lambda x: '|'.join(x.unique())
out = df.groupby(['org_id', 'org_name'], as_index=False).agg(join_unique)
print(out)

# Output
      org_id        org_name                                location_id       loc_status                         city         country
0  100004261  Beacon Limited                LOC-100000912|LOC-100005615           ACTIVE  Crowborough|Tunbridge Wells  United Kingdom
1  100023310    advance GmbH  LOC-100032442|LOC-100042003|LOC-100052061  ACTIVE|INACTIVE                      Planegg         Germany


旧答案

您可以使用

groupby_agg

>>> (df.groupby(['org_id', 'org_name'], as_index=False)
       .agg({'location_id': '|'.join, 'city': 'first', 'country': 'first'}))

      org_id        org_name                                location_id             city         country
0  100004261  Beacon Limited                LOC-100005615|LOC-100000912  Tunbridge Wells  United Kingdom
1  100023310    advance GmbH  LOC-100052061|LOC-100032442|LOC-100042003          Planegg         Germany

0
投票

这里有一种方法可以解决您的问题:

print( df.groupby(['org_id','org_name']).apply(lambda d: d.apply(lambda col: '|'.join(set(col)))).reset_index() )

输出:

      org_id        org_name                                location_id       loc_status                         city         country
0  100004261  Beacon Limited                LOC-100000912|LOC-100005615           ACTIVE  Crowborough|Tunbridge Wells  United Kingdom
1  100023310    advance GmbH  LOC-100052061|LOC-100042003|LOC-100032442  INACTIVE|ACTIVE                      Planegg         Germany
© www.soinside.com 2019 - 2024. All rights reserved.