如何将数据帧行分组到 pandas groupby 的列表中

Question

我有一个 pandas 数据框

df

，例如：

a b
A 1
A 2
B 5
B 5
B 4
C 6

我想按第一列进行分组，并将第二列作为行中的列表：

A [1,2]
B [5,5,4]
C [6]

是否可以使用 pandas groupby 执行类似的操作？

Answer 1

您可以使用

groupby

对感兴趣的列进行分组，然后对每个组使用

apply

list

：

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

Answer 2

实现这一目标的便捷方法是：

df.groupby('a').agg({'b':lambda x: list(x)})

考虑编写自定义聚合：https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

Answer 3

如果性能很重要，请降低到 numpy 级别：

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

测试：

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

Answer 4

要解决数据帧的几列的问题：

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

这个答案的灵感来自于Anamika Modi的答案。谢谢！

Answer 5

使用以下任何

groupby

和

agg

食谱。

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

要将多列聚合为列表，请使用以下任一方法：

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

要仅对单个列进行分组列出，请将 groupby 转换为

SeriesGroupBy

对象，然后调用

SeriesGroupBy.agg

。使用，

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

Answer 6

正如您所说，

groupby

对象的

pd.DataFrame

方法可以完成这项工作。

示例

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

给出了组的按索引描述。

要获取单个组的元素，您可以这样做

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

Answer 7

是时候使用

agg

代替

apply

了。

什么时候

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

如果您希望将多个列堆叠到 list 中，则结果为

pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

如果您希望列表中只有单列，则结果为

ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

注意，当您仅聚合单列时，

pd.DataFrame

的结果比

ps.Series

的结果慢约 10 倍，请在多列情况下使用它。

Answer 8

只是一个补充。

pandas.pivot_table

更通用，看起来更方便：

"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
                    'b':[1,2,5,5,4,6],
                    'c':[1,2,1,1,1,6]})
print(df)

   a  b  c
0  A  1  1
1  A  2  2
2  B  5  1
3  B  5  1
4  B  4  1
5  C  6  6

"""pivot_table"""
pt = pd.pivot_table(df,
                    values=['b', 'c'],
                    index='a',
                    aggfunc={'b': list,
                             'c': set})
print(pt)
           b       c
a                   
A     [1, 2]  {1, 2}
B  [5, 5, 4]     {1}
C        [6]     {6}

Answer 9

如果在对多列进行分组时寻找唯一列表，这可能会有所帮助：

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

Answer 10

基于 @B.M 答案，这是一个更通用的版本，并更新为与较新的库版本一起使用：（numpy 版本

1.19.2

，pandas 版本

1.2.1

）而且这个解决方案还可以处理多索引：

但是这还没有经过严格测试，请谨慎使用。

如果性能很重要，请降低到 numpy 级别：

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})


def f_multi(df,col_names):
    if not isinstance(col_names,list):
        col_names = [col_names]
        
    values = df.sort_values(col_names).values.T

    col_idcs = [df.columns.get_loc(cn) for cn in col_names]
    other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
    other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]

    # split df into indexing colums(=keys) and data colums(=vals)
    keys = values[col_idcs,:]
    vals = values[other_col_idcs,:]
    
    # list of tuple of key pairs
    multikeys = list(zip(*keys))
    
    # remember unique key pairs and ther indices
    ukeys, index = np.unique(multikeys, return_index=True, axis=0)
    
    # split data columns according to those indices
    arrays = np.split(vals, index[1:], axis=1)

    # resulting list of subarrays has same number of subarrays as unique key pairs
    # each subarray has the following shape:
    #    rows = number of non-grouped data columns
    #    cols = number of data points grouped into that unique key pair
    
    # prepare multi index
    idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names) 

    list_agg_vals = dict()
    for tup in zip(*arrays, other_col_names):
        col_vals = tup[:-1] # first entries are the subarrays from above 
        col_name = tup[-1]  # last entry is data-column name
        
        list_agg_vals[col_name] = col_vals

    df2 = pd.DataFrame(data=list_agg_vals, index=idx)
    return df2

测试：

In [227]: %timeit f_multi(df, ['a','d'])

2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [228]: %timeit df.groupby(['a','d']).agg(list)

4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

结果：

对于随机种子 0 会得到：

Answer 11

我发现实现相同目标的最简单方法，至少对于一列，类似于Anamika的答案，只是使用聚合函数的元组语法。

df.groupby('a').agg(b=('b','unique'), c=('c','unique'))

Answer 12

让我们将

df.groupby

与列表和

Series

构造函数一起使用

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

Answer 13

排序消耗

O(nlog(n))

时间，这是上面建议的解决方案中最耗时的操作

对于一个简单的解决方案（包含单列）

pd.Series.to_list

可以工作并且可以被认为更有效，除非考虑其他框架

例如

import pandas as pd
from string import ascii_lowercase
import random

def generate_string(case=4):
    return ''.join([random.choice(ascii_lowercase) for _ in range(case)])

df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})


%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})

对于 2000 万条记录，大约需要

17.2 seconds

。与需要大约

apply(list)

的

19.2

和需要大约

20.6s

的 lambda 函数相比

Answer 14

这里我用“|”对元素进行了分组作为分隔符

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]

Answer 15

答案基于@EdChum对其答案的评论。评论是这样的-

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think

首先创建一个数据框，第一列有 500k 个类别，总 df 形状为 2000 万，如问题中提到的。

df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()

# Sort data by first column 
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)

# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))

# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))

# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']

# Now create final list_b column, using min and max indexes for each category of a and filtering list of b. 
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)

print(gp_df.shape)
gp_df.head()

上面的代码需要 2 分钟才能处理第一列中的 2000 万行和 500k 个类别。

Answer 16

只是为了添加之前的答案，就我而言，我想要列表和其他功能，例如

min

和

max

。方法是：

df = pd.DataFrame({
    'a':['A','A','B','B','B','C'], 
    'b':[1,2,5,5,4,6]
})

df=df.groupby('a').agg({
    'b':['min', 'max',lambda x: list(x)]
})

#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)

如何将数据帧行分组到 pandas groupby 的列表中

问题描述投票：0回答：16

16个回答

如果性能很重要，请降低到 numpy 级别：

测试：

如果性能很重要，请降低到 numpy 级别：

测试：

结果：

最新问题

如何将数据帧行分组到 pandas groupby 的列表中

问题描述 投票：0回答：16

16个回答

如果性能很重要，请降低到 numpy 级别：

测试：

如果性能很重要，请降低到 numpy 级别：

测试：

结果：

最新问题

问题描述投票：0回答：16