app_year_start name
2012 John Smith
2012 John Smith
2012 John Smith
2012 Jane Doe
2013 Jane Doe
2012 John Snow
2015 John Snow
2014 John Smith
2015 John Snow
2012 John Snow
2012 John Smith
2012 John Smith
2012 John Smith
2012 John Smith
2012 Jane Doe
2013 Jane Doe
2012 John Snow
2015 John Snow
2014 John Smith
2015 John Snow
2012 John Snow
2012 John Smith
我想获得以下输出:
app_year_start name total_apps
2012 John Smith 8
2013 Jane Doe 2
2014 John Smith 2
2015 John Snow 4
i尝试使用
groupby
按年来组织数据,然后使用多种方法,例如
value_counts()
,
count()
,
max()
等...这是我获得的最接近的方法:
df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)
但没有完全产生预期的输出。
我咨询了以下帖子:1,
2,3
,但它们在我的情况下都没有工作。
cross-tabulation并找到最大值。
(
# cross tabulate to get each applicant's number of applications
pd.crosstab(df['app_year_start'], df['name'])
# the applicant with most applications and their counts
.agg(['idxmax', 'max'], 1)
# change column names
.set_axis(['name','total_apps'], axis=1)
# flatten df
.reset_index()
)
mode
df.groupby('app_year_start')['name'].agg(lambda x: x.mode().iloc[0])
:
df.groupby('app_year_start')['name'].agg(lambda x: ', '.join(x.mode()))
或者,如果您希望所有的值以单个字符串的形式连接在一起:
app_year_start
2012 John Smith
2013 Jane Doe
2014 John Smith
2015 John Snow
Name: name, dtype: object
(df
.groupby(['app_year_start', 'name'])['name']
.agg(total_apps='count')
.sort_values(by='total_apps', ascending=False)
.reset_index()
.groupby('app_year_start', as_index=False)
.first()
)
thimiant的初始代码:
app_year_start name total_apps
0 2012 John Smith 8
1 2013 Jane Doe 2
2 2014 John Smith 2
3 2015 John Snow 4
输出:
value_counts
和a
groupby
:
:dfc = (df.value_counts().reset_index().groupby('app_year_start').max()
.sort_index(ascending=False).reset_index()
.rename(columns={0:'total_apps'})
)
print(dfc)
result
app_year_start name total_apps
0 2015 John Snow 4
1 2014 John Smith 2
2 2013 Jane Doe 2
3 2012 John Snow 8