我有以下数据框:
timeseries = pd.date_range("2018-01-01", periods=10, freq="m")
df = pd.DataFrame(data = ["a","a","b","a","a","c","c","c","a","a"], index = timeseries, columns = ['var'])
print(df)
var
2018-01-31 a
2018-02-28 a
2018-03-31 b
2018-04-30 a
2018-05-31 a
2018-06-30 c
2018-07-31 c
2018-08-31 c
2018-09-30 a
2018-10-31 a
我想提取“var”列中每个不间断值序列的最小和最大日期,并将它们分配为附加列。期望的结果是:
var min Date max Date
1/31/2018 a 1/31/2018 2/28/2018
2/28/2018 a 1/31/2018 2/28/2018
3/31/2018 b 3/31/2018 3/31/2018
4/30/2018 a 4/30/2018 5/31/2018
5/31/2018 a 4/30/2018 5/31/2018
6/30/2018 c 6/30/2018 8/31/2018
7/31/2018 c 6/30/2018 8/31/2018
8/31/2018 c 6/30/2018 8/31/2018
9/30/2018 a 9/30/2018 10/31/2018
10/31/2018 a 9/30/2018 10/31/2018
例如,在“var”列中,值“a”第一次出现是在 2018 年 1 月 31 日,然后在 2018 年 2 月 28 日再次出现“a”,然后被“b”中断。因此,对于最小和最大日期,我们将有 1/31/2018 和 2/28/2018。
我认为我可以通过 groupby 操作来实现这一点,但没有管理,因为 groupby 聚合了所有“a”或“b”或“c”等。
to_series
,然后形成自定义石斑鱼(使用shift
+cumsum
)并使用groupby.transform
:
# group successive values
group = df['var'].ne(df['var'].shift()).cumsum()
# form grouper and get transform
t = df.index.to_series().groupby(group).transform
# compute min/max
df['min Date'] = t('min')
df['max Date'] = t('max')
输出:
var min Date max Date
2018-01-31 a 2018-01-31 2018-02-28
2018-02-28 a 2018-01-31 2018-02-28
2018-03-31 b 2018-03-31 2018-03-31
2018-04-30 a 2018-04-30 2018-05-31
2018-05-31 a 2018-04-30 2018-05-31
2018-06-30 c 2018-06-30 2018-08-31
2018-07-31 c 2018-06-30 2018-08-31
2018-08-31 c 2018-06-30 2018-08-31
2018-09-30 a 2018-09-30 2018-10-31
2018-10-31 a 2018-09-30 2018-10-31
import numpy as np
import pandas as pd
timeseries = pd.date_range("1990-01-01", periods=10, freq="ME")
# Create the DataFrame
df = pd.DataFrame(data=["a", "a", "b", "a", "a", "c", "c", "c", "a", "a"], index=timeseries, columns=['id'])
print(df)
"""
id
1990-01-31 a
1990-02-28 a
1990-03-31 b
1990-04-30 a
1990-05-31 a
1990-06-30 c
1990-07-31 c
1990-08-31 c
1990-09-30 a
1990-10-31 a
"""
# Find groups of successive values
group = df['id'].ne(df['id'].shift()).cumsum()
df['group'] = group
# Calculate min and max dates for each group
min_dates = df.groupby(group).apply(lambda x: x.index.min())
max_dates = df.groupby(group).apply(lambda x: x.index.max())
# Merge min and max dates with the original DataFrame
df['min_Date'] = group.map(min_dates)
df['max_Date'] = group.map(max_dates)
print(df)
"""
id group min_Date max_Date
1990-01-31 a 1 1990-01-31 1990-02-28
1990-02-28 a 1 1990-01-31 1990-02-28
1990-03-31 b 2 1990-03-31 1990-03-31
1990-04-30 a 3 1990-04-30 1990-05-31
1990-05-31 a 3 1990-04-30 1990-05-31
1990-06-30 c 4 1990-06-30 1990-08-31
1990-07-31 c 4 1990-06-30 1990-08-31
1990-08-31 c 4 1990-06-30 1990-08-31
1990-09-30 a 5 1990-09-30 1990-10-31
1990-10-31 a 5 1990-09-30 1990-10-31
"""
import pandas as pd
import numpy as np
timeseries = pd.date_range("2018-01-01", periods=10, freq="ME")
df = pd.DataFrame(data=["a", "a", "b", "a", "a", "c", "c", "c", "a", "a"],
index=timeseries, columns=['var'])
print(df)
seq = np.cumsum(df['var'].values != np.roll(df['var'].values,1))
df['seq'] = seq
df1 = df.reset_index()
minMaxDate = df1.groupby(seq).agg(minDate = ('index','min'),maxDate = ('index','max'))
res = df1.merge(minMaxDate,left_on ='seq',right_index =True)
print(res)
'''
index var seq minDate maxDate
0 2018-01-31 a 0 2018-01-31 2018-02-28
1 2018-02-28 a 0 2018-01-31 2018-02-28
2 2018-03-31 b 1 2018-03-31 2018-03-31
3 2018-04-30 a 2 2018-04-30 2018-05-31
4 2018-05-31 a 2 2018-04-30 2018-05-31
5 2018-06-30 c 3 2018-06-30 2018-08-31
6 2018-07-31 c 3 2018-06-30 2018-08-31
7 2018-08-31 c 3 2018-06-30 2018-08-31
8 2018-09-30 a 4 2018-09-30 2018-10-31
9 2018-10-31 a 4 2018-09-30 2018-10-31
'''