index date_from date_to
0 '2019-08-01' '2019-08-05'
1 '2019-08-04' '2019-08-07'
2 '2019-08-07' '2019-08-09'
我需要计算所有范围的日期数。 结果应该是这样:
date count
'2019-08-01' 1
'2019-08-02' 1
'2019-08-03' 1
'2019-08-04' 2
'2019-08-05' 2
'2019-08-06' 1
'2019-08-07' 2
'2019-08-08' 1
'2019-08-09' 1
I使用循环“ for”解决问题,但是计算需要很长时间,因为原始数据框架很大。 thanks.
一个选项可以是根据天数,以
groupby.cumcount
和pd.to_timedelta
:生成中间天数。
value_counts
输出:
# ensure datetime
df[['date_from', 'date_to']] = df[['date_from', 'date_to']].apply(pd.to_datetime)
# number of days between from and to
n = df['date_to'].sub(df['date_from']).dt.days
# repeat the days
s = df.loc[df.index.repeat(n+1), 'date_from']
# increment to create intermediates and count
out = (s.add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='day'))
.value_counts(sort=False)
)
中间体:
2019-08-01 1
2019-08-02 1
2019-08-03 1
2019-08-04 2
2019-08-05 2
2019-08-06 1
2019-08-07 2
2019-08-08 1
2019-08-09 1
Name: count, dtype: int64
其他选项,使用生成器
# n
0 4
1 3
2 2
dtype: int64
# s
0 2019-08-01
0 2019-08-01
0 2019-08-01
0 2019-08-01
0 2019-08-01
1 2019-08-04
1 2019-08-04
1 2019-08-04
1 2019-08-04
2 2019-08-07
2 2019-08-07
2 2019-08-07
Name: date_from, dtype: datetime64[ns]
# s.add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='day'))
0 2019-08-01
0 2019-08-02
0 2019-08-03
0 2019-08-04
0 2019-08-05
1 2019-08-04
1 2019-08-05
1 2019-08-06
1 2019-08-07
2 2019-08-07
2 2019-08-08
2 2019-08-09
dtype: datetime64[ns]
输出:
import numpy as np
out = pd.Series.value_counts(np.fromiter((d for f, t in
zip(df['date_from'], df['date_to'])
for d in pd.date_range(f,t)), 'datetime64[ns]'),
sort=False)
Timings2019-08-01 1
2019-08-02 1
2019-08-03 1
2019-08-04 2
2019-08-05 2
2019-08-06 1
2019-08-07 2
2019-08-08 1
2019-08-09 1
Name: count, dtype: int64
30k行:
# pandas repeat + groupby.cumcount+timedelta + value_counts
1.6 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# numpy iterator + value_counts
604 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
,但是,这是一个快速的熊猫实施
# pandas repeat + groupby.cumcount+timedelta + value_counts
18.1 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numpy iterator + value_counts
3.52 s ± 70.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
您可以使用
Explode方法并获取所有范围的日期数:
res = pd.DataFrame()
res['date'] = pd.date_range(df['date_from'].min(), df['date_to'].max())
def in_interval(d):
return (df['date_from']<=d) & (df['date_to']>=d)
res['count'] = df2['date'].apply(lambda d: df[in_interval(d)].shape[0])