df1
name date
A 14-04-05
A 14-05-08
A 14-08-09
A 15-01-05
B 18-07-05
B 18-08-09
B 18-10-02
C 19-01-03
C 19-02-04
C 19-03-30
D 16-04-01
D 16-08-04
df2
name startdate
A 14-07-07
B 18-09-09
C 19-03-15
D 16-06-28
在一个数据集中记录所有日期,第二个数据集记录开始日期。
我想把df1的记录与df2的开始日期进行比较。记录'0',如果它比开始日期早,'1',如果它是开始日期后的一天。
结果是我想要的
df1
name date Label startdate
A 14-04-05 0 14-07-07
A 14-05-08 0 14-07-07
A 14-08-09 1 14-07-07
A 15-01-05 1 14-07-07
B 18-07-05 0 18-09-09
B 18-08-09 0 18-09-09
B 18-10-02 1 18-09-09
C 19-01-03 0 19-03-15
C 19-02-04 0 19-03-15
C 19-03-30 1 19-03-15
D 16-04-01 0 16-06-28
D 16-08-04 1 16-06-28
我试着用datetime处理它,但没有成功。
简单的数据集示例
df1 = pd.DataFrame(np.array([['A', '2015-12-21'],['A', '2015-12-22'], ['A', '2015-12-25'], ['B', '2018-01-28'],['B', '2018-02-28'],['B', '2018-03-28']]),
columns=['name', 'date'])
df2 = pd.DataFrame(np.array([['A', '2015-12-23'], ['B', '2018-03-01']]),
columns=['name', 'startdate'])
谢谢你的阅读
使用 DataFrame.merge
添加新列,然后通过 Series.gt
为大 DataFrame.insert
用于按位置新建列,用于转换为数字列。0,1
是用 Series.view
:
df1['date'] = pd.to_datetime(df1['date'])
df2['startdate'] = pd.to_datetime(df2['startdate'])
df = df1.merge(df2, on='name', how='left')
df.insert(2, 'Label', df['date'].gt(df['startdate']).view('i1'))
print (df)
name date Label startdate
0 A 2014-04-05 0 2014-07-07
1 A 2014-05-08 0 2014-07-07
2 A 2014-08-09 1 2014-07-07
3 A 2015-01-05 1 2014-07-07
4 B 2018-07-05 0 2018-09-09
5 B 2018-08-09 0 2018-09-09
6 B 2018-10-02 1 2018-09-09
7 C 2019-01-03 0 2019-03-15
8 C 2019-02-04 0 2019-03-15
9 C 2019-03-30 1 2019-03-15
10 D 2016-04-01 0 2016-06-28
11 D 2016-08-04 1 2016-06-28
或者:
df1['date'] = pd.to_datetime(df1['date'])
df2['startdate'] = pd.to_datetime(df2['startdate'])
df1['startdate'] = df1['name'].map(df2.set_index('name')['startdate'])
df1.insert(2, 'Label', df1['date'].gt(df1['startdate']).view('i1'))
print (df1)
name date Label startdate
0 A 2014-04-05 0 2014-07-07
1 A 2014-05-08 0 2014-07-07
2 A 2014-08-09 1 2014-07-07
3 A 2015-01-05 1 2014-07-07
4 B 2018-07-05 0 2018-09-09
5 B 2018-08-09 0 2018-09-09
6 B 2018-10-02 1 2018-09-09
7 C 2019-01-03 0 2019-03-15
8 C 2019-02-04 0 2019-03-15
9 C 2019-03-30 1 2019-03-15
10 D 2016-04-01 0 2016-06-28
11 D 2016-08-04 1 2016-06-28
你可以 map
它。
print (df1.assign(new=(df1["date"]>df1["name"].map(df2.set_index("name")["startdate"])).astype(int),
start=df1["name"].map(df2.set_index("name")["startdate"])))
name date new start
0 A 14-04-05 0 14-07-07
1 A 14-05-08 0 14-07-07
2 A 14-08-09 1 14-07-07
3 A 15-01-05 1 14-07-07
4 B 18-07-05 0 18-09-09
5 B 18-08-09 0 18-09-09
6 B 18-10-02 1 18-09-09
7 C 19-01-03 0 19-03-15
8 C 19-02-04 0 19-03-15
9 C 19-03-30 1 19-03-15
10 D 16-04-01 0 16-06-28
11 D 16-08-04 1 16-06-28