我有一个数据集
BDate,Snum,ArrTime,OID,TDate,TTime,VID
1/1/2018,72,05:59:01,7214,1/1/2018,12:06:20 AM ,7206
1/1/2018,72,06:04:33,7208,1/1/2018,12:36:31 AM,7205
1/1/2018,72,06:21:07,7216,1/1/2018,5:53:49 AM,7220
1/1/2018,80,06:29:01,8026,1/1/2018,5:59:10 AM,7214
1/1/2018,72,06:30:54,7218,1/1/2018,6:04:55 AM,7208
1/1/2018,72,06:33:54,7221,1/1/2018,06:21:17 AM,7216
1/1/2018,80,06:35:26,8018,1/1/2018,06:31:04 AM,7218
1/1/2018,72,09:38:34,7211,1/1/2018,1:40:38 PM,7209
1/1/2018,72,13:39:45,7209,,,
我正在考虑的目标是将ArrTime上的列与TTime中最接近的时间相匹配,这是我在其他帖子中已经实现的。
我试图通过创建基于ArrTime列的时间限制来改进分析。从上面的数据集中可以看出,ArrTime的第一个时间是0:59:01,最后一个ArrTime是13:39:45。我想使用这2个时间(但最后时间+ 1分钟)作为时间边界来移除TTime中任何不在范围内的时间。
如下所示是我的代码
mydataset = pd.read_csv("Test.csv", error_bad_lines=False, engine ='python', index_col= False,header = 0, sep = ",")
mydataset['Date1'] = pd.to_datetime(mydataset['BDate'] + ' ' + mydataset['ArrTime'], format='%d/%m/%Y %H:%M:%S')
datesAM = pd.to_datetime(mydataset['TDate'] + ' ' + mydataset['TTime'], format='%d/%m/%Y %I:%M:%S %p')
datesPM = pd.to_datetime(mydataset['TDate'] + ' ' + mydataset['TTime'], format='%d/%m/%Y %H:%M:%S %p')
mydataset['Date2'] = datesAM.mask(mydataset['TTime'].str.endswith('AM',na=False), datesPM)
#print(mydataset)
df1 = mydataset[['Date1','Snum', 'OID']].sort_values('Date1').dropna(subset=['I'])
df1['OID'] = df1['OID'].astype(np.int64)
a = df1['Date1'].iloc[0]
a1 = a.time().strftime('%H:%M:%S')
print(a1)
b = df1['Date1'].iloc[-1]
b1 = b.time().strftime('%H:%M:%S')
print(b1)
df2 = mydataset[['Date2','VID']].sort_values('Date2').dropna(subset=['VID'])
df2['VID'] = df2['VID'].astype(np.int64)
df2[df2['Date2'].indexer_between_time(a1,b1)]
#df2['Date2'] = pd.date_range(start = a1, end = b1)
#print(df2)
我已经尝试使用iloc来识别第一个日期时间和最后时间,然后将其剥离为时间格式。我已经尝试过使用pd.date_range和indexer_between_time,但它都给了我错误,例如“'系列'对象没有属性'indexer_between_time'”和“值的长度与索引的长度不匹配”
我的最终目标是删除不属于该范围的细节(不是整行,而只是日期,时间,VID),然后进行最近时间的匹配(匹配时间已经实现)
BDate,Snum,ArrTime,OID,TDate,TTime,VID
1/1/2018,72,05:59:01,7214,,,
1/1/2018,72,06:04:33,7208,,,
1/1/2018,72,06:21:07,7216,,,
1/1/2018,80,06:29:01,8026,1/1/2018,5:59:10 AM,7214
1/1/2018,72,06:30:54,7218,1/1/2018,6:04:55 AM,7208
1/1/2018,72,06:33:54,7221,1/1/2018,06:21:17 AM,7216
1/1/2018,80,06:35:26,8018,1/1/2018,06:31:04 AM,7218
1/1/2018,72,09:38:34,7211,1/1/2018,1:40:38 PM,7209
1/1/2018,72,13:39:45,7209,,,
我会通过将datetime列转换为unix时间戳来实现这一点,这样我们就可以轻松地比较并过滤掉该范围内的日期时间。
我就是这样做的:
mydataset = pd.read_csv("data.csv", error_bad_lines=False, engine ='python', index_col= False,header = 0, sep = ",")
mydataset['Date1'] = pd.to_datetime(mydataset['BDate'] + ' ' + mydataset['ArrTime'], format='%d/%m/%Y %H:%M:%S')
# Function to clean dates because the format is not consistent. For example: We have *6:04:55 AM* and *06:21:17 AM*
def cleanDate(x):
if str(x) == 'nan':
return np.nan
else:
temp = ''
if int(x.split(':')[0]) < 10:
temp += '0' + str(int(x.split(':')[0])) +':'
else:
temp += x.split(':')[0] + ':'
temp += x.split(':',1)[1]
return temp
mydataset['TTime'] = mydataset['TTime'].apply(lambda x: cleanDate(x))
mydataset['Date2'] = pd.to_datetime(mydataset['TDate'] + ' ' + mydataset['TTime'], format='%d/%m/%Y %I:%M:%S %p', errors='ignore')
mydataset['Date2'] = pd.to_datetime(mydataset['Date2'])
# Convert Datetime to unix timestamp and create a new column
mydataset['tsArrTime'] = mydataset['Date1'].apply(lambda x: time.mktime(x.timetuple()))
mydataset['tsTTime'] = mydataset['Date2'].apply(lambda x: time.mktime(x.timetuple()) if str(x) != 'NaT' else 0)
# Get min and max timestamp from tsArrTime column
minTime = mydataset['tsArrTime'].min()
maxTime = mydataset['tsArrTime'].max() + 60 # End datetime + 1 min
# Check if tsTTime is within the range else replace with empty string (Change it to whatever you want)
mydataset.loc[(mydataset['tsTTime'] < minTime) | (mydataset['tsTTime'] > maxTime), 'TTime'] = ''
mydataset.loc[(mydataset['tsTTime'] < minTime) | (mydataset['tsTTime'] > maxTime), 'TDate'] = ''
mydataset.loc[(mydataset['tsTTime'] < minTime) | (mydataset['tsTTime'] > maxTime), 'VID'] = ''
mydataset['TTime'] = mydataset['TTime'].fillna('')
mydataset['TDate'] = mydataset['TDate'].fillna('')
mydataset['VID'] = mydataset['VID'].fillna('')
mydataset = mydataset.drop(columns=['Date1','Date2','tsArrTime','tsTTime'])
这是输出:
BDate Snum ArrTime OID TDate TTime VID
0 1/1/2018 72 05:59:01 7214
1 1/1/2018 72 06:04:33 7208
2 1/1/2018 72 06:21:07 7216
3 1/1/2018 80 06:29:01 8026 1/1/2018 05:59:10 AM 7214
4 1/1/2018 72 06:30:54 7218 1/1/2018 06:04:55 AM 7208
5 1/1/2018 72 06:33:54 7221 1/1/2018 06:21:17 AM 7216
6 1/1/2018 80 06:35:26 8018 1/1/2018 06:31:04 AM 7218
7 1/1/2018 72 09:38:34 7211 1/1/2018 01:40:38 PM 7209
8 1/1/2018 72 13:39:45 7209