使用python将日期时间保持在与数据相关的范围内

问题描述 投票:0回答:1

我有一个数据集

BDate,Snum,ArrTime,OID,TDate,TTime,VID
1/1/2018,72,05:59:01,7214,1/1/2018,12:06:20 AM ,7206
1/1/2018,72,06:04:33,7208,1/1/2018,12:36:31 AM,7205
1/1/2018,72,06:21:07,7216,1/1/2018,5:53:49 AM,7220
1/1/2018,80,06:29:01,8026,1/1/2018,5:59:10 AM,7214
1/1/2018,72,06:30:54,7218,1/1/2018,6:04:55 AM,7208
1/1/2018,72,06:33:54,7221,1/1/2018,06:21:17 AM,7216
1/1/2018,80,06:35:26,8018,1/1/2018,06:31:04 AM,7218
1/1/2018,72,09:38:34,7211,1/1/2018,1:40:38 PM,7209
1/1/2018,72,13:39:45,7209,,,

我正在考虑的目标是将ArrTime上的列与TTime中最接近的时间相匹配,这是我在其他帖子中已经实现的。

我试图通过创建基于ArrTime列的时间限制来改进分析。从上面的数据集中可以看出,ArrTime的第一个时间是0:59:01,最后一个ArrTime是13:39:45。我想使用这2个时间(但最后时间+ 1分钟)作为时间边界来移除TTime中任何不在范围内的时间。

如下所示是我的代码

mydataset = pd.read_csv("Test.csv", error_bad_lines=False, engine ='python', index_col= False,header = 0, sep = ",")
mydataset['Date1'] = pd.to_datetime(mydataset['BDate'] + ' ' + mydataset['ArrTime'], format='%d/%m/%Y %H:%M:%S')
datesAM = pd.to_datetime(mydataset['TDate'] + ' ' + mydataset['TTime'], format='%d/%m/%Y %I:%M:%S %p')
datesPM = pd.to_datetime(mydataset['TDate'] + ' ' + mydataset['TTime'], format='%d/%m/%Y %H:%M:%S %p')
mydataset['Date2'] = datesAM.mask(mydataset['TTime'].str.endswith('AM',na=False), datesPM)
#print(mydataset)

df1 = mydataset[['Date1','Snum', 'OID']].sort_values('Date1').dropna(subset=['I'])
df1['OID'] = df1['OID'].astype(np.int64)

a = df1['Date1'].iloc[0]
a1 = a.time().strftime('%H:%M:%S') 
print(a1)
b = df1['Date1'].iloc[-1]
b1 = b.time().strftime('%H:%M:%S') 
print(b1)

df2 = mydataset[['Date2','VID']].sort_values('Date2').dropna(subset=['VID'])
df2['VID'] = df2['VID'].astype(np.int64)

df2[df2['Date2'].indexer_between_time(a1,b1)]

#df2['Date2'] = pd.date_range(start = a1, end = b1)
#print(df2)

我已经尝试使用iloc来识别第一个日期时间和最后时间,然后将其剥离为时间格式。我已经尝试过使用pd.date_range和indexer_between_time,但它都给了我错误,例如“'系列'对象没有属性'indexer_between_time'”和“值的长度与索引的长度不匹配”

我的最终目标是删除不属于该范围的细节(不是整行,而只是日期,时间,VID),然后进行最近时间的匹配(匹配时间已经实现)

BDate,Snum,ArrTime,OID,TDate,TTime,VID
1/1/2018,72,05:59:01,7214,,,
1/1/2018,72,06:04:33,7208,,,
1/1/2018,72,06:21:07,7216,,,
1/1/2018,80,06:29:01,8026,1/1/2018,5:59:10 AM,7214
1/1/2018,72,06:30:54,7218,1/1/2018,6:04:55 AM,7208
1/1/2018,72,06:33:54,7221,1/1/2018,06:21:17 AM,7216
1/1/2018,80,06:35:26,8018,1/1/2018,06:31:04 AM,7218
1/1/2018,72,09:38:34,7211,1/1/2018,1:40:38 PM,7209  
1/1/2018,72,13:39:45,7209,,,
python pandas dataframe data-processing
1个回答
0
投票

我会通过将datetime列转换为unix时间戳来实现这一点,这样我们就可以轻松地比较并过滤掉该范围内的日期时间。

我就是这样做的:

mydataset = pd.read_csv("data.csv", error_bad_lines=False, engine ='python', index_col= False,header = 0, sep = ",")
mydataset['Date1'] = pd.to_datetime(mydataset['BDate'] + ' ' + mydataset['ArrTime'], format='%d/%m/%Y %H:%M:%S')

# Function to clean dates because the format is not consistent. For example: We have *6:04:55 AM* and *06:21:17 AM* 
def cleanDate(x):
    if str(x) == 'nan':
        return np.nan
    else:
        temp = ''
        if int(x.split(':')[0]) < 10:
            temp += '0' + str(int(x.split(':')[0])) +':'
        else:
            temp += x.split(':')[0] + ':'
        temp += x.split(':',1)[1]
        return temp

mydataset['TTime'] = mydataset['TTime'].apply(lambda x: cleanDate(x))
mydataset['Date2'] = pd.to_datetime(mydataset['TDate'] + ' ' + mydataset['TTime'], format='%d/%m/%Y %I:%M:%S %p', errors='ignore')
mydataset['Date2'] = pd.to_datetime(mydataset['Date2'])

# Convert Datetime to unix timestamp and create a new column
mydataset['tsArrTime'] = mydataset['Date1'].apply(lambda x: time.mktime(x.timetuple()))
mydataset['tsTTime'] = mydataset['Date2'].apply(lambda x: time.mktime(x.timetuple()) if str(x) != 'NaT' else 0)

# Get min and max timestamp from tsArrTime column
minTime = mydataset['tsArrTime'].min() 
maxTime = mydataset['tsArrTime'].max() + 60  # End datetime + 1 min

# Check if tsTTime is within the range else replace with empty string (Change it to whatever you want)
mydataset.loc[(mydataset['tsTTime'] < minTime) | (mydataset['tsTTime'] > maxTime), 'TTime'] = ''
mydataset.loc[(mydataset['tsTTime'] < minTime) | (mydataset['tsTTime'] > maxTime), 'TDate'] = ''
mydataset.loc[(mydataset['tsTTime'] < minTime) | (mydataset['tsTTime'] > maxTime), 'VID'] = ''
mydataset['TTime'] = mydataset['TTime'].fillna('')
mydataset['TDate'] = mydataset['TDate'].fillna('')
mydataset['VID'] = mydataset['VID'].fillna('')

mydataset = mydataset.drop(columns=['Date1','Date2','tsArrTime','tsTTime'])

这是输出:

     BDate    Snum  ArrTime     OID     TDate       TTime           VID
0   1/1/2018    72  05:59:01    7214            
1   1/1/2018    72  06:04:33    7208            
2   1/1/2018    72  06:21:07    7216            
3   1/1/2018    80  06:29:01    8026    1/1/2018    05:59:10 AM     7214
4   1/1/2018    72  06:30:54    7218    1/1/2018    06:04:55 AM     7208
5   1/1/2018    72  06:33:54    7221    1/1/2018    06:21:17 AM     7216
6   1/1/2018    80  06:35:26    8018    1/1/2018    06:31:04 AM     7218
7   1/1/2018    72  09:38:34    7211    1/1/2018    01:40:38 PM     7209
8   1/1/2018    72  13:39:45    7209        
© www.soinside.com 2019 - 2024. All rights reserved.