我有一些包含大约30000条记录的数据。我想将数据分为288条记录的组。然后将数据分类为test_data和train_data单独的数据帧,其中前4个记录按顺序和随机存储在train_data中,第5个记录存储在test_data中。随机地将5天中的任何一天保存到test_data中,其余4天保存到train_data中。
如何实现?
样本数据:
# timestamp var1 var2
--------------------------------------
1 01-01-2019 18:00:00 1.2 21
2 01-01-2019 18:05:00 2.3 32
3 01-01-2019 18:10:00 3.4 43
4 01-01-2019 18:15:00 4.5 54
5 01-01-2019 18:20:00 5.6 65
.
.
.
3000 .. - .. .. .. ..
样本输出:
#in case of sequencial OR contiguous division
train_data = (#1,#2,#3,#4 .... #1152,#1441,......,#2592,...)
test_data = (€253,#254,.....,#1440,.....,#2593,....)
#in case of random division, any 288 contiguous records from bunch of 5 in to #test_data and 4x288 into train_data.
目前,我有这种数据拆分方法。
set.seed(100)
train <- sample(nrow(dataset1), 0.7 * nrow(dataset1), replace = FALSE)
TrainSet <- dataset1[train,]
#scale (TrainSet, center = TRUE, scale = TRUE)
ValidSet <- dataset1[-train,]
#scale (ValidSet, center = TRUE, scale = TRUE)
summary(TrainSet)
summary(ValidSet)
这里是一种方法:
# assume the number of rows is divisible by 288
num_days = nrow(dataset1)/288
# Each value (True or False) indicates whether the *day* is included or not
training.days.mask = sample(rep(c(T,T,T,T,F), length.out=num_days))
testing.days.mask = !training.days
# To index the actual values, repeat each mask 288 times
training.samples.mask = rep(training.days.mask, each=288)
testing.samples.mask = rep(testing.days.mask, each=288)
# now use the mask to extract the data
training.samples = dataset1[training.samples.mask,]
testing.samples = dataset1[testing.samples.mask,]
想法是首先对日指数(不是样本)执行sample
。然后,重复每个口罩288次以捕获一整天的样本。
这完成了您想要的吗?
dat$day <- as.Date(timestamp, "%d-%m-%Y") # Add the day for each observation
days <- unique(dat$day) # Get the days since it is the sampling unit
groups <- seq(1, 105, by=5) # Assuming 30240 observations, 105 days
daystest <- sample(5, length(groups), replace=TRUE) + groups
datetest <- days[daystest] # Days in the test set
Testing <- dat[dat$day %in% datetest,] # Test data set
Training <- dat[!dat$day %in% datetest,]
Testing
是用于测试的原始数据的数据文件,Training
是用于训练的原始数据的数据文件。由于您没有包括可复制的数据样本,因此无法对其进行测试。