如何将r中的数据帧分成相等数量的记录,然后将dat随机相等地分为两个数据帧

问题描述 投票:-1回答:2

我有一些包含大约30000条记录的数据。我想将数据分为288条记录的组。然后将数据分类为test_data和train_data单独的数据帧,其中前4个记录按顺序和随机存储在train_data中,第5个记录存储在test_data中。随机地将5天中的任何一天保存到test_data中,其余4天保存到train_data中。

如何实现?

样本数据:

 #   timestamp               var1      var2
    --------------------------------------
 1   01-01-2019 18:00:00      1.2       21
 2   01-01-2019 18:05:00      2.3       32
 3   01-01-2019 18:10:00      3.4       43
 4   01-01-2019 18:15:00      4.5       54
 5   01-01-2019 18:20:00      5.6       65
 . 
 .
 .
3000  ..   -    ..   ..        ..        ..

样本输出:

#in case of sequencial OR contiguous division  
train_data = (#1,#2,#3,#4 .... #1152,#1441,......,#2592,...) 
test_data = (€253,#254,.....,#1440,.....,#2593,....)

#in case of random division, any 288 contiguous records from bunch of 5 in to #test_data and 4x288 into train_data.

目前,我有这种数据拆分方法。

   set.seed(100)

    train <- sample(nrow(dataset1), 0.7 * nrow(dataset1), replace = FALSE)
    TrainSet <- dataset1[train,]
    #scale (TrainSet, center = TRUE, scale = TRUE)
    ValidSet <- dataset1[-train,]
    #scale (ValidSet, center = TRUE, scale = TRUE)
    summary(TrainSet)
    summary(ValidSet)
r random sampling
2个回答
0
投票

这里是一种方法:

# assume the number of rows is divisible by 288
num_days = nrow(dataset1)/288

# Each value (True or False) indicates whether the *day* is included or not 
training.days.mask = sample(rep(c(T,T,T,T,F), length.out=num_days))
testing.days.mask = !training.days

# To index the actual values, repeat each mask 288 times
training.samples.mask = rep(training.days.mask, each=288)
testing.samples.mask = rep(testing.days.mask, each=288)

# now use the mask to extract the data
training.samples = dataset1[training.samples.mask,]
testing.samples = dataset1[testing.samples.mask,]

想法是首先对日指数(不是样本)执行sample。然后,重复每个口罩288次以捕获一整天的样本。


0
投票

这完成了您想要的吗?

dat$day <- as.Date(timestamp, "%d-%m-%Y")  # Add the day for each observation
days <- unique(dat$day)                    # Get the days since it is the sampling unit
groups <- seq(1, 105, by=5)                # Assuming 30240 observations, 105 days
daystest <- sample(5, length(groups), replace=TRUE) + groups
datetest <- days[daystest]                 # Days in the test set
Testing <- dat[dat$day %in% datetest,]         # Test data set
Training <- dat[!dat$day %in% datetest,]

Testing是用于测试的原始数据的数据文件,Training是用于训练的原始数据的数据文件。由于您没有包括可复制的数据样本,因此无法对其进行测试。

© www.soinside.com 2019 - 2024. All rights reserved.