我有一个 myfile.csv 文件,其列由管道 (|) 分隔,我想使用 R 将文件拆分为 2 个具有相同行数的 csv 文件,并且每个 csv 文件保留标题行。
userid|logindatetime|logoutdatetime|description
111|2024-02-10 08:00:00|2024-02-10 08:00:00|systemadministrator
112|2024-02-11 08:00:00|2024-02-10 08:00:05|user1
113|2024-02-12 08:00:00|2024-02-10 08:00:10|user2
114|2024-02-13 08:00:00|2024-02-10 08:00:15|user3
115|2024-02-14 08:00:00|2024-02-10 08:00:20|user4
116|2024-02-15 08:00:00|2024-02-10 08:00:25|user5
117|2024-02-16 08:00:00|2024-02-10 08:00:30|user6
118|2024-02-17 08:00:00|2024-02-10 08:00:35|user7
111|2024-02-18 08:00:00|2024-02-10 08:00:40|systemadministrator
119|2024-02-18 08:00:00|2024-02-10 08:00:45|user8
我遵循了下面建议的解决方案使用循环将 R 中的大文件分割成更小的文件,这部分有效。
library(tidyverse)
rowCount <- 2
data %>%
mutate(Group = ceiling((row_number()) / rowCount)) %>%
group_by(Group) %>%
group_walk(
function(.x, .y) {
write.csv(.x, file = paste0("myfile.csv", .y$Group, ".csv"))
}
)
但是,建议的解决方案不保留管道(|)分隔符,而是引入逗号(,)分隔符,并且还在字符串值和行计数器上引入双引号(“”)。我需要在拆分 csv 文件时保持其原始结构和格式。以下是分割后我的 2 个文件的预期输出。
预期输出:文件1
userid|logindatetime|logoutdatetime|description
111|2024-02-10 08:00:00|2024-02-10 08:00:00|systemadministrator
112|2024-02-11 08:00:00|2024-02-10 08:00:05|user1
113|2024-02-12 08:00:00|2024-02-10 08:00:10|user2
114|2024-02-13 08:00:00|2024-02-10 08:00:15|user3
115|2024-02-14 08:00:00|2024-02-10 08:00:20|user4
预期输出:文件2
userid|logindatetime|logoutdatetime|description
116|2024-02-15 08:00:00|2024-02-10 08:00:25|user5
117|2024-02-16 08:00:00|2024-02-10 08:00:30|user6
118|2024-02-17 08:00:00|2024-02-10 08:00:35|user7
111|2024-02-18 08:00:00|2024-02-10 08:00:40|systemadministrator
119|2024-02-18 08:00:00|2024-02-10 08:00:45|user8
我得到的错误输出例如,输出 1 如下
"","userid","logindatetime","logoutdatetime","description"
"1",111,"2024-02-10 08:00:00","2024-02-10 08:00:00","systemadministrator"
"2",112,"2024-02-11 08:00:00","2024-02-10 08:00:05","user1"
"3",113,"2024-02-12 08:00:00","2024-02-10 08:00:10","user2"
"4",114,"2024-02-13 08:00:00","2024-02-10 08:00:15","user3"
"5",115,"2024-02-14 08:00:00","2024-02-10 08:00:20","user4"
我尝试使用下面的代码指定列分隔符(sep =“|”)并忽略(quote =“”)并包含标题(header = TRUE)。
rowCount <- 2
data %>%
mutate(Group = ceiling((row_number()) / rowCount)) %>%
group_by(Group) %>%
group_walk(
function(.x, .y) {
write.csv(.x, file = paste0("myfile.csv", .y$Group, ".csv", sep = "|", header = TRUE, quote = "", stringsAsFactors = FALSE))
}
)
但是我收到以下错误。我怎样才能得到我想要的输出?
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
cannot open file 'myfile.csv1.csv|TRUEFALSE': Invalid argument
这里有一种简单的方法,可以将 CSV 文件拆分为两个单独的文件,同时保留原始管道分隔符和格式。
read.table
加载数据,并以管道 (|
) 作为分隔符。write.table
将每个组保存到具有相同管道分隔符的新文件中。# Load necessary library
library(dplyr)
# Read the CSV file with pipe separator
data <- read.table("myfile.csv", header = TRUE, sep = "|", stringsAsFactors = FALSE)
# Define how many rows you want per file
rowCount <- 5 # Adjust this as needed
# Create a grouping variable
data <- data %>%
mutate(Group = ceiling(row_number() / rowCount))
# Split the data and write each group to a new file
data %>%
group_by(Group) %>%
group_walk(
function(.x, .y) {
write.table(.x, file = paste0("file", .y$Group, ".csv"), sep = "|", row.names = FALSE, quote = FALSE)
}
)
read.table
:这会读取原始文件,并使用管道作为分隔符。mutate
函数创建一个新列,根据rowCount
来识别每行属于哪个组。write.table
用于保存各组。它使用管道分隔符并避免在字符串周围添加引号。rowCount
变量以适合每个输出文件中所需的行数。"myfile.csv"
的文件路径指向您的实际文件所在位置。