我正在尝试使用readLines
将17.6GB的csv文件导入R.我已经尝试了几种方法讨论here,here,here和其他地方和readLines
似乎是有效至少可以将数据导入R的唯一方法。
问题是我无法将readLines
的输出转换为我可以在分析中使用的数据框。相关问题here的答案并没有帮助我解决我的问题。
这是我的示例数据:
write.csv(data.frame(myid=1:10,var=runif(10)),"temp.csv")
dt<-data.frame(myid=1:10,var=runif(10))
dt
myid var
1 1 0.5949020
2 2 0.8515591
3 3 0.8139010
4 4 0.3804234
5 5 0.4923082
6 6 0.9933775
7 7 0.1740895
8 8 0.8342808
9 9 0.3958154
10 10 0.9690561
创建块:
file_i <- file("temp.csv","r")
chunk_size <- 100000 # choose the best size for you
x<- readLines(file_in, n=chunk_size)
打开R中readLines的输出:
View(x)
x
[1] "\"\",\"myid\",\"var\""
[2] "\"1\",1,0.594902001088485"
[3] "\"2\",2,0.851559089729562"
[4] "\"3\",3,0.81390100880526"
[5] "\"4\",4,0.380423351423815"
[6] "\"5\",5,0.492308202432469"
[7] "\"6\",6,0.993377464590594"
[8] "\"7\",7,0.174089450156316"
[9] "\"8\",8,0.834280799608678"
[10] "\"9\",9,0.395815373631194"
[11] "\"10\",10,0.969056134112179"
在此先感谢您的帮助
以下是在发布到数据帧时转换数据的完整指令序列。
set.seed(1234) # Make the results reproducible
write.csv(data.frame(myid=1:10,var=runif(10)),"temp.csv")
dat <- readLines("temp.csv")
df1 <- strsplit(dat[-1], ",")
df1 <- do.call(rbind, df1)
df1 <- df1[,-1]
df1 <- as.data.frame(df1)
df1[] <- lapply(df1, function(x) as.numeric(as.character(x)))
names(df1) <- gsub('"', '', strsplit(dat[1], ',')[[1]][-1], fixed = TRUE)
df1
给定readLines之后的输出,这必须是CSV文件的内容:
"","myid","var"
"1","1","0.5949020"
"2","2","0.8515591"
"3","3","0.8139010"
"4","4","0.3804234"
"5","5","0.4923082"
"6","6","0.9933775"
"7","7","0.1740895"
"8","8","0.8342808"
"9","9","0.3958154"
"10","10","0.9690561"
也就是说,您的值以逗号分隔并用双引号括起来。当我读到这个文件时,我得到你的输出:
dat
[1] "\"\",\"myid\",\"var\"" "\"1\",\"1\",\"0.5949020\""
[3] "\"2\",\"2\",\"0.8515591\"" "\"3\",\"3\",\"0.8139010\""
[5] "\"4\",\"4\",\"0.3804234\"" "\"5\",\"5\",\"0.4923082\""
[7] "\"6\",\"6\",\"0.9933775\"" "\"7\",\"7\",\"0.1740895\""
[9] "\"8\",\"8\",\"0.8342808\"" "\"9\",\"9\",\"0.3958154\""
[11] "\"10\",\"10\",\"0.9690561\""
所以你需要做的是
unlist(strsplit(..., split = ",")
和
gsub("\"", "", ...)
这给了我们:
unlist(strsplit(gsub("\"", "", dat), split = ","))
[1] "" "myid" "var" "1" "1" "0.5949020" "2"
[8] "2" "0.8515591" "3" "3" "0.8139010" "4" "4"
[15] "0.3804234" "5" "5" "0.4923082" "6" "6" "0.9933775"
[22] "7" "7" "0.1740895" "8" "8" "0.8342808" "9"
[29] "9" "0.3958154" "10" "10" "0.9690561"