我试图将以下csv文件读入R
http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv
我目前使用的代码是:
url <- "http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv"
shorthistory <- read.csv(url, skip = 4)
但是我一直收到以下错误。
1:在readLines(文件,跳过)中:第1行似乎包含嵌入的nul 2:在readLines(文件,跳过)中:第2行似乎包含嵌入的nul 3:在readLines(文件,跳过)中:第3行似乎包含嵌入的nul 4:在readLines(文件,跳过)中:第4行似乎包含嵌入的nul
这让我相信我正在错误地利用这个功能,因为每一行都失败了。
任何帮助将非常感谢!
由于左上角的空白,read.csv()
似乎不起作用。必须逐行读取文件(readLines()
),然后跳过前4行。
下面显示了一个示例。该文件作为文件连接(file()
)打开,然后逐行读取(readLines()
)。通过子集化跳过前4行。该文件以制表符分隔,以便递归地应用strsplit()
。它们仍然保留为字符串列表,它们应该重新格式化为数据框或任何其他合适的类型。
# open file connection and read lines
path <- "http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv"
con <- file(path, open = "rt", raw = TRUE)
text <- readLines(con, skipNul = TRUE)
close(con)
# skip first 4 lines
text <- text[5:length(text)]
# recursively split string
text <- do.call(c, lapply(text, strsplit, split = "\t"))
text[[1]][1:4]
# [1] "1-PAGE LTD ORDINARY" "1PG " "1330487" "1.72"
我没有最终尝试readlines,但事实证明该文件是unicode ....是的文件是一个糟糕的格式,但结束使用以下代码来获取短裤的体积数据。
shorthistory <- read.csv("http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv",skip=1,fileEncoding = "UTF-16",sep = "\t")
shorthistory <- shorthistory[-(1:2),]
shorthistory <- cbind(Row.Names = rownames(shorthistory), shorthistory)
rownames(shorthistory) <- NULL
colnames(shorthistory) <- substr(colnames(shorthistory),2,11)
colnames(shorthistory)[1] <- "Company"
colnames(shorthistory)[2] <- "Ticker"
shorthist1 <- shorthistory[,1:2]
i=3 ##start at first volume column with short data
while(i<=length(colnames(shorthistory))){
if(i%%2 == 0){
shorthist1 <- cbind(shorthist1,shorthistory[i])
i <- i+1
}
else{
i <- i+1
}
}
melted <- melt(data = shorthist1,id = c("Ticker","Company"))
melted$variable <- as.POSIXlt(x = melted$variable,format = "%Y.%m.%d")
melted$value[melted$value==""] <- 0.00
在包含BOM(字节顺序标记)和NUL的CSV文件出现很多问题之后,我写了这个小函数。它逐行读取文件(忽略NUL),跳过空行,然后应用read.csv
。
# Read CSV files with BOM and NUL problems
read.csvX = function(file, encoding="UTF-16LE", header=T, stringsAsFactors=T) {
csvLines = readLines(file, encoding=encoding, skipNul=T, warn=F)
# Remove BOM (ÿþ) from first line
if (substr(csvLines[[1]], 1, 2) == "ÿþ") {
csvLines[[1]] = substr(csvLines[[1]], 3, nchar(csvLines[[1]]))
}
csvLines = csvLines[csvLines != ""]
if (length(csvLines) == 0) {
warning("Empty file")
return(NULL)
}
csvData = read.csv(text=paste(csvLines, collapse="\n"), header=header, stringsAsFactors=stringsAsFactors)
return(csvData)
}
希望这个旧问题的答案可以帮助某人。