我正在处理75GB的XML文件,即不可能将它们加载到内存中并构建DOM XML树。因此,我求助于处理块中的行块(使用readr::read_lines_chunked
),例如1万行。这是N = 3行的小型演示,我在其中提取了构建tibble
所需的数据,但这不是超级快的方法:
library(tidyverse)
xml <- c("<row Id=\"4\" Attrib1=\"1\" Attrib2=\"7\" Attrib3=\"2008-07-31T21:42:52.667\" Attrib4=\"645\" Attrib5=\"45103\" Attrib6=\"fjbnjahkcbvahjsvdghvadjhavsdjbaJKHFCBJHABCJKBASJHcvbjavbfcjkhabcjkhabsckajbnckjasbnckjbwjhfbvjahsdcvbzjhvcbwiebfewqkn\" Attrib7=\"8\" Attrib8=\"11652943\" Attrib9=\"Rich B\" Attrib10=\"2019-09-03T17:25:25.207\" Attrib11=\"2019-10-21T14:03:54.607\" Attrib12=\"1\" Attrib13=\"a|b|c|d|e|f|g\" Attrib14=\"13\" Attrib15=\"3\" Attrib16=\"49\" Attrib17=\"2012-10-31T16:42:47.213\"/>",
"<row Id=\"5\" Attrib1=\"2\" Attrib2=\"8\" Attrib3=\"2008-07-31T21:42:52.999\" Attrib4=\"649\" Attrib5=\"7634\" Attrib6=\"fjbnjahkcbvahjsvdghvadjhavsdjbaJKHFCBJHABCJKBASJHcvbjavbfcjkhabcjkhabsckajbnckjasbnckjbwjhfbvjahsdcvbzjhvcbwiebfewqkn\" Attrib7=\"8\" Attrib8=\"11652943\" Attrib9=\"Rich B\" Attrib10=\"2019-09-03T17:25:25.207\" Attrib11=\"2019-10-21T14:03:54.607\" Attrib12=\"2\" Attrib13=\"a|b|c|d|e|f|g\" Attrib14=\"342\" Attrib15=\"43\" Attrib16=\"767\" Attrib17=\"2012-10-31T16:42:47.213\"/>",
"<row Id=\"6\" Attrib1=\"3\" Attrib2=\"9\" Attrib3=\"2008-07-31T21:42:52.999\" Attrib4=\"348\" Attrib5=\"2732\" Attrib6=\"djhfbsdjhfbijhsdbfjkdbnfkjndaskjfnskjdlnfkjlsdnf\" Attrib7=\"9\" Attrib8=\"34873\" Attrib9=\"FHDHsf\" Attrib10=\"2019-09-03T17:25:25.207\" Attrib11=\"2019-10-21T14:03:54.607\" Attrib12=\"3\" Attrib13=\"a|b|c|d|e|f|g\" Attrib14=\"342\" Attrib15=\"43\" Attrib16=\"767\" Attrib17=\"2012-10-31T16:42:47.4333\"/>")
pattern <- paste(".*(Id=\"\\d+\") ",
"(Attrib1=\"\\d+\") ",
"(Attrib2=\"\\d+\") ",
"(Attrib3=\"[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[0-9]+.[0-9]+\") ",
"(Attrib4=\"\\d+\") ",
"(Attrib5=\"\\d+\")",
".*(Attrib8=\"\\d+\") ",
".*(Attrib10=\"[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[0-9]+.[0-9]+\") ",
"(Attrib11=\"[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[0-9]+.[0-9]+\")",
".*(Attrib13=\"([a-z]|[0-9]|\\||\\s)+\") ",
"(Attrib14=\"\\d+\") ",
"(Attrib15=\"\\d+\") ",
"(Attrib16=\"\\d+\")",
sep="")
# match the groups in pattern and extract the matches
tmp <- str_match(xml, pattern)[,-c(1,12)]
# remove non matching NA rows
r <- which(is.na(tmp[,1]))
if (length(r) > 0) {
tmp <- tmp[-r,]
}
# remove the metadata and stay with the data within the double quotes only
tmp <- apply(tmp, 1, function(s) {
str_remove_all(str_match(s, "(\".*\")")[,-1], "\"")
})
# need the transposed version of tmp
tmp <- t(tmp)
tmp
# convert to a tibble
colnames(tmp) <- c("Id", "Attrib1", "Attrib2", "Attrib3", "Attrib4", "Attrib5", "Attrib8", "Attrib10", "Attrib11", "Attrib13", "Attrib14", "Attrib15", "Attrib16")
as_tibble(tmp)
在性能方面是否有更好的方法?
UPDATE:我在10k行(而不是3条)上对以上代码进行了基准测试,这是900秒。然后,我将属性正则表达式组的数量从13个减少到7个(仅至关重要的组),并且相同的基准下降到128秒。
根据我的推断,我们从大约10天的时间增加到大约35小时的9731474行。然后,我使用Linux命令split -l1621913 -d Huge.xml Huge_split_ --verbose
将大文件拆分为6个文件,以匹配我拥有的内核数,现在在每个拆分文件上并行运行代码...因此,我查看的是35/6 =〜5.8小时...还算不错。我会:
library(doMC)
registerDoMC(6)
resultList <- foreach (i=0:5) %dopar% {
file <- sprintf('Huge_split_0%d', i)
partial <- # run the chunk algorithm on file
return(partial)
}