R中进 行此字符串正则表达式处理的最快替代方法是什么?

问题描述 投票:1回答:1

我正在处理75GB的XML文件,即不可能将它们加载到内存中并构建DOM XML树。因此,我求助于处理块中的行块(使用readr::read_lines_chunked),例如1万行。这是N = 3行的小型演示,我在其中提取了构建tibble所需的数据,但这不是超级快的方法:

library(tidyverse)
xml <- c("<row Id=\"4\" Attrib1=\"1\" Attrib2=\"7\" Attrib3=\"2008-07-31T21:42:52.667\" Attrib4=\"645\" Attrib5=\"45103\" Attrib6=\"fjbnjahkcbvahjsvdghvadjhavsdjbaJKHFCBJHABCJKBASJHcvbjavbfcjkhabcjkhabsckajbnckjasbnckjbwjhfbvjahsdcvbzjhvcbwiebfewqkn\" Attrib7=\"8\" Attrib8=\"11652943\" Attrib9=\"Rich B\" Attrib10=\"2019-09-03T17:25:25.207\" Attrib11=\"2019-10-21T14:03:54.607\" Attrib12=\"1\" Attrib13=\"a|b|c|d|e|f|g\" Attrib14=\"13\" Attrib15=\"3\" Attrib16=\"49\" Attrib17=\"2012-10-31T16:42:47.213\"/>",
         "<row Id=\"5\" Attrib1=\"2\" Attrib2=\"8\" Attrib3=\"2008-07-31T21:42:52.999\" Attrib4=\"649\" Attrib5=\"7634\" Attrib6=\"fjbnjahkcbvahjsvdghvadjhavsdjbaJKHFCBJHABCJKBASJHcvbjavbfcjkhabcjkhabsckajbnckjasbnckjbwjhfbvjahsdcvbzjhvcbwiebfewqkn\" Attrib7=\"8\" Attrib8=\"11652943\" Attrib9=\"Rich B\" Attrib10=\"2019-09-03T17:25:25.207\" Attrib11=\"2019-10-21T14:03:54.607\" Attrib12=\"2\" Attrib13=\"a|b|c|d|e|f|g\" Attrib14=\"342\" Attrib15=\"43\" Attrib16=\"767\" Attrib17=\"2012-10-31T16:42:47.213\"/>",
         "<row Id=\"6\" Attrib1=\"3\" Attrib2=\"9\" Attrib3=\"2008-07-31T21:42:52.999\" Attrib4=\"348\" Attrib5=\"2732\" Attrib6=\"djhfbsdjhfbijhsdbfjkdbnfkjndaskjfnskjdlnfkjlsdnf\" Attrib7=\"9\" Attrib8=\"34873\" Attrib9=\"FHDHsf\" Attrib10=\"2019-09-03T17:25:25.207\" Attrib11=\"2019-10-21T14:03:54.607\" Attrib12=\"3\" Attrib13=\"a|b|c|d|e|f|g\" Attrib14=\"342\" Attrib15=\"43\" Attrib16=\"767\" Attrib17=\"2012-10-31T16:42:47.4333\"/>")
pattern <- paste(".*(Id=\"\\d+\") ",
                 "(Attrib1=\"\\d+\") ",
                 "(Attrib2=\"\\d+\") ",
                 "(Attrib3=\"[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[0-9]+.[0-9]+\") ",
                 "(Attrib4=\"\\d+\") ",
                 "(Attrib5=\"\\d+\")",
                 ".*(Attrib8=\"\\d+\") ",
                 ".*(Attrib10=\"[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[0-9]+.[0-9]+\") ",
                 "(Attrib11=\"[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+[0-9]+.[0-9]+\")",
                 ".*(Attrib13=\"([a-z]|[0-9]|\\||\\s)+\") ",
                 "(Attrib14=\"\\d+\") ",
                 "(Attrib15=\"\\d+\") ",
                 "(Attrib16=\"\\d+\")",
                 sep="")
# match the groups in pattern and extract the matches
tmp <- str_match(xml, pattern)[,-c(1,12)]
# remove non matching NA rows  
r <- which(is.na(tmp[,1]))
if (length(r) > 0) {
  tmp <- tmp[-r,]
}
# remove the metadata and stay with the data within the double quotes only
tmp <- apply(tmp, 1, function(s) {
  str_remove_all(str_match(s, "(\".*\")")[,-1], "\"")
})
# need the transposed version of tmp
tmp <- t(tmp)
tmp
# convert to a tibble
colnames(tmp) <- c("Id", "Attrib1", "Attrib2", "Attrib3", "Attrib4", "Attrib5", "Attrib8", "Attrib10", "Attrib11", "Attrib13", "Attrib14", "Attrib15", "Attrib16")
as_tibble(tmp)

在性能方面是否有更好的方法?

UPDATE:我在10k行(而不是3条)上对以上代码进行了基准测试,这是900秒。然后,我将属性正则表达式组的数量从13个减少到7个(仅至关重要的组),并且相同的基准下降到128秒。

根据我的推断,我们从大约10天的时间增加到大约35小时的9731474行。然后,我使用Linux命令split -l1621913 -d Huge.xml Huge_split_ --verbose将大文件拆分为6个文件,以匹配我拥有的内核数,现在在每个拆分文件上并行运行代码...因此,我查看的是35/6 =〜5.8小时...还算不错。我会:

library(doMC)
registerDoMC(6)
resultList <- foreach (i=0:5) %dopar% {
  file <- sprintf('Huge_split_0%d', i)  
  partial <- # run the chunk algorithm on file
  return(partial)
}
r xml parsing bigdata tidyverse
1个回答
© www.soinside.com 2019 - 2024. All rights reserved.