DF 示例:
country <- c('Australia', 'Italy', 'Peru', 'China')
score <- c("0.091", "0.413,.", "-", "0.102,0.102,0.102,.,.,.,.,.,.,.,.")
country_scores <- data.frame(country, score)
每个分数条目可以有任意数量的逗号分隔值或“-”表示没有数据。我希望提取字符串中的最大值并测试它是否满足特定阈值。我尝试了 https://stackoverflow.com/a/65121200/8621123 的解决方案,但在我的 130 万行和 186 列的数据帧上,该解决方案非常慢(至少 8 分钟):
library(tidyverse)
country_scores %>%
mutate(scores = str_extract_all(score, '\\d+(\\.\\d+)?'),
score_max = map_dbl(new, ~max(as.numeric(.x))))
看看这个
data.table
方法是否适用于您的数据
library(data.table)
setDT(country_scores)
country_scores[, max_score := sapply(strsplit(score, ","), \(x)
sort(as.numeric(x[!grepl("^\\.|^-$", x)]), decreasing=T)[1])]
输出
country score max_score
<char> <char> <num>
1: Australia 0.091 0.091
2: Italy 0.413,. 0.413
3: Peru - NA
4: China 0.102,0.102,0.102,.,.,.,.,.,.,.,. 0.102