假设我有下表
library(data.table)
df <- data.table(col = c('{"geo":"USA","class":"A","score":"99"}'
, '{"geo":"Hawaii","class":"B","score":"83"}'
)
); df
col
<char>
1: {"geo":"USA","class":"A","score":"99"}
2: {"geo":"Hawaii","class":"B","score":"83"}
我想将某些字段提取到列中
# fields to extract
x <- c('geo', 'class')
# extract
df[, (x) := {y = lapply(col, \(j) from_json(j)) |>
lapply(`[`, x)
a = sapply(y, `[`, x[1])
b = sapply(y, `[`, x[2])
.(a, b)
}
]
以上方法可行,但有两个问题:
真实数据集有数百万行,每行大约有50个字段。目前解析 1e5 行的 2 个字段大约需要 15 分钟。 此外,即使完成后,我的电脑仍然昏昏欲睡。 Ctrl + Alt + Del 确认占用大量内存。
gc()
也解决不了。
下面是创建更大样本数据的代码
# Function to generate sample data
generate_sample_data <- \(n_rows = 10)
{
# Generate random data for each field
geo_options <- c("USA", "Hawaii", "Canada", "Mexico", "UK")
class_options <- c("A", "B", "C", "D")
# Create the data table
df <- data.table(col = replicate(n_rows
, paste0('{"geo":"', sample(geo_options, 1), '",'
, '"class":"', sample(class_options, 1), '",'
, '"score":"', sample(60:100, 1), '",'
, '"extra_field":"', sample(LETTERS, 1), '",'
, '"timestamp":"', as.character(Sys.time() + sample(0:100000, 1)), '"}'
)
)
)
return(df)
}
# Generate a data table with 15 rows
df <- generate_sample_data(1e6); df[1:3]
尝试使用
fromJSON
功能
jsonlite::fromJSON(sprintf("[%s]", toString(df[[1]])))
geo class score extra_field timestamp
1 Canada C 63 R 2025-01-21 02:56:42.58203
2 UK A 71 Z 2025-01-21 12:15:20.582483
3 Hawaii D 79 K 2025-01-21 14:43:41.582883
4 Canada C 68 M 2025-01-21 01:40:20.583218
5 Mexico B 92 X 2025-01-21 05:26:35.583423
6 Hawaii D 71 R 2025-01-21 05:09:51.583672
7 Canada D 95 F 2025-01-21 10:36:04.583794
8 UK B 84 O 2025-01-21 13:11:29.583986
9 Mexico A 92 W 2025-01-21 15:21:13.58412
10 UK A 75 Z 2025-01-21 12:47:03.584297
11 Hawaii B 70 T 2025-01-20 13:31:34.584423
12 UK A 88 Q 2025-01-21 09:46:32.584641
13 Mexico B 64 K 2025-01-21 11:03:20.584838
14 Hawaii D 63 J 2025-01-21 13:53:28.585002
15 UK D 83 B 2025-01-21 04:03:55.585202