需要使用jsonlite来使用stream_in（）和stream_out（）来处理ndjson消息列表

Question

我有一个ndjson数据源。举一个简单的例子，考虑一个包含三行的文本文件，每行包含一个有效的json消息。我想从消息中提取7个变量并将它们放在数据帧中。

请在文本文件中使用以下示例数据。您可以将此数据粘贴到文本编辑器中并将其另存为“ndjson_sample.txt”

{"ts":"1","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":12353,\"Var5\":1,\"Var6\":\"abc\",\"Var7\":\"x\"}"}
{"ts":"2","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-68,\"Var4\":4528,\"Var5\":1,\"Var6\":\"def\",\"Var7\":\"y\"}"}
{"ts":"3","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":-5409,\"Var5\":1,\"Var6\":\"ghi\",\"Var7\":\"z\"}"}

以下三行代码完成了我想要做的事情：

file1 <- "ndjson_sample.txt"
json_data1 <- ndjson::stream_in(file1)
raw_df_temp1 <- as.data.frame(ndjson::flatten(json_data1$ct))

由于我不会进入的原因，我不能使用ndjson包。我必须找到一种方法来使用jsonlite包来使用stream_in()和stream_out()函数做同样的事情。这是我试过的：

con_in1 <- file(file1, open = "rt")
con_out1 <- file(tmp <- tempfile(), open = "wt")
callback_func <- function(df){
  jsonlite::stream_out(df, con_out1, pagesize = 1)
}
jsonlite::stream_in(con_in1, handler = callback_func, pagesize = 1)
close(con_out1)
con_in2 <- file(tmp, open = "rt")
raw_df_temp2 <- jsonlite::stream_in(con_in2)

这并没有给我与最终输出相同的数据框。你能告诉我我做错了什么以及我需要改变什么使raw_df_temp1等于raw_df_temp2？

我可以通过在文件的每一行上运行的fromJSON()函数来解决这个问题，但是我想找到一种方法来使用stream函数。我将要处理的文件非常大，因此效率将是关键。我需要尽可能快。

先感谢您。

Answer 1

目前在ct下你会找到一个字符串，可以（随后）独立地输入fromJSON，但它不会被解析。忽略你的stream_out(stream_in(...),...)测试，这里有几种方法可以阅读：

library(jsonlite)
json <- stream_in(file('ds_guy.ndjson'), simplifyDataFrame=FALSE)
# opening file input connection.
#  Imported 3 records. Simplifying...
# closing file input connection.
cbind(
  ts = sapply(json, `[[`, "ts"),
  do.call(rbind.data.frame, lapply(json, function(a) fromJSON(a$ct)))
)
#   ts Var1 Var2 Var3  Var4 Var5 Var6 Var7
# 1  1    6    6  -70 12353    1  abc    x
# 2  2    6    6  -68  4528    1  def    y
# 3  3    6    6  -70 -5409    1  ghi    z

在每个字符串上调用fromJSON可能很麻烦，并且对于更大的数据，这减慢了为什么有stream_in，所以如果我们可以将"ct"组件捕获到它自己的流中，那么......

writeLines(sapply(json, `[[`, "ct"), 'ds_guy2.ndjson')

（使用非R工具可以实现更高效的方法，包括简单的方法

sed -e 's/.*"ct":"\({.*\}\)"}$/\1/g' -e 's/\\"/"/g' ds_guy.ndjson > ds_guy.ndjson2

虽然这对可能不太安全的数据做了一些假设。一个更好的解决方案是使用jq，它应该“始终”正确解析正确的json，然后快速sed来替换转义引号：

jq '.ct' ds_guy.ndjson | sed -e 's/\\"/"/g' > ds_guy2.ndjson

如果需要，你可以用R中的system(...)做到这一点。）

从那里开始，假设每行都包含一行data.frame数据：

json2 <- stream_in(file('ds_guy2.ndjson'), simplifyDataFrame=TRUE)
# opening file input connection.
#  Imported 3 records. Simplifying...
# closing file input connection.
cbind(ts=sapply(json, `[[`, "ts"), json2)
#   ts Var1 Var2 Var3  Var4 Var5 Var6 Var7
# 1  1    6    6  -70 12353    1  abc    x
# 2  2    6    6  -68  4528    1  def    y
# 3  3    6    6  -70 -5409    1  ghi    z

注意：在第一个例子中，"ts"是factor，所有其他都是character，因为这是fromJSON给出的。在第二个例子中，所有字符串都是factor。根据您的需要，可以通过明智地使用stringsAsFactors=FALSE轻松解决这个问题。

需要使用jsonlite来使用stream_in（）和stream_out（）来处理ndjson消息列表

问题描述投票：0回答：1

1个回答

最新问题

需要使用jsonlite来使用stream_in（）和stream_out（）来处理ndjson消息列表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1