将嵌套 XML 转换为 R 中的数据帧

问题描述 投票:0回答:1

我正在尝试将 clinicaltrials.gov 中的 XML 数据转换为数据帧,以便在 R 中进行分析。我有一个 URL,允许我在每项研究中选择我要查找的特定字段。每行和 NCTId 都是一项单独的研究,可以在多个设施和地点进行。因此,单个观测的设施和位置列中将会有多个响应。

我的第一次尝试成功地将数据转换为 df,但清理起来很困难,有时将数百个设施和位置值打包到一个字段中进行单个研究,并且没有分隔符(尝试了空格,但设施、城市和国家/地区太多(不仅仅是包括美国)有两个加号和一个空格)。

代码:

url1 <- "https://classic.clinicaltrials.gov/api/query/study_fields?expr=&fields=NCTId%2C+OrgFullName%2C+BriefTitle%2C+StartDate%2C+Condition%2C+Keyword%2C+InterventionName%2C+EnrollmentCount%2C+LocationFacility%2C+LocationCity%2C+LocationState%2C+LocationZip%2C+LocationCountry&min_rnk=1&max_rnk=1000&fmt=xml"
dat1 <- read_xml(url1)
dat_xml1 <- xmlParse(dat1)
df1 <- xmlToDataFrame(nodes = getNodeSet(dat_xml1, "//StudyFields"))

结果:(问题示例)

glimpse(df1[13, 10:13])
Rows: 1
Columns: 4
$ city    <chr> "ScottsdaleJacksonvilleAlbert LeaMankatoRochesterEau ClaireLa Crosse"
$ state   <chr> "ArizonaFloridaMinnesotaMinnesotaMinnesotaWisconsinWisconsin"
$ zip     <chr> "8525932224-99805600756001559055470154601"
$ country <chr> "United StatesUnited StatesUnited StatesUnited StatesUnited StatesUnited StatesUnited States"

经过一些清理尝试后,我返回尝试修复导入时的问题并取消每列中列出的值的嵌套:

代码:

url1 <- "https://classic.clinicaltrials.gov/api/query/study_fields?expr=&fields=NCTId%2C+OrgFullName%2C+BriefTitle%2C+StartDate%2C+Condition%2C+Keyword%2C+InterventionName%2C+EnrollmentCount%2C+LocationFacility%2C+LocationCity%2C+LocationState%2C+LocationZip%2C+LocationCountry&min_rnk=1&max_rnk=1000&fmt=xml"
dat1 <- as_list(read_xml(url1))
xml_df <- as_tibble(dat1) %>%
  unnest_longer(StudyFieldsResponse)

dat_wide <- xml_df %>%
  filter(StudyFieldsResponse_id == "StudyFields") %>%
  unnest_wider(StudyFieldsResponse, names_repair = "unique")

dat_df <- dat_wide %>%
  unnest(cols = names(.)) %>%
  unnest(cols = names(.)) %>%
  type_convert()

在我的 dat_wide 对象中,我可以看到一个字段内的多个响应是单独列出的,所以我觉得这是正确的路径。

glimpse(dat_wide[206, 10:13])
Rows: 1
Columns: 4
$ city    <list> [["Kansas City"], ["Columbus"]]
$ state   <list> [["Kansas"], ["Ohio"]]
$ zip     <list> [["66160"], ["43210"]]
$ country <list> [["United States"], ["United States"]]

但是,当我进入取消嵌套每个单元格中的值的最后一步时,我收到错误:

dat_df <- dat_wide %>%
+   unnest(cols = names(.)) %>%
+   unnest(cols = names(.)) %>%
+   type_convert()
Error in `unnest()`:
! In row 1, can't recycle input of size 4 to size 5.
Run `rlang::last_trace()` to see where the error occurred.

最终,我想取消嵌套并旋转这些值,以便具有多个位置的单独研究将具有尽可能多的行。感谢您就这个问题提供的任何帮助 - 谢谢!

r xml api xml-parsing
1个回答
0
投票

这是将 XML 文件转换为整洁的 tibble 的一种方法。

注意:我没有遇到任何多个响应的问题。也许我错过了什么。

url1 <- "https://classic.clinicaltrials.gov/api/query/study_fields?expr=&fields=NCTId%2C+OrgFullName%2C+BriefTitle%2C+StartDate%2C+Condition%2C+Keyword%2C+InterventionName%2C+EnrollmentCount%2C+LocationFacility%2C+LocationCity%2C+LocationState%2C+LocationZip%2C+LocationCountry&min_rnk=1&max_rnk=1000&fmt=xml"

library(xml2)
library(tidyr)
library(purrr)

dat1 <- read_xml(url1)

dat2 <- dat1 |>
  as_list() |>
  pluck("StudyFieldsResponse")

field_names <- vapply(dat2$FieldList, `[[`, 1,
  FUN.VALUE = character(1), USE.NAMES = FALSE
)

dat3 <- lapply(
  dat2$StudyFieldsList,
  \(x) {
    lapply(
      x,
      \(x) {
        field_value <- x$FieldValue[[1]]
        if (is.null(field_value)) {
          return(NA_character_)
        } else {
          return(field_value)
        }
      }
    ) |>
      setNames(field_names) |>
      as_tibble()
  }
) |>
  list_rbind()

head(dat3)
#> # A tibble: 6 × 13
#>   NCTId      OrgFullName BriefTitle StartDate Condition Keyword InterventionName
#>   <chr>      <chr>       <chr>      <chr>     <chr>     <chr>   <chr>           
#> 1 NCT062002… Toros Univ… Effect of… December… Healthy … Carob … Germ and Carob …
#> 2 NCT062002… Thompson C… Reduction… January … Prostate… <NA>    Spot Delete pla…
#> 3 NCT062002… St. Olavs … Cardio-va… October … Heart Fa… <NA>    <NA>            
#> 4 NCT062002… Samsung Me… Rivoceran… March 1,… Thymic E… <NA>    Rivoceranib     
#> 5 NCT062002… Poitiers U… Patients … December… Waldenst… <NA>    Venetoclax      
#> 6 NCT062002… Novo Nordi… A Researc… April 1,… Heart Fa… <NA>    Ziltivekimab    
#> # ℹ 6 more variables: EnrollmentCount <chr>, LocationFacility <chr>,
#> #   LocationCity <chr>, LocationState <chr>, LocationZip <chr>,
#> #   LocationCountry <chr>
© www.soinside.com 2019 - 2024. All rights reserved.