我正在尝试将 clinicaltrials.gov 中的 XML 数据转换为数据帧,以便在 R 中进行分析。我有一个 URL,允许我在每项研究中选择我要查找的特定字段。每行和 NCTId 都是一项单独的研究,可以在多个设施和地点进行。因此,单个观测的设施和位置列中将会有多个响应。
我的第一次尝试成功地将数据转换为 df,但清理起来很困难,有时将数百个设施和位置值打包到一个字段中进行单个研究,并且没有分隔符(尝试了空格,但设施、城市和国家/地区太多(不仅仅是包括美国)有两个加号和一个空格)。
代码:
url1 <- "https://classic.clinicaltrials.gov/api/query/study_fields?expr=&fields=NCTId%2C+OrgFullName%2C+BriefTitle%2C+StartDate%2C+Condition%2C+Keyword%2C+InterventionName%2C+EnrollmentCount%2C+LocationFacility%2C+LocationCity%2C+LocationState%2C+LocationZip%2C+LocationCountry&min_rnk=1&max_rnk=1000&fmt=xml"
dat1 <- read_xml(url1)
dat_xml1 <- xmlParse(dat1)
df1 <- xmlToDataFrame(nodes = getNodeSet(dat_xml1, "//StudyFields"))
结果:(问题示例)
glimpse(df1[13, 10:13])
Rows: 1
Columns: 4
$ city <chr> "ScottsdaleJacksonvilleAlbert LeaMankatoRochesterEau ClaireLa Crosse"
$ state <chr> "ArizonaFloridaMinnesotaMinnesotaMinnesotaWisconsinWisconsin"
$ zip <chr> "8525932224-99805600756001559055470154601"
$ country <chr> "United StatesUnited StatesUnited StatesUnited StatesUnited StatesUnited StatesUnited States"
经过一些清理尝试后,我返回尝试修复导入时的问题并取消每列中列出的值的嵌套:
代码:
url1 <- "https://classic.clinicaltrials.gov/api/query/study_fields?expr=&fields=NCTId%2C+OrgFullName%2C+BriefTitle%2C+StartDate%2C+Condition%2C+Keyword%2C+InterventionName%2C+EnrollmentCount%2C+LocationFacility%2C+LocationCity%2C+LocationState%2C+LocationZip%2C+LocationCountry&min_rnk=1&max_rnk=1000&fmt=xml"
dat1 <- as_list(read_xml(url1))
xml_df <- as_tibble(dat1) %>%
unnest_longer(StudyFieldsResponse)
dat_wide <- xml_df %>%
filter(StudyFieldsResponse_id == "StudyFields") %>%
unnest_wider(StudyFieldsResponse, names_repair = "unique")
dat_df <- dat_wide %>%
unnest(cols = names(.)) %>%
unnest(cols = names(.)) %>%
type_convert()
在我的 dat_wide 对象中,我可以看到一个字段内的多个响应是单独列出的,所以我觉得这是正确的路径。
glimpse(dat_wide[206, 10:13])
Rows: 1
Columns: 4
$ city <list> [["Kansas City"], ["Columbus"]]
$ state <list> [["Kansas"], ["Ohio"]]
$ zip <list> [["66160"], ["43210"]]
$ country <list> [["United States"], ["United States"]]
但是,当我进入取消嵌套每个单元格中的值的最后一步时,我收到错误:
dat_df <- dat_wide %>%
+ unnest(cols = names(.)) %>%
+ unnest(cols = names(.)) %>%
+ type_convert()
Error in `unnest()`:
! In row 1, can't recycle input of size 4 to size 5.
Run `rlang::last_trace()` to see where the error occurred.
最终,我想取消嵌套并旋转这些值,以便具有多个位置的单独研究将具有尽可能多的行。感谢您就这个问题提供的任何帮助 - 谢谢!
这是将 XML 文件转换为整洁的 tibble 的一种方法。
注意:我没有遇到任何多个响应的问题。也许我错过了什么。
url1 <- "https://classic.clinicaltrials.gov/api/query/study_fields?expr=&fields=NCTId%2C+OrgFullName%2C+BriefTitle%2C+StartDate%2C+Condition%2C+Keyword%2C+InterventionName%2C+EnrollmentCount%2C+LocationFacility%2C+LocationCity%2C+LocationState%2C+LocationZip%2C+LocationCountry&min_rnk=1&max_rnk=1000&fmt=xml"
library(xml2)
library(tidyr)
library(purrr)
dat1 <- read_xml(url1)
dat2 <- dat1 |>
as_list() |>
pluck("StudyFieldsResponse")
field_names <- vapply(dat2$FieldList, `[[`, 1,
FUN.VALUE = character(1), USE.NAMES = FALSE
)
dat3 <- lapply(
dat2$StudyFieldsList,
\(x) {
lapply(
x,
\(x) {
field_value <- x$FieldValue[[1]]
if (is.null(field_value)) {
return(NA_character_)
} else {
return(field_value)
}
}
) |>
setNames(field_names) |>
as_tibble()
}
) |>
list_rbind()
head(dat3)
#> # A tibble: 6 × 13
#> NCTId OrgFullName BriefTitle StartDate Condition Keyword InterventionName
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 NCT062002… Toros Univ… Effect of… December… Healthy … Carob … Germ and Carob …
#> 2 NCT062002… Thompson C… Reduction… January … Prostate… <NA> Spot Delete pla…
#> 3 NCT062002… St. Olavs … Cardio-va… October … Heart Fa… <NA> <NA>
#> 4 NCT062002… Samsung Me… Rivoceran… March 1,… Thymic E… <NA> Rivoceranib
#> 5 NCT062002… Poitiers U… Patients … December… Waldenst… <NA> Venetoclax
#> 6 NCT062002… Novo Nordi… A Researc… April 1,… Heart Fa… <NA> Ziltivekimab
#> # ℹ 6 more variables: EnrollmentCount <chr>, LocationFacility <chr>,
#> # LocationCity <chr>, LocationState <chr>, LocationZip <chr>,
#> # LocationCountry <chr>