我在 unnest_wider(来自 tidyr)方面遇到了麻烦。 我有一个嵌套的 XML 文档,我正在尝试将其转换为数据框/tibble。我按照here介绍的工作流程进行操作,该工作流程建议将 XML 节点集转换为 R 列表。
我的 XML 是 OAI/Dublin Core 格式的,其中有几个具有相同名称的元素(例如“subject.other”)。简化后,我的
doc.xml
看起来像这样:
<?xml version="1.0" encoding="utf-8"?>
<ListRecords>
<record>
<header>
<identifier>id_01</identifier>
<datestamp>2024-05</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Fake Editions</dc:publisher>
<dc:subject.other>Great subject n°1</dc:subject.other>
<dc:subject.other>Great subject n°2</dc:subject.other>
<dc:subject.other>Great subject n°3</dc:subject.other>
<dc:subject.other>Great subject n°4</dc:subject.other>
<dc:subject.other>Great subject n°5</dc:subject.other>
<dc:subject.other>Great subject n°6</dc:subject.other>
<dc:title>Random title</dc:title>
</oai_dc:dc>
</metadata>
</record>
</ListRecords>
我尝试过的
我运行的代码如下:
# doc.xml is turned into a list
doc_list <- xmlconvert::xml_to_list(read_xml("doc.xml"))
# the list becomes a tibble
df <- tibble::enframe(doc_list)
# unnesting the column "value", where we find the listed elements contained in <header> and <metadata> in the XML
final_df <- df %>%
unnest_wider(value, names_repair = "universal")
期望...
我希望我的
final_df
最终看起来像这样:
structure(list(
identifier = "id_01",
publisher = "Fake Editions",
subject.other_1 = "Great subject n°1",
subject.other_2 = "Great subject n°2",
subject.other_3 = "Great subject n°3"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
...现实
但是我得到的是:
structure(list(
identifier = "id_01",
publisher = "Fake Editions",
subject.other_1 = "Great subject n°1",
subject.other_2 = "Great subject n°1",
subject.other_3 = "Great subject n°1"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
如您所见,所发生的情况是不同“subject.other”元素中包含的实际值被删除并被第一个元素(“Great subject n°1”)中包含的值替换。我尝试更改
.names_repair
选项,但没有改变任何内容。
您能找到任何解决方案使其发挥作用吗?我已经尝试了一切方法来将此 XML 放入数据框/tibble 中,但我失去了希望! 非常感谢!
(我可以提供更多代码/细节,抱歉我不习惯在Stackoverflow上提问)
看看它是否适用于
xml2::as_list
:
玩具数据:
doc_list <- read_xml("<?xml version=\"1.0\" encoding=\"utf-8\"?><ListRecords><record><header><identifier>id_01</identifier><datestamp>2024-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°1</dc:subject.other><dc:subject.other>Great subject n°2</dc:subject.other><dc:subject.other>Great subject n°3</dc:subject.other><dc:subject.other>Great subject n°4</dc:subject.other><dc:subject.other>Great subject n°5</dc:subject.other><dc:subject.other>Great subject n°6</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record></ListRecords>")
代码:
doc_df <- doc_list %>%
xml2::as_list() %>%
unlist() %>%
as_tibble_row(.name_repair = "unique")
输出:
> glimpse(doc_df)
Rows: 1
Columns: 10
$ ListRecords.record.header.identifier <chr> "id_01"
$ ListRecords.record.header.datestamp <chr> "2024-05"
$ ListRecords.record.metadata.dc.publisher <chr> "Fake Editions"
$ ListRecords.record.metadata.dc.subject.other...4 <chr> "Great subject n°1"
$ ListRecords.record.metadata.dc.subject.other...5 <chr> "Great subject n°2"
$ ListRecords.record.metadata.dc.subject.other...6 <chr> "Great subject n°3"
$ ListRecords.record.metadata.dc.subject.other...7 <chr> "Great subject n°4"
$ ListRecords.record.metadata.dc.subject.other...8 <chr> "Great subject n°5"
$ ListRecords.record.metadata.dc.subject.other...9 <chr> "Great subject n°6"
$ ListRecords.record.metadata.dc.title <chr> "Random title"