从 tibble 中取消嵌套同名列表:数据被删除

问题描述 投票:0回答:1

我在 unnest_wider(来自 tidyr)方面遇到了麻烦。 我有一个嵌套的 XML 文档,我正在尝试将其转换为数据框/tibble。我按照here介绍的工作流程进行操作,该工作流程建议将 XML 节点集转换为 R 列表。

我的 XML 是 OAI/Dublin Core 格式的,其中有几个具有相同名称的元素(例如“subject.other”)。简化后,我的

doc.xml
看起来像这样:

<?xml version="1.0" encoding="utf-8"?>
<ListRecords>
  <record>
    <header>
      <identifier>id_01</identifier>
      <datestamp>2024-05</datestamp>
    </header>
    <metadata>
      <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:publisher>Fake Editions</dc:publisher>
        <dc:subject.other>Great subject n°1</dc:subject.other>
        <dc:subject.other>Great subject n°2</dc:subject.other>
        <dc:subject.other>Great subject n°3</dc:subject.other>
        <dc:subject.other>Great subject n°4</dc:subject.other>
        <dc:subject.other>Great subject n°5</dc:subject.other>
        <dc:subject.other>Great subject n°6</dc:subject.other>
        <dc:title>Random title</dc:title>
      </oai_dc:dc>
    </metadata>
  </record>
</ListRecords>

我尝试过的

我运行的代码如下:

# doc.xml is turned into a list
doc_list <- xmlconvert::xml_to_list(read_xml("doc.xml")) 

# the list becomes a tibble
df <- tibble::enframe(doc_list) 

# unnesting the column "value", where we find the listed elements contained in <header> and <metadata> in the XML
final_df <- df %>%
  unnest_wider(value, names_repair = "universal") 

期望...

我希望我的

final_df
最终看起来像这样:

structure(list(
  identifier = "id_01",
  publisher = "Fake Editions", 
  subject.other_1 = "Great subject n°1",
  subject.other_2 = "Great subject n°2", 
  subject.other_3 = "Great subject n°3"), 
  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

...现实

但是我得到的是:

structure(list(
  identifier = "id_01",
  publisher = "Fake Editions", 
  subject.other_1 = "Great subject n°1",
  subject.other_2 = "Great subject n°1", 
  subject.other_3 = "Great subject n°1"), 
  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

如您所见,所发生的情况是不同“subject.other”元素中包含的实际值被删除并被第一个元素(“Great subject n°1”)中包含的值替换。我尝试更改

.names_repair
选项,但没有改变任何内容。

您能找到任何解决方案使其发挥作用吗?我已经尝试了一切方法来将此 XML 放入数据框/tibble 中,但我失去了希望! 非常感谢!

(我可以提供更多代码/细节,抱歉我不习惯在Stackoverflow上提问)

r xml unnest oai
1个回答
0
投票

看看它是否适用于

xml2::as_list

玩具数据:

doc_list <- read_xml("<?xml version=\"1.0\" encoding=\"utf-8\"?><ListRecords><record><header><identifier>id_01</identifier><datestamp>2024-05</datestamp></header><metadata><oai_dc:dc xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd\"><dc:publisher>Fake Editions</dc:publisher><dc:subject.other>Great subject n°1</dc:subject.other><dc:subject.other>Great subject n°2</dc:subject.other><dc:subject.other>Great subject n°3</dc:subject.other><dc:subject.other>Great subject n°4</dc:subject.other><dc:subject.other>Great subject n°5</dc:subject.other><dc:subject.other>Great subject n°6</dc:subject.other><dc:title>Random title</dc:title></oai_dc:dc></metadata></record></ListRecords>")

代码:

doc_df <- doc_list %>% 
  xml2::as_list() %>% 
  unlist() %>% 
  as_tibble_row(.name_repair = "unique")

输出:

> glimpse(doc_df)
Rows: 1
Columns: 10
$ ListRecords.record.header.identifier             <chr> "id_01"
$ ListRecords.record.header.datestamp              <chr> "2024-05"
$ ListRecords.record.metadata.dc.publisher         <chr> "Fake Editions"
$ ListRecords.record.metadata.dc.subject.other...4 <chr> "Great subject n°1"
$ ListRecords.record.metadata.dc.subject.other...5 <chr> "Great subject n°2"
$ ListRecords.record.metadata.dc.subject.other...6 <chr> "Great subject n°3"
$ ListRecords.record.metadata.dc.subject.other...7 <chr> "Great subject n°4"
$ ListRecords.record.metadata.dc.subject.other...8 <chr> "Great subject n°5"
$ ListRecords.record.metadata.dc.subject.other...9 <chr> "Great subject n°6"
$ ListRecords.record.metadata.dc.title             <chr> "Random title"
© www.soinside.com 2019 - 2024. All rights reserved.