如何从存储为字符串的.xml中提取数据/节点?

问题描述 投票:0回答:1

我有一个字符串格式的 .xml xmlString,如下所示

<entry xmlns="http://www.w3.org/2004/tom">
  <id>urn:contentItem:7WBG-8H88-Y898-B277-00000-00-1</id>
  <title>Dinn-Pixie Stares, Inc</title>
  <published>2015-12-24T00:00:00Z</published>
  <updated>2023-10-24T18:42:17Z</updated>
  <author>
    <name>AlphaNext</name>
  </author>
  <content type="application/xml">
    <baseRelatedDoc xmlns="" xmlns:xsi="http://www.w3.org/2012/XMLSchema" xsi:noNamespaceSchemaLocation="http://www.alphanext.com/xmlscemas/content/public/caseddoc/1/" documentType="socket">
      <baseRelatedDocHead>
        <baseInfo>
          <portInfo>
            <identifier idType="portIdentifier">100027579</identifier>
            <portName>United States Port, Minnesota Middle</portName>
            <jurisdiction>
              <jurisSystem/>
            </jurisdiction>
          </portInfo>
          <date dateType="filed" year="2015" month="02" day="21">2015-02-21</date>
          <classification classificationScheme="baseType">
            <classificationItem>
              <classCode>BK</classCode>
              <className>Tankruptcy</className>
            </classificationItem>
          </classification>
          <classification classificationScheme="baseNos">
            <classificationItem>
              <classCode>0</classCode>
              <className>UNKNOWN</className>
            </classificationItem>
          </classification>
        </baseInfo>
        <baseSupplement>
          <label>US Tankruptcy Port Socket</label>
          <date dateType="updated" year="2024" month="09" day="13">2024-07-11T15:08:26.450</date>
          <status>Unknown</status>
        </baseSupplement>
        <baseName>Dinn-Pixie Stares, Inc</baseName>
      </baseRelatedDocHead>
      <baseRelatedDocBody>
        <socket/>
      </baseRelatedDocBody>
      <metadata>
        <dc:metadata xmlns:dc="http://purrel.org/dc/element/1.2/">
          <dc:source sourceScheme="productContentSetIdentifier">343392</dc:source>
          <dc:creator>US Tankruptcy Port for the Last Town of Minnesota</dc:creator>
          <dc:identifier identifierScheme="PGIID">urn:contentItem:8WBG-8H70-Y892-B237-00000-00</dc:identifier>
          <dc:date dateType="last-updated">2024-07-11</dc:date>
        </dc:metadata>
      </metadata>
    </baseRelatedDoc>
  </content>
</entry>

我需要提取字段的值,如 baseNameportNametitleclassCodedc:creator .

但是,当我尝试使用

y=tree.findall('baseName')
提取它们时,其中
tree = ET.fromstring(xmlString))
y 显示为空列表。 当我尝试像 portName, dc:creator 这样的节点时,我得到相同的空列表。我如何提取这些节点/字段的值?

python xml xml-parsing elementtree findall
1个回答
0
投票

你可以尝试:

import xml.etree.ElementTree as ET

xml_doc = """\
<entry xmlns="http://www.w3.org/2004/tom">
  <id>urn:contentItem:7WBG-8H88-Y898-B277-00000-00-1</id>
  <title>Dinn-Pixie Stares, Inc</title>
  <published>2015-12-24T00:00:00Z</published>
  <updated>2023-10-24T18:42:17Z</updated>
  <author>
    <name>AlphaNext</name>
  </author>
  <content type="application/xml">
    <baseRelatedDoc xmlns="" xmlns:xsi="http://www.w3.org/2012/XMLSchema" xsi:noNamespaceSchemaLocation="http://www.alphanext.com/xmlscemas/content/public/caseddoc/1/" documentType="socket">
      <baseRelatedDocHead>
        <baseInfo>
          <portInfo>
            <identifier idType="portIdentifier">100027579</identifier>
            <portName>United States Port, Minnesota Middle</portName>
            <jurisdiction>
              <jurisSystem/>
            </jurisdiction>
          </portInfo>
          <date dateType="filed" year="2015" month="02" day="21">2015-02-21</date>
          <classification classificationScheme="baseType">
            <classificationItem>
              <classCode>BK</classCode>
              <className>Tankruptcy</className>
            </classificationItem>
          </classification>
          <classification classificationScheme="baseNos">
            <classificationItem>
              <classCode>0</classCode>
              <className>UNKNOWN</className>
            </classificationItem>
          </classification>
        </baseInfo>
        <baseSupplement>
          <label>US Tankruptcy Port Socket</label>
          <date dateType="updated" year="2024" month="09" day="13">2024-07-11T15:08:26.450</date>
          <status>Unknown</status>
        </baseSupplement>
        <baseName>Dinn-Pixie Stares, Inc</baseName>
      </baseRelatedDocHead>
      <baseRelatedDocBody>
        <socket/>
      </baseRelatedDocBody>
      <metadata>
        <dc:metadata xmlns:dc="http://purrel.org/dc/element/1.2/">
          <dc:source sourceScheme="productContentSetIdentifier">343392</dc:source>
          <dc:creator>US Tankruptcy Port for the Last Town of Minnesota</dc:creator>
          <dc:identifier identifierScheme="PGIID">urn:contentItem:8WBG-8H70-Y892-B237-00000-00</dc:identifier>
          <dc:date dateType="last-updated">2024-07-11</dc:date>
        </dc:metadata>
      </metadata>
    </baseRelatedDoc>
  </content>
</entry>"""


root = ET.fromstring(xml_doc)

ns = {"tom": "http://www.w3.org/2004/tom", "dc": "http://purrel.org/dc/element/1.2/"}

base_name = root.find(".//baseName").text
port_name = root.find(".//portName").text
title = root.find(".//tom:title", ns).text
class_codes = [e.text for e in root.findall(".//classCode")]
dc_creator = root.find(".//dc:creator", ns).text

print(f"{base_name=}")
print(f"{port_name=}")
print(f"{title=}")
print(f"{class_codes=}")
print(f"{dc_creator=}")

打印:

base_name='Dinn-Pixie Stares, Inc'
port_name='United States Port, Minnesota Middle'
title='Dinn-Pixie Stares, Inc'
class_codes=['BK', '0']
dc_creator='US Tankruptcy Port for the Last Town of Minnesota'
© www.soinside.com 2019 - 2024. All rights reserved.