将此 xml 的数据提取到 Python 数据框

问题描述 投票:0回答:1

我无法将此 xml 数据提取到数据框以正常工作。 这是我的 xml 示例。实际上,我有多个“实体”堆栈,它们代表所需结果中的一行数据。为了节省空间,我只在这里粘贴了一个“Entity”实例:

<EntityCollection>

  <Entities>

    <Entity>

      <Attributes>

        <KeyValuePairOfstringanyType>

          <key>client_number</key>

          <value type="b:string">ABC123345</value>

        </KeyValuePairOfstringanyType>

        <KeyValuePairOfstringanyType>

          <key>my_data.client_type</key>

          <value type="AliasedValue">

            <AttributeLogicalName>my_type</AttributeLogicalName>

            <EntityLogicalName>my_data1</EntityLogicalName>

            <NeedFormatting>true</NeedFormatting>

            <ReturnType>123</ReturnType>

            <Value type="OptionSetValue">

              <Value>345</Value>

            </Value>

          </value>

        </KeyValuePairOfstringanyType>

        <KeyValuePairOfstringanyType>

          <key>my_data.status</key>

          <value type="AliasedValue">

            <AttributeLogicalName>status</AttributeLogicalName>

            <EntityLogicalName>my_data1</EntityLogicalName>

            <NeedFormatting>true</NeedFormatting>

            <ReturnType>123</ReturnType>

            <Value type="OptionSetValue">

              <Value>67</Value>

            </Value>

          </value>

        </KeyValuePairOfstringanyType>

        <KeyValuePairOfstringanyType>

          <key>my_data.date</key>

          <value type="AliasedValue">

            <AttributeLogicalName>date</AttributeLogicalName>

            <EntityLogicalName>my_data1</EntityLogicalName>

            <NeedFormatting>true</NeedFormatting>

            <ReturnType>89</ReturnType>

            <Value type="b:string">2024-01-01</Value>

          </value>

        </KeyValuePairOfstringanyType>

        <KeyValuePairOfstringanyType>

          <key>my_data.country</key>

          <value type="AliasedValue">

            <AttributeLogicalName>country</AttributeLogicalName>

            <EntityLogicalName>my_data1</EntityLogicalName>

            <NeedFormatting>true</NeedFormatting>

            <ReturnType>123</ReturnType>

            <Value type="OptionSetValue">

              <Value>456</Value>

            </Value>

          </value>

        </KeyValuePairOfstringanyType>

        <KeyValuePairOfstringanyType>

          <key>client_code</key>

          <value type="b:guid">some_code456</value>

        </KeyValuePairOfstringanyType>

        <KeyValuePairOfstringanyType>

          <key>my_data.my_data1id</key>

          <value type="AliasedValue">

           <AttributeLogicalName>my_data1id</AttributeLogicalName>

            <EntityLogicalName>my_data1</EntityLogicalName>

            <NeedFormatting>true</NeedFormatting>

            <ReturnType>13</ReturnType>

            <Value type="b:guid">some_code123</Value>

          </value>

        </KeyValuePairOfstringanyType>

      </Attributes>

      <EntityState nil="true"/>

      <FormattedValues>

        <KeyValuePairOfstringstring>

          <key>my_data.client_type</key>

          <value>client123</value>

        </KeyValuePairOfstringstring>

        <KeyValuePairOfstringstring>

          <key>my_data.status</key>

          <value>OK</value>

        </KeyValuePairOfstringstring>

        <KeyValuePairOfstringstring>

          <key>my_data.country</key>

          <value>some_country</value>

        </KeyValuePairOfstringstring>

      </FormattedValues>

      <Id>some_code456</Id>

      <KeyAttributes/>

      <LogicalName>some-id</LogicalName>

      <RelatedEntities/>

      <RowVersion>ID123</RowVersion>

    </Entity>

</Entities>

</EntityCollection>

“key”(或“AttributeLogicalName”)中的元素应该是标题。 “值”/“值”应该是实际数据。这是期望的结果(第一行)。enter image description here

由于数据非常嵌套,我尝试使用 xml.etree 但没有走得太远,因为我不确定如何在这里继续:

for _, elem in ET.iterparse(my_xml):
    if len(elem) == 0:
        print(f'{elem.tag} {elem.attrib} text={elem.text}')
    else:
        print(f'{elem.tag} {elem.attrib}')

怎样才能得到想要的结果?

python pandas xml
1个回答
0
投票

考虑使用 XSLT,转换特定 entitykeyvalue/Value 节点的嵌套 XML。然后,使用pandas按实体进行

pivot
,由于重复的键,后缀数字会添加到键名称中。 Pandas
read_xml
使用
lxml
解析器支持 XSLT 1.0。

XSLT (另存为.xsl文件)

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
  <xsl:output method="xml" indent="yes"/>
  
  <xsl:template match="/EntityCollection/Entities">
    <root>
        <xsl:apply-templates select="Entity"/>
    </root>
  </xsl:template>
  
  <xsl:template match="Entity">
    <xsl:apply-templates select="descendant::KeyValuePairOfstringanyType"/>
    <xsl:apply-templates select="descendant::KeyValuePairOfstringstring"/>
  </xsl:template>

  <xsl:template match="KeyValuePairOfstringanyType|KeyValuePairOfstringstring">
    <data>
        <entity><xsl:value-of select="ancestor::Entity/Id"/></entity>
        <xsl:copy-of select="descendant::key"/>
        <value><xsl:value-of select="descendant::Value[not(*)]|descendant::value[not(*)]"/></value>
    </data>
  </xsl:template>

</xsl:stylesheet>

XSLT 输出(长格式可缩放到任何键和值)

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <data>
      <entity>some_code456</entity>
      <key>client_number</key>
      <value>ABC123345</value>
   </data>
   <data>
      <entity>some_code456</entity>
      <key>my_data.client_type</key>
      <value>345</value>
   </data>
   <data>
      <entity>some_code456</entity>
      <key>my_data.status</key>
      <value>67</value>
   </data>
   <data>
      <entity>some_code456</entity>
      <key>my_data.date</key>
      <value>2024-01-01</value>
   </data>
   <data>
      <entity>some_code456</entity>
      <key>my_data.country</key>
      <value>456</value>
   </data>
   <data>
      <entity>some_code456</entity>
      <key>client_code</key>
      <value>some_code456</value>
   </data>
   <data>
      <entity>some_code456</entity>
      <key>my_data.my_data1id</key>
      <value>some_code123</value>
   </data>
   <data>
      <entity>some_code456</entity>
      <key>my_data.client_type</key>
      <value>client123</value>
   </data>
   <data>
      <entity>some_code456</entity>
      <key>my_data.status</key>
      <value>OK</value>
   </data>
   <data>
      <entity>some_code456</entity>
      <key>my_data.country</key>
      <value>some_country</value>
   </data>
</root>

Python (处理带后缀计数的重复键)

entities_df = (
    pd.read_xml("input.xml", stylesheet="style.xsl")
      .assign(key = lambda df: df["key"].str.cat(df.groupby("key").cumcount().add(1).astype('str')))
      .pivot(index="entity", columns="key", values="value")
)

Python 输出

entities_df
# key           client_code1 client_number1 my_data.client_type1 my_data.client_type2  ... my_data.date1 my_data.my_data1id1 my_data.status1 my_data.status2
# entity                                                                               ...                                                                  
# some_code456  some_code456      ABC123345                  345            client123  ...    2024-01-01        some_code123              67              OK

# [1 rows x 10 columns]
© www.soinside.com 2019 - 2024. All rights reserved.