我无法将此 xml 数据提取到数据框以正常工作。 这是我的 xml 示例。实际上,我有多个“实体”堆栈,它们代表所需结果中的一行数据。为了节省空间,我只在这里粘贴了一个“Entity”实例:
<EntityCollection>
<Entities>
<Entity>
<Attributes>
<KeyValuePairOfstringanyType>
<key>client_number</key>
<value type="b:string">ABC123345</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.client_type</key>
<value type="AliasedValue">
<AttributeLogicalName>my_type</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>123</ReturnType>
<Value type="OptionSetValue">
<Value>345</Value>
</Value>
</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.status</key>
<value type="AliasedValue">
<AttributeLogicalName>status</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>123</ReturnType>
<Value type="OptionSetValue">
<Value>67</Value>
</Value>
</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.date</key>
<value type="AliasedValue">
<AttributeLogicalName>date</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>89</ReturnType>
<Value type="b:string">2024-01-01</Value>
</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.country</key>
<value type="AliasedValue">
<AttributeLogicalName>country</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>123</ReturnType>
<Value type="OptionSetValue">
<Value>456</Value>
</Value>
</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>client_code</key>
<value type="b:guid">some_code456</value>
</KeyValuePairOfstringanyType>
<KeyValuePairOfstringanyType>
<key>my_data.my_data1id</key>
<value type="AliasedValue">
<AttributeLogicalName>my_data1id</AttributeLogicalName>
<EntityLogicalName>my_data1</EntityLogicalName>
<NeedFormatting>true</NeedFormatting>
<ReturnType>13</ReturnType>
<Value type="b:guid">some_code123</Value>
</value>
</KeyValuePairOfstringanyType>
</Attributes>
<EntityState nil="true"/>
<FormattedValues>
<KeyValuePairOfstringstring>
<key>my_data.client_type</key>
<value>client123</value>
</KeyValuePairOfstringstring>
<KeyValuePairOfstringstring>
<key>my_data.status</key>
<value>OK</value>
</KeyValuePairOfstringstring>
<KeyValuePairOfstringstring>
<key>my_data.country</key>
<value>some_country</value>
</KeyValuePairOfstringstring>
</FormattedValues>
<Id>some_code456</Id>
<KeyAttributes/>
<LogicalName>some-id</LogicalName>
<RelatedEntities/>
<RowVersion>ID123</RowVersion>
</Entity>
</Entities>
</EntityCollection>
“key”(或“AttributeLogicalName”)中的元素应该是标题。 “值”/“值”应该是实际数据。这是期望的结果(第一行)。
由于数据非常嵌套,我尝试使用 xml.etree 但没有走得太远,因为我不确定如何在这里继续:
for _, elem in ET.iterparse(my_xml):
if len(elem) == 0:
print(f'{elem.tag} {elem.attrib} text={elem.text}')
else:
print(f'{elem.tag} {elem.attrib}')
怎样才能得到想要的结果?
考虑使用 XSLT,转换特定 entity、key 和 value/Value 节点的嵌套 XML。然后,使用pandas按实体进行
pivot
,由于重复的键,后缀数字会添加到键名称中。 Pandas read_xml
使用 lxml
解析器支持 XSLT 1.0。
XSLT (另存为.xsl文件)
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/EntityCollection/Entities">
<root>
<xsl:apply-templates select="Entity"/>
</root>
</xsl:template>
<xsl:template match="Entity">
<xsl:apply-templates select="descendant::KeyValuePairOfstringanyType"/>
<xsl:apply-templates select="descendant::KeyValuePairOfstringstring"/>
</xsl:template>
<xsl:template match="KeyValuePairOfstringanyType|KeyValuePairOfstringstring">
<data>
<entity><xsl:value-of select="ancestor::Entity/Id"/></entity>
<xsl:copy-of select="descendant::key"/>
<value><xsl:value-of select="descendant::Value[not(*)]|descendant::value[not(*)]"/></value>
</data>
</xsl:template>
</xsl:stylesheet>
XSLT 输出(长格式可缩放到任何键和值)
<?xml version="1.0" encoding="UTF-8"?>
<root>
<data>
<entity>some_code456</entity>
<key>client_number</key>
<value>ABC123345</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.client_type</key>
<value>345</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.status</key>
<value>67</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.date</key>
<value>2024-01-01</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.country</key>
<value>456</value>
</data>
<data>
<entity>some_code456</entity>
<key>client_code</key>
<value>some_code456</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.my_data1id</key>
<value>some_code123</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.client_type</key>
<value>client123</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.status</key>
<value>OK</value>
</data>
<data>
<entity>some_code456</entity>
<key>my_data.country</key>
<value>some_country</value>
</data>
</root>
Python (处理带后缀计数的重复键)
entities_df = (
pd.read_xml("input.xml", stylesheet="style.xsl")
.assign(key = lambda df: df["key"].str.cat(df.groupby("key").cumcount().add(1).astype('str')))
.pivot(index="entity", columns="key", values="value")
)
Python 输出
entities_df
# key client_code1 client_number1 my_data.client_type1 my_data.client_type2 ... my_data.date1 my_data.my_data1id1 my_data.status1 my_data.status2
# entity ...
# some_code456 some_code456 ABC123345 345 client123 ... 2024-01-01 some_code123 67 OK
# [1 rows x 10 columns]