将动态XML文件转换为pandas Dataframe

Question

感谢您的时间和关注。这是我第一次发布在stackoverflow中，如果我很笨拙，请原谅。

基本上是用Python编写代码，但这是我第一次解析XML文件。我已经进行了几周的研究，但是我在某一点或更多方面受阻。

我的样本是：

<record date="2019-12-02" time="19:13:40.091913" id="ALARM:AlaCtrl">
  <field name="system_inst">run</field>
  <field name="system_name">run0</field>
  <field name="flags">AB2</field>
  <field name="alias">CO3_CV_01</field>
  <field name="group">CleanWater</field>
  <field name="priority">126</field>
  <field name="text">Open feedback loss</field>
  <field name="trtext">Retour de position ouvert perdu</field>
  <field name="end">2019-12-02  19:13:40.392992</field>
  <field name="duration">0.301</field>
  <field name="ackts">2019-12-02  20:36:28.615704</field>
  <field name="user">benaissam</field>
  <field name="acktext">Denaissam</field>
  <field name="ivtext">IvTxt</field>
</record>
<record date="2019-12-02" time="20:06:04.661429" id="ALARM:SFCResetPause_TON_Q">
    <field name="system_inst">run</field>
    <field name="system_name">run</field>
    <field name="flags">AB1</field>
    <field name="alias">G7_DOSING</field>
    <field name="group">SeedsMB</field>
    <field name="priority">5</field>
    <field name="text">Dosage seeds in pause : SBX1 not ready</field>
    <field name="trtext">Dosage semence en pause : SBx1 pas prêt</field>
    <field name="end">2019-12-02  20:06:05.187379</field>
    <field name="duration">0.526</field>
</record>

记录数是动态的，并且每个记录的栏位字段名称可以更改。在这里，我的代码吨将这个xml文件解析为pandas dataframe：

import pandas as pd
import xml.etree.ElementTree as et
import re
import itertools


with open('Alarm_DYN002_SeedsMB_5.xml') as f:
    it = itertools.chain('<root>', f, '</root>')
    root = et.fromstringlist(it)

    df_cols = ["date", "time", "id", "system_inst","system_name", "flags", 'alias', 'group',
    'priority', 'text', 'trtext', 'end', 'duration', 'ackts', 'user', 'acktext', 'ivtext']
    rows = []

    system_inst = []
    system_name = []
    flags = []
    alias = []
    group = []
    priority = []
    Text = []
    trtext = []
    end = []
    duration = []
    ackts = []
    user = []
    acktext = []
    ivtext = []

    for record in root.findall('record'):

      ListDate = record.get('date')
      ListTime = record.get('time')
      ListId   = record.get('id')

      system_inst = record.getchildren()[0].text
      system_name = record.getchildren()[1].text
      flags = record.getchildren()[2].text
      alias = record.getchildren()[3].text
      group = record.getchildren()[4].text
      priority = record.getchildren()[5].text
      Text = record.getchildren()[6].text
      trtext = record.getchildren()[7].text
      end = record.getchildren()[8].text
      duration = record.getchildren()[9].text
      ackts = record.getchildren()[10].text
      user = record.getchildren()[11].text
      acktext = record.getchildren()[12].text
      ivtext = record.getchildren()[13].text

      rows.append({"date": ListDate, "time": ListTime, "id": ListId, "system_inst" : system_inst,
                  "system_name" : system_name, "flags" : flags, "alias" : alias, "group" : group,
                  "priority" : priority, "text" : Text, "trtext" : trtext, "end" : end,
                  "duration" : duration, "ackts" : ackts, "user" : user, "acktext" : acktext,
                  "ivtext" : ivtext})

    out_df = pd.DataFrame(rows, columns = df_cols)
    print(out_df)

最后，我想要一个这样的数据框：enter image description here

但是对于每条记录，我可能会缺少不同的字段，在这种情况下，我希望数据框中的字段为“无”。但是我目前找不到解决方案。

再次，感谢您的时间及您的帮助。

Answer 1

这是一个非常有趣的问题-谢谢！您可以使用lxml和xpath完成操作。我会尝试解释：

from lxml import etree
import pandas as pd
records = """[your xml above]"""

root = etree.fromstring(records)
num_recs = int(root.xpath('count(//record)')) #count the number of records; 2, in this case
rec_grid = [[] for __ in range(num_recs)] #intitalize a list of sublists (2 in this case, with each sublist holding the relevant fields
fields = ["date","time","id","system_inst", "system_name", "flags", "alias", "group", "priority", "text", "trtext", "end", "duration", "ackts", "user", "acktext", "ivtext"]

paths = root.xpath('//record') #this contains a list of the 2 locations of the records
counter = 0
for path in paths:    
    for fld in fields[:3]: #the first 3 fields are in a different sub-location than the other 14             
        target = f'(./@{fld})' #using f-strings to populate the full path
        if path.xpath(target):
                rec_grid[counter].append(path.xpath(target)[0]) #we start populating our current sublist with the relevant info            
        else:
                rec_grid[counter].append('NA')

    for fld in fields[3:]:  # and now for the rest of the fields            
        target = f'(./field[@name="{fld}"]/text())'
        if path.xpath(target):
            rec_grid[counter].append(path.xpath(target)[0]) 
        else:
            rec_grid[counter].append('NA')
    counter+=1

df = pd.DataFrame(rec_grid, columns=fields) #now that we have our lists, create a df
df

输出太长，无法在此处重现，但看起来像您问题中的链接图像。

将动态XML文件转换为pandas Dataframe

问题描述投票：1回答：1

1个回答

最新问题

将动态XML文件转换为pandas Dataframe

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1