感谢您的时间和关注。这是我第一次发布在stackoverflow中,如果我很笨拙,请原谅。
基本上是用Python编写代码,但这是我第一次解析XML文件。我已经进行了几周的研究,但是我在某一点或更多方面受阻。
我的样本是:
<record date="2019-12-02" time="19:13:40.091913" id="ALARM:AlaCtrl">
<field name="system_inst">run</field>
<field name="system_name">run0</field>
<field name="flags">AB2</field>
<field name="alias">CO3_CV_01</field>
<field name="group">CleanWater</field>
<field name="priority">126</field>
<field name="text">Open feedback loss</field>
<field name="trtext">Retour de position ouvert perdu</field>
<field name="end">2019-12-02 19:13:40.392992</field>
<field name="duration">0.301</field>
<field name="ackts">2019-12-02 20:36:28.615704</field>
<field name="user">benaissam</field>
<field name="acktext">Denaissam</field>
<field name="ivtext">IvTxt</field>
</record>
<record date="2019-12-02" time="20:06:04.661429" id="ALARM:SFCResetPause_TON_Q">
<field name="system_inst">run</field>
<field name="system_name">run</field>
<field name="flags">AB1</field>
<field name="alias">G7_DOSING</field>
<field name="group">SeedsMB</field>
<field name="priority">5</field>
<field name="text">Dosage seeds in pause : SBX1 not ready</field>
<field name="trtext">Dosage semence en pause : SBx1 pas prêt</field>
<field name="end">2019-12-02 20:06:05.187379</field>
<field name="duration">0.526</field>
</record>
记录数是动态的,并且每个记录的栏位字段名称可以更改。在这里,我的代码吨将这个xml文件解析为pandas dataframe:
import pandas as pd
import xml.etree.ElementTree as et
import re
import itertools
with open('Alarm_DYN002_SeedsMB_5.xml') as f:
it = itertools.chain('<root>', f, '</root>')
root = et.fromstringlist(it)
df_cols = ["date", "time", "id", "system_inst","system_name", "flags", 'alias', 'group',
'priority', 'text', 'trtext', 'end', 'duration', 'ackts', 'user', 'acktext', 'ivtext']
rows = []
system_inst = []
system_name = []
flags = []
alias = []
group = []
priority = []
Text = []
trtext = []
end = []
duration = []
ackts = []
user = []
acktext = []
ivtext = []
for record in root.findall('record'):
ListDate = record.get('date')
ListTime = record.get('time')
ListId = record.get('id')
system_inst = record.getchildren()[0].text
system_name = record.getchildren()[1].text
flags = record.getchildren()[2].text
alias = record.getchildren()[3].text
group = record.getchildren()[4].text
priority = record.getchildren()[5].text
Text = record.getchildren()[6].text
trtext = record.getchildren()[7].text
end = record.getchildren()[8].text
duration = record.getchildren()[9].text
ackts = record.getchildren()[10].text
user = record.getchildren()[11].text
acktext = record.getchildren()[12].text
ivtext = record.getchildren()[13].text
rows.append({"date": ListDate, "time": ListTime, "id": ListId, "system_inst" : system_inst,
"system_name" : system_name, "flags" : flags, "alias" : alias, "group" : group,
"priority" : priority, "text" : Text, "trtext" : trtext, "end" : end,
"duration" : duration, "ackts" : ackts, "user" : user, "acktext" : acktext,
"ivtext" : ivtext})
out_df = pd.DataFrame(rows, columns = df_cols)
print(out_df)
最后,我想要一个这样的数据框:enter image description here
但是对于每条记录,我可能会缺少不同的字段,在这种情况下,我希望数据框中的字段为“无”。但是我目前找不到解决方案。
再次,感谢您的时间及您的帮助。
这是一个非常有趣的问题-谢谢!您可以使用lxml和xpath完成操作。我会尝试解释:
from lxml import etree
import pandas as pd
records = """[your xml above]"""
root = etree.fromstring(records)
num_recs = int(root.xpath('count(//record)')) #count the number of records; 2, in this case
rec_grid = [[] for __ in range(num_recs)] #intitalize a list of sublists (2 in this case, with each sublist holding the relevant fields
fields = ["date","time","id","system_inst", "system_name", "flags", "alias", "group", "priority", "text", "trtext", "end", "duration", "ackts", "user", "acktext", "ivtext"]
paths = root.xpath('//record') #this contains a list of the 2 locations of the records
counter = 0
for path in paths:
for fld in fields[:3]: #the first 3 fields are in a different sub-location than the other 14
target = f'(./@{fld})' #using f-strings to populate the full path
if path.xpath(target):
rec_grid[counter].append(path.xpath(target)[0]) #we start populating our current sublist with the relevant info
else:
rec_grid[counter].append('NA')
for fld in fields[3:]: # and now for the rest of the fields
target = f'(./field[@name="{fld}"]/text())'
if path.xpath(target):
rec_grid[counter].append(path.xpath(target)[0])
else:
rec_grid[counter].append('NA')
counter+=1
df = pd.DataFrame(rec_grid, columns=fields) #now that we have our lists, create a df
df
输出太长,无法在此处重现,但看起来像您问题中的链接图像。