我有 6000 多个 XML 电子表格文件,我想将它们解析为 Pandas DataFrames。最好我想使用 pd.read_xml 方法来执行此操作并提供正确的参数(x-path?)。我不想使用 lxml 解析它。我需要它尽可能高效。
我的文件示例:
<?mso-application progid='Excel.Sheet'?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<Styles>
<Style ss:ID="Default" ss:Name="Normal">
<Alignment ss:Vertical="Bottom"/>
</Style>
<Style ss:ID="ShortDate">
<NumberFormat ss:Format="Short Date"/>
</Style>
<Style ss:ID="Titles">
<Font ss:Bold="1"/>
<Interior ss:Color="#C0C0C0" ss:Pattern="Solid"/>
</Style>
<Style ss:ID="ColorShortDate">
<Font ss:Color="#808080"/>
<NumberFormat ss:Format="Short Date"/>
</Style>
<Style ss:ID="MakeCellColor">
<Font ss:Color="#808080"/>
</Style>
<Style ss:ID="DateTimeStamp">
<NumberFormat ss:Format="dd/mm/yyyy'' hh:mm:ss"/>
</Style>
</Styles>
<Worksheet ss:Name="Details" xml:lang="en-US" xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:Composites="Composites" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<Table>
<Column ss:Width="114.75"/>
<Column ss:Width="116.25"/>
<Column ss:Width="53.25"/>
<Column ss:Width="171.75"/>
<Row ss:StyleID="Titles">
<Cell>
<Data ss:Type="String">ShortName</Data>
</Cell>
<Cell>
<Data ss:Type="String">FieldCode</Data>
</Cell>
<Cell>
<Data ss:Type="String">Date</Data>
</Cell>
<Cell>
<Data ss:Type="String">Value</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">Calculate</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">yes</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">Calendar</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">Working Days</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">FromCurrency</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">EUR</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">TranslatableName</Data>
</Cell>
<Cell>
<Data ss:Type="String">Inception</Data>
</Cell>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">ToCurrency</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">CHF</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">InceptionDate</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">1995-01-01</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">LastModificationDate</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">2011-08-19</Data>
</Cell>
</Row>
</Table>
<WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
<Selected/>
<FreezePanes/>
<SplitHorizontal>1</SplitHorizontal>
<TopRowBottomPane>1</TopRowBottomPane>
<ActivePane>2</ActivePane>
</WorksheetOptions>
</Worksheet>
</Workbook>
我期望的数据框是:
简称 | 字段代码 | 日期 | 价值 |
---|---|---|---|
瑞士法郎兑欧元 | 计算 | 是的 | |
瑞士法郎兑欧元 | 日历 | 工作日 | |
瑞士法郎兑欧元 | 来自货币 | 欧元 | |
瑞士法郎兑欧元 | 可翻译名称 | 创始 | 瑞士法郎兑欧元 |
瑞士法郎兑欧元 | 至货币 | 瑞士法郎 | |
瑞士法郎兑欧元 | 成立日期 | 1995-01-01 | |
瑞士法郎兑欧元 | 最后修改日期 | 2011-08-19 |
如果你坚持只使用pandas(即使在hooder下使用了lxml),你可以尝试:
N = 4 # nb of cols
xp = ".//ss:Row/ss:Cell"
ns = {"ss": "urn:schemas-microsoft-com:office:spreadsheet"}
tmp = pd.read_xml("file.xml", xpath=xp, namespaces=ns).squeeze()
df = pd.DataFrame(np.reshape(tmp[N:], (-1, N)), columns=tmp[:N]).fillna("")
输出:
print(df)
Data ShortName FieldCode Date Value
0 CHF vs EUR Calculate yes
1 CHF vs EUR Calendar Working Days
2 CHF vs EUR FromCurrency EUR
3 CHF vs EUR TranslatableName Inception CHF vs EUR
4 CHF vs EUR ToCurrency CHF
5 CHF vs EUR InceptionDate 1995-01-01
6 CHF vs EUR LastModificationDate 2011-08-19
[7 rows x 4 columns]