将“XML Spreadsheet 2003”解析为 Pandas 数据框

Question

我有 6000 多个 XML 电子表格文件，我想将它们解析为 Pandas DataFrames。最好我想使用 pd.read_xml 方法来执行此操作并提供正确的参数（x-path？）。我不想使用 lxml 解析它。我需要它尽可能高效。

我的文件示例：

<?mso-application progid='Excel.Sheet'?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <Styles>
        <Style ss:ID="Default" ss:Name="Normal">
            <Alignment ss:Vertical="Bottom"/>
        </Style>
        <Style ss:ID="ShortDate">
            <NumberFormat ss:Format="Short Date"/>
        </Style>
        <Style ss:ID="Titles">
            <Font ss:Bold="1"/>
            <Interior ss:Color="#C0C0C0" ss:Pattern="Solid"/>
        </Style>
        <Style ss:ID="ColorShortDate">
            <Font ss:Color="#808080"/>
            <NumberFormat ss:Format="Short Date"/>
        </Style>
        <Style ss:ID="MakeCellColor">
            <Font ss:Color="#808080"/>
        </Style>
        <Style ss:ID="DateTimeStamp">
            <NumberFormat ss:Format="dd/mm/yyyy'' hh:mm:ss"/>
        </Style>
    </Styles>
    <Worksheet ss:Name="Details" xml:lang="en-US" xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:Composites="Composites" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
        <Table>
            <Column ss:Width="114.75"/>
            <Column ss:Width="116.25"/>
            <Column ss:Width="53.25"/>
            <Column ss:Width="171.75"/>
            <Row ss:StyleID="Titles">
                <Cell>
                    <Data ss:Type="String">ShortName</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">FieldCode</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Date</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Value</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Calculate</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">yes</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Calendar</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Working Days</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">FromCurrency</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">EUR</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">TranslatableName</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Inception</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">ToCurrency</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">CHF</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">InceptionDate</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">1995-01-01</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">LastModificationDate</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">2011-08-19</Data>
                </Cell>
            </Row>
        </Table>
        <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
            <Selected/>
            <FreezePanes/>
            <SplitHorizontal>1</SplitHorizontal>
            <TopRowBottomPane>1</TopRowBottomPane>
            <ActivePane>2</ActivePane>
        </WorksheetOptions>
    </Worksheet>
</Workbook>

我期望的数据框是：

简称	字段代码	日期	价值
瑞士法郎兑欧元	计算		是的
瑞士法郎兑欧元	日历		工作日
瑞士法郎兑欧元	来自货币		欧元
瑞士法郎兑欧元	可翻译名称	创始	瑞士法郎兑欧元
瑞士法郎兑欧元	至货币		瑞士法郎
瑞士法郎兑欧元	成立日期		1995-01-01
瑞士法郎兑欧元	最后修改日期		2011-08-19

Answer 1

如果你坚持只使用pandas（即使在hooder下使用了lxml），你可以尝试:

N = 4 # nb of cols
xp = ".//ss:Row/ss:Cell"
ns = {"ss": "urn:schemas-microsoft-com:office:spreadsheet"}

tmp = pd.read_xml("file.xml", xpath=xp, namespaces=ns).squeeze()
df = pd.DataFrame(np.reshape(tmp[N:], (-1, N)), columns=tmp[:N]).fillna("")

输出：

print(df)

Data   ShortName             FieldCode       Date         Value
0     CHF vs EUR             Calculate                      yes
1     CHF vs EUR              Calendar             Working Days
2     CHF vs EUR          FromCurrency                      EUR
3     CHF vs EUR      TranslatableName  Inception    CHF vs EUR
4     CHF vs EUR            ToCurrency                      CHF
5     CHF vs EUR         InceptionDate               1995-01-01
6     CHF vs EUR  LastModificationDate               2011-08-19

[7 rows x 4 columns]

将“XML Spreadsheet 2003”解析为 Pandas 数据框

问题描述投票：0回答：1

1个回答

最新问题

将“XML Spreadsheet 2003”解析为 Pandas 数据框

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1