无法解析stackexchange XML文件

问题描述 投票:1回答:2

我试图从stack exchange dump解析PostHistory.xml文件。我的代码看起来像这样:

import xml.etree.ElementTree as eTree
with open("PostHistory.xml", 'r') as xml_file:
    xml_tree = eTree.parse(xml_file)

但我得到:

UnicodeDecodeError: 'utf-8' codec can't decode 
bytes in position 1959-1960: invalid continuation byte

我可以像这样读取文件的文本:

with open("PostHistory.xml") as xml_file:
     a = xml_file.readline()

file *命令返回该文件的描述:

PostHistory.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, 
with very long lines, with CRLF line terminators

此外,该文件的第一行确认UTF-8编码:

<?xml version="1.0" encoding="utf-8"?>

我试图添加参数encoding="utf-8-sig"但我又得到了同样的错误。

文件大小为112 Gb。我在这里错过了什么吗?

xml python-3.x xml-parsing elementtree
2个回答
1
投票

你可以尝试这样的事情:

    with open(posts_path) as xml_file:  
        for line in xml_file:            
            try:                    
                xml_obj = eTree.fromstring(line)                    
            except UnicodeDecodeError as e:
                # Dealing with corrupted encoded strings
                new_str = line.encode("latin-1", "ignore")
                xml_obj1 = eTree.fromstring(ww)

因此,当您获得无效字符时,您将其编码为“latin-1”


0
投票

文件字节的实际情况可能与XML声明中指定的编码相矛盾。 (仅在XML声明中设置编码不会更改文件中的其余字节。)

你可以试试

open("PostHistory.xml", 'r', encoding="ISO-8859-1")

但如果是数据损坏而不是文件范围的编码问题,你可能不得不卷起袖子并修复1959-1960上的错误字节。

也可以看看:

© www.soinside.com 2019 - 2024. All rights reserved.