我试图从stack exchange dump解析PostHistory.xml文件。我的代码看起来像这样:
import xml.etree.ElementTree as eTree
with open("PostHistory.xml", 'r') as xml_file:
xml_tree = eTree.parse(xml_file)
但我得到:
UnicodeDecodeError: 'utf-8' codec can't decode
bytes in position 1959-1960: invalid continuation byte
我可以像这样读取文件的文本:
with open("PostHistory.xml") as xml_file:
a = xml_file.readline()
file *命令返回该文件的描述:
PostHistory.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text,
with very long lines, with CRLF line terminators
此外,该文件的第一行确认UTF-8编码:
<?xml version="1.0" encoding="utf-8"?>
我试图添加参数encoding="utf-8-sig"
但我又得到了同样的错误。
文件大小为112 Gb。我在这里错过了什么吗?
你可以尝试这样的事情:
with open(posts_path) as xml_file:
for line in xml_file:
try:
xml_obj = eTree.fromstring(line)
except UnicodeDecodeError as e:
# Dealing with corrupted encoded strings
new_str = line.encode("latin-1", "ignore")
xml_obj1 = eTree.fromstring(ww)
因此,当您获得无效字符时,您将其编码为“latin-1”
文件字节的实际情况可能与XML声明中指定的编码相矛盾。 (仅在XML声明中设置编码不会更改文件中的其余字节。)
你可以试试
open("PostHistory.xml", 'r', encoding="ISO-8859-1")
但如果是数据损坏而不是文件范围的编码问题,你可能不得不卷起袖子并修复1959-1960
上的错误字节。
也可以看看: