如何解析xml文件来获取所有infoTable元素?

问题描述 投票:0回答:1

我想要解析 SEC 网站中的 13-F 表单以获取所有 infoTable 元素。

获取目标数据:

from urllib.request import Request, urlopen
url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012324011775/0000950123-24-011775.txt"
req = Request(
    url=url,
    headers={'User-Agent': '[email protected]',
             "Accept-Encoding":"gzip, deflate",
             'Host': 'www.sec.gov'}
    )
webpage = urlopen(req).read()
import gzip
content = gzip.decompress(webpage)
data = content.decode('utf-8')

用一些库进行解析。

与迷你王国

 from xml.dom import minidom   
 xmldoc = minidom.parseString(data)
 Traceback (most recent call last):     
  File "<stdin>", line 1, in <module>    
  File "/usr/lib/python3.11/xml/dom/minidom.py", line 2000, in parseString
    return expatbuilder.parseString(string)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 223, in parseString
  parser.Parse(string, True)
 xml.parsers.expat.ExpatError: not well-formed (invalid token): line 17, column 52

使用 xml.etree

import xml.etree.ElementTree as ET
tree = ET.fromstring(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.11/xml/etree/ElementTree.py", line 1338, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 17, column 52

使用 lxml.etree

from lxml import etree
tree = etree.fromstring(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/etree.pyx", line 3306, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1995, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1875, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1105, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 633, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 672, in lxml.etree._raiseParseError
  File "<string>", line 17
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 17, column 53

他们都无法加载数据,他们都提到“格式不正确(无效令牌):第 17 行,第 52 列”。

enter image description here

我在13-F表格的

line 17,column 52
处没有看到一些奇怪的标签。如何解决
not well-formed
问题?

python-3.x xml xml-parsing
1个回答
0
投票

如果您只想要“XML”,那么您也许能够解析 SGML 文件并使用更正式的 XML 解析器之一选择您想要解析的内容。

import requests
import re

pattern = re.compile(r"<XML>(.*?)<\/XML>", flags=re.MULTILINE | re.DOTALL)
url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012324011775/0000950123-24-011775.txt"
headers={'User-Agent': '[email protected]', "Accept-Encoding":"gzip, deflate", 'Host': 'www.sec.gov'}
req = requests.get(url, headers=headers)

for index, match in enumerate(pattern.finditer(req.text)):
    ## just the first 10 characters of the XML document
    print(index, match.group(1)[:10].strip() + "...")

这给了我:

0 <?xml ver...
1 <informat...
© www.soinside.com 2019 - 2024. All rights reserved.