我想要解析 SEC 网站中的 13-F 表单以获取所有 infoTable 元素。
获取目标数据:
from urllib.request import Request, urlopen
url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012324011775/0000950123-24-011775.txt"
req = Request(
url=url,
headers={'User-Agent': '[email protected]',
"Accept-Encoding":"gzip, deflate",
'Host': 'www.sec.gov'}
)
webpage = urlopen(req).read()
import gzip
content = gzip.decompress(webpage)
data = content.decode('utf-8')
用一些库进行解析。
与迷你王国
from xml.dom import minidom
xmldoc = minidom.parseString(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/xml/dom/minidom.py", line 2000, in parseString
return expatbuilder.parseString(string)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 17, column 52
使用 xml.etree
import xml.etree.ElementTree as ET
tree = ET.fromstring(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/xml/etree/ElementTree.py", line 1338, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 17, column 52
使用 lxml.etree
from lxml import etree
tree = etree.fromstring(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/etree.pyx", line 3306, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1995, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1875, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1105, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 633, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 672, in lxml.etree._raiseParseError
File "<string>", line 17
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 17, column 53
他们都无法加载数据,他们都提到“格式不正确(无效令牌):第 17 行,第 52 列”。
我在13-F表格的
line 17,column 52
处没有看到一些奇怪的标签。如何解决not well-formed
问题?
如果您只想要“XML”,那么您也许能够解析 SGML 文件并使用更正式的 XML 解析器之一选择您想要解析的内容。
import requests
import re
pattern = re.compile(r"<XML>(.*?)<\/XML>", flags=re.MULTILINE | re.DOTALL)
url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012324011775/0000950123-24-011775.txt"
headers={'User-Agent': '[email protected]', "Accept-Encoding":"gzip, deflate", 'Host': 'www.sec.gov'}
req = requests.get(url, headers=headers)
for index, match in enumerate(pattern.finditer(req.text)):
## just the first 10 characters of the XML document
print(index, match.group(1)[:10].strip() + "...")
这给了我:
0 <?xml ver...
1 <informat...