Here是有关维基百科转储以及如何使用多流的文章的链接,这样我就不必打开整个文件来解析它。 这里是它建议使用的库。
我的问题是我不知道如何正确使用索引文件或该库来正确解析文件。当我尝试解压缩它时,我只是读取了一系列空字节“b”。我想要做的是能够一次解析文件几千个字符,以便我可以将它们使用到我的 NLP 应用程序中。
提前致谢。
我从 wikidump 链接找到了一些代码!
代码链接是:https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dumps/+/ariel/toys/bz2multistream
如果您耐心阅读 wikiarticles.py 脚本,您会发现下面的代码片段:
def retrieve_text(self, title, offset):
'''
retrieve the page text for a given title from the xml file
this does decompression of a bz2 stream so it's more expsive than
other parts of this class
arguments:
title -- the page title, with spaces and not underscores, case sensitive
offset -- the offset in bytes to the bz2 stream in the xml file which contains
the page text
returns the page text or None if no such page was found
'''
self.xml_fd.seek(offset)
unzipper = bz2.BZ2Decompressor()
out = None
found = False
try:
block = self.xml_fd.read(262144)
out = unzipper.decompress(block)
# hope we got enough back to have the page text
except:
raise
# format of the contents (and there are multiple pages per stream):
# <page>
# <title>AccessibleComputing</title>
# <ns>0</ns>
# <id>10</id>
# ...
# </page>
所以对于你的问题,也许你应该执行以下步骤:
第2步注意,你应该首先使用python open函数获取文件描述符,并将fd ref保存到变量中,然后你可以使用fd.seek(offset)跳转到偏移量,并且你必须调用fd.read( block_size_bytes) 来读取页面数据。
再次阅读 wikiarticles.py,你就会得到答案。
问题提出已经快四年了......
这是基于 KevinLoveCherry 的答案中提到的代码的实现。索引文件为您提供偏移量、页面 ID 和页面标题。此代码读取多流转储文件并提取所需文章的维基文本。此实现在我的 2022 年笔记本电脑上运行大约需要 0.1 秒。
调用函数
get_wikitext()
获取文章文本。传入 offset
和 page_id
,或者 offset
和页面 title
。
import xml.etree.ElementTree as ET
import bz2
def get_wikitext(dump_filename, offset, page_id=None, title=None, namespace_id=None, verbose=True, block_size=256*1024):
"""Extract wikitext from a multistream dump file.
Requires the offset (in bytes) from the start of the dump file.
This can be obtained from the index file.
Pass in some of either the page_id, namespace_id, or title of the
page you're looking for.
"""
unzipper = bz2.BZ2Decompressor()
# Read the compressed stream, decompress the data
uncompressed_data = b""
with open(dump_filename, "rb") as infile:
infile.seek(int(offset))
while True:
compressed_data = infile.read(block_size)
try:
uncompressed_data += unzipper.decompress(compressed_data)
except EOFError:
# We've reached the end of the stream
break
# If there's no more data in the file
if compressed_data == '':
# End if we've finished reading the stream
if unzipper.need_input:
break
# Otherwise we've failed to correctly read all of the stream
raise Exception("Failed to read a complete stream")
# Extract out the page
# Format of the contents (and there are multiple pages per stream):
# <page>
# <title>AccessibleComputing</title>
# <ns>0</ns>
# <id>10</id>
# ...
# </page>
uncompressed_text = uncompressed_data.decode("utf-8")
xml_data = "<root>" + uncompressed_text + "</root>"
root = ET.fromstring(xml_data)
for page in root.findall("page"):
if title is not None:
if title != page.find("title").text:
continue
if namespace_id is not None:
if namespace_id != int(page.find("ns").text):
continue
if page_id is not None:
if page_id != int(page.find("id").text):
continue
# We've found what we're looking for
revision = page.find("revision")
wikitext = revision.find("text")
return wikitext.text
# We failed to find what we were looking for
return None
def example():
index_line = "600:12:Anarchism"
offset, page_id, title = index_line.split(":")
dump_file = "enwiki-dump/enwiki-20231101-pages-articles-multistream.xml.bz2"
wikitext = get_wikitext(dump_file, int(offset), page_id=int(page_id))
print(wikitext)