如何从python中的维基百科刮擦任何桌子? 我想从Python的Wikipedia刮擦桌子。 Wikipedia是从中获取数据的好来源,但是存在的数据是HTML格式,它非常不友好,无法直接使用...

问题描述 投票:0回答:1

import re import requests from lxml import html res = requests.get('https://en.wikipedia.org/wiki/Unicode_block').content tree = html.fromstring(res) UNICODE_BLOCKS = [] for block in tree.xpath(".//table[contains(@class, 'wikitable')]/tbody/tr/td/span[@class='monospaced']"): codes = block.text start, end = (int(i[2:], 16) for i in codes.split('..')) row = block.xpath('./ancestor::tr')[0] block_name = re.sub('\n|\[\w+\]', '', row.find('./td[3]/a').text) assigned = int(row.find('./td[5]').text.replace(',', '')) scripts = row.find('./td[6]').text_content() if ',' in scripts: systems = [] for script in scripts.split(', '): i = script.index('(') name = script[:i-1] count = int(script[i+1:].split(" ")[0].replace(',', '')) systems.append((name, count)) else: systems = [(scripts.strip(), assigned)] UNICODE_BLOCKS.append((start, end, block_name, assigned, systems))

它正是我想要的,我已经艰苦地验证了它的正确性,但是如您所见,它相当复杂,并且仅适用于该特定桌子。 尽管我可以尝试使用列出的简单表,但Wikipedia有许多与单元格的表,我的策略无法与之合作。 简单的示例是链接页面的第二个表。我该如何将其转变为以下内容:

[
    (0x1000, 0x105f, 'Tibetan', '1.0.0', '1.0.1', 'Myanmar', 'Tibetan', 96, 71, 'Tibetan'),
    (0x3400, 0x3d2d, 'Hangul', '1.0.0', '2.0', 'CJK Unified Ideographs Extension A', 'Hangul Syllables', 2350, 2350, 'Hangul'),
    (0x3d2e, 0x44b7, 'Hangul Supplementary-A', '1.1', '2.0', 'CJK Unified Ideographs Extension A', 'Hangul Syllables', 1930, 1930, 'Hangul'),
    (0x44b8, 0x4dff, 'Hangul Supplementary-B', '1.1', '2.0', 'CJK Unified Ideographs Extension A and Yijing Hexagram Symbols', 'Hangul Syllables', 2376, 2376, 'Hangul')
]

我记得遇到像上面的许多桌子,但是在专门寻找它们时,我以某种方式很难找到一个表。但是我能够找到以下页面:

https://en.wikipedia.org/wiki/lindsey_stirling_discography

如何将单打桌变成以下内容:

[
    ('"Crystallize"', 2012, 'Lindsey Stirling'),
    ('"Beyond the Veil"', 2014, 'Shatter Me'),
    ('"Shatter Me" featuring Lzzy Hale)', 2014, 'Shatter Me'),
    ('"Take Flight"', 2014, 'Shatter Me'),
    ('"Master of Tides"', 2014, 'Shatter Me'),
    ('"Hallelujah"', 2015, 'Non-album single'),
    ('"The Arena"', 2016, 'Brave Enough'),
    ('"Something Wild" (featuring Andrew McMahon)', 2016, "Brave Enough and Pete's Dragon"),
    ('"Prism"', 2016, 'Brave Enough'),
    ('"Hold My Heart" (featuring ZZ Ward)', 2016, 'Brave Enough'),
    ('"Love\'s Just a Feeling" (featuring Rooty)', 2017, 'Brave Enough'),
    ('"Dance of the Sugar Plum Fairy"', 2017, 'Warmer in the Winter'),
    ('"Christmas C\'mon" (featuring Becky G)', 2017, 'Warmer in the Winter'),
    ('"Warmer in the Winter" (featuring Trombone Shorty)', 2018, 'Warmer in the Winter'),
    ('"Carol of the Bells"', 2018, 'Warmer in the Winter'),
    ('"Underground"', 2019, 'Artemis'),
    ('"The Upside" (solo or featuring Elle King)', 2019, 'Artemis'),
    ('"Artemis"', 2019, 'Artemis'),
    ('"What You\'re Made Of" (featuring Kiesza)', 2020, 'Azur Lane Soundtrack'),
    ('"Lose You Now"', 2021, 'Lose You Now'),
    ('"Joy to the World"', 2022, 'Snow Waltz'),
    ('"Sleigh Ride"', 2023, 'Snow Waltz'),
    ('"Kashmir"', 2023, 'Non-album single'),
    ('"Carol of the Bells" (Live from Summer Tour 2023)', 2023, 'Non-album single'),
    ('"Heavy Weight"', 2023, 'Beat Saber Original Soundtrack Vol. 6'),
    ('"Eye of the Untold Her"', 2024, 'Duality'),
    ('"Inner Gold" (featuring Royal & the Serpent)', 2024, 'Duality'),
    ('"You\'re a Mean One, Mr. Grinch" featuring Sabrina Carpenter)', 2024, 'Warmer in the Winter'),
]

我看到了很多类似的问题,其中许多问题都使用了pandas +

bs4

。我不喜欢

pandas
bs4

,而且我个人不使用它们,这个问题与它们无关,而是要展示我的研究,我刚刚下载了

pandas
,这迫使我下载
html5lib
beautifulsoup4
。我俩都很少使用,实际上我不记得曾经使用过它们,我主要使用
aiohttp
+
lxml
(尽管在这种情况下,我使用
requests
)。
现在,以下代码不起作用,这个问题与使其有效:
import pandas as pd
import requests
from lxml import etree, html

res = requests.get('https://en.wikipedia.org/wiki/Unicode_block').content
tree = html.fromstring(res)

pd.read_html(etree.tostring(tree.xpath(".//table[contains(@class, 'wikitable')]/tbody")[0]))[0]
提出错误:
ValueError: No tables found

I包括了该代码,以防止问题作为使用

pandas

的重复性。我不喜欢
pandas

Wikipedia从Wikipedia刮擦桌子的正确方法是什么?答案不应该那么简单,而是“只使用

pandas
”。或者,如果使用
pandas

是首选的方式,则答案必须证明它如何正确解析Wikipedia中的所有表,尤其是对给出的格式中的两个示例表解析。

	
pd.read_html()

寻找

<table>

元素。您正在返回表中的
<tbody>
元素,因此找不到表。因此,从XPath字符串中删除
/tbody

python web-scraping wikipedia
1个回答
0
投票
pd.read_html()

被弃用,您应该用

io.BytesIO
.
包裹。
wikitable = tree.xpath(".//table[contains(@class, 'wikitable')]")[0] pd.read_html(io.BytesIO(etree.tostring(wikitable)))[0]
    

最新问题
© www.soinside.com 2019 - 2025. All rights reserved.