无法获取包含tr标签链接的所有数据

问题描述 投票:0回答:4

我在python中编写了一个脚本来从表中的一些html元素中获取数据。我粗略地挑选了一些tr标签内的数据。我的目标是在类href中获取数据(包括fn链接)。到目前为止我所尝试的可以解析所有这些(从类fn除链接)。如何更改我的下面脚本以获取该类的链接。提前感谢任何解决方案。

这是我到目前为止所尝试的:

from bs4 import BeautifulSoup

content="""
<tr>
    <td align="center">1964</td>
    <td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
    <span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
    <span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
    <td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
    <td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
    <span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
    <td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
    <td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
    <td align="center">—</td>
</tr>
"""
soup = BeautifulSoup(content,"lxml")
for items in soup.select('tr'):
    item_name = [item.text for item in items.select(".fn a")]
    print(item_name)

我现在的输出:

['Charles Hard Townes', 'Nikolay Basov', 'Alexander Prokhorov', 'Dorothy Hodgkin', 'Konrad Emil Bloch', 'Feodor Felix Konrad Lynen', 'Jean-Paul Sartre', 'Martin Luther King, Jr.']

再次提醒你:我的预期输出是获取所有数据,包括href类的fn链接。

python python-3.x web-scraping beautifulsoup
4个回答
2
投票

这个修改过的代码让我和数据一起得到了href

from bs4 import BeautifulSoup

content="""
<tr>
    <td align="center">1964</td>
    <td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
    <span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
    <span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
    <td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
    <td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
    <span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
    <td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
    <td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
    <td align="center">—</td>
</tr>
"""
soup = BeautifulSoup(content,"lxml")
for items in soup.select('tr'):
    item_name = [[item.text,item.get('href')] for item in items.select(".fn a")]
    print(item_name)

OUTPUT

[['Charles Hard Townes', '/wiki/Charles_Hard_Townes'], ['Nikolay Basov', '/wiki/Nikolay_Basov'], ['Alexander Prokhorov', '/wiki/Alexander_Prokhorov'], ['Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'], ['Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'], ['Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'], ['Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'], ['Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.']]

3
投票

您可以使用bs4或正则表达式:

bs4

from bs4 import BeautifulSoup as soup
s = soup(content, 'lxml')
new_data = list(zip([i.text for i in s.find_all('a')], [i['href'] for i in s.find_all('a', href=True)]))

输出:

[(u'Charles Hard Townes', '/wiki/Charles_Hard_Townes'), (u'Nikolay Basov', '/wiki/Nikolay_Basov'), (u'Alexander Prokhorov', '/wiki/Alexander_Prokhorov'), (u'Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'), (u'Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'), (u'Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'), (u'Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'), (u'[D]', '#endnote_Note1D'), (u'Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.')]

正则表达式:

import re
new_data = map(lambda x:filter(None, x)[0], re.findall('href="(.*?)"|title="(.*?)">', content))
final_data = [(new_data[i], new_data[i+1]) for i in range(0, len(new_data)-1, 2)]

输出:

[('/wiki/Charles_Hard_Townes', 'Charles Hard Townes'), ('/wiki/Nikolay_Basov', 'Nikolay Basov'), ('/wiki/Alexander_Prokhorov', 'Alexander Prokhorov'), ('/wiki/Dorothy_Hodgkin', 'Dorothy Hodgkin'), ('/wiki/Konrad_Emil_Bloch', 'Konrad Emil Bloch'), ('/wiki/Feodor_Felix_Konrad_Lynen', 'Feodor Felix Konrad Lynen'), ('/wiki/Jean-Paul_Sartre', 'Jean-Paul Sartre'), ('#endnote_Note1D', '/wiki/Martin_Luther_King,_Jr.')]

2
投票

稍微简单一些:无需单独选择表行。

soup = BeautifulSoup(content,"lxml")
links = soup.select('tr .fn a')
for link in links:
    print (link.attrs['href'])
    print (link.text)

0
投票

您可以尝试bs4而不是使用正则表达式:

from bs4 import BeautifulSoup

content="""
<tr>
    <td align="center">1964</td>
    <td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
    <span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
    <span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
    <td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
    <td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
    <span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
    <td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
    <td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
    <td align="center">—</td>
</tr>
"""

soup = BeautifulSoup(content,"lxml")
for i in soup.find_all('td'):
    if i.find('a')!=None:
        print((i.find('a').attrs['title'],i.find('a').attrs['href']))

输出:

('Charles Hard Townes', '/wiki/Charles_Hard_Townes')
('Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin')
('Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch')
('Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre')
('Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.')
© www.soinside.com 2019 - 2024. All rights reserved.