我有一点不寻常的要求;我希望根据它在网页上显示的文本获取div的id。例如,假设我有以下html:
<div class="productTabRightCompatibility">
<h2>
Product Downloads
</h2>
<ul class="listColumn">
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_0">
</div>
<a href="/-/dummy_link_one_technical_drawing" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_0" target="_blank">
ProductOne Technical Drawing
</a>
</li>
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_1">
</div>
<a href="/-/dummy_link_two_cad_drawing" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_1" target="_blank">
ProductOne CAD Drawing
</a>
</li>
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_2">
</div>
<a href="/-/dummy_link_three_installation_manual" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_2" target="_blank">
ProductOne Installation Manual
</a>
</li>
</ul>
</div>
不幸的是,网站并不总是以相同的顺序排列这些,所以有时技术图纸由id ResourceLink_0表示,有时CAD图纸是。唯一不变的是我想要的项目用文本“[Product#] Technical Drawing”表示。我希望能够浏览多个页面并获得与技术图纸相对应的链接,而不管订单如何。目前我正在迭代所有链接,并在链接地址的某处找到包含“technical_drawing”的链接,但我想知道是否有更好的方法来获得结果。
使用BeautifulSoup
和re
包,你应该能够做到这样的事情:
from bs4 import BeautifulSoup
import re
html = """<div class="productTabRightCompatibility">
<h2>
Product Downloads
</h2>
<ul class="listColumn">
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_0">
</div>
<a href="/-/dummy_link_one_technical_drawing" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_0" target="_blank">
ProductOne Technical Drawing
</a>
</li>
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_1">
</div>
<a href="/-/dummy_link_two_cad_drawing" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_1" target="_blank">ProductOne CAD Drawing</a>
</li>
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_2">
</div>
<a href="/-/dummy_link_three_installation_manual" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_2" target="_blank">
ProductOne Installation Manual
</a>
</li>
</ul>
</div>"""
soup = BeautifulSoup(html,'html.parser')
a_link = soup.find('a', text=re.compile("ProductOne Technical Drawing"))
print(a_link.get('href'))
OUTPUT:
/-/dummy_link_one_technical_drawing
使用re
,您可以搜索标签的文本,然后获取该标签的href
值。如果您在页面上有多个元素,我在这里使用了find_all
。
import bs4
import re
html_doc='''<html><div class="productTabRightCompatibility">
<h2>
Product Downloads
</h2>
<ul class="listColumn">
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_0">
</div>
<a href="/-/dummy_link_one_technical_drawing" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_0" target="_blank">
ProductOne Technical Drawing
</a>
</li>
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_1">
</div>
<a href="/-/dummy_link_two_cad_drawing" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_1" target="_blank">
ProductOne CAD Drawing
</a>
</li>
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_2">
</div>
<a href="/-/dummy_link_three_installation_manual" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_2" target="_blank">
ProductOne Installation Manual
</a>
</li>
</ul>
</div></html>'''
soup =bs4.BeautifulSoup(html_doc, 'html.parser')
items=soup.find_all('a' , text=re.compile("Technical Drawing"))
for item in items:
print(item['href'])
输出:
/-/dummy_link_one_technical_drawing
你可以避免使用find和regex并使用更快的css attribute = value选择器,结尾是$ operator
[href$='technical_drawing']
码:
from bs4 import BeautifulSoup as bs
html='''<html><div class="productTabRightCompatibility">
<h2>
Product Downloads
</h2>
<ul class="listColumn">
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_0">
</div>
<a href="/-/dummy_link_one_technical_drawing" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_0" target="_blank">
ProductOne Technical Drawing
</a>
</li>
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_1">
</div>
<a href="/-/dummy_link_two_cad_drawing" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_1" target="_blank">
ProductOne CAD Drawing
</a>
</li>
<li>
<div class="iconSprite icon16 iconDownloads" id="layoutmain_1_ProductTabs1_rptResources_divResourceImage_2">
</div>
<a href="/-/dummy_link_three_installation_manual" id="layoutmain_1_ProductTabs1_rptResources_hlResourceLink_2" target="_blank">
ProductOne Installation Manual
</a>
</li>
</ul>
</div></html>'''
soup =bs(html, 'lxml')
link =soup.select_one("[href$='technical_drawing']")['href']
print(link)