我不是来自网页设计或网站/html背景,并且是这个领域的新手。
尝试从包含容器/卡片的此链接中抓取元素。
我尝试了下面的代码并取得了一些成功,但不知道如何正确执行以获取信息内容而不在结果中获取 html/css 元素。
from bs4 import BeautifulSoup as bs
import requests
url = 'https://ihgfdelhifair.in/mis/Exhibitors'
page = requests.get(url)
soup = bs(page.text, 'html')
cards = soup.find_all('div', class_="row Exhibitor-Listing-box")
cards
显示的内容如下:
[<div class="row Exhibitor-Listing-box">
<div class="col-md-3">
<div class="card">
<div class="container">
<h4><b> 1 ARTIFACT DECOR (INDIA)</b></h4>
<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> [email protected]</p>
<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span> SHEENU</p>
<p style="margin-bottom: 5px!important; font-size: 13px;"><span>State : </span> UTTAR PRADESH</p>
<p style="margin-bottom: 5px!important; font-size: 13px;"><span>City : </span> AGRA</p>
<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Hall No. : </span> 12</p>
<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Stand No. : </span> G-15/43</p>
<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Mobile No. : </span> +91-5624010111, +91-7055166000</p>
<p style="margin-bottom: 5px!important; font-size: 11px;"><span>Website : </span> www.artifactdecor.com</p>
<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Source Retail : </span> Y</p>
<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Vriksh Certified : </span> N</p>
</div>
现在当我使用下面的代码来提取元素时:
for element in cards:
title = element.find_all('h4')
email = element.find_all('p')
print(title)
print(email)
输出:它为我提供了我需要的信息,但其中包含我不想要的html/css内容
[<h4><b> 1 ARTIFACT DECOR (INDIA)</b></h4>, <h4><b> 10G HOUSE OF CRAFT</b></h4>, <h4><b> 2 S COLLECTION</b></h4>, <h4><b> ........]
[<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> [email protected]</p>, <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span> ..................]
那么我怎样才能从中取出标题,电子邮件,联系人,州,城市元素而不在结果中使用html/css?
正如 Manos Kounelakis 所建议的,您可能正在寻找的是 BeautifulSoup HTML 元素的
text
属性。此外,根据类 card
的元素来分割 html 更为自然,因为这对应于屏幕上的每个视觉卡单元。这是一些可以很好地打印信息的代码:
import requests
from bs4 import BeautifulSoup as bs
url = "https://ihgfdelhifair.in/mis/Exhibitors"
page = requests.get(url)
soup = bs(page.text, features="html5lib")
cards = soup.find_all("div", class_="card")
for element in cards:
title = element.find("h4").text
other_info = [" ".join(elem.text.split()) for elem in element.find_all("p")]
print("Title:", title)
for info in other_info:
print(info)
print("-" * 80)