如何正确使用 beautifulsoup 来抓取元素?

问题描述 投票:0回答:1

我不是来自网页设计或网站/html背景,并且是这个领域的新手。

尝试从包含容器/卡片的此链接中抓取元素。

我尝试了下面的代码并取得了一些成功,但不知道如何正确执行以获取信息内容而不在结果中获取 html/css 元素。

from bs4 import BeautifulSoup as bs
import requests

url = 'https://ihgfdelhifair.in/mis/Exhibitors'

page = requests.get(url)
soup = bs(page.text, 'html')

我希望从以下内容中提取(作为实践)信息: sample image

cards = soup.find_all('div', class_="row Exhibitor-Listing-box")
cards

显示的内容如下:

[<div class="row Exhibitor-Listing-box">
 <div class="col-md-3">
 <div class="card">
 <div class="container">
 <h4><b>  1 ARTIFACT DECOR (INDIA)</b></h4>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> [email protected]</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span>                                                   SHEENU</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>State : </span> UTTAR PRADESH</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>City : </span> AGRA</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Hall No. : </span> 12</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Stand No. : </span> G-15/43</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Mobile No. : </span> +91-5624010111, +91-7055166000</p>
 <p style="margin-bottom: 5px!important; font-size: 11px;"><span>Website : </span> www.artifactdecor.com</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Source Retail : </span> Y</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Vriksh Certified : </span> N</p>
 </div>

现在当我使用下面的代码来提取元素时:

for element in cards:
    title = element.find_all('h4')
    email = element.find_all('p')
    print(title)
    print(email)

输出:它为我提供了我需要的信息,但其中包含我不想要的html/css内容

[<h4><b>  1 ARTIFACT DECOR (INDIA)</b></h4>, <h4><b>  10G HOUSE OF CRAFT</b></h4>, <h4><b>  2 S COLLECTION</b></h4>, <h4><b>  ........]
[<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> [email protected]</p>, <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span>        ..................]

那么我怎样才能从中取出标题,电子邮件,联系人,州,城市元素而不在结果中使用html/css?

python html css beautifulsoup
1个回答
0
投票

正如 Manos Kounelakis 所建议的,您可能正在寻找的是 BeautifulSoup HTML 元素的

text
属性。此外,根据类
card
的元素来分割 html 更为自然,因为这对应于屏幕上的每个视觉卡单元。这是一些可以很好地打印信息的代码:

import requests
from bs4 import BeautifulSoup as bs

url = "https://ihgfdelhifair.in/mis/Exhibitors"

page = requests.get(url)
soup = bs(page.text, features="html5lib")

cards = soup.find_all("div", class_="card")

for element in cards:
    title = element.find("h4").text
    other_info = [" ".join(elem.text.split()) for elem in element.find_all("p")]
    print("Title:", title)
    for info in other_info:
        print(info)
    print("-" * 80)
© www.soinside.com 2019 - 2024. All rights reserved.