如何正确使用 beautifulsoup 来抓取元素？

Question

我不是来自网页设计或网站/html背景，并且是这个领域的新手。

尝试从包含容器/卡片的此链接中抓取元素。

我尝试了下面的代码并取得了一些成功，但不知道如何正确执行以获取信息内容而不在结果中获取 html/css 元素。

from bs4 import BeautifulSoup as bs
import requests

url = 'https://ihgfdelhifair.in/mis/Exhibitors'

page = requests.get(url)
soup = bs(page.text, 'html')

我希望从以下内容中提取（作为实践）信息：

cards = soup.find_all('div', class_="row Exhibitor-Listing-box")
cards

显示的内容如下：

[<div class="row Exhibitor-Listing-box">
 <div class="col-md-3">
 <div class="card">
 <div class="container">
 <h4><b>  1 ARTIFACT DECOR (INDIA)</b></h4>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> [email protected]</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span>                                                   SHEENU</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>State : </span> UTTAR PRADESH</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>City : </span> AGRA</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Hall No. : </span> 12</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Stand No. : </span> G-15/43</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Mobile No. : </span> +91-5624010111, +91-7055166000</p>
 <p style="margin-bottom: 5px!important; font-size: 11px;"><span>Website : </span> www.artifactdecor.com</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Source Retail : </span> Y</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Vriksh Certified : </span> N</p>
 </div>

现在当我使用下面的代码来提取元素时：

for element in cards:
    title = element.find_all('h4')
    email = element.find_all('p')
    print(title)
    print(email)

输出：它为我提供了我需要的信息，但其中包含我不想要的html/css内容

[<h4><b>  1 ARTIFACT DECOR (INDIA)</b></h4>, <h4><b>  10G HOUSE OF CRAFT</b></h4>, <h4><b>  2 S COLLECTION</b></h4>, <h4><b>  ........]
[<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> [email protected]</p>, <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span>        ..................]

那么我怎样才能从中取出标题，电子邮件，联系人，州，城市元素而不在结果中使用html/css？

Answer 1

正如 Manos Kounelakis 所建议的，您可能正在寻找的是 BeautifulSoup HTML 元素的

text

属性。此外，根据类

card

的元素来分割 html 更为自然，因为这对应于屏幕上的每个视觉卡单元。这是一些可以很好地打印信息的代码：

import requests
from bs4 import BeautifulSoup as bs

url = "https://ihgfdelhifair.in/mis/Exhibitors"

page = requests.get(url)
soup = bs(page.text, features="html5lib")

cards = soup.find_all("div", class_="card")

for element in cards:
    title = element.find("h4").text
    other_info = [" ".join(elem.text.split()) for elem in element.find_all("p")]
    print("Title:", title)
    for info in other_info:
        print(info)
    print("-" * 80)

如何正确使用 beautifulsoup 来抓取元素？

问题描述投票：0回答：1

1个回答

最新问题

如何正确使用 beautifulsoup 来抓取元素？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1