确定网站是否是网上商店

问题描述 投票:1回答:1

我正在尝试确定网站列表中的网站是否是网络商店。

似乎大多数网店都有:

  • 一个a标签,在他们的href中有“cart”一词
  • 一个li标签,分配给类名中带有“cart”一词的类

我相信我必须利用正则表达式,然后告诉BeautifulSoup find方法在ali标签中搜索HTML数据中的这个正则表达式。我该怎么办呢?

到目前为止,下面的代码在HTML数据中搜索a标记,其中包含href的完全购物车。

import re
from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        cart = re.compile('.*cart.*', re.IGNORECASE)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        if soup.find('a', href=cart):
            shops.append(url)

print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)

输出:

SHOPS FOUND:
https://www.nike.com/
https://www.amazon.com/
python python-3.x selenium web-scraping beautifulsoup
1个回答
0
投票

您可以使用contains *运算符和css属性选择器来指定类属性,或者使用href属性包含子字符串cart。将两个(class和href)与Or语法结合使用。 TODO:您可以考虑添加等待条件以确保首先存在所有lia标记元素。

from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        items = soup.select('a[href*=cart], li[class*=cart]')
        if len(items) > 0:
                shops.append(url)
print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)
© www.soinside.com 2019 - 2024. All rights reserved.