我正在尝试确定网站列表中的网站是否是网络商店。
似乎大多数网店都有:
a
标签,在他们的href
中有“cart”一词li
标签,分配给类名中带有“cart”一词的类我相信我必须利用正则表达式,然后告诉BeautifulSoup find
方法在a
或li
标签中搜索HTML数据中的这个正则表达式。我该怎么办呢?
到目前为止,下面的代码在HTML数据中搜索a
标记,其中包含href
的完全购物车。
码
import re
from bs4 import BeautifulSoup
from selenium import webdriver
websites = [
'https://www.nike.com/',
'https://www.youtube.com/',
'https://www.google.com/',
'https://www.amazon.com/',
'https://www.gamestop.com/'
]
shops = []
driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')
with webdriver.Chrome(options=options) as driver:
for url in websites:
driver.get(url)
cart = re.compile('.*cart.*', re.IGNORECASE)
soup = BeautifulSoup(driver.page_source, 'html.parser')
if soup.find('a', href=cart):
shops.append(url)
print('\nSHOPS FOUND:')
for shop in shops:
print(shop)
输出:
SHOPS FOUND:
https://www.nike.com/
https://www.amazon.com/
您可以使用contains *运算符和css属性选择器来指定类属性,或者使用href属性包含子字符串cart。将两个(class和href)与Or语法结合使用。 TODO:您可以考虑添加等待条件以确保首先存在所有li
和a
标记元素。
from bs4 import BeautifulSoup
from selenium import webdriver
websites = [
'https://www.nike.com/',
'https://www.youtube.com/',
'https://www.google.com/',
'https://www.amazon.com/',
'https://www.gamestop.com/'
]
shops = []
driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')
with webdriver.Chrome(options=options) as driver:
for url in websites:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.select('a[href*=cart], li[class*=cart]')
if len(items) > 0:
shops.append(url)
print('\nSHOPS FOUND:')
for shop in shops:
print(shop)