我编写了一个抓取工具,用于从 2 个不同的 URL 获取一些数据。两个页面之间的唯一区别在于,一个页面有菜单下拉列表,而另一个页面则没有。抓取工具适用于
urls
中的第一个 url 并获取所有内容,但在第二个 url 上,它仅从第一个下拉选项中获取数据。我以为我已经满足了这个需求,但它不起作用。
这是我的代码:
import requests
from bs4 import BeautifulSoup
import re
import csv
urls = ['https://untappd.com/v/other-half-brewing-co/1360488',
'https://untappd.com/v/beer-witch/10272294']
def get_menu_beers(soup):
beers_all = soup.find_all('ul', {'class': 'menu-section-list'})
# Open the CSV file in write mode
with open('scraped.csv', mode='a', newline='') as file:
writer = csv.writer(file)
for beer_group in beers_all:
beers = beer_group.find_all('li')
for beer in beers:
details = beer.find('div', {'class': 'beer-details'})
a_href = details.find("a",{"class":"track-click"}).get("href")
id_num = re.findall(r'\d+', a_href)
beer_id = int(id_num[-1])
name_ = details.find("a",{"class":"track-click"}).text
rating_value = details.find('div', {'class': 'caps small'})['data-rating']
writer.writerow([str(name_).strip().replace('\n', ' '), rating_value])
for url in urls:
response = requests.get(url, headers = {'User-agent': 'Mozilla/5.0'})
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
try:
try:
select_options = soup.find_all('select', {'class':'menu-selector'})
if (len(select_options) > 0 ):
options_list = select_options[0].find_all('option')
menu_ids =[]
for option in options_list:
menu_ids.append(int(option['value']))
menu_urls = []
for menu_id in menu_ids:
menu_url = str(url)+ '?menu_id=' + str(menu_id)
menu_urls.append(menu_url)
for url in menu_urls:
res = requests.get(url, headers = {'User-agent': 'Mozilla/5.0'})
s = BeautifulSoup(res.text, 'html.parser')
get_menu_beers(s)
else:
get_menu_beers(soup)
except:
print("Failed HERE")
except:
print(f"Failed: {url}")
我以为我的代码的这一部分会找到所有下拉选项,为每个下拉选项生成网址并单独抓取它们,但什么也没有发生
if (len(select_options) > 0 ):
options_list = select_options[0].find_all('option')
menu_ids =[]
for option in options_list:
menu_ids.append(int(option['value']))
menu_urls = []
for menu_id in menu_ids:
menu_url = str(url)+ '?menu_id=' + str(menu_id)
menu_urls.append(menu_url)
它适用于这个网址
https://untappd.com/v/other-half-brewing-co/1360488
但不适用于这个 https://untappd.com/v/beer-witch/10272294
有人可以告诉我我做错了什么吗?