BS4 并请求仅查找和抓取下拉列表中第一个列出的项目

问题描述 投票:0回答:1

我编写了一个抓取工具,用于从 2 个不同的 URL 获取一些数据。两个页面之间的唯一区别在于,一个页面有菜单下拉列表,而另一个页面则没有。抓取工具适用于

urls
中的第一个 url 并获取所有内容,但在第二个 url 上,它仅从第一个下拉选项中获取数据。我以为我已经满足了这个需求,但它不起作用。

这是我的代码:

import requests
from bs4 import BeautifulSoup
import re
import csv

urls = ['https://untappd.com/v/other-half-brewing-co/1360488',
        'https://untappd.com/v/beer-witch/10272294']



def get_menu_beers(soup):
    beers_all = soup.find_all('ul', {'class': 'menu-section-list'})
    
    # Open the CSV file in write mode
    with open('scraped.csv', mode='a', newline='') as file:
        writer = csv.writer(file)
        
        for beer_group in beers_all:
            beers = beer_group.find_all('li')
            for beer in beers:
                details = beer.find('div', {'class': 'beer-details'})
                a_href = details.find("a",{"class":"track-click"}).get("href")
                id_num = re.findall(r'\d+', a_href)
                beer_id = int(id_num[-1])
                name_ = details.find("a",{"class":"track-click"}).text
                rating_value = details.find('div', {'class': 'caps small'})['data-rating']
                
                
                writer.writerow([str(name_).strip().replace('\n', ' '), rating_value])

for url in urls:         
    response = requests.get(url, headers = {'User-agent': 'Mozilla/5.0'})
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        try:
            try:
                select_options = soup.find_all('select', {'class':'menu-selector'})

                if (len(select_options) > 0 ):
                    options_list = select_options[0].find_all('option')
                    menu_ids =[]
                    for option in options_list:
                        menu_ids.append(int(option['value']))

                    menu_urls = []
                    for menu_id in menu_ids:
                        menu_url = str(url)+ '?menu_id=' + str(menu_id)
                        menu_urls.append(menu_url)

        
                    for url in menu_urls:
                        res = requests.get(url, headers = {'User-agent': 'Mozilla/5.0'})
                        s = BeautifulSoup(res.text, 'html.parser')
                        get_menu_beers(s)

                else:
                    get_menu_beers(soup)
            except:
                print("Failed HERE")
        except:
            print(f"Failed: {url}")

我以为我的代码的这一部分会找到所有下拉选项,为每个下拉选项生成网址并单独抓取它们,但什么也没有发生

if (len(select_options) > 0 ):
                        options_list = select_options[0].find_all('option')
                        menu_ids =[]
                        for option in options_list:
                            menu_ids.append(int(option['value']))
    
                        menu_urls = []
                        for menu_id in menu_ids:
                            menu_url = str(url)+ '?menu_id=' + str(menu_id)
                            menu_urls.append(menu_url)

它适用于这个网址

https://untappd.com/v/other-half-brewing-co/1360488
但不适用于这个
https://untappd.com/v/beer-witch/10272294

有人可以告诉我我做错了什么吗?

web-scraping beautifulsoup python-requests drop-down-menu
1个回答
0
投票

使用不同的URL,例如 enter image description here

您还可以使用 selenium 或 playwright 来选择每个下拉选项。

© www.soinside.com 2019 - 2024. All rights reserved.