使用 Selenium 或 Beautiful soup 刮擦 hulkapps 表

问题描述 投票:0回答:1

我正在尝试抓取以下网址:https://papemelroti.com/products/live-free-badge

但是好像找不到这个表类

<table class="hulkapps-table table"><thead><tr><th style="border-top-left-radius: 0px;">Quantity</th><th style="border-top-right-radius: 0px;">Bulk Discount</th><th style="display: none">Add to Cart</th></tr></thead><tbody><tr><td style="border-bottom-left-radius: 0px;">Buy 50 +   <span class="hulk-offer-text"></span></td><td style="border-bottom-right-radius: 0px;"><span class="hulkapps-price"><span class="money"><span class="money"> ₱1.00 </span></span> Off</span></td><td style="display: none;"><button type="button" class="AddToCart_0" style="cursor: pointer; font-weight: 600; letter-spacing: .08em; font-size: 11px; padding: 5px 15px; border-color: #171515; border-width: 2px; color: #ffffff; background: #161212;" onclick="add_to_cart(50)">Add to Cart</button></td></tr></tbody></table>

我已经有了我的 Selenium 代码,但它仍然没有抓取它。这是我的代码:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

service = Service('/usr/local/bin/chromedriver')  # Adjust path if necessary
driver = webdriver.Chrome(service=service, options=chrome_options)

def get_page_html(url):
    driver.get(url)
    time.sleep(3)  # Wait for JS to load
    return driver.page_source

def scrape_discount_quantity(url):
    page_html = get_page_html(url)
    soup = BeautifulSoup(page_html, "html.parser")

    # Locate the table containing the quantity and discount
    table = soup.find('table', class_='hulkapps-table')
    print(page_html)

    if table:
        table_rows = table.find_all('tr')
        for row in table_rows:
            quantity_cells = row.find_all('td')
            if len(quantity_cells) >= 2:  # Check if there are at least two cells
                quantity_cell = quantity_cells[0].get_text(strip=True)  # Get quantity text
                discount_cell = quantity_cells[1].get_text(strip=True)  # Get discount text
                return quantity_cell, discount_cell
    return None, None

# Example usage
url = 'https://papemelroti.com/products/live-free-badge'
quantity, discount = scrape_discount_quantity(url)
print(f"Quantity: {quantity}, Discount: {discount}")

driver.quit()  # Close the browser when done

它不断返回“无”

供参考: enter image description here

python selenium-webdriver beautifulsoup
1个回答
0
投票

折扣数据从此

https://volumediscount.hulkapps.com/api/v2/shop/get_offer_table
API 端点加载,当您使用 selenium
driver.page_source
返回页面源时,bs4 没有要抓取的表名称,我尝试了您的代码并确认
hulkapps-table
不存在于回应!所以很明显的反应是
None
,

我的回答:

我使用了这个

https://volumediscount.hulkapps.com/api/v2/shop/get_offer_table
API 端点以及此请求中的
product_id
https://papemelroti.com/products/live-free-badge.json
,这是我的代码,它是基本的:

import requests
import json

def getDiscount(root_url):
    prod_resp = requests.get(f'{root_url}.json').content #Get product_id
    prod_id = json.loads(prod_resp)['product']['id']
    disc_url = 'https://volumediscount.hulkapps.com/api/v2/shop/get_offer_table' #Discount URL
    data = f'pid={prod_id}&store_id=papemelroti.myshopify.com'
    headers = {
        "User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0",
        "Content-Type":"application/x-www-form-urlencoded; charset=UTF-8"
    }
    resp = requests.post(disc_url, data=data, headers=headers).content
    data_json = json.loads(resp)
    disc_json = json.loads(data_json['eligible_offer']['offer_levels'])[0]

    #Offer has two variants: 'Price' and 'Off' so you can use condition if you like to scrape products other than 'live-free-badge'

    if 'price_discount' in disc_json[2]:
        print(f"Product ID:{prod_id} (Quantity: {disc_json[0]}, Discount: {disc_json[1]} Price discount)")
    elif 'Off' in disc_json[2]:
        print(f"Product ID:{prod_id} (Quantity: {disc_json[0]}, Discount: {disc_json[1]}% Off)")

#sample for both 'Off' and 'Price'
getDiscount('https://papemelroti.com/products/dear-me-magnet')
getDiscount('https://papemelroti.com/products/live-free-badge')

输出:

Product ID:7217967726790 (Quantity: 50, Discount: 10% Off)
Product ID:104213217289 (Quantity: 50, Discount: 1.00 Price discount)

让我知道这是否可以或者您是否想严格使用硒

© www.soinside.com 2019 - 2024. All rights reserved.