使用 Python 和 BeautifulSoup 抓取同一个表的下一页

Question

所以我正在学习网络抓取，并且正在使用雅虎财经网站进行练习，但是迭代我正在提取的表格的下一页很麻烦。

我尝试了下面的代码，但它只适用于第一页，并且无法浏览其他页面。

for page in range(0, 201, 25):
    url = f'https://finance.yahoo.com/markets/stocks/most-active/?start=-{page}&count=25'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html')

    columns = soup.find_all('div', class_='header-container yf-1dbt8wv')
    header = [name.text.strip() for name in columns]
    header.insert(1, "Name")

    data = []
    body = soup.find('tbody')
    rows = body.find_all('tr', class_= 'yf-1dbt8wv')

    for row in rows:
        point = row.find_all('td', class_='cell yf-1dbt8wv') 
        line = [case.text.strip() for case in point] 
        splitter = line[0].split(" ", 1)
        line = splitter + line[1:]
        line[1] = line[1].strip()
        line[2] = line[2].split(" ", 1)[0]
        data.append(line)

此外，由于 url 是动态的，我尝试使用在同一页面上显示表中所有 203 行的 url：

url = 'https://finance.yahoo.com/markets/stocks/most-active/?start=0&count=203'
# time.sleep(5)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

columns = soup.find_all('div', class_='header-container yf-1dbt8wv')
header = [name.text.strip() for name in columns]
header.insert(1, "Name")

data = []
body = soup.find('tbody')
rows = body.find_all('tr', class_= 'yf-1dbt8wv')

for row in rows:
    point = row.find_all('td', class_='cell yf-1dbt8wv') 
    line = [case.text.strip() for case in point] 
    splitter = line[0].split(" ", 1)
    line = splitter + line[1:]
    line[1] = line[1].strip()
    line[2] = line[2].split(" ", 1)[0]
    data.append(line)

...尽管我可以在一页上看到表格中的整个行，但它仍然只抓取默认的 25 行：

我错过了什么吗？我还需要学习什么才能把事情做好吗？我需要一些帮助。谢谢！

Answer 1

雅虎财经页面相当复杂。

可能会提示cookie接受/拒绝。你首先需要处理这个问题。

随后，您需要意识到页面是由 JavaScript 驱动的，并且使用 requests 和 BeautifulSoup 的组合不太可能产生预期结果。您可能应该使用selenium。

向前翻页的方法是查找特定按钮，如果未禁用，则模拟单击。刷新驱动程序并继续。

以下是如何获取所有公司名称的示例（可以在带有 longName 类的 span 元素中找到）。您应该能够轻松扩展它以获得您想要的特定数据。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.action_chains import ActionChains

options = ChromeOptions()
options.add_argument("--headless=true")

url = "https://finance.yahoo.com/markets/stocks/most-active/"

def click(driver, e):
    action = ActionChains(driver)
    action.click(e)
    action.perform()

def reject(driver, wait):
    try:
        selector = By.CSS_SELECTOR, "button.reject-all"
        button = wait.until(EC.presence_of_element_located(selector))
        click(driver, button)
    except Exception:
        pass

def text(e):
    if r := e.text:
        return r
    return e.get_attribute("textContent")

def next_page(driver, wait):
    selector = By.CSS_SELECTOR, "div.buttons button"
    buttons = wait.until(EC.presence_of_all_elements_located(selector))
    if not buttons[2].get_attribute("disabled"):
        click(driver, buttons[2])
        driver.refresh()
        return True
    return False

with webdriver.Chrome(options) as driver:
    driver.get(url)
    wait = WebDriverWait(driver, 5)
    reject(driver, wait)
    selector = By.CSS_SELECTOR, "tbody.body tr td.cell span.longName"
    while True:
        for span in wait.until(EC.presence_of_all_elements_located(selector)):
            print(text(span))
        if not next_page(driver, wait):
            break

使用 Python 和 BeautifulSoup 抓取同一个表的下一页

问题描述投票：0回答：1

1个回答

最新问题

使用 Python 和 BeautifulSoup 抓取同一个表的下一页

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1