所以我正在学习网络抓取,并且正在使用雅虎财经网站进行练习,但是迭代我正在提取的表格的下一页很麻烦。
我尝试了下面的代码,但它只适用于第一页,并且无法浏览其他页面。
for page in range(0, 201, 25):
url = f'https://finance.yahoo.com/markets/stocks/most-active/?start=-{page}&count=25'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html')
columns = soup.find_all('div', class_='header-container yf-1dbt8wv')
header = [name.text.strip() for name in columns]
header.insert(1, "Name")
data = []
body = soup.find('tbody')
rows = body.find_all('tr', class_= 'yf-1dbt8wv')
for row in rows:
point = row.find_all('td', class_='cell yf-1dbt8wv')
line = [case.text.strip() for case in point]
splitter = line[0].split(" ", 1)
line = splitter + line[1:]
line[1] = line[1].strip()
line[2] = line[2].split(" ", 1)[0]
data.append(line)
此外,由于 url 是动态的,我尝试使用在同一页面上显示表中所有 203 行的 url:
url = 'https://finance.yahoo.com/markets/stocks/most-active/?start=0&count=203'
# time.sleep(5)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
columns = soup.find_all('div', class_='header-container yf-1dbt8wv')
header = [name.text.strip() for name in columns]
header.insert(1, "Name")
data = []
body = soup.find('tbody')
rows = body.find_all('tr', class_= 'yf-1dbt8wv')
for row in rows:
point = row.find_all('td', class_='cell yf-1dbt8wv')
line = [case.text.strip() for case in point]
splitter = line[0].split(" ", 1)
line = splitter + line[1:]
line[1] = line[1].strip()
line[2] = line[2].split(" ", 1)[0]
data.append(line)
...尽管我可以在一页上看到表格中的整个行,但它仍然只抓取默认的 25 行:
我错过了什么吗?我还需要学习什么才能把事情做好吗?我需要一些帮助。谢谢!
雅虎财经页面相当复杂。
可能会提示cookie接受/拒绝。你首先需要处理这个问题。
随后,您需要意识到页面是由 JavaScript 驱动的,并且使用 requests 和 BeautifulSoup 的组合不太可能产生预期结果。您可能应该使用selenium。
向前翻页的方法是查找特定按钮,如果未禁用,则模拟单击。刷新驱动程序并继续。
以下是如何获取所有公司名称的示例(可以在带有 longName 类的 span 元素中找到)。您应该能够轻松扩展它以获得您想要的特定数据。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.action_chains import ActionChains
options = ChromeOptions()
options.add_argument("--headless=true")
url = "https://finance.yahoo.com/markets/stocks/most-active/"
def click(driver, e):
action = ActionChains(driver)
action.click(e)
action.perform()
def reject(driver, wait):
try:
selector = By.CSS_SELECTOR, "button.reject-all"
button = wait.until(EC.presence_of_element_located(selector))
click(driver, button)
except Exception:
pass
def text(e):
if r := e.text:
return r
return e.get_attribute("textContent")
def next_page(driver, wait):
selector = By.CSS_SELECTOR, "div.buttons button"
buttons = wait.until(EC.presence_of_all_elements_located(selector))
if not buttons[2].get_attribute("disabled"):
click(driver, buttons[2])
driver.refresh()
return True
return False
with webdriver.Chrome(options) as driver:
driver.get(url)
wait = WebDriverWait(driver, 5)
reject(driver, wait)
selector = By.CSS_SELECTOR, "tbody.body tr td.cell span.longName"
while True:
for span in wait.until(EC.presence_of_all_elements_located(selector)):
print(text(span))
if not next_page(driver, wait):
break