我做了一个有效的刮刀,除了它不会刮掉最后一页。网址没有改变,因此我将其设置为在无限循环上运行。
当我再也不能点击下一个按钮时(在最后一页上),我已经设置了循环中断,并且似乎脚本在追加结果的最后结果之前结束。
如何将最后一页附加到列表中?
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
import itertools
url = "https://example.com"
driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver")
driver.get(url)
inputElement = driver.find_element_by_id("txtBusinessName")
inputElement.send_keys("ship")
inputElement.send_keys(Keys.ENTER)
df2 = pd.DataFrame()
for i in itertools.count():
element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, "grid_businessList")))
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
table = soup.find('table', id="grid_businessList")
rows = table.findAll("tr")
columns = [v.text.replace('\xa0', ' ') for v in rows[0].find_all('th')]
df = pd.DataFrame(columns=columns)
for i in range(1, len(rows)):
tds = rows[i].find_all('td')
if len(tds) == 5:
values = [tds[0].text, tds[1].text, tds[2].text, tds[3].text, tds[4].text, tds[5].text]
else:
values = [td.text for td in tds]
df = df.append(pd.Series(values, index=columns), ignore_index=True)
try:
next_button = driver.find_element_by_css_selector("li.next:nth-child(9) > a:nth-child(1)")
driver.execute_script("arguments[0].click();", next_button)
sleep(5)
except NoSuchElementException:
break
df2 = df2.append(df)
df2.to_csv(r'/home/user/Documents/test/' + 'gasostest.csv', index=False)
问题是除了在你追加最后一页之前打破循环。
你可以做的是在你的try - except语句中使用finally语句。 finally块中的代码将始终运行,请参阅https://docs.python.org/3/tutorial/errors.html#defining-clean-up-actions
您的代码可以重写为:
try:
next_button = driver.find_element_by_css_selector("li.next:nth-child(9) > a:nth-child(1)")
driver.execute_script("arguments[0].click();", next_button)
sleep(5)
except NoSuchElementException:
break
finally:
df2 = df2.append(df)
df2.to_csv(r'/home/user/Documents/test/' + 'gasostest.csv', index=False)