我在python中编写了一些与selenium结合使用的代码来解析quora.com
的不同问题。我的刮刀正在做这件事。事情是我在这里使用硬编码延迟让刮刀工作,即使已经定义了Explicit Wait
。由于页面是无限滚动的,我试图使滚动过程数量有限。现在,我有两个问题:
wait.until(EC.staleness_of(page))
不能在我的刮刀内工作。它现在被注释掉了。page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
刮刀抛出一个错误:can't focus element
。顺便说一句,我不想为page = driver.find_element_by_tag_name('body')
这个选项。
这是我到目前为止所写的:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.quora.com/topic/C-programming-language")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
for scroll in range(10):
page.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
# wait.until(EC.staleness_of(page))
for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
print(item.text)
driver.quit()
你可以尝试下面的代码来获得尽可能多的XHR,然后解析页面:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
driver = webdriver.Chrome()
driver.get("https://www.quora.com/topic/C-programming-language")
wait = WebDriverWait(driver, 10)
page = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "question_link")))
links_counter = len(wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "question_link"))))
while True:
page.send_keys(Keys.END)
try:
wait.until(lambda driver: len(driver.find_elements_by_class_name("question_link")) > links_counter)
links_counter = len(driver.find_elements_by_class_name("question_link"))
except TimeoutException:
break
for item in wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "rendered_qtext"))):
print(item.text)
driver.quit()
在这里我们向下滚动页面并等待最多10秒钟,以便加载更多链接,或者如果链接数量保持不变则打破while
循环
至于你的问题:
wait.until(EC.staleness_of(page))
无法正常工作,因为当您向下滚动页面时,您没有获得新的DOM - 您只需创建XHR,它会在现有DOM中添加更多链接,因此第一个链接(page
)在这种情况下不会过时div
),段落(p
)等