我构建了一个小型网络抓取工具,过去几个月在 Google Colab 中成功运行。它从 CMS 网站下载一组计费代码。最近,驱动程序在检索某些但不是全部 URL 时开始抛出超时异常。当我在本地运行下面的代码片段时,它会成功执行。它尝试从两个 url 下载文件,但尝试检索第二个 url 失败。
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
def download_documents() -> None:
"""Download billing code documents from CMS"""
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=chrome_options)
working_url = "https://www.cms.gov/medicare-coverage-database/view/article.aspx?articleid=59626&ver=6"
not_working_url = "https://www.cms.gov/medicare-coverage-database/view/lcd.aspx?lcdid=36377&ver=19"
for row in [working_url, not_working_url]:
print(f"Retrieving from {row}...")
driver.get(row) # Fails on second url
print("Wait for webdriver...")
wait = WebDriverWait(driver, 2)
print("Attempting license accept...")
# Accept license
try:
wait.until(EC.element_to_be_clickable((By.ID, "btnAcceptLicense"))).click()
except TimeoutException:
pass
wait = WebDriverWait(driver, 4)
print("Attempting pop up close...")
# Click on Close button of the second pop-up
try:
wait.until(
EC.element_to_be_clickable(
(
By.XPATH,
"//button[@data-page-action='Clicked the Tracking Sheet Close button.']",
)
)
).click()
except TimeoutException:
pass
print("Attempting download...")
driver.find_element(By.ID, "btnDownload").click()
download_documents()
预期行为:上面的代码在 Google Colab 中成功运行,就像在本地一样。
尝试以下这些论点:
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)