我尝试在 AWS MWAA 中定期运行 selenium,但 chromium 每次都会崩溃,状态代码为 -5。我尝试用谷歌搜索这个状态代码,但没有成功。关于导致此错误的原因有什么想法吗?或者,我应该如何使用 AWS MWAA 运行 selenium?我看到的一个建议是在 docker 容器中沿着侧面气流运行 selenium,但这对于 AWS MWAA 来说是不可能的。
代码
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromiumService
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.core.os_manager import ChromeType
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(
service=ChromiumService(
ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install()
),
options=options,
)
错误:chromedriver 退出,状态代码为 5
>>> options = Options()
>>> options.add_argument("--headless=new")
>>> driver = webdriver.Chrome(
... service=ChromiumService(
... ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install()
... ),
... options=options,
... )
DEBUG:selenium.webdriver.common.driver_finder:Skipping Selenium Manager; path to chrome driver specified in Service class: /usr/local/airflow/.wdm/drivers/chromedriver/linux64/114.0.5735.90/chromedriver
DEBUG:selenium.webdriver.common.service:Started executable: `/usr/local/airflow/.wdm/drivers/chromedriver/linux64/114.0.5735.90/chromedriver` in a child process with pid: 19414 using 0 to output -3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/airflow/.local/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 45, in __init__
super().__init__(
File "/usr/local/airflow/.local/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 55, in __init__
self.service.start()
File "/usr/local/airflow/.local/lib/python3.11/site-packages/selenium/webdriver/common/service.py", line 102, in start
self.assert_process_still_running()
File "/usr/local/airflow/.local/lib/python3.11/site-packages/selenium/webdriver/common/service.py", line 115, in assert_process_still_running
raise WebDriverException(f"Service {self._path} unexpectedly exited. Status code was: {return_code}")
selenium.common.exceptions.WebDriverException: Message: Service /usr/local/airflow/.wdm/drivers/chromedriver/linux64/114.0.5735.90/chromedriver unexpectedly exited. Status code was: -5
版本
selenium==4.21.0
webdriver-manager==4.0.2
chromedriver==114.0.5735.90
aws-mwaa-local-runner v2.8.1
要重现此错误,您可以下载 AWS MWAA localrunner v2.8.1,安装上述要求,bash 进入容器 (
docker exec -it {container_id} /bin/bash
) 并运行脚本。
由于误解,我主要尝试在没有root权限的情况下完成这项工作。现在有两种方法设置环境!
我很自豪地说这种方法不需要 root 权限。 foodycoder向我表明,他无法运行任何需要它的东西,因为他说他无法安装程序。哦,好吧,这是一个工作方法。
我在here提供了一个设置Python脚本(setup.py)。在环境中运行它,它将为您设置一切。
基本上它的作用是下载 Chrome、chromeDriver 以及我之前使用 root 权限安装的运行所需的库。然后,它提取它们,允许它们可执行,并允许它们识别库。
这就是它的样子:
import subprocess, zipfile, os
def unzip_file(name, path):
"""
Unzips a file
Args:
name (str): The name of the zip file to unzip
path (str): The path to the extract directory
"""
print(f"Unzipping {name} to {path}...")
# Open the ZIP file
with zipfile.ZipFile(name, 'r') as zip_ref:
# Extract all contents into the specified directory
zip_ref.extractall(path)
print("Extraction complete!")
delete_file(name)
def download_file(url):
"""
Downloads the file from a given url
Args:
url (str): The url to download the file from
"""
download = subprocess.run(["wget", f"{url}"], capture_output=True, text=True)
# Print the output of the command
print(download.stdout)
def delete_file(path):
"""
Downloads the file from a given url
Args:
path (str): The path to the file to delete
"""
# Check if the file exists before attempting to delete
if os.path.exists(path):
os.remove(path)
print(f"File {path} has been deleted.")
else:
print(f"The file {path} does not exist.")
def write_to_bashrc(line):
"""
Downloads the file from a given url
Args:
line (str): The line to write
"""
# Path to the ~/.bashrc file
bashrc_path = os.path.expanduser("~/.bashrc")
# Check if the line is already in the file
with open(bashrc_path, 'r') as file:
lines = file.readlines()
if line not in lines:
with open(bashrc_path, 'a') as file:
file.write(line)
print(f"{line} has been added to ~/.bashrc")
else:
print("That is already in ~/.bashrc")
if __name__ == '__main__':
download_file("https://storage.googleapis.com/chrome-for-testing-public/127.0.6533.119/linux64/chrome-linux64.zip")
unzip_file("chrome-linux64.zip", ".")
subprocess.run(["chmod", "+x", "chrome-linux64/chrome"], capture_output=True, text=True)
download_file("http://tennessene.github.io/chrome-libs.zip")
unzip_file("chrome-libs.zip", "libs")
download_file("https://storage.googleapis.com/chrome-for-testing-public/127.0.6533.119/linux64/chromedriver-linux64.zip")
unzip_file("chromedriver-linux64.zip", ".")
subprocess.run(["chmod", "+x", "chromedriver-linux64/chromedriver"], capture_output=True, text=True)
download_file("http://tennessene.github.io/driver-libs.zip")
unzip_file("driver-libs.zip", "libs")
current_directory = os.path.abspath(os.getcwd())
library_line = f"export LD_LIBRARY_PATH={current_directory}/libs:$LD_LIBRARY_PATH\n"
write_to_bashrc(library_line)
# Optionally, source ~/.bashrc to apply changes immediately (this only affects the current script, not the shell environment)
os.system("source ~/.bashrc")
首先,我会安装 chrome。在这里您可以直接从Google下载
.rpm
包。
wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
确保安装
sudo rpm -i google-chrome-stable_current_x86_64.rpm
接下来,我将下载 chromeDriver。构建版本在here提供。
wget https://storage.googleapis.com/chrome-for-testing-public/127.0.6533.119/linux64/chromedriver-linux64.zip
提取它
unzip chromedriver-linux64.zip
这是最后一步之前的一些背景信息。您可能已经知道,AWS MWAA 使用类似于 CentOS/RHEL 的 Amazon Linux 2。我如何能够找到所需的库(此处的库适用于 Ubuntu),是我偶然发现了我需要的库之一,但它适用于 Oracle Linux。
它们有不同的名称(例如
nss
而不是 libnss3
)。然后我查看了 Amazon 的软件包存储库,它们就在那里,但名称与 Oracle Linux 的软件包相似。我最终需要的 chromeDriver 库是 nss
、nss-utils
、nspr
和 libxcb
。
最后,安装那些讨厌的库
sudo dnf update
sudo dnf install nss nss-utils nspr libxcb
比手工复制要好得多!
此后它应该立即起作用。确保您的
main.py
看起来像我的。
这是我的主要 python 脚本最终的样子(main.py):
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
def visit_url(url):
"""
Navigates to a given url.
Args:
url (str): The url of the site to visit (e.g., "https://stackexchange.com/").
"""
print(f"Visiting {url}")
driver.get(url)
WebDriverWait(driver, 10).until(
lambda driver: driver.execute_script('return document.readyState') == 'complete'
)
if __name__ == '__main__':
# Set up Chrome options
options = Options()
options.add_argument("--headless") # Run Chrome in headless mode
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--remote-debugging-port=9222")
options.binary_location = "chrome-linux64/chrome"
# Initialize the WebDriver
driver = webdriver.Chrome(options=options, service=Service("chromedriver-linux64/chromedriver"))
try:
visit_url("https://stackoverflow.com/")
# For debugging purposes (if you can even access it)
driver.save_screenshot("stack_overflow.png")
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Always close the browser
print("Finished! Closing...")
driver.close()
driver.quit()
让它识别 Chrome 是非常挑剔的,因为它不在原来的位置。但是,这是一个基本脚本,您可以以此为基础编写程序。它会保存屏幕截图,您可以在
localhost:9222
观看它的工作情况。但不太确定这会如何运作。