我在笔记本上写了一个复杂的webscraper,需要很长时间才能运行。我已经开始使用 databricks,我想在我的 databricks 集群中运行这个脚本,这样我就可以在不依赖本地服务器的情况下运行 scraper。
但是,我无法正确设置环境。有很多堆栈溢出,但我还没有弄清楚。
我是这样配置的:
%pip install selenium
%pip install chromedriver
%pip install webdriver_manager
%pip install beautifulsoup4
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.core.utils import ChromeType
import time
from bs4 import BeautifulSoup
import pickle as pkl
service=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
根据我正在尝试的代码,我得到错误:
"databricks" "WebDriverException: Message: unknown error: cannot find Chrome binary"
WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/chromium-browser is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x55c30eb24d93......
我目前遇到第二个错误。如果有人能告诉我如何配置我的环境以便我的代码运行,我将不胜感激!
我已经从堆栈溢出中尝试了一些东西: