我正在尝试将我的小型爬虫设置为 Docker 项目。我正在使用 Selenium Wire,因此我可以一次运行多个请求。但是,现在我想设置代理,但遇到了几个问题。
这是我的代码:
需求.txt
selenium==4.0.0
selenium-wire==5.1.0
blinker==1.7.0
setuptools==74.0.0
requests
fake_useragent==1.5.1
我的 Docker-Compose 文件: Docker-Compose
version: '3'
services:
chrome:
image: selenium/node-chrome:4.10.0
depends_on:
- selenium-hub
environment:
- SE_EVENT_BUS_HOST=selenium-hub
- SE_EVENT_BUS_PUBLISH_PORT=4442
- SE_EVENT_BUS_SUBSCRIBE_PORT=4443
- SE_NODE_MAX_SESSIONS=10
networks:
- selenium-network
selenium-hub:
image: selenium/hub:4.10.0
container_name: selenium-hub
ports:
- "4444:4444"
networks:
- selenium-network
python-app:
build:
context: .
dockerfile: Dockerfile.future
depends_on:
- selenium-hub
networks:
- selenium-network
networks:
selenium-network:
driver: bridge
Dockerfile.future
FROM python
WORKDIR /
COPY requirements.txt .
COPY test_with_futures.py .
RUN pip install -r requirements.txt
CMD ["python", "test_with_futures.py"]
和我的Python代码“Test_with_futures.py”
print("Docker gestartet")
import time
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
import concurrent
from fake_useragent import UserAgent
from random import random
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def call_function():
try:
# PROXY SETTINGS
PROXY = f"http://USER:[email protected]:20000"
desired_capabilities = DesiredCapabilities.CHROME.copy() # Necessary cause it wants t use Firefox.
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument(f'--proxy-server={PROXY}') # If i dont use this, there Proxy will not be used, everthing will happen with my own IP.
# Setzen des Proxys in den Selenium Wire Optionen
seleniumwire_options = {
'auto_config': False,
'proxy': {
'http': PROXY,
'https': PROXY
}
}
driver = webdriver.Remote(
command_executor="http://selenium-hub:4444/wd/hub",
options=chrome_options,
seleniumwire_options=seleniumwire_options,
desired_capabilities=desired_capabilities
)
print("----------------- AKTUELLER PROXY -------------------------------------------")
print(driver.proxy)
print("------------------------------------------------------------")
driver.get("https://ip.smartproxy.com/json")
wait = WebDriverWait(driver, 50) # Warte bis zu 50 Sekunden
pre_element = wait.until(EC.presence_of_element_located((By.XPATH, "/html/body/pre")))
res = pre_element.text
print("----------------- DRIVER -----------------------------------")
print(res)
print("------------------------------------------------------------")
return "", "" # Not used at the moment
except Exception as e:
print(e)
finally:
driver.quit()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
futures = []
for i in range(0, 5):
print(f"Webseitenindex {i}")
future = executor.submit(call_function)
futures.append(future)time.sleep(1)
for future in futures:
ipv4_value, ipv6_value = future.result() # Entpacken des Tupleprint(f"IPv4: {ipv4_value}, IPv6: {ipv6_value}")
Can someone help me please?
这是我收到的错误消息:
2024-10-07 11:41:07 Message:
2024-10-07 11:41:07 Stacktrace:
2024-10-07 11:41:07 #0 0x55c1df1544e3 <unknown>
2024-10-07 11:41:07 #1 0x55c1dee83c76 <unknown>
2024-10-07 11:41:07 #2 0x55c1deebfc96 <unknown>
2024-10-07 11:41:07 #3 0x55c1deebfdc1 <unknown>
2024-10-07 11:41:07 #4 0x55c1deef97f4 <unknown>
2024-10-07 11:41:07 #5 0x55c1deedf03d <unknown>
2024-10-07 11:41:07 #6 0x55c1deef730e <unknown>
2024-10-07 11:41:07 #7 0x55c1deedede3 <unknown>
2024-10-07 11:41:07 #8 0x55c1deeb42dd <unknown>
2024-10-07 11:41:07 #9 0x55c1deeb534e <unknown>
2024-10-07 11:41:07 #10 0x55c1df1143e4 <unknown>
2024-10-07 11:41:07 #11 0x55c1df1183d7 <unknown>
2024-10-07 11:41:07 #12 0x55c1df122b20 <unknown>
2024-10-07 11:41:07 #13 0x55c1df119023 <unknown>
2024-10-07 11:41:07 #14 0x55c1df0e71aa <unknown>
2024-10-07 11:41:07 #15 0x55c1df13d6b8 <unknown>
2024-10-07 11:41:07 #16 0x55c1df13d847 <unknown>
2024-10-07 11:41:07 #17 0x55c1df14d243 <unknown>
2024-10-07 11:41:07 #18 0x7f358377d609 start_thread
我解决了这个问题。 Selenium 线不支持用户密码身份验证。我将其更改为白名单 ip 作为身份验证,并且成功了。