我尝试从法国网站上删除专业号码,但收到 403 错误,并且被 Clouflares 阻止。我使用 Selenium 和 Scrapy。我添加了 scrapy cloudflares 中间件,但仍然不起作用。我还向 selenium 选项添加了一些选项参数。
蜘蛛.py:
import scrapy
import random
from scrapy_selenium import SeleniumRequest
from scrapy.selector import Selector
from selenium import webdriver
class ApiPbSpider(scrapy.Spider):
name = 'api_pb'
def start_requests(self):
yield SeleniumRequest(
url = 'https://www.pagesjaunes.fr/pagesblanches/recherche?quoiqui=sylvie&ou=Saint+Beno%C3%AEt+%2886280%29&univers=pagesblanches&idOu=L08621400',
callback=self.parse,
wait_time = 15,
)
def parse(self, response):
driver = response.meta['driver']
code_page = driver.page_source
print(code_page)
设置.py:
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless'] # '--headless' if using chrome instead of firefox
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': '/',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language':'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7',
'Origin': 'https://www.pagesjaunes.fr/pagesblanches/',
'Referer':'https://www.pagesjaunes.fr/pagesblanches/',
'Sec-Ch-Ua':'"Google Chrome";v="113", "Chromium";v="113", "Not-A.Brand";v="24"',
'Sec-Ch-Ua-Mobile':'?0',
'Sec-Ch-Ua-Platform':'"Windows"',
'Sec-Fetch-Dest':'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'api_pages_blanches.middlewares.ApiPagesBlanchesSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# The priority of 560 is important, because we want this middleware to kick in just before the scrapy built-in `RetryMiddleware`.
'scrapy_cloudflare_middleware.middlewares.CloudFlareMiddleware': 560,
'scrapy_selenium.SeleniumMiddleware': 800
}
但是,如果我添加住宅代理,我会得到代码 200,但收到的是空主体。你有什么想法吗?
之前已经尝试过被某些网站阻止,因为我使用 Selenium 处于无头模式。首先尝试关闭无头模式。如果它有效,则意味着需要 JavaScript 来加载网站并继续抓取。
只需删除
SELENIUM_DRIVER_ARGUMENTS = ['--headless']
中的 settings.py
行即可。