我使用 Python 中的 Scrapy 模块创建了一个可以运行的蜘蛛;但是,在运行蜘蛛时,我在某个时刻被阻止了。我调查并了解到,为了防止这种情况发生,我可以使用设置来启用中间件,允许为每个请求创建假用户代理和不同的代理 IP。然而,即使在实现这个之后,我仍然在某个时刻收到 403 错误。
这是我的中间件实现:
ScrapeOpsFakeUserAgentMiddleware 类:
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def __init__(self, settings):
self.scrapeops_api_key = settings.get('SCRAPEOPS_API_KEY')
self.scrapeops_endpoint = settings.get('SCRAPEOPS_FAKE_USER_AGENT_ENDPOINT', 'http://headers.scrapeops.io/v1/user-agents?')
self.scrapeops_fake_user_agents_active = settings.get('SCRAPEOPS_FAKE_USER_AGENT_ENABLED', False)
self.scrapeops_num_results = settings.get('SCRAPEOPS_NUM_RESULTS')
self.headers_list = []
self._get_user_agents_list()
self._scrapeops_fake_user_agents_enabled()
def _get_user_agents_list(self):
payload = {'api_key': self.scrapeops_api_key}
if self.scrapeops_num_results is not None:
payload['num_results'] = self.scrapeops_num_results
response = requests.get(self.scrapeops_endpoint, params=urlencode(payload))
json_response = response.json()
self.user_agents_list = json_response.get('result', [])
def _get_random_user_agent(self):
random_index = random.randint(0, len(self.user_agents_list) - 1)
return self.user_agents_list[random_index]
def _scrapeops_fake_user_agents_enabled(self):
if self.scrapeops_api_key is None or self.scrapeops_api_key == '' or self.scrapeops_fake_user_agents_active == False:
self.scrapeops_fake_user_agents_active = False
self.scrapeops_fake_user_agents_active = True
def process_request(self, request, spider):
random_user_agent = self._get_random_user_agent()
request.headers['User-Agent'] = random_user_agent
类 RandomProxyMiddleware:
def __init__(self, settings):
self.proxies = settings.get('ROTATING_PROXY_LIST', [])
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def process_request(self, request, spider):
if not self.proxies:
return
proxy = random.choice(self.proxies)
request.meta['proxy'] = proxy
有人可以帮我找到解决方案吗?
403状态码通常是因为网站拒绝了您的请求。 如果网站不需要登录或临时cookie,您可以使用动态代理。如果涉及到账号,可以购买多个账号,并将每个账号绑定一个住宅IP,然后进行爬取。