使用 scrapy 模块通过 Python 抓取网站

问题描述 投票:0回答:1

我使用 Python 中的 Scrapy 模块创建了一个可以运行的蜘蛛;但是,在运行蜘蛛时,我在某个时刻被阻止了。我调查并了解到,为了防止这种情况发生,我可以使用设置来启用中间件,允许为每个请求创建假用户代理和不同的代理 IP。然而,即使在实现这个之后,我仍然在某个时刻收到 403 错误。

这是我的中间件实现:

ScrapeOpsFakeUserAgentMiddleware 类:

@classmethod
def from_crawler(cls, crawler):
    return cls(crawler.settings)

def __init__(self, settings):
    self.scrapeops_api_key = settings.get('SCRAPEOPS_API_KEY')
    self.scrapeops_endpoint = settings.get('SCRAPEOPS_FAKE_USER_AGENT_ENDPOINT', 'http://headers.scrapeops.io/v1/user-agents?') 
    self.scrapeops_fake_user_agents_active = settings.get('SCRAPEOPS_FAKE_USER_AGENT_ENABLED', False)
    self.scrapeops_num_results = settings.get('SCRAPEOPS_NUM_RESULTS')
    self.headers_list = []
    self._get_user_agents_list()
    self._scrapeops_fake_user_agents_enabled()

def _get_user_agents_list(self):
    payload = {'api_key': self.scrapeops_api_key}
    if self.scrapeops_num_results is not None:
        payload['num_results'] = self.scrapeops_num_results
    response = requests.get(self.scrapeops_endpoint, params=urlencode(payload))
    json_response = response.json()
    self.user_agents_list = json_response.get('result', [])

def _get_random_user_agent(self):
    random_index = random.randint(0, len(self.user_agents_list) - 1)
    return self.user_agents_list[random_index]

def _scrapeops_fake_user_agents_enabled(self):
    if self.scrapeops_api_key is None or self.scrapeops_api_key == '' or self.scrapeops_fake_user_agents_active == False:
        self.scrapeops_fake_user_agents_active = False
    self.scrapeops_fake_user_agents_active = True

def process_request(self, request, spider):        
    random_user_agent = self._get_random_user_agent()
    request.headers['User-Agent'] = random_user_agent

类 RandomProxyMiddleware:

def __init__(self, settings):
    self.proxies = settings.get('ROTATING_PROXY_LIST', [])

@classmethod
def from_crawler(cls, crawler):
    return cls(crawler.settings)

def process_request(self, request, spider):
    if not self.proxies:
        return
    
    proxy = random.choice(self.proxies)
    request.meta['proxy'] = proxy

有人可以帮我找到解决方案吗?

python web-scraping scrapy http-status-code-403
1个回答
0
投票

403状态码通常是因为网站拒绝了您的请求。 如果网站不需要登录或临时cookie,您可以使用动态代理。如果涉及到账号,可以购买多个账号,并将每个账号绑定一个住宅IP,然后进行爬取。

© www.soinside.com 2019 - 2024. All rights reserved.