使用scrapy中的try / except子句无法获得所需的结果

问题描述 投票:0回答:1

我在scrapy编写了一个脚本,使用get_proxies()方法使用新生成的代理来发出代理请求。我使用requests模块来获取代理,以便在脚本中重用它们。我想要做的是从它的landing page解析所有的电影链接,然后从它的target page获取每部电影的名称。我的以下脚本可以使用代理的旋转。

我知道有一种更简单的方法来改变代理,就像这里描述的HttpProxyMiddleware但我仍然想坚持我在这里尝试的方式。

website link

这是我目前的尝试(它不断使用新的代理来获取有效的响应,但每次获取503 Service Unavailable):

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_proxies():   
    response = requests.get("https://www.us-proxy.org/")
    soup = BeautifulSoup(response.text,"lxml")
    proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
    return proxy

class ProxySpider(scrapy.Spider):
    name = "proxiedscript"
    handle_httpstatus_list = [503]
    proxy_vault = get_proxies()
    check_url = "https://yts.am/browse-movies"

    def start_requests(self):
        random.shuffle(self.proxy_vault)
        proxy_url = next(cycle(self.proxy_vault))
        request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
        request.meta['https_proxy'] = f'http://{proxy_url}'
        yield request

    def parse(self,response):
        print(response.meta)
        if "DDoS protection by Cloudflare" in response.css(".attribution > a::text").get():
            random.shuffle(self.proxy_vault)
            proxy_url = next(cycle(self.proxy_vault))
            request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
            request.meta['https_proxy'] = f'http://{proxy_url}'
            yield request

        else:
            for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
                nlink = response.urljoin(item)
                yield scrapy.Request(nlink,callback=self.parse_details)

    def parse_details(self,response):
        name = response.css("#movie-info h1::text").get()
        yield {"Name":name}

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
    c.crawl(ProxySpider)
    c.start()

为了确定请求是否被代理,我打印response.meta并且可以得到像这样的{'https_proxy': 'http://142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}的结果。

由于我过度使用链接检查scrapy中的代理请求是如何工作的,此时我得到503 Service Unavailable错误,我可以在响应DDoS protection by Cloudflare中看到此关键字。但是,当我尝试使用requests模块应用我在此实现的相同逻辑时,我得到了有效的响应。

My earlier question: why I can't get the valid response as (I suppose) I'm using proxies in the right way? [solved]

赏金问题:如何在我的脚本中定义try/except子句,以便一旦它与某个代理抛出连接错误,它将尝试使用不同的代理?

python python-3.x web-scraping scrapy
1个回答
4
投票

根据scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware docs(和source) 预计使用proxy元键(不是https_proxy

#request.meta['https_proxy'] = f'http://{proxy_url}'  
request.meta['proxy'] = f'http://{proxy_url}'

由于scrapy没有收到有效的元密钥 - 您的scrapy应用程序没有使用代理


3
投票

start_requests()功能只是切入点。在后续请求中,您需要将此元数据重新提供给Request对象。

此外,错误可能发生在两个级别:代理服务器和目标服务器

我们需要处理来自代理服务器和目标服务器的错误响应代码。 middelware将代理错误返回给errback函数。在从response.status解析期间可以处理目标服务器响应

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess


def get_proxies():
    response = requests.get("https://www.us-proxy.org/")
    soup = BeautifulSoup(response.text, "lxml")
    proxy = [':'.join([item.select_one("td").text, item.select_one("td:nth-of-type(2)").text]) for item in
             soup.select("table.table tbody tr") if "yes" in item.text]
    # proxy = ['https://52.0.0.1:8090', 'https://52.0.0.2:8090']
    return proxy


def get_random_proxy(proxy_vault):
    random.shuffle(proxy_vault)
    proxy_url = next(cycle(proxy_vault))
    return proxy_url


class ProxySpider(scrapy.Spider):
    name = "proxiedscript"
    handle_httpstatus_list = [503, 502, 401, 403]
    check_url = "https://yts.am/browse-movies"
    proxy_vault = get_proxies()

    def handle_middleware_errors(self, *args, **kwargs):
        # implement middleware error handling here
        print('Middleware Error')
        # retry request with different proxy
        yield self.make_request(url=args[0].request._url, callback=args[0].request._meta['callback'])

    def start_requests(self):
        yield self.make_request(url=self.check_url, callback=self.parse)

    def make_request(self, url, callback, dont_filter=True):
        return scrapy.Request(url,
                              meta={'proxy': f'https://{get_random_proxy(self.proxy_vault)}', 'callback': callback},
                              callback=callback,
                              dont_filter=dont_filter,
                              errback=self.handle_middleware_errors)

    def parse(self, response):
        print(response.meta)
        try:
            if response.status != 200:
                # implement server status code handling here - this loops forever
                print(f'Status code: {response.status}')
                raise
            else:
                for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
                    nlink = response.urljoin(item)
                    yield self.make_request(url=nlink, callback=self.parse_details)
        except:
            # if anything goes wrong fetching the lister page, try again
            yield self.make_request(url=self.check_url, callback=self.parse)

    def parse_details(self, response):
        print(response.meta)
        try:
            if response.status != 200:
                # implement server status code handeling here - this loops forever
                print(f'Status code: {response.status}')
                raise
            name = response.css("#movie-info h1::text").get()
            yield {"Name": name}
        except:
            # if anything goes wrong fetching the detail page, try again
            yield self.make_request(url=response.request._url, callback=self.parse_details)


if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    c.crawl(ProxySpider)
    c.start()
© www.soinside.com 2019 - 2024. All rights reserved.