刮除多个页面时出现周期性HTTP错误413

问题描述 投票:0回答:2

我正在通过循环浏览我在网站上搜索我感兴趣的关键字时返回的多个页面来删除Wykop.pl('波兰的Reddit')上的帖子。我写了一个循环来迭代每个页面的目标内容;然而,循环将终止于某些页面(始终如一),错误为“HTTP错误413:请求实体太大”。

我试图单独刮掉有问题的页面,但同样的错误信息不断重复出现。为了解决这个问题,我不得不手动设置我的范围以收集数据,但是以丢失大量数据为代价,我想知道是否有Pythonic解决方案来处理这个错误。我也尝试了更长的停顿时间,也许我冒着发送太多请求的风险,但事实似乎并非如此。

from time import sleep
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
from mtranslate import translate
from IPython.core.display import clear_output


from mtranslate import translate
posts = []
votes = []
dates = []
images = []
users = []

start_time = time()
requests = 0
pages = [str(i) for i in range(1,10)]

for page in pages:
    url = "https://www.wykop.pl/szukaj/wpisy/smog/strona/" + page + "/"
    response = get(url)

    # Pause the loop
    sleep(randint(8,15))

        # Monitor the requests
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)
    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))
    # Break the loop if the number of requests is greater than expected
    if requests > 10:
        warn('Number of requests was greater than expected.')
        break


    soup = BeautifulSoup(response.text, 'html.parser')
    results = soup.find_all('li', class_="entry iC")


    for result in results:
            # Error handling
            try:

                post = result.find('div', class_="text").text
                post = translate(post,'en','auto')
                posts.append(post)

                date = result.time['title']
                dates.append(date)

                vote = result.p.b.span.text
                vote = int(vote)
                votes.append(vote)

                user = result.div.b.text
                users.append(user)

                image = result.find('img',class_='block lazy')
                images.append(image)

            except AttributeError as e:
                print(e)

如果我可以一次性运行脚本,我会将范围设置为1到163(因为我有163页的帖子结果提到了我感兴趣的关键字)。因此,我不得不设置较小的范围来逐步收集数据,但同样以丢失数据页为代价。

我作为一个偶然事件的另一种选择是从指定的有问题的页面中删除我在桌面上下载的html文档。

python pandas web-scraping runtime-error
2个回答
0
投票

您可能已经遇到某种IP地址限制。运行脚本时,它对我没有任何速率限制(目前)。我会建议您使用requests.Session()(您需要更改您的requests变量或它会覆盖导入)。这有助于减少可能的内存泄漏问题。

例如:

from bs4 import BeautifulSoup
from time import sleep
from time import time
from random import randint
import requests

posts = []
votes = []
dates = []
images = []
users = []

start_time = time()
request_count = 0
req_sess = requests.Session()

for page_num in range(1, 100):
    response = req_sess.get(f"https://www.wykop.pl/szukaj/wpisy/smog/strona/{page_num}/")

    # Pause the loop
    #sleep(randint(1,3))

    # Monitor the requests
    request_count += 1
    elapsed_time = time() - start_time
    print('Page {}; Request:{}; Frequency: {} requests/s'.format(page_num, request_count, request_count/elapsed_time))

    #clear_output(wait = True)
    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        print('Request: {}; Status code: {}'.format(requests, response.status_code))
        print(response.headers)

    # Break the loop if the number of requests is greater than expected
    #if requests > 10:
    #    print('Number of requests was greater than expected.')
    #    break

    soup = BeautifulSoup(response.text, 'html.parser')
    results = soup.find_all('li', class_="entry iC")

    for result in results:
        # Error handling
        try:
            post = result.find('div', class_="text").text
            #post = translate(post,'en','auto')
            posts.append(post)

            date = result.time['title']
            dates.append(date)

            vote = result.p.b.span.text
            vote = int(vote)
            votes.append(vote)

            user = result.div.b.text
            users.append(user)

            image = result.find('img',class_='block lazy')
            images.append(image)

        except AttributeError as e:
            print(e)

给出以下输出:

Page 1; Request:1; Frequency: 1.246137372973911 requests/s
Page 2; Request:2; Frequency: 1.3021880233774552 requests/s
Page 3; Request:3; Frequency: 1.2663757427416629 requests/s
Page 4; Request:4; Frequency: 1.1807827876080845 requests/s                
.
.
.
Page 96; Request:96; Frequency: 0.8888853607003809 requests/s
Page 97; Request:97; Frequency: 0.8891876183362001 requests/s
Page 98; Request:98; Frequency: 0.888801819672809 requests/s
Page 99; Request:99; Frequency: 0.8900784741536467 requests/s                

当我开始使用更高的页码时,这也很好。理论上,当您获得413错误状态代码时,它现在应该显示响应头。根据RFC 7231,服务器应返回一个Retry-After头字段,您可以使用该字段确定在您下一个请求之前退回多长时间。


0
投票

好的,所以这里是抓住:

413错误与受到刮擦的网站Wykop无关,而与mtranslate包有关,后者依赖于Google Translate的API。在我的原始代码中发生的事情是,当Wykop被刮掉时,它将帖子从波兰语翻译成英语。但是,Google Translation API的限制为每用户每100秒100,000个字符。因此,当代码到达第13页时,mtranslate达到了谷歌翻译的请求限制。因此,为什么Martin的解决方案能够很好地利用平移功能来清理数据。

我得出了这个结论,因为我使用模块来翻译存储在数据帧中的帖子,因为我在翻译循环的大约8%标记处遇到了相同的错误。

© www.soinside.com 2019 - 2024. All rights reserved.