如何在循环 url 数组时尊重不同的抓取延迟?

问题描述 投票:0回答:1

在 urllib3 或 requests 模块中,在使用并发.futures 中的 ThreadPoolExecutor 时,如何在循环遍历 url 数组时尊重不同的抓取延迟?

在下面的示例中,前三个网址属于同一服务器并共享相同的抓取延迟(因此,在抓取一个页面后,我可能不会发出另一个请求,因此在延迟时间之前同一服务器上的相同或另一个页面已经过去了)。 其他网址位于不同的服务器上,需要尊重另一个或没有抓取延迟。

urls = [
    "https://news.ycombinator.com/",      #Crawl-Delay 30
    "https://news.ycombinator.com/?p=2",  #Crawl-Delay 30
    "https://news.ycombinator.com/?p=3",  #Crawl-Delay 30
    "https://www.infoworld.com/",         #Crawl-Delay None
    "https://www.theregister.com",        #Crawl-Delay 5
]

这是我正在使用的代码的摘录:

from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
from bs4 import BeautifulSoup

def fetch_url(url):
    response = requests.get(url)
    data = response.content
    return BeautifulSoup(data, "lxml")

def main():
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(fetch_url, url): url for url in urls}

        for future in as_completed(futures):
            url = futures[future]
            try:
                soup = future.result()
            except Exception as e:
                print(f"Error processing {url}: {e}")
python web-scraping python-requests
1个回答
0
投票

此代码基本上使用队列让线程获取自己的特定 url,然后休眠“DELAY”并执行一些逻辑。 这应该可行,但您必须编辑此代码以包含所需的逻辑和抓取。 祝你好运!

import threading
import queue
import time
import requests

q = queue.Queue()
urls = [{'URL': "https://news.ycombinator.com/", 'DELAY': 30},
        {'URL': "https://news.ycombinator.com/", 'DELAY': 30},
        {'URL': "https://news.ycombinator.com/?p=2", 'DELAY': 30},
        {'URL': "https://news.ycombinator.com/?p=3", 'DELAY': 30},
        {'URL': "https://news.infoworld.com/", 'DELAY': 0},
        {'URL': "https://news.theregister.com/", 'DELAY': 5},]


def scrape_sites():
    global q
    while not q.empty():
        url_data = q.get()
        url = url_data['URL']
        delay = url_data['DELAY']
        while True: # replace this with some logic that will stop the loop when you want
            try:
                res = requests.get(url)
                # do your bs4 scraping stuff here
                time.sleep(delay)
            except Exception as e:
                print(e)
            

for _ in range(q.qsize()):
    t = threading.Thread(target=scrape_sites).start()
© www.soinside.com 2019 - 2024. All rights reserved.