在 urllib3 或 requests 模块中,在使用并发.futures 中的 ThreadPoolExecutor 时,如何在循环遍历 url 数组时尊重不同的抓取延迟?
在下面的示例中,前三个网址属于同一服务器并共享相同的抓取延迟(因此,在抓取一个页面后,我可能不会发出另一个请求,因此在延迟时间之前同一服务器上的相同或另一个页面已经过去了)。 其他网址位于不同的服务器上,需要尊重另一个或没有抓取延迟。
urls = [
"https://news.ycombinator.com/", #Crawl-Delay 30
"https://news.ycombinator.com/?p=2", #Crawl-Delay 30
"https://news.ycombinator.com/?p=3", #Crawl-Delay 30
"https://www.infoworld.com/", #Crawl-Delay None
"https://www.theregister.com", #Crawl-Delay 5
]
这是我正在使用的代码的摘录:
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
from bs4 import BeautifulSoup
def fetch_url(url):
response = requests.get(url)
data = response.content
return BeautifulSoup(data, "lxml")
def main():
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(fetch_url, url): url for url in urls}
for future in as_completed(futures):
url = futures[future]
try:
soup = future.result()
except Exception as e:
print(f"Error processing {url}: {e}")
此代码基本上使用队列让线程获取自己的特定 url,然后休眠“DELAY”并执行一些逻辑。 这应该可行,但您必须编辑此代码以包含所需的逻辑和抓取。 祝你好运!
import threading
import queue
import time
import requests
q = queue.Queue()
urls = [{'URL': "https://news.ycombinator.com/", 'DELAY': 30},
{'URL': "https://news.ycombinator.com/", 'DELAY': 30},
{'URL': "https://news.ycombinator.com/?p=2", 'DELAY': 30},
{'URL': "https://news.ycombinator.com/?p=3", 'DELAY': 30},
{'URL': "https://news.infoworld.com/", 'DELAY': 0},
{'URL': "https://news.theregister.com/", 'DELAY': 5},]
def scrape_sites():
global q
while not q.empty():
url_data = q.get()
url = url_data['URL']
delay = url_data['DELAY']
while True: # replace this with some logic that will stop the loop when you want
try:
res = requests.get(url)
# do your bs4 scraping stuff here
time.sleep(delay)
except Exception as e:
print(e)
for _ in range(q.qsize()):
t = threading.Thread(target=scrape_sites).start()