ThreadPoolExecutor与threading.Thread

问题描述 投票:3回答:1

关于ThreadPoolExecutorThread课程的表现,我有一个问题,在我看来,我缺乏一些基本的理解。

我有两个功能的网络刮刀。首先解析网站主页的每个图像的链接,然后从解析的链接加载图像:

import threading
import urllib.request
from bs4 import BeautifulSoup as bs
import os
from concurrent.futures import ThreadPoolExecutor

path = r'C:\Users\MyDocuments\Pythom\Networking\bbc_images_scraper_test'
url = 'https://www.bbc.co.uk'

# Function to parse link anchors for images
def img_links_parser(url, links_list):
    res = urllib.request.urlopen(url)
    soup = bs(res,'lxml')
    content = soup.findAll('div',{'class':'top-story__image'})

    for i in content:
        try:
            link = i.attrs['style']
            # Pulling the anchor from parentheses
            link = link[link.find('(')+1 : link.find(')')]
            # Putting the anchor in the list of links
            links_list.append(link)
        except:
            # links might be under 'data-lazy' attribute w/o paranthesis
            links_list.append(i.attrs['data-lazy'])

# Function to load images from links
def img_loader(base_url, links_list, path_location):
    for link in links_list:
        try:
            # Pulling last element off the link which is name.jpg
            file_name = link.split('/')[-1]
            # Following the link and saving content in a given direcotory
            urllib.request.urlretrieve(urllib.parse.urljoin(base_url, link), 
            os.path.join(path_location, file_name))
        except:
            print('Error on {}'.format(urllib.parse.urljoin(base_url, link)))

以下代码分为两种情况:

案例1:我正在使用多个线程:

threads = []
t1 = threading.Thread(target = img_loader, args = (url, links[:10], path))
t2 = threading.Thread(target = img_loader, args = (url, links[10:20], path))
t3 = threading.Thread(target = img_loader, args = (url, links[20:30], path))
t4 = threading.Thread(target = img_loader, args = (url, links[30:40], path))
t5 = threading.Thread(target = img_loader, args = (url, links[40:50], path))
t6 = threading.Thread(target = img_loader, args = (url, links[50:], path))

threads.extend([t1,t2,t3,t4,t5,t6])
for t in threads:
    t.start()
for t in threads:
    t.join()

上面的代码在我的机器上完成它的工作10秒钟。

案例2:我正在使用ThreadPoolExecutor

with ThreadPoolExecutor(50) as exec:
    results = exec.submit(img_loader, url, links, path)

上面的代码结果为18秒。

我的理解是ThreadPoolExecutor为每个工人创建了一个线程。因此,鉴于我将max_workers设置为50将导致50个线程,因此应该更快地完成工作。

有人可以解释一下我在这里缺少什么吗?我承认我在这里犯了一个愚蠢的错误,但我只是不明白。

非常感谢!

multithreading python-3.x threadpoolexecutor
1个回答
2
投票

在案例2中,您将所有链接发送给一个工作人员。代替

exec.submit(img_loader, url, links, path)

你需要:

for link in links:
    exec.submit(img_loader, url, [link], path)

我没有尝试过,只是来自reading the documentation of ThreadPoolExecutor

© www.soinside.com 2019 - 2024. All rights reserved.