我有一个功能:获取新文章的标题和内容并附加到列表中。
我想:拨打1000+网址并运行该功能
目标:只需一次运行,我就可以在列表中获得1000+标题和内容,而无需循环遍历每个URL并按顺序调用。
我到目前为止的代码:
建立
import requests
from newspaper import Article
import threading
1)这是“做功”功能
def get_url_info(url):
try:
r = requests.head(url)
if r.status_code < 400: # if loads
article = Article(url)
article.download()
article.parse()
if detect(article.title) == 'en': #English only
if len(article.text)<50: #filter out permission request
title = (article.title.encode('ascii', errors='ignore'))
text = (article.text.encode('ascii', errors='ignore'))
test_url= url
except Exception as e:
issue = url #storing issue urls
print(e, url)
return title, text, test_url
2)这是列表函数的实际附加:
def get_text_list():
text_list = [] #article content list
test_urls = [] #urls taht works
title_list = [] #article titles
url_list = get_tier_3()[:8000] #get first 8000 english texts for testing
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
#originally this was for url in url_list
thread.start()
"""
title, text, test_url = call do work here
title_list.append(title)
text_list.append(text)
test_urls.append(test_url)
"""
print (i) #counts number of urls from DB processed
return text_list, test_urls, title_list
问题:在设置线程并从每个线程获取信息后,我不知道如何继续
我认为multiprocessing
模块可能更适合这项任务。由于CPython的实现方式,它无法使用threading
模块真正实现CPU绑定任务(如HTTP请求)的基于线程的并发。
我还建议不要为您希望处理的每个URL生成单独的线程或进程。您极不可能以这种方式实现任何性能提升,因为所有线程或进程都将争夺系统资源。
通常,更好的解决方案是生成较少数量的线程/进程,并将一组URL委派给每个线程/进程进行处理。使用multiprocessing pool实现此方法的简单方法如下:
from multiprocessing import Pool
NUM_PROCS = 4 # example number of processes to be used in multiprocessing
def get_url_info(url):
...
def get_text_list():
# Get your list of URLs
url_list = get_tier_3()[:8000]
# Output lists
title_list, text_list, test_urls = [], [], []
# Initialize a multiprocessing pool that will close after finishing execution.
with Pool(NUM_PROCS) as pool:
results = pool.map(get_url_info, url_list)
for title, text, test_url in results:
title_list.append(title)
text_list.append(text)
test_urls.append(test_url)
return title_list, text_list, test_urls
我希望这有帮助!