我正在制作一个 python 网络抓取脚本。我应该使用 asyncio 来完成此操作。因此,对于异步 HTTP 请求,我使用 AioHTTP。
没关系,但是当我尝试制作一个非阻塞应用程序(等待)时,beautifulsoup4将阻止应用程序(因为beautifulsoup4不支持异步)
这是我尝试过的。
import asyncio, aiohttp
from bs4 import BeautifulSoup
async def extractLinks(html):
soup = BeautifulSoup(html, 'html.parser')
return soup.select(".c-pro-box__title a")
async def getHtml(session, url):
async with session.get(url) as response:
return await response.text()
async def loadPage(url):
async with aiohttp.ClientSession() as session:
html = await getHtml(session, url)
links = await extractLinks(html)
return links
loop = asyncio.get_event_loop()
loop.run_until_complete(loadPage())
extractLinks()
会阻塞程序流程。嗯,根据您的问题,我了解您正在寻求提高代码性能。
因此,根据您较短的详细信息,我也会提供较短的回复。
Async
HTML 解析器,因为本机代码基于 regex
以及其他基于 I/O 的操作。BeautifulSoup
与 html.parser
一起使用,这比 lxml
Reference 慢得多。另外,我强烈建议将 selectolax
与 Lexbor
一起使用,因为通过 C
Checkhttpx
代替 aiohttp
。 (由于技术性能基于C
)trio
,这是Async
下面的代码应该对您的用例有帮助。
import httpx
from selectolax.lexbor import LexborHTMLParser
import trio
links = []
async def get_soup(content):
return LexborHTMLParser(content)
async def worker(channel):
async with channel:
async for client, link in channel:
r = await client.get(link)
soup = await get_soup(r.content)
# continue from here!
async def main():
async with httpx.AsyncClient() as client, trio.open_nursery() as nurse:
sender, receiver = trio.open_memory_channel(0)
async with receiver:
for _ in range(10):
nurse.start_soon(worker, receiver.clone())
async with sender:
for link in links:
await sender.send([client, link])
if __name__ == "__main__":
trio.run(main)