在 Python 中使用 Beautifulsoup4 进行异步 HTML 解析

Question

我正在制作一个 python 网络抓取脚本。我应该使用 asyncio 来完成此操作。因此，对于异步 HTTP 请求，我使用 AioHTTP。
没关系，但是当我尝试制作一个非阻塞应用程序（等待）时，beautifulsoup4将阻止应用程序（因为beautifulsoup4不支持异步）

这是我尝试过的。

import asyncio, aiohttp
from bs4 import BeautifulSoup

async def extractLinks(html):
    soup = BeautifulSoup(html, 'html.parser')
    return soup.select(".c-pro-box__title a")

async def getHtml(session, url):
    async with session.get(url) as response:
        return await response.text()

async def loadPage(url):
    async with aiohttp.ClientSession() as session:
        html = await getHtml(session, url)
        links = await extractLinks(html)
        return links

loop = asyncio.get_event_loop()
loop.run_until_complete(loadPage())

extractLinks()

会阻塞程序流程。
那么这是否可以使其成为非阻塞呢？或者除了 beautifulsoup4 之外还有什么库可以尽可能支持异步吗？

Answer 1

嗯，根据您的问题，我了解您正在寻求提高代码性能。

因此，根据您较短的详细信息，我也会提供较短的回复。

没有
```
Async
```
HTML 解析器，因为本机代码基于
```
regex
```
以及其他基于 I/O 的操作。
您将
```
BeautifulSoup
```
与
```
html.parser
```
一起使用，这比
```
lxml
```
Reference 慢得多。另外，我强烈建议将
```
selectolax
```
与
```
Lexbor
```
一起使用，因为通过
```
C
```
Check
使用
```
httpx
```
代替
```
aiohttp
```
。（由于技术性能基于
```
C
```
）
使用
```
trio
```
，这是
```
Async
```

下面的代码应该对您的用例有帮助。

import httpx
from selectolax.lexbor import LexborHTMLParser
import trio

links = []


async def get_soup(content):
    return LexborHTMLParser(content)


async def worker(channel):
    async with channel:
        async for client, link in channel:
            r = await client.get(link)
            soup = await get_soup(r.content)
            # continue from here!


async def main():
    async with httpx.AsyncClient() as client, trio.open_nursery() as nurse:
        sender, receiver = trio.open_memory_channel(0)
        async with receiver:
            for _ in range(10):
                nurse.start_soon(worker, receiver.clone())

            async with sender:
                for link in links:
                    await sender.send([client, link])


if __name__ == "__main__":
    trio.run(main)

在 Python 中使用 Beautifulsoup4 进行异步 HTML 解析

问题描述投票：0回答：1

1个回答

最新问题

在 Python 中使用 Beautifulsoup4 进行异步 HTML 解析

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1