如何在Python中使用正则表达式从HTML中提取特定类名的所有<div>标签内容？

Question

我正在用Python编写一个爬虫，需要从HTML文档中提取具有特定类名（例如“class-name”）的所有

标签的内容。我了解到正则表达式通常不是解析 HTML 的最佳工具，因为它可能会由于 HTML 的复杂性和嵌套结构而失败。然而，在这种特殊情况下，HTML 结构相对简单且可预测，因此我想尝试使用正则表达式来完成此任务。我尝试了以下代码，但它似乎没有按预期工作：

我的问题是：

我的正则表达式正确吗？如果出现错误，应该如何修改，保证只捕获类名为“类名”的

标签的内容？

如果正则表达式确实不是处理这种情况的最佳方法，您能否推荐一个更合适的Python库（例如BeautifulSoup）来处理这个问题并提供一个简短的示例代码？

```
re.DOTALL
```
标志用于使
```
.
```
字符匹配包括换行符在内的任何字符。
```
re.IGNORECASE
```
标志是可选的，但如果您不确定类名是否区分大小写，它会很有用。
此正则表达式假定目标
```
<div>
```
标签内不存在具有相同类名的嵌套
```
<div>
```
标签。嵌套标签可能会破坏这个正则表达式。
HTML 属性可以采用任意顺序，并且可以有其他属性或空格，这可能会使正则表达式解决方案变得脆弱。

    import re  

    html_content = """  
    <html>  
    <body>  
        <div class="unwanted-class">Don't want this content</div>  
        <div class="class-name">Need this content</div>  
        <div class="class-name">Also need this content</div>  
    </body>  
    </html>  
    """  

    pattern = r'<div class="class-name">(.*?)</div>'  
    matches = re.findall(pattern, html_content, re.DOTALL)  

    for match in matches:  
        print(match.strip())

您可以使用

[^>]*

模式来匹配所需

之前和之后的任何非结束

class-name

的内容，以允许

<div>

标记中的其他属性：

import re

html_content = """  
<html>  
<body>  
    <div class="unwanted-class">Not wanted</div>  
    <div class="class-name">Wanted1</div>  
    <div class="class-name" data-id="1">Wanted2</div>  
    <div style="color:red;" class="class-name">Wanted3</div>
</body>  
</html>  
"""

pattern = r'<div\s+[^>]*class="class-name"[^>]*>(.*?)</div>'
matches = re.findall(pattern, html_content, re.DOTALL)

for match in matches:
    print(match.strip())

输出：

Wanted1
Wanted2
Wanted3

使用

BeautifulSoup

如上所述，不建议使用正则表达式来解析 HTML。更好的方法是使用像

BeautifulSoup

:

这样的库

from bs4 import BeautifulSoup

html_content = """  
<html>  
<body>  
    <div class="unwanted-class">Not wanted</div>  
    <div class="class-name">Wanted1</div>  
    <div class="class-name" data-id="1">Wanted2</div>  
    <div style="color:red;" class="class-name">Wanted3</div>
</body>  
</html>  
"""

soup = BeautifulSoup(html_content, "html.parser")
divs = soup.find_all("div", class_="class-name")

for div in divs:
    print(div.get_text(strip=True))

输出：

Wanted1
Wanted2
Wanted3

Answer 1

您可以使用

[^>]*

模式来匹配所需

之前和之后的任何非结束

class-name

的内容，以允许

<div>

标记中的其他属性：

import re

html_content = """  
<html>  
<body>  
    <div class="unwanted-class">Not wanted</div>  
    <div class="class-name">Wanted1</div>  
    <div class="class-name" data-id="1">Wanted2</div>  
    <div style="color:red;" class="class-name">Wanted3</div>
</body>  
</html>  
"""

pattern = r'<div\s+[^>]*class="class-name"[^>]*>(.*?)</div>'
matches = re.findall(pattern, html_content, re.DOTALL)

for match in matches:
    print(match.strip())

输出：

Wanted1
Wanted2
Wanted3

使用

BeautifulSoup

如上所述，不建议使用正则表达式来解析 HTML。更好的方法是使用像

BeautifulSoup

:

这样的库

from bs4 import BeautifulSoup

html_content = """  
<html>  
<body>  
    <div class="unwanted-class">Not wanted</div>  
    <div class="class-name">Wanted1</div>  
    <div class="class-name" data-id="1">Wanted2</div>  
    <div style="color:red;" class="class-name">Wanted3</div>
</body>  
</html>  
"""

soup = BeautifulSoup(html_content, "html.parser")
divs = soup.find_all("div", class_="class-name")

for div in divs:
    print(div.get_text(strip=True))

输出：

Wanted1
Wanted2
Wanted3

如何在Python中使用正则表达式从HTML中提取特定类名的所有<div>标签内容？

问题描述投票：0回答：1

使用
`BeautifulSoup`

1个回答

使用
`BeautifulSoup`

最新问题

如何在Python中使用正则表达式从HTML中提取特定类名的所有<div>标签内容？

问题描述 投票：0回答：1

使用BeautifulSoup

1个回答

使用BeautifulSoup

最新问题

问题描述投票：0回答：1

使用
`BeautifulSoup`

使用
`BeautifulSoup`