使用 bs4 和 lxml 时会收到额外的文本

Question

我有 div 对象

<div class="body">

       <div class="pull_right date details" title="21.11.2024 20:17:23 UTC+07:00">
20:17
       </div>

       <div class="from_name">
Cheki_FNS 
       </div>

       <div class="text">
Cash receipt received: from <strong>Komandor trading network</strong> (LLC &quot;TS KOMANDOR&quot;)
       </div>

      </div>

     </div>

我使用下一个代码：


with open("messages.html", "r", encoding="utf-8") as file:
    html_content = file.read()

soup = BeautifulSoup(html_content, "xml")

div_text = soup.find("div", class_="text")


if div_text:
    print(div_text.get_text())
else:
    print("Error, class text not find")

我期待得到下一行 - “已收到现金收据：来自 Komandor 交易网络 (TS Komandor LLC)”，但我得到“20:17 Cheki_FNS 已收到现金收据：来自 Komandor 交易网络 (TS Komandor LLC)”科曼多有限责任公司）”。文本的某些部分超出了 div 对象的范围，这真的是一个问题吗？

Answer 1

根据您的示例，您的选择是正确的，因此请确保您从文件中获得了准确的输入。

您还应该检查以下几点：

不要使用 xml 解析器，它是 HTML
可以使用CSS选择器

从 bs4 导入 BeautifulSoup

html_内容='''
```
     <div class="pull_right date details" title="21.11.2024 20:17:23 UTC+07:00">
```
20:17
```
     <div class="from_name">
```
Cheki_FNS
```
     <div class="text">
```
收到的现金收据：来自Komandor交易网络（有限责任公司“TS KOMANDOR”）
```
    </div>

   </div
```
'''

汤 = BeautifulSoup(html_content)

print(soup.find("div", class_="text").get_text(strip=True)) print(soup.select_one("div.text").get_text(strip=True))

使用 bs4 和 lxml 时会收到额外的文本

问题描述投票：0回答：1

1个回答

最新问题

使用 bs4 和 lxml 时会收到额外的文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1