使用 bs4 从本地 html 文件解析数据?

问题描述 投票:0回答:1

我尝试使用以下代码解析本地 html 文档 -

import os, sys
from bs4 import BeautifulSoup

path = os.path.abspath(os.path.dirname(sys.argv[0])) 
fnHTML = os.path.join(path, "inp.html")
page = open(fnHTML)
soup = BeautifulSoup (page.read(), 'lxml')  

worker = soup.find("span")
wHeadLine = worker.text.strip()
wPara = worker.find_next("td").text.strip()
print(wHeadLine)
print(wPara)

输出看起来像这样:

Find your faves—faster
We’ve made it easier than ever to see what’s on now and continue  watching your recordings, favorite teams and more.

但是 html 上的文本看起来像那样 - 看图片

enter image description here

为什么输出的文本没有“—”和“We’ve”?

python beautifulsoup
1个回答
0
投票

这只是读取文件时的编码问题,尝试一下希望这会有所帮助

import os, sys
from bs4 import BeautifulSoup

path = os.path.abspath(os.path.dirname(sys.argv[0])) 
fnHTML = os.path.join(path, "inp.html")
with open(fnHTML, encoding='utf-8') as file: #added encoding utf-8
    soup = BeautifulSoup (file.read(), 'lxml')  
    worker = soup.find("span")
    wHeadLine = worker.text.strip()
    wPara = worker.find_next("td").text.strip()
    print(wHeadLine)
    print(wPara)

谢谢

© www.soinside.com 2019 - 2024. All rights reserved.