使用 bs4 从本地 html 文件解析数据？

Question

我尝试使用以下代码解析本地 html 文档 -

import os, sys
from bs4 import BeautifulSoup

path = os.path.abspath(os.path.dirname(sys.argv[0])) 
fnHTML = os.path.join(path, "inp.html")
page = open(fnHTML)
soup = BeautifulSoup (page.read(), 'lxml')  

worker = soup.find("span")
wHeadLine = worker.text.strip()
wPara = worker.find_next("td").text.strip()
print(wHeadLine)
print(wPara)

输出看起来像这样：

Find your favesâ€”faster
Weâ€™ve made it easier than ever to see whatâ€™s on now and continue  watching your recordings, favorite teams and more.

但是 html 上的文本看起来像那样 - 看图片

为什么输出的文本没有“—”和“We’ve”？

Answer 1

这只是读取文件时的编码问题，尝试一下希望这会有所帮助

import os, sys
from bs4 import BeautifulSoup

path = os.path.abspath(os.path.dirname(sys.argv[0])) 
fnHTML = os.path.join(path, "inp.html")
with open(fnHTML, encoding='utf-8') as file: #added encoding utf-8
    soup = BeautifulSoup (file.read(), 'lxml')  
    worker = soup.find("span")
    wHeadLine = worker.text.strip()
    wPara = worker.find_next("td").text.strip()
    print(wHeadLine)
    print(wPara)

谢谢

使用 bs4 从本地 html 文件解析数据？

问题描述投票：0回答：1

1个回答

最新问题

使用 bs4 从本地 html 文件解析数据？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1