我尝试使用以下代码解析本地 html 文档 -
import os, sys
from bs4 import BeautifulSoup
path = os.path.abspath(os.path.dirname(sys.argv[0]))
fnHTML = os.path.join(path, "inp.html")
page = open(fnHTML)
soup = BeautifulSoup (page.read(), 'lxml')
worker = soup.find("span")
wHeadLine = worker.text.strip()
wPara = worker.find_next("td").text.strip()
print(wHeadLine)
print(wPara)
输出看起来像这样:
Find your faves—faster
We’ve made it easier than ever to see what’s on now and continue watching your recordings, favorite teams and more.
但是 html 上的文本看起来像那样 - 看图片
为什么输出的文本没有“—”和“We’ve”?
这只是读取文件时的编码问题,尝试一下希望这会有所帮助
import os, sys
from bs4 import BeautifulSoup
path = os.path.abspath(os.path.dirname(sys.argv[0]))
fnHTML = os.path.join(path, "inp.html")
with open(fnHTML, encoding='utf-8') as file: #added encoding utf-8
soup = BeautifulSoup (file.read(), 'lxml')
worker = soup.find("span")
wHeadLine = worker.text.strip()
wPara = worker.find_next("td").text.strip()
print(wHeadLine)
print(wPara)
谢谢