如何使用Jupyter从网站中提取文本？

Question

我试图从链接获取文章的文本，但在导入文本时，我得到所有其他链接，广告链接和图像名称，我不需要它用于我的分析。

import re
from nltk import word_tokenize, sent_tokenize, ngrams
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower- 
 120000419.html" #this is the link 
html = request.urlopen(url).read().decode('utf8')
raw = BeautifulSoup(html,"lxml").get_text()
raw

我得到了这个结果（复制了几行，我得到了一篇文章的实际文本，但是存在于其他行中）：

window.performance && window.performance.mark && window.performance.mark（\'PageStart \'）; Best Bites：Weeknight meal菜花蔬菜炒饭！function（s，f，p）{var a = []，e = {_version：“3.6.0”，_ config：{classPrefix：“”，enableClasses：！0，enableJSClass：！0，usePrefixes：！0}，_ q：[]，on：function（e，t）{var n =这;的setTimeout（函数（）{T（N [E]）}，0）}，addTest：功能（例如，T，N）{a.push（{名：E，FN：吨，选择：N}） }，addAsyncTest：function（e）{a.push（{name：null，fn：e}）}}，l = function（）{}; l.prototype = e，l = new l; var c = [] ; function v（e，t）{return typeof e === t} var t =“Moz O ms Webkit”，u = e._config

我只想知道是否有任何方法可以提取文章的文本，忽略所有这些值。

Answer 1

当BS4解析站点时，它会在内部创建自己的DOM作为对象。

要访问DOM的不同部分，我们必须使用正确的访问器或标签，如下所示

import re
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup

url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
parsedHTML = BeautifulSoup(html, "html.parser")
readableText = parsedHTML.article.get_text() # <- we got the text from inside the <article> tag 
print(readableText)

你很近但你没有指定你想要get_text（）的标签。

find（）和find_all（）对于在页面上查找标记非常有用。

如何使用Jupyter从网站中提取文本？

问题描述投票：0回答：1

1个回答

最新问题

如何使用Jupyter从网站中提取文本？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1