我有一个147个多伦多星报新闻文章的Excel文件,我已编译并创建了一个数据框。我还编写了一个Python脚本,可以一次从一篇文章中提取文本。但是,我想改进自己的脚本,以便Python可以循环浏览数据框中的所有URL,抓取文本,将抓取的,停用词添加到行(或可能添加到链接的文本文件?),然后利用该数据框用于分类算法和更多探索。
有人可以帮我编写循环吗? (我没有编程背景。.挣扎!)
url_file = 'https://github.com/MarissaFosse/ryersoncapstone/raw/master/DailyNewsArticles.xlsx'
tstar_articles = pd.read_excel(url_file, "TorontoStar Articles", header=0)
URL = 'https://www.thestar.com/news/gta/2019/12/31/with-291-people-shot-2019-is-closing-as-torontos-bloodiest-year-on-record-for-overall-gun-violence.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_='c-article-body__content')
results_text = [tag.get_text().strip() for tag in results]
sentence_list = [sentence for sentence in results_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
article = ' '.join(sentence_list)
from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(article)
stop_words = set(stopwords.words('english'))
filtered_article = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
clean_tokens = tokens[:]
for token in tokens:
if token in stopwords.words('english'):
clean_tokens.remove(token)
我有一个147个多伦多星报新闻文章的Excel文件,我已编译并创建了一个数据框。我还编写了一个Python脚本,可以一次从一篇文章中提取文本。但是,我'...
首先,大多数新闻站点都有RSS feed,对于ww.thestar.com
网站,有https://www.thestar.com/about/rssfeeds.html