我如何编写Python脚本来循环遍历147个URL，以将抓取的文本附加到我的数据框中？

Question

我有一个147个多伦多星报新闻文章的Excel文件，我已编译并创建了一个数据框。我还编写了一个Python脚本，可以一次从一篇文章中提取文本。但是，我想改进自己的脚本，以便Python可以循环浏览数据框中的所有URL，抓取文本，将抓取的，停用词添加到行（或可能添加到链接的文本文件？），然后利用该数据框用于分类算法和更多探索。

有人可以帮我编写循环吗？（我没有编程背景。.挣扎！）

创建数据框

url_file = 'https://github.com/MarissaFosse/ryersoncapstone/raw/master/DailyNewsArticles.xlsx'
tstar_articles = pd.read_excel(url_file, "TorontoStar Articles", header=0)

一篇文章]

URL = 'https://www.thestar.com/news/gta/2019/12/31/with-291-people-shot-2019-is-closing-as-torontos-bloodiest-year-on-record-for-overall-gun-violence.html'

page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_='c-article-body__content') 
results_text = [tag.get_text().strip() for tag in results]
sentence_list = [sentence for sentence in results_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
article = ' '.join(sentence_list)

from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(article)
stop_words = set(stopwords.words('english'))
filtered_article = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []

for w in word_tokens:
  if w not in stop_words:
    filtered_sentence.append(w)

clean_tokens = tokens[:]

for token in tokens:
  if token in stopwords.words('english'):

    clean_tokens.remove(token)

我有一个147个多伦多星报新闻文章的Excel文件，我已编译并创建了一个数据框。我还编写了一个Python脚本，可以一次从一篇文章中提取文本。但是，我'...

Answer 1

首先，大多数新闻站点都有RSS feed，对于ww.thestar.com网站，有https://www.thestar.com/about/rssfeeds.html

我如何编写Python脚本来循环遍历147个URL，以将抓取的文本附加到我的数据框中？

问题描述投票：-1回答：1

创建数据框

1个回答

最新问题

我如何编写Python脚本来循环遍历147个URL，以将抓取的文本附加到我的数据框中？

问题描述 投票：-1回答：1

创建数据框

1个回答

最新问题

问题描述投票：-1回答：1