我正在尝试检索 2024 年在加拿大雅虎财经发表的有关一家股票代码为 TECK-B.TO 的公司的所有新闻文章。 文章可以在这个网址看到:
https://ca.finance.yahoo.com/quote/TECK-B.TO/news.
可以看到,上述网址中有超过 50 篇关于该公司的文章。
使用 Databricks 和下面的 Python 代码,我已经能够检索其中 2 篇文章。
我想检索上述网址中2024年发表的所有文章。
我尝试过使用此代码:
# Install necessary libraries
# %pip install requests beautifulsoup4 pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from datetime import datetime
def fetch_tmx_news_2024():
url = "https://ca.finance.yahoo.com/quote/TECK-B.TO/news/"
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
articles = []
for item in soup.find_all('li', class_='js-stream-content'):
link = item.find('a')['href'] if item.find('a') else None
title = item.find('a').text if item.find('a') else None
date_str = item.find('time')['datetime'] if item.find('time') else None
# Debug print to check each article's details
print(f"Title: {title}, Link: {link}, Date: {date_str}")
if date_str:
date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%SZ")
print(f"Parsed Date: {date}") # Debug print to check parsed date
if date.year == 2024:
articles.append({
'title': title,
'link': f"https://ca.finance.yahoo.com{link}" if link else None,
'date': date
})
# Debug print to check the articles list
print("Articles found: ", articles)
schema = StructType([
StructField("title", StringType(), nullable=True),
StructField("link", StringType(), nullable=True),
StructField("date", TimestampType(), nullable=True)
])
if articles:
return spark.createDataFrame(pd.DataFrame(articles), schema=schema)
else:
print("No articles found for the year 2024.")
return spark.createDataFrame(pd.DataFrame(columns=['title', 'link', 'date']), schema=schema)
# Invocation of the function
news_df = fetch_tmx_news_2024()
# Display Spark DataFrame
display(news_df)
我希望检索在上述 URL 中于 2024 年发表的有关上述公司的所有文章 (https://ca.finance.yahoo.com/quote/TECK-B.TO/news/ ).
你好,
根据您的期望,我认为使用 requests 库和一点代码可以获取您想要的结果,这是我的思维导图:
我们可以使用他们的 API 端点
https://ca.finance.yahoo.com/caas/content/article/?uuid={UUIDS}&appid=article2_csn
来获取所有信息,例如日期、标题和新闻 URL。
首先: 我们必须找到所有新闻 UUID(类似于文章 UUID),我们需要向此端点发送请求
https://ca.finance.yahoo.com/_finance_doubledown/api/resource?bkt=finance-CA-en-CA-def&device=desktop&ecma=modern
,并在帖子正文中包含 JSON 数据(您需要使用目标公司名称作为 catagory
参数的值) ,例如:TECK-B.TO
)
这是获取至少 100 多个新闻文章 URL 的简单脚本:
import requests
import json
import re
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def getUuids(companyName):
url = 'https://ca.finance.yahoo.com/_finance_doubledown/api/resource?bkt=finance-CA-en-CA-def&device=desktop&ecma=modern'
data = {"requests":{"g0":{"resource":"StreamService","operation":"read","params":{"forceJpg":True,"releasesParams":{"limit":50,"offset":0},"ncpParams":{"query":{"id":"tickers-news-stream","version":"v1","namespace":"finance","listAlias":"finance-CA-en-CA-ticker-news"}},"useNCP":True,"batches":{"pagination":True,"size":10,"timeout":1500,"total":170},"category":f"YFINANCE:{companyName}"}}}}
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0",
"Content-Type": "application/json"
}
resp = requests.post(url, json=data, headers=headers, verify=False).json()
return resp['g0']['data']['stream_pagination']['gqlVariables']['tickerStream']['pagination']['uuids']
def getDetails(companyName):
result = getUuids(companyName)
remove_junk = re.sub(':STORY|:VIDEO', '', result)
result_url = f'https://ca.finance.yahoo.com/caas/content/article/?uuid={remove_junk}&appid=article2_csn'
result_resp = requests.get(result_url, verify=False).json()
for i in result_resp['items']:
try:
news_urls = i['data']['partnerData']['finalUrl']
news_modifiedDate = i['data']['partnerData']['modifiedDate']
news_title = i['data']['partnerData']['pageTitle']
print(f"News URL: {news_urls}\nModifiedDate: {news_modifiedDate}\nTitle: {news_title}\n=================")
except Exception:
pass
getDetails('TECK-B.TO')
我认为这可以帮助您获取特定公司的所有新闻文章 URL、日期和标题。