从雅虎财经加拿大网站检索新闻文章

问题描述 投票:0回答:1

我正在尝试检索 2024 年在加拿大雅虎财经发表的有关一家股票代码为 TECK-B.TO 的公司的所有新闻文章。 文章可以在这个网址看到:

https://ca.finance.yahoo.com/quote/TECK-B.TO/news.

可以看到,上述网址中有超过 50 篇关于该公司的文章。

使用 Databricks 和下面的 Python 代码,我已经能够检索其中 2 篇文章。

我想检索上述网址中2024年发表的所有文章。

我尝试过使用此代码:

# Install necessary libraries
# %pip install requests beautifulsoup4 pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from datetime import datetime

def fetch_tmx_news_2024():
    url = "https://ca.finance.yahoo.com/quote/TECK-B.TO/news/"
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    articles = []

    for item in soup.find_all('li', class_='js-stream-content'):
        link = item.find('a')['href'] if item.find('a') else None
        title = item.find('a').text if item.find('a') else None
        date_str = item.find('time')['datetime'] if item.find('time') else None

        # Debug print to check each article's details
        print(f"Title: {title}, Link: {link}, Date: {date_str}")

        if date_str:
            date = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%SZ")
            print(f"Parsed Date: {date}")  # Debug print to check parsed date
            if date.year == 2024:
                articles.append({
                    'title': title,
                    'link': f"https://ca.finance.yahoo.com{link}" if link else None,
                    'date': date
                })

    # Debug print to check the articles list
    print("Articles found: ", articles)

    schema = StructType([
        StructField("title", StringType(), nullable=True),
        StructField("link", StringType(), nullable=True),
        StructField("date", TimestampType(), nullable=True)
    ])

    if articles:
        return spark.createDataFrame(pd.DataFrame(articles), schema=schema)
    else:
        print("No articles found for the year 2024.")
        return spark.createDataFrame(pd.DataFrame(columns=['title', 'link', 'date']), schema=schema)

# Invocation of the function
news_df = fetch_tmx_news_2024()

# Display Spark DataFrame
display(news_df)

我希望检索在上述 URL 中于 2024 年发表的有关上述公司的所有文章 (https://ca.finance.yahoo.com/quote/TECK-B.TO/news/ ).

python html databricks yahoo-finance
1个回答
0
投票

详情:

你好,

根据您的期望,我认为使用 requests 库和一点代码可以获取您想要的结果,这是我的思维导图:

我们可以使用他们的 API 端点

https://ca.finance.yahoo.com/caas/content/article/?uuid={UUIDS}&appid=article2_csn
来获取所有信息,例如日期、标题和新闻 URL。

首先: 我们必须找到所有新闻 UUID(类似于文章 UUID),我们需要向此端点发送请求

https://ca.finance.yahoo.com/_finance_doubledown/api/resource?bkt=finance-CA-en-CA-def&device=desktop&ecma=modern
,并在帖子正文中包含 JSON 数据(您需要使用目标公司名称作为
catagory
参数的值) ,例如:
TECK-B.TO

这是获取至少 100 多个新闻文章 URL 的简单脚本:

import requests
import json
import re
from requests.packages.urllib3.exceptions import InsecureRequestWarning

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

def getUuids(companyName):
    url = 'https://ca.finance.yahoo.com/_finance_doubledown/api/resource?bkt=finance-CA-en-CA-def&device=desktop&ecma=modern'
    data = {"requests":{"g0":{"resource":"StreamService","operation":"read","params":{"forceJpg":True,"releasesParams":{"limit":50,"offset":0},"ncpParams":{"query":{"id":"tickers-news-stream","version":"v1","namespace":"finance","listAlias":"finance-CA-en-CA-ticker-news"}},"useNCP":True,"batches":{"pagination":True,"size":10,"timeout":1500,"total":170},"category":f"YFINANCE:{companyName}"}}}}
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0",
        "Content-Type": "application/json"
    }
    resp = requests.post(url, json=data, headers=headers, verify=False).json()
    return resp['g0']['data']['stream_pagination']['gqlVariables']['tickerStream']['pagination']['uuids']


def getDetails(companyName):
    result = getUuids(companyName)
    remove_junk = re.sub(':STORY|:VIDEO', '', result)
    result_url = f'https://ca.finance.yahoo.com/caas/content/article/?uuid={remove_junk}&appid=article2_csn'
    result_resp = requests.get(result_url, verify=False).json()
    for i in result_resp['items']:
        try:
            news_urls = i['data']['partnerData']['finalUrl']
            news_modifiedDate = i['data']['partnerData']['modifiedDate']
            news_title = i['data']['partnerData']['pageTitle']
            print(f"News URL:  {news_urls}\nModifiedDate: {news_modifiedDate}\nTitle: {news_title}\n=================")
        except Exception:
            pass

getDetails('TECK-B.TO')

我认为这可以帮助您获取特定公司的所有新闻文章 URL、日期和标题。

© www.soinside.com 2019 - 2024. All rights reserved.