我想从网站提取数据并在 15 天内刷新该数据,每当网站更新时数据集也应该更新。如何接近它

问题描述 投票:0回答:1

必须从网站提取数据并安排每 15 天刷新一次。每次更新后都必须向特定人员发送有关该更新的邮件。任何建议,我应该如何解决这个问题。

我使用 scrapy 进行网页抓取并存储在 azure blob 中,并在 Azure 函数中编写该网页抓取脚本并计划刷新。并且还触发电子邮件。

azure web-scraping automation azure-functions scheduled-tasks
1个回答
0
投票

您的方法是正确的,您可以使用 scrapy 来抓取数据,然后将数据存储在 Azure 存储帐户 blob 中,并使用下面的 Http 触发功能代码向特定人员发送电子邮件:-

import azure.functions as func
from scrapy import cmdline
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

def main(req: func.HttpRequest) -> func.HttpResponse:
    # Scraping process
    cmdline.execute("scrapy crawl your_spider_name".split())

    # Saving data to Azure Blob Storage
    connection_string = "DefaultEndpointsProtocol=https;AccountName=siliconstrg549;AccountKey=gz8TtaPzHsc6qsVpJsHsNyXM2ERh1qb5xtf2NN+5YNyUgIgPeYEoZkRGb3Y2hJVoAZCYUTMNYqGn+AStpqBQUw==;EndpointSuffix=core.windows.net"
    container_name = "data"

    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    container_client = blob_service_client.get_container_client(container_name)

    # Upload the scraped data to Azure Blob Storage
    with open("path/to/your/scraped/data", "rb") as data:
        container_client.upload_blob(name="scraped_data.csv", data=data)

    # Sending email notification
    sender_email = "[email protected]"
    receiver_email = "[email protected]"
    smtp_server = "smtp.example.com"
    smtp_port = 587
    smtp_username = "your_smtp_username"
    smtp_password = "your_smtp_password"

    message = MIMEMultipart()
    message["From"] = sender_email
    message["To"] = receiver_email
    message["Subject"] = "Data Update Notification"

    body = "The data has been updated. You can access it from Azure Blob Storage."
    message.attach(MIMEText(body, "plain"))

    with smtplib.SMTP(smtp_server, smtp_port) as server:
        server.starttls()
        server.login(smtp_username, smtp_password)
        server.sendmail(sender_email, receiver_email, message.as_string())

    return func.HttpResponse(
        "Scraping process completed. Check your email for updates.",
        status_code=200
    )

另一种选择是使用 Power Automate Desktop 运行 Python 脚本进行 Web 抓取,或直接使用流程中的 Web 数据提取任务从网页中提取数据并将其作为电子邮件发送:-

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.