从网站下载所有 xlsx 文件和元信息

问题描述 投票:0回答:2

我正在使用kaggle浏览器。看看以下所有内容是否可以在这个 Kaggle 笔记本上完成。

网站网址:点击这里

网站截图:

网站上的下载文件每小时、每天更新一次。我认为除了您在网站中看到的

xlsx
文件内容之外,该网站上的任何信息都不会改变。

我想从此网址下载两件事:元信息和您在屏幕截图中看到的

xlsx
文件。

首先,我想下载此元信息并将其设为数据框,如下所示。 现在我手动选择它们,并将它们复制到此处。但我想从网址来做

url_meta_df = 

ID   Type   Name        URL
CAL  Region California  https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAL.xlsx
CAR  Region Carolinas   https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAR.xlsx
CENT Region Central     https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CENT.xlsx
FLA  Region Florida     https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_FLA.xlsx

第二:下载每个

xlsx
文件,保存它们。

我的代码:我已经根据 SO 中的答案尝试了以下操作

from bs4 import BeautifulSoup

import requests
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))

当前输出:

None
https://twitter.com/EIAgov
None
https://www.facebook.com/eiagov
None
#page-sub-nav
/
#
/petroleum/
/petroleum/weekly/
/petroleum/supply/weekly/
/naturalgas/
http://ir.eia.gov/ngs/ngs.html
/naturalgas/weekly/
/electricity/
/electricity/monthly/
....
python python-3.x url beautifulsoup python-requests
2个回答
0
投票

这应该会为您提供所有文件。

注意: 这可能需要一段时间,因为文件很大 - 每个 20MB 以上。

import os
import random
import time
from pathlib import Path
from shutil import copyfileobj

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203",
}

url = "https://www.eia.gov/electricity/930-api//respondents/data"
file_base_url = "https://www.eia.gov/electricity/gridmonitor/knownissues/xls"
download_dir = Path("xls_files")

params = {
    "type[0]": "BA",
    "type[1]": "BR",
    "type[2]": "ER",
}


def wait_a_bit() -> None:
    pseudo_random_wait = random.randint(1, 5)
    print(f"Waiting {pseudo_random_wait} seconds...")
    time.sleep(pseudo_random_wait)


def downloader() -> None:
    with requests.Session() as connection:
        connection.headers.update(headers)
        response = connection.get(url, params=params)
        data = response.json()[0]["data"]

        xls_files = []
        for item in data:
            if item["type"] == "BA" and item["time_zone"] is not None:
                xls_files.append(f"{file_base_url}/{item['id']}.xlsx")

            if item["type"] == "ER":
                xls_files.append(f"{file_base_url}/Region_{item['id']}.xlsx")

        os.makedirs(download_dir, exist_ok=True)
        for count, file in enumerate(xls_files, start=1):
            file_name = file.split("/")[-1]
            print(f"Downloading file {count} of {len(xls_files)}: {file_name}")
            response = connection.get(file, stream=True)
            with open(download_dir / file_name, "wb") as f:
                copyfileobj(response.raw, f)
            wait_a_bit()


if __name__ == "__main__":
    downloader()

-1
投票

您尝试执行的操作的主要问题是带有 *.xlsx 信息的表不在页面的 URL 中。当您按下按钮时,该表是使用 Javascript 创建的,因此要实际访问表数据,您首先必须使用 python webbrowser 模块之类的东西来单击按钮并生成表的 HTML。

但是有一种更快、更简单的方法可以做到这一点:参考表实际上是静态的 - 它们具有每个 *.xlsx 文件的 ID、区域和固定链接。因此,您可以通过以下这个答案获取元数据表来获取Javascript渲染的页面的HTML代码,获得HTML代码后,您可以下载.xlsx文件。

from bs4 import BeautifulSoup
import pandas as pd
import requests

html_file = #PATH TO HTML FILE

with open(html_file) as f:
    soup = BeautifulSoup(f.read())

# there is more than one table - it's the last one
table = soup.find_all('table')[-1]

# use pandas to create df from table
df = pd.read_html(str(table))[0]

# loop for each url in the df column
for xlsx_url in df['URL']:

    # get the xlsx file
    resp = requests.get(xlsx_url)

    # get the filename to save
    filename = xlsx_url.split('/')[-1]

    #save it
    with open(filename, 'wb') as f:
        f.write(resp.content)
© www.soinside.com 2019 - 2024. All rights reserved.