我正在使用kaggle浏览器。看看以下所有内容是否可以在这个 Kaggle 笔记本上完成。
网站网址:点击这里
网站截图:
网站上的下载文件每小时、每天更新一次。我认为除了您在网站中看到的
xlsx
文件内容之外,该网站上的任何信息都不会改变。
我想从此网址下载两件事:元信息和您在屏幕截图中看到的
xlsx
文件。
首先,我想下载此元信息并将其设为数据框,如下所示。 现在我手动选择它们,并将它们复制到此处。但我想从网址来做
url_meta_df =
ID Type Name URL
CAL Region California https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAL.xlsx
CAR Region Carolinas https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAR.xlsx
CENT Region Central https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CENT.xlsx
FLA Region Florida https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_FLA.xlsx
第二:下载每个
xlsx
文件,保存它们。
我的代码:我已经根据 SO 中的答案尝试了以下操作
from bs4 import BeautifulSoup
import requests
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
当前输出:
None
https://twitter.com/EIAgov
None
https://www.facebook.com/eiagov
None
#page-sub-nav
/
#
/petroleum/
/petroleum/weekly/
/petroleum/supply/weekly/
/naturalgas/
http://ir.eia.gov/ngs/ngs.html
/naturalgas/weekly/
/electricity/
/electricity/monthly/
....
这应该会为您提供所有文件。
注意: 这可能需要一段时间,因为文件很大 - 每个 20MB 以上。
import os
import random
import time
from pathlib import Path
from shutil import copyfileobj
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203",
}
url = "https://www.eia.gov/electricity/930-api//respondents/data"
file_base_url = "https://www.eia.gov/electricity/gridmonitor/knownissues/xls"
download_dir = Path("xls_files")
params = {
"type[0]": "BA",
"type[1]": "BR",
"type[2]": "ER",
}
def wait_a_bit() -> None:
pseudo_random_wait = random.randint(1, 5)
print(f"Waiting {pseudo_random_wait} seconds...")
time.sleep(pseudo_random_wait)
def downloader() -> None:
with requests.Session() as connection:
connection.headers.update(headers)
response = connection.get(url, params=params)
data = response.json()[0]["data"]
xls_files = []
for item in data:
if item["type"] == "BA" and item["time_zone"] is not None:
xls_files.append(f"{file_base_url}/{item['id']}.xlsx")
if item["type"] == "ER":
xls_files.append(f"{file_base_url}/Region_{item['id']}.xlsx")
os.makedirs(download_dir, exist_ok=True)
for count, file in enumerate(xls_files, start=1):
file_name = file.split("/")[-1]
print(f"Downloading file {count} of {len(xls_files)}: {file_name}")
response = connection.get(file, stream=True)
with open(download_dir / file_name, "wb") as f:
copyfileobj(response.raw, f)
wait_a_bit()
if __name__ == "__main__":
downloader()
您尝试执行的操作的主要问题是带有 *.xlsx 信息的表不在页面的 URL 中。当您按下按钮时,该表是使用 Javascript 创建的,因此要实际访问表数据,您首先必须使用 python webbrowser 模块之类的东西来单击按钮并生成表的 HTML。
但是有一种更快、更简单的方法可以做到这一点:参考表实际上是静态的 - 它们具有每个 *.xlsx 文件的 ID、区域和固定链接。因此,您可以通过以下这个答案获取元数据表来获取Javascript渲染的页面的HTML代码,获得HTML代码后,您可以下载.xlsx文件。
from bs4 import BeautifulSoup
import pandas as pd
import requests
html_file = #PATH TO HTML FILE
with open(html_file) as f:
soup = BeautifulSoup(f.read())
# there is more than one table - it's the last one
table = soup.find_all('table')[-1]
# use pandas to create df from table
df = pd.read_html(str(table))[0]
# loop for each url in the df column
for xlsx_url in df['URL']:
# get the xlsx file
resp = requests.get(xlsx_url)
# get the filename to save
filename = xlsx_url.split('/')[-1]
#save it
with open(filename, 'wb') as f:
f.write(resp.content)