我是一名学生,目前正在从事 python 项目。我需要在网上下载一个名为“prix-des-carburants-en-france-flux-instantane-v2”的csv文件(13,8Mo)(https://data.economie.gouv.fr/explore/dataset /prix-des-carburants-en-france-flux-instantane-v2/)通过 Firefox。
我编写了一个脚本来检索它。首先,我将进入该网站,检索 csv 名称和上次更新时间,然后进入“导出”部分。之后,我单击 csv 下载链接开始下载。
但有时(比如 1% 的情况)我的下载会中途停止。有两种情况:
我在 stackoverflow 上读过一些类似的问题,但是我对这些答案不满意。建议使用“os”(这不是最Pythonic的方式)或者它是关于“请求”模块。
我喜欢我的方法,我只是像人一样点击链接,这意味着我使用 Firefox 下载系统(它通常工作正常,不需要有一个“尝试除外”块,我们使用的东西已经开发出来了)
请注意,我还没有真正找到问题的根源,因为这个错误并不经常出现!
我尝试实现“wait_for_filled_downloaded”方法。如果我是正确的,则在完成下载之前文件夹中会出现一个文件。因此,如果我的文件大小增大,我的方法会每秒查看一次。这意味着当大小没有改变时,它就已经完全下载了。
说实话这并没有改变什么,我仍然有同样的问题。
可以用“with”来完成吗?就像第二个问题一样,它想在结束之前关闭我的浏览器?如果可以的话我应该使用 quit() 吗?如果在执行过程中崩溃,退出会关闭我的浏览器?
对于第一个问题,也许我可以尝试放置更大的 check_interval 参数...但这意味着我的程序正在运行,但在此期间什么也不做!
这是我的代码:
"""
Module which provides methods for scraping data using Firefox webdriver.
"""
import time
from pathlib import Path
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import WebDriverException
class FirefoxScraperHolder:
"""
Class for scraping data using Firefox webdriver.
"""
def __init__(self, target_url):
"""
Initialize a FirefoxScraperHolder instance.
:param target_url: The URL to scrape data from.
"""
self.cwf = Path(__file__).resolve().parent
self.options = webdriver.FirefoxOptions()
self.driver = webdriver.Firefox(options=self.set_preferences())
self.target_url = target_url
self._updated_data_date = None
self._csv_id = None
def set_preferences(self):
"""
Set Firefox webdriver preferences
(Here only for downloading files)
:return: The configured Firefox options.
"""
# 2 means we use chosen directory as download folder
self.options.set_preference("browser.download.folderList", 2)
self.options.set_preference("browser.download.dir", str(self.cwf))
return self.options
@property
def updated_data_date(self):
"""
Get the last updated data date.
:return: The last updated data date.
:rtype: str
"""
return self._updated_data_date
@property
def csv_id(self):
"""
Get the filename of the downloaded CSV.
:return: The filename of the downloaded CSV.
:rtype: str
"""
return self._csv_id
def perform_scraping(self, aria_label, ng_if):
"""
Perform the scraping process.
:param aria_label: ARIA label for the CSV element.
:param ng_if: NG-if attribute for updated data date element.
"""
try:
with self.driver:
self.driver.maximize_window()
self.driver.get(self.target_url)
# Retrieves csv information
self.click_on(By.LINK_TEXT, "Informations")
self._updated_data_date = self.retrieve_text_info(
By.CSS_SELECTOR,
f"[ng-if='{ng_if}']")
self._csv_id = self.retrieve_text_info(
By.CLASS_NAME,
'ods-dataset-metadata-block__metadata-value'
) + '.csv'
# Download csv
self.click_on(By.LINK_TEXT, "Export")
self.click_on(By.CSS_SELECTOR, f"[aria-label='{aria_label}']")
self.wait_for_fully_downloaded()
except WebDriverException as exception:
print(f"An error occurred during the get operation: {exception}")
def click_on(self, find_by, value):
"""
Click on a web element identified by 'find_by' and 'value'.
:param find_by: The method used to find the element
(e.g., By.LINK_TEXT).
:param value: The value to search for.
"""
# Here 'wait' and 'EC' avoid error due to the loading of the website
wait = WebDriverWait(self.driver, 20)
element = wait.until(EC.element_to_be_clickable((find_by, value)))
element.click()
def remove_cwf_existing_csvs(self):
"""
Remove existing CSV files from the current working folder.
"""
for file in self.cwf.glob('*.csv'):
file.unlink(missing_ok=True)
def retrieve_text_info(self, find_by, value):
"""
Retrieve text information of a web element identified
by 'find_by' and 'value'.
:param find_by: The method used to find the element
(e.g., By.CSS_SELECTOR).
:param value: The value to search for.
:return: The text information of the web element.
:rtype: str
"""
# Here 'wait' and 'EC' avoid error due to the loading of the website
wait = WebDriverWait(self.driver, 20)
info = wait.until(EC.visibility_of_element_located((find_by, value)))
return info.text
def wait_for_fully_downloaded(self, timeout=60, check_interval=1):
"""
Wait for a file to be fully downloaded.
:param timeout: Maximum time to wait in seconds.
Default is 60 seconds.
:param check_interval: Interval for checking file size
in seconds. Default is 1 second.
"""
file_path = self.cwf / self._csv_id
start_time = time.time()
while time.time() - start_time < timeout:
if file_path.is_file():
initial_size = file_path.stat().st_size
time.sleep(check_interval)
# Checks if the file size changes during check_interval
if file_path.stat().st_size == initial_size:
return
return
# Here my main code:
# target_url = ('https://data.economie.gouv.fr/explore/dataset/prix-des-'
# 'carburants-en-france-flux-instantane-v2/')
# csv_aria_label = 'Dataset export (CSV)'
# updated_data_date_ng_if = 'ctx.dataset.metas.data_processed'
#
# # Retrieves datas
# firefox_scraper = FirefoxScraperHolder(target_url)
# firefox_scraper.remove_cwf_existing_csvs()
# firefox_scraper.perform_scraping(csv_aria_label, updated_data_date_ng_if)
开始下载文件后调用以下函数。
def wait_until_download_finishes():
dl_wait = True
while dl_wait:
time.sleep(5)
dl_wait = False
for file_name in os.listdir(DOWNLOAD_FILE_PATH):
if file_name.endswith('.part'):
dl_wait = True
DOWNLOAD_FILE_PATH 是您下载文件的路径。
每个正在下载的文件都以.part结尾,这是Firefox正在下载文件的标志。 因此,当 DWONLOAD_FILE_PATH 中没有 .part 文件时,这意味着您的文件已完全下载