我有两个脚本:
other.py
看起来像这样:# some stuff is done here and a list of urls is created such as: urls = ['https://www.walmart.com/ip/Sabrina-Carpenter-Cherry-Pop-EDP-30ml-1oz/5492571361?classType=REGULAR&athbdg=L1600', 'https://www.walmart.com/ip/Hoey-5-1-Painless-Hair-Remover-Women-Facial-Removal-Electric-Cordless-Shaver-Set-Wet-Dry-Lady-Razor-Women-Bikini-Line-Nose-Hair-Eyebrow-Arm-Leg-USB-R/647670434?classType=REGULAR'] # Then, the script runs another script called get_url.py and passes the urls to it to be processed: subprocess.Popen(['python', 'get_url.py', str(urls)]) #it is important that this does not block the code and the rest of the code in this script can run without waiting for get_url.py to complete.
get_url.py
看起来像这样,并下载传递给它的每个 url:import pandas as pd import os import time from datetime import datetime import pyautogui from selenium.webdriver.chrome.options import Options from selenium import webdriver from concurrent.futures import ProcessPoolExecutor def get_page(url): file_name = f"{url[:20]}_{pd.to_datetime(datetime.now()).strftime('%Y-%m-%d %H-%M-%S')}.html" file_path = os.path.join(os.getcwd(), 'data', 'htmls') path_and_name = os.path.join(file_path, file_name) driver = webdriver.Chrome(options=options) driver.get(url) time.sleep(1) pyautogui.hotkey('ctrl', 's') # open the save as window time.sleep(1) pyautogui.typewrite(path_and_name ) # enter the path and file name so the webpage is downloaded in the desired directory time.sleep(.5) pyautogui.hotkey('enter') time.sleep(.2) while True: # wait until the download is complete, then close the driver files = os.listdir(file_path) if file_name in files: driver.close() break time.sleep(.1) urls = sys.argv[1] # getting urls from other.py #converting the string urls to an actual list: urls = ast.literal_eval(page_urls.replace('[', '').replace(']', '').replace('\n', ', ')) if __name__ =='__main__': # multi-processing the urls to speed up things(necessary) with ProcessPoolExecutor(max_workers=10) as executer: executer.map(get_page, urls, chunksize = 1)
只要打开一个浏览器,该功能就可以正常工作。然而,一旦通过
ProcessPoolExecutor
打开多个窗口,函数的 pyautogui.typewrite
部分就会失去对窗口的跟踪,这可能会导致 path_and_name
在“另存为”窗口中被多次键入,或者输入不完整导致页面无法下载或使用错误的名称/目录下载。更糟糕的是,如果我在函数运行时单击代码编辑器中的某个位置,pyautogui
可能会在光标处于活动状态的编辑器中键入 path_and_name
值。在“无头”模式下运行浏览器,这样我就不会意外地弄乱窗口,这并没有帮助。
那么,基本上,我该如何修复上面的代码?
因为 pyautogui 在活动窗口上运行。如果您想并行运行多个窗口或在程序运行时使用浏览器,您应该尝试使用不与 GUI 交互的东西。
因此,不要单击保存窗口,而是定义一个下载文件夹并将 Chrome 浏览器配置为下载到该文件夹。
# download folder
downloads = os.path.join(os.get(), html, data)
# Chrome options
options = Options()
prefs = {
"download.default_directory": downloads,
}
options.add_experimental_option("prefs", prefs)