使用 Selenium 在 Chrome 中下载网页时与“另存为”窗口交互

问题描述 投票:0回答:1

我有两个脚本:

  1. other.py
    看起来像这样:
# some stuff is done here and a list of urls is created such as:

urls = ['https://www.walmart.com/ip/Sabrina-Carpenter-Cherry-Pop-EDP-30ml-1oz/5492571361?classType=REGULAR&athbdg=L1600', 'https://www.walmart.com/ip/Hoey-5-1-Painless-Hair-Remover-Women-Facial-Removal-Electric-Cordless-Shaver-Set-Wet-Dry-Lady-Razor-Women-Bikini-Line-Nose-Hair-Eyebrow-Arm-Leg-USB-R/647670434?classType=REGULAR']

# Then, the script runs another script called get_url.py and passes the urls to it to be processed: 
subprocess.Popen(['python', 'get_url.py', str(urls)])

#it is important that this does not block the code and the rest of the code in this script can run without waiting for get_url.py to  complete.
    上面调用的
  1. get_url.py
    看起来像这样,并下载传递给它的每个 url:
import pandas as pd
import os
import time
from datetime import datetime
import pyautogui
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from concurrent.futures import ProcessPoolExecutor

def get_page(url):
    file_name = f"{url[:20]}_{pd.to_datetime(datetime.now()).strftime('%Y-%m-%d %H-%M-%S')}.html"
    file_path = os.path.join(os.getcwd(), 'data', 'htmls')
    path_and_name = os.path.join(file_path, file_name) 

    driver = webdriver.Chrome(options=options)
    driver.get(url)

    time.sleep(1)
    pyautogui.hotkey('ctrl', 's') # open the save as window
    time.sleep(1)
    pyautogui.typewrite(path_and_name ) # enter the path and file name so the webpage is downloaded in the desired directory
    time.sleep(.5)
    pyautogui.hotkey('enter')   
    time.sleep(.2)

    while True: # wait until the download is complete, then close the driver
        files = os.listdir(file_path)
        if file_name in files: 
            driver.close()
            break
        time.sleep(.1)

urls = sys.argv[1] # getting urls from other.py
#converting the string urls to an actual list:
urls = ast.literal_eval(page_urls.replace('[', '').replace(']', '').replace('\n', ', '))

if __name__ =='__main__': # multi-processing the urls to speed up things(necessary)
    with ProcessPoolExecutor(max_workers=10) as executer:
        executer.map(get_page, urls, chunksize = 1)

只要打开一个浏览器,该功能就可以正常工作。然而,一旦通过

ProcessPoolExecutor
打开多个窗口,函数的
pyautogui.typewrite
部分就会失去对窗口的跟踪,这可能会导致
path_and_name
在“另存为”窗口中被多次键入,或者输入不完整导致页面无法下载或使用错误的名称/目录下载。更糟糕的是,如果我在函数运行时单击代码编辑器中的某个位置,
pyautogui
可能会在光标处于活动状态的编辑器中键入
path_and_name
值。在“无头”模式下运行浏览器,这样我就不会意外地弄乱窗口,这并没有帮助。

那么,基本上,我该如何修复上面的代码?

python selenium-webdriver
1个回答
0
投票

因为 pyautogui 在活动窗口上运行。如果您想并行运行多个窗口或在程序运行时使用浏览器,您应该尝试使用不与 GUI 交互的东西。

因此,不要单击保存窗口,而是定义一个下载文件夹并将 Chrome 浏览器配置为下载到该文件夹。

# download folder
downloads = os.path.join(os.get(), html, data)

# Chrome options
options = Options()
prefs = {
  "download.default_directory": downloads,
}

options.add_experimental_option("prefs", prefs)
© www.soinside.com 2019 - 2024. All rights reserved.