我目前正在执行的网络抓取任务遇到问题。我尝试抓取的网页采用了 reCAPTCHA v3,这对我的抓取工作提出了重大挑战。 ReCAPTCHA v3 是一种广泛使用的防止自动访问网站的工具,在后台运行,无需任何用户交互,这使得绕过或克服特别具有挑战性。
这是项目结构。
def start_driver():
try:
url = 'https://antcpt.com/anticaptcha-plugin.zip'
filehandle, _ = urllib.request.urlretrieve(url)
with zipfile.ZipFile(filehandle, "r") as f:
f.extractall("plugin")
api_key = "MY API KEY"
file = Path('./plugin/js/config_ac_api_key.js')
file.write_text(file.read_text().replace("antiCapthaPredefinedApiKey = ''", "antiCapthaPredefinedApiKey = '{}'".format(api_key)))
zip_file = zipfile.ZipFile('./plugin.zip', 'w', zipfile.ZIP_DEFLATED)
for root, dirs, files in os.walk("./plugin"):
for file in files:
path = os.path.join(root, file)
zip_file.write(path, arcname=path.replace("./plugin/", ""))
zip_file.close()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_extension('./plugin.zip')
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=chrome_options)
return driver
except Exception as e:
print("Error al iniciar el controlador:", e)
return None
if __name__ == '__main__':
driver = start_driver()
但我收到错误
Message: session not created: cannot process extension #1 from unknown error: cannot read manifest
plugin.zip 文件创建插件的压缩文件夹。 您所需要的只是像这样的压缩结构
之前:plugin.zip -> 插件 -> [所有文件]
之后:plugin.zip -> [所有文件]