如何在没有直接链接的情况下下载验证码图像

问题描述 投票:0回答:2
我正在尝试从命令行客户端访问 sci-hub.io,而不是击败其验证码系统。当您将 doi 发布到其首页时,它会返回格式为

http://moscow.sci-hub.io/abc123blah/foo.pdf 的 pdf url。如果您随后请求该链接,您将随机获得 pdf 或验证码。 验证码页面有这个来源:

<html> <head> <title>Для просмотра статьи разгадайте капчу</title> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> </head> <body style = "background:white"> <div> <table style = "width:100%;height:100%"><tr><td style = "vertical-align:middle;text-align:center"> <h2 style = "color:gray;font-family:sans-serif;padding:18px">для просмотра статьи разгадайте капчу</h2> <p></p> <form action = "" method = "POST"> <p><img id="captcha" src="/captcha/securimage_show.php" /></p> <input type="text" maxlength="6" name="captcha_code" style = "width:256px;font-size:18px;height:36px;margin-top:18px;text-align:center" autofocus /><br> <a style = "color:gray;text-decoration:none" href="#" onclick="document.getElementById('captcha').src = '/captcha/securimage_show.php?' + Math.random(); return false">[ показать другую картинку ]</a> <p style = "margin-top:22px"><input type = "submit" value= "Продолжить"></p> </form> </td></tr></table> </div> </body> </html>

我能想到的就是请求 secureimage_show.php,保存图像,将其显示给用户,获取其解码,然后发布响应。示例 pdf 链接是

http://moscow.sci-hub.io/291193c259b69cc057d74e3eb4965c4f/ong2014.pdf 比如:

import requests from PIL import Image import io pdf_url = "http://moscow.sci-hub.io/3dcd1bf3b82ea549c0a72e9ab195ab78/walter2015.pdf" r1 = requests.get(pdf_url) if r1.headers['Content-Type'] != 'application/pdf': print("Looks like Sci-hub gave us a captcha") image = requests.get("http://moscow.sci-hub.io/captcha/securimage_show.php").content img = io.BytesIO(image) im = Image.open(img) im.show() captcha_text = input("Enter captcha text: ") r2 = requests.post(pdf_url, data = {'captcha_code': captcha_text}) if r2.headers['Content-Type'] != 'application/pdf': print("Looks like Sci-hub gave us another captcha") else: with open("filename.pdf", 'wb') as f: f.write(r.content) print("saved!") else: print("Got a PDF") with open("filename.pdf", 'wb') as f: f.write(r.content) print("saved!")

我没有办法获取我第一次请求 pdf 时生成的原始验证码图像。当我从 secureimage_show.php 请求另一个验证码图像时,它会生成一个新图像,因此 POST 响应不正确。我该如何解决这个问题?

python web-scraping captcha
2个回答
0
投票
感谢安德鲁为我指明了正确的方向。我需要建立一个包含请求的会话。我假设这个会话来回传递了一个 cookie,以便服务器可以跟踪它发送给我的最新验证码。只是猜测,因为这对我来说仍然有点神奇。

import requests from PIL import Image from io import BytesIO pdf_url = "http://moscow.sci-hub.io/3dcd1bf3b82ea549c0a72e9ab195ab78/walter2015.pdf" s = requests.Session() r1 = s.get(pdf_url) if r1.headers['Content-Type'] != 'application/pdf': print("Looks like Sci-hub gave us a captcha") image = s.get("http://moscow.sci-hub.io/captcha/securimage_show.php").content img = BytesIO(image) im = Image.open(img) im.show() captcha_text = input("Enter captcha text: ") r2 = s.post(pdf_url, data = {'captcha_code': captcha_text}) if r2.headers['Content-Type'] != 'application/pdf': print("Looks like Sci-hub gave us another captcha") else: with open("filename.pdf", 'wb') as f: f.write(r2.content) print("saved!") else: print("Got a PDF") with open("filename.pdf", 'wb') as f: f.write(r1.content) print("saved!")
    

-4
投票
[在此输入链接描述

块引用

]

1

在此输入链接描述

© www.soinside.com 2019 - 2024. All rights reserved.