如何在没有直接链接的情况下下载验证码图像

Question

我正在尝试从命令行客户端访问 sci-hub.io，而不是击败其验证码系统。当您将 doi 发布到其首页时，它会返回格式为

http://moscow.sci-hub.io/abc123blah/foo.pdf 的 pdf url。如果您随后请求该链接，您将随机获得 pdf 或验证码。验证码页面有这个来源：

<html>
    <head>
        <title>Для просмотра статьи разгадайте капчу</title>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
    </head>
    <body style = "background:white">
        <div>
            <table style = "width:100%;height:100%"><tr><td style = "vertical-align:middle;text-align:center">
            <h2 style = "color:gray;font-family:sans-serif;padding:18px">для просмотра статьи разгадайте капчу</h2>
            <p></p>
            <form action = "" method = "POST">
                <p><img id="captcha" src="/captcha/securimage_show.php" /></p>
                <input type="text" maxlength="6" name="captcha_code" style = "width:256px;font-size:18px;height:36px;margin-top:18px;text-align:center" autofocus /><br>
                <a style = "color:gray;text-decoration:none" href="#" onclick="document.getElementById('captcha').src = '/captcha/securimage_show.php?' + Math.random(); return false">[ показать другую картинку ]</a>
                <p style = "margin-top:22px"><input type = "submit" value= "Продолжить"></p>
            </form>
            </td></tr></table>
        </div>
    </body>
</html>

我能想到的就是请求 secureimage_show.php，保存图像，将其显示给用户，获取其解码，然后发布响应。示例 pdf 链接是

http://moscow.sci-hub.io/291193c259b69cc057d74e3eb4965c4f/ong2014.pdf 比如：

import requests
from PIL import Image
import io

pdf_url = "http://moscow.sci-hub.io/3dcd1bf3b82ea549c0a72e9ab195ab78/walter2015.pdf"

r1 = requests.get(pdf_url)

if r1.headers['Content-Type'] != 'application/pdf':
    print("Looks like Sci-hub gave us a captcha")

    image = requests.get("http://moscow.sci-hub.io/captcha/securimage_show.php").content
    img = io.BytesIO(image)
    im = Image.open(img)
    im.show()
    captcha_text = input("Enter captcha text: ")

    r2 = requests.post(pdf_url, data = {'captcha_code': captcha_text})

    if r2.headers['Content-Type'] != 'application/pdf':
        print("Looks like Sci-hub gave us another captcha")
    else:
        with open("filename.pdf", 'wb') as f:
            f.write(r.content)
        print("saved!")

else:
    print("Got a PDF")
    with open("filename.pdf", 'wb') as f:
        f.write(r.content)
    print("saved!")

我没有办法获取我第一次请求 pdf 时生成的原始验证码图像。当我从 secureimage_show.php 请求另一个验证码图像时，它会生成一个新图像，因此 POST 响应不正确。我该如何解决这个问题？

Answer 1

感谢安德鲁为我指明了正确的方向。我需要建立一个包含请求的会话。我假设这个会话来回传递了一个 cookie，以便服务器可以跟踪它发送给我的最新验证码。只是猜测，因为这对我来说仍然有点神奇。

import requests
from PIL import Image
from io import BytesIO

pdf_url = "http://moscow.sci-hub.io/3dcd1bf3b82ea549c0a72e9ab195ab78/walter2015.pdf"

s = requests.Session()
r1 = s.get(pdf_url)

if r1.headers['Content-Type'] != 'application/pdf':
    print("Looks like Sci-hub gave us a captcha")

    image = s.get("http://moscow.sci-hub.io/captcha/securimage_show.php").content
    img = BytesIO(image)
    im = Image.open(img)
    im.show()
    captcha_text = input("Enter captcha text: ")

    r2 = s.post(pdf_url, data = {'captcha_code': captcha_text})

    if r2.headers['Content-Type'] != 'application/pdf':
        print("Looks like Sci-hub gave us another captcha")
    else:
        with open("filename.pdf", 'wb') as f:
            f.write(r2.content)
        print("saved!")

else:
    print("Got a PDF")
    with open("filename.pdf", 'wb') as f:
        f.write(r1.content)
    print("saved!")

Answer 2

[在此输入链接描述

块引用

]

1

在此输入链接描述

如何在没有直接链接的情况下下载验证码图像

问题描述投票：0回答：2

2个回答

最新问题

如何在没有直接链接的情况下下载验证码图像

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2