我有一个网页抓取项目卡住了。
我正在使用 Requests 包和 get() 方法下载网站 http,之后我想在上面使用 Beautiful Soup。
它在我的 labtop 上运行良好,但是当我将程序上传到我的 AWS Ubuntu EC2 实例时,我遇到了错误。我试过其他网站,它们都有效,我只在这个网站上遇到这些问题。
有人知道为什么会这样吗? 根据错误消息,我怀疑 SSL 问题,但即使使用 verify=False 参数,它仍然无法工作。
代码:
import requests
url = "http://www.kino.dk"
r = requests.get(url, verify=False)
print(r.text)
错误信息:
> Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/contrib/pyopenssl.py", line 485, in wrap_socket
cnx.do_handshake()
File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1915, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1647, in _raise_ssl_error
_raise_current_error()
File "/usr/lib/python3/dist-packages/OpenSSL/_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'tls12_check_peer_sigalg', 'wrong signature type')]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 376, in _make_request
self._validate_conn(conn)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 996, in _validate_conn
conn.connect()
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 352, in connect
self.sock = ssl_wrap_socket(
File "/usr/lib/python3/dist-packages/urllib3/util/ssl_.py", line 370, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/lib/python3/dist-packages/urllib3/contrib/pyopenssl.py", line 491, in wrap_socket
raise ssl.SSLError("bad handshake: %r" % e)
ssl.SSLError: ("bad handshake: Error([('SSL routines', 'tls12_check_peer_sigalg', 'wrong signature type')])",)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 436, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.kino.dk', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls12_check_peer_sigalg', 'wrong signature type')])")))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 6, in <module>
response = requests.get(url, verify=False)
File "/usr/lib/python3/dist-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 668, in send
history = [resp for resp in gen] if allow_redirects else []
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 668, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 239, in resolve_redirects
resp = self.send(
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.kino.dk', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls12_check_peer_sigalg', 'wrong signature type')])")))
看来网站只是有非常有效的反抓取对策。