Python3 从网页中提取 HTML 数据

问题描述 投票:0回答:1

我正在尝试从下面代码所示的网页中提取 HTML 数据。其他网站可以工作,但下面代码中显示的网站会导致错误。是什么导致了错误?

这是代码

import requests

url = 'https://clasificadosonline.com/'  # URL of the webpage to scrape

try:
    response = requests.get(url)
    response.raise_for_status() 
    html_content = response.text

    # Print the HTML content
    print(html_content)

except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")


except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

这是我遇到的错误。

"C:\Users\17874\OneDrive - University of Puerto Rico\Desktop\WebScraping\venv\Scripts\python.exe" "C:\Users\17874\OneDrive - University of Puerto Rico\Desktop\WebScraping\venv\Clasificados Online.py" 
Request failed: HTTPSConnectionPool(host='clasificadosonline.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:997)')))
Process finished with exit code 0
'''
python-3.x web-scraping
1个回答
0
投票

尝试使用

DEFAULT@SECLEVEL=1
密码强制 SSL 连接:

import ssl
import warnings

import requests
import requests.packages.urllib3.exceptions as urllib3_exceptions

warnings.simplefilter("ignore", urllib3_exceptions.InsecureRequestWarning)


class TLSAdapter(requests.adapters.HTTPAdapter):
    def init_poolmanager(self, *args, **kwargs):
        ctx = ssl.create_default_context()
        ctx.check_hostname = False
        ctx.set_ciphers("DEFAULT@SECLEVEL=1")
        ctx.options |= 0x4
        kwargs["ssl_context"] = ctx
        return super(TLSAdapter, self).init_poolmanager(*args, **kwargs)


url = "https://clasificadosonline.com/"  # URL of the webpage to scrape


with requests.session() as s:
    s.mount("https://", TLSAdapter())

    response = s.get(url)
    response.raise_for_status()
    html_content = response.text

    print(html_content)

打印:

<script type="text/javascript">

<!--

    

if (screen.width <= 480) {

document.location = "https://www.clasificadosonline.com/m/";

}



//-->

</script>

<html><!-- #BeginTemplate "/Templates/master.dwt" --><!-- DW6 -->

...
© www.soinside.com 2019 - 2024. All rights reserved.