在尝试网络抓取时,服务器拒绝我访问(Python,请求)

问题描述 投票:1回答:1

我正在尝试访问我学校的内部网以进行网络搜索并使用我必须完成的作业检索表格,我在网上搜索任何解决方案但我找不到任何解决方案。出于显而易见的原因,我不会提供登录凭据,但我会提供html数据。任何帮助都很棒,谢谢。

我的代码到目前为止:

import requests

while True:
    Post_Login_URL = 'http://parents.netherhall.org/'
    Request_URL = 'https://parents.netherhall.org/parents/students/?admissionno=011161&page=homework'
    username = input('What is your username? ')
    password = input('What is your password? ')
    payload = {
        'username': username,
        'password': password
    }
    with requests.Session() as session:
        post = session.post(Post_Login_URL, data=payload)
        r = session.get(Request_URL)
        print(r.text)

我得到的回应:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML dir=ltr><HEAD><TITLE>The page cannot be displayed</TITLE>
<STYLE id=L_defaultr_1>A:link {
    FONT: 8pt/11pt verdana; COLOR: #ff0000
}
A:visited {
    FONT: 8pt/11pt verdana; COLOR: #4e4e4e
}
</STYLE>

<META content=NOINDEX name=ROBOTS>
<META http-equiv=Content-Type content="text-html; charset=UTF-8">

<META content="MSHTML 5.50.4522.1800" name=GENERATOR></HEAD>
<BODY bgColor=#ffffff>
<TABLE cellSpacing=5 cellPadding=3 width=410>
  <TBODY>
  <TR>
    <TD id=L_defaultr_0 valign=middle align=left width=360>
      <H1 id=L_defaultr_2 style="FONT: 13pt/15pt verdana; COLOR: #000000"><ID id=L_defaultr_3><!--Problem-->The page cannot be displayed
</ID></H1></TD></TR>
  <TR>
    <TD width=400 colSpan=2><FONT id=L_defaultr_4
      style="FONT: 8pt/11pt verdana; COLOR: #000000"><ID id=L_defaultr_5><B>Explanation: </B>There is a problem with the page you are trying to reach and it cannot be displayed.</ID></FONT></TD></TR>
  <TR>
    <TD width=400 colSpan=2><FONT id=L_defaultr_6 
      style="FONT: 8pt/11pt verdana; COLOR: #000000">
      <HR color=#c0c0c0 noShade>

      <P id=L_defaultr_7><B>Try the following:</B></P>
      <UL>
        <LI id=L_defaultr_8><B>Refresh page:</B> Search for the page again by clicking the Refresh button. The timeout may have occurred due to Internet congestion.
<LI id=L_defaultr_9><B>Check spelling:</B> Check that you typed the Web page address correctly. The address may have been mistyped.
<LI id=L_defaultr_10><B>Access from a link:</B> If there is a link to the page you are looking for, try accessing the page from that link.

      </UL>
      <HR color=#c0c0c0 noShade>

      <P id=L_defaultr_11>Technical Information (for support personnel)</P>
      <UL>
        <LI id=L_defaultr_12>Error Code: 401 Unauthorized. The server requires authorization to fulfill the request. Access to the Web server is denied. Contact the server administrator. (12209)

        </UL></FONT></TD></TR></TBODY></TABLE></BODY></HTML>
python-3.x python-requests
1个回答
1
投票

您必须设置请求标头,因为默认情况下,User-Agent类似于“python请求”。

要执行此操作,请打开浏览器,如果您使用的是Chrome,请按Cntrl + E,如果您使用的是Firefox,请按Cntrl + Shift + E.然后转到网络。现在登录网站,在左侧(或下方)将出现一行代表对parents.netherhall.org的请求。单击它并复制标题。

然后像这样实现它们:

from requests import Session

# Create headers dict.
headers = {
    'header_name': 'header_value', # and so on
}

Post_Login_URL = 'http://parents.netherhall.org/'
Request_URL = 'https://parents.netherhall.org/parents/students/?admissionno=011161&page=homework'
username = input('What is your username? ')
password = input('What is your password? ')
payload = {
    'username': username,
    'password': password
}
with Session() as session:
    post = session.post(Post_Login_URL, headers=headers data=payload)
    print(r.text) # Page source.
    print('Logged in successfully:', r.ok)
© www.soinside.com 2019 - 2024. All rights reserved.