Python web scrape登录

问题描述 投票:0回答:1

我是python的新手,并试图使用xpath并请求使用here中演示的方法登录并从this tutorial中获取一些数据。我的python脚本目前如下:

from lxml import html
import requests

url = "http://www.londoncoffeeguide.com/Venues/Profile/26-Grains"

session_requests = requests.session()
login_url = "http://www.londoncoffeeguide.com/signin?returnurl=%2fVenues"
result = session_requests.get(login_url)

tree = html.fromstring(result.content)
authenticity_token = list(set(tree.xpath("//input[@name='__CMSCsrfToken']/@value")))[0]

payload = {
    "p$lt$ctl01$LogonForm_SignIn$Login1$UserName": 'XXX', 
    "p$lt$ctl01$LogonForm_SignIn$Login1$Password": 'XXX', 
    "__CMSCsrfToken": authenticity_token
}

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}

with requests.session() as s:
    p = s.post(login_url, data=payload, headers=headers)
    print(p.text)

不幸的是,帖子请求的文本返回显示...

<head><title>
    System error
</title>

...然后是登录页面的HTML的其余部分。我已经尝试添加如上所示的标题行,仔细检查我正在使用的登录详细信息是否正确,我很高兴CMSCsrfToken是正确的,但登录不起作用。对此有任何帮助非常感谢,我一直在谷歌上搜索,但我发现类似问题的各种反应似乎没有帮助(到目前为止!)

python python-3.x xpath web-scraping python-requests
1个回答
0
投票

你把你的usernamepassword放在了错误的领域。此外,在有效载荷中添加的附加字段很少,如viewstategeneratorviewstate e.t.c.为了使脚本工作。以下脚本将使您登录,然后获取不同的配置文件项标题。

from lxml.html import fromstring
import requests

login_url = "http://www.londoncoffeeguide.com/signin?returnurl=%2fVenues"

username = "" #fill this in
password = "" #fill this in as well

with requests.session() as session:
    session.headers['User-Agent'] = 'Mozilla/5.0'
    result = session.get(login_url)
    tree = fromstring(result.text)
    auth_token = tree.xpath("//input[@id='__CMSCsrfToken']/@value")[0]
    viewstate = tree.xpath("//input[@id='__VIEWSTATE']/@value")[0]
    viewgen = tree.xpath("//input[@id='__VIEWSTATEGENERATOR']/@value")[0]

    payload = {
        "__CMSCsrfToken": auth_token,
        "__VIEWSTATEGENERATOR":viewgen,
        "p$lt$ctl02$pageplaceholder$p$lt$ctl00$RowLayout_Bootstrap$RowLayout_Bootstrap_2$ColumnLayout_Bootstrap1$ColumnLayout_Bootstrap1_1$LogonForm_SignIn$Login1$UserName": username, 
        "p$lt$ctl02$pageplaceholder$p$lt$ctl00$RowLayout_Bootstrap$RowLayout_Bootstrap_2$ColumnLayout_Bootstrap1$ColumnLayout_Bootstrap1_1$LogonForm_SignIn$Login1$Password": password, 
        "__VIEWSTATE":viewstate,
        "p$lt$ctl02$pageplaceholder$p$lt$ctl00$RowLayout_Bootstrap$RowLayout_Bootstrap_2$ColumnLayout_Bootstrap1$ColumnLayout_Bootstrap1_1$LogonForm_SignIn$Login1$LoginButton": "Log on"
    }

    session.headers.update({'User-Agent': 'Mozilla/5.0'})
    p = session.post(login_url, data=payload)
    root = fromstring(p.text)
    for iteminfo in root.cssselect(".ProfileItem .ProfileItemTitle"):
        print(iteminfo.text)

确保在执行前填写脚本中的usernamepassword字段。

© www.soinside.com 2019 - 2024. All rights reserved.