无法使用python-requests获取带有“Content-Disposition:attachment;”的网页

问题描述 投票:3回答:1

使用我的firefox浏览器,我登录到下载站点并单击其中一个查询按钮。弹出一个小窗口,名为“打开report1.csv”,我可以选择“打开方式”或“保存文件”。我保存文件。

对于这个行动,Live HTTP headers告诉我:

https:// myserver / ReportPage?download&NAME = ALL&DATE = THISYEAR

GET / ReportPage?下载&NAME = ALL&DATE = THISYEAR HTTP / 1.1 主持人:myserver User-Agent:Mozilla / 5.0(X11; Linux x86_64; rv:52.0)Gecko / 20100101 Firefox / 52.0 接受:text / html,application / xhtml + xml,application / xml; q = 0.9,/; q = 0.8 Accept-Language:en-US,en; q = 0.8,de-DE; q = 0.5,de; q = 0.3 Accept-Encoding:gzip,deflate,br Referer:https:// myserver / ReportPage?4&NAME = ALL&DATE = THISYEAR Cookie:JSESSIONID = 88DEDBC6880571FDB0E6E4112D71B7D6 连接:保持活力 升级 - 不安全请求:1

HTTP / 1.1 200好的 日期:星期六,2017年12月30日22:37:40 GMT 服务器:Apache-Coyote / 1.1 最后修改时间:2017年12月30日星期六22:37:40 GMT 到期日:1970年1月1日星期四00:00:00 GMT Pragma:没有缓存 缓存控制:无缓存,无存储 内容 - 处理:附件;文件名= “report1.csv”;文件名* = UTF-8''report1.csv 内容类型:text / csv 内容长度:332369 Keep-Alive:超时= 5,最大= 100 连接:保持活力

现在我尝试用请求模拟这个。

$ python3
>>> import requests
>>> from lxml import html
>>>
>>> s = requests.Session()
>>> s.verify = './myserver.crt'  # certificate of myserver for https
>>>
>>> # get the login web page to enter username and password
... r = s.get( 'https://myserver' )
>>>
>>> # Get url for logging in. It's the action-attribute in the form anywhere.
... # We use xpath.
... tree = html.fromstring(r.text)
>>> loginUrl = 'https://myserver/' + list(tree.xpath("//form[@id='id4']/@action"))[0]
>>> print( loginUrl )   # it contains a session-id
https://myserver/./;jsessionid=77EA70CB95252426439097E274286966?0-1.loginForm
>>>
>>> # logging in with username and password
... r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
>>> print( r.status_code )
200
>>> # try to get the download file using url from Live HTTP headers
... downloadQueryUrl = 'https://myserver/ReportPage?download&NAME=ALL&DATE=THISYEAR'
>>> r = s.get( downloadQueryUrl )
>>> print( r.status_code)
200
>>> print( r. headers )
{'Connection': 'Keep-Alive',
'Date': 'Sun, 31 Dec 2017 14:46:03 GMT',
'Cache-Control': 'no-cache, no-store',
'Keep-Alive': 'timeout=5, max=94',
'Transfer-Encoding': 'chunked',
'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT',
'Pragma': 'no-cache',
'Content-Encoding': 'gzip',
'Content-Type': 'text/html;charset=UTF-8',
'Server': 'Apache-Coyote/1.1',
'Vary': 'Accept-Encoding'}
>>> print( r.url )
https://myserver/ReportPage?4&NAME=ALL&DATE=THISYEAR
>>>

请求成功但我没有得到文件下载页面。没有“内容 - 处置:附件”;标题中的条目。我只获得查询开始的页面,例如来自引用者的页面。

这与session-cookie有关吗?似乎请求自动管理这个。 csv文件有特殊处理吗?我必须使用流吗? Live HTTP Headers显示的download-Url是正确的吗?也许有一个动态的创作?

如何获得包含“Content-Disposition:attachment;”的网页来自myserver并下载其文件请求?

python python-requests
1个回答
1
投票

我知道了。 @Patrick Mevzek指出我正确的方向。这次真是万分感谢。

登录后,我不会留在第一个登录页面并调用查询。相反,我请求报告页面,从中提取query-url并请求query-url。现在我在其标题中得到了“Content-Disposition:attachment;”的回复。现在将它的文本打印到stdout很简单。我更喜欢这个,因为我可以将输出重定向到任何文件。信息消息转到stderr,因此它们不会弄乱重定向的输出。典型的电话是./download >out.csv

为了完整性,这里是脚本模板,没有任何错误检查以澄清其工作。

#!/usr/bin/python3

import requests
import sys
from lxml import html

s = requests.Session()
s.verify = './myserver.crt'  # certificate of myserver for https

# get the login web site to enter username and password
r = s.get( 'https://myserver' )

# Get url for logging in. It's the action-attribute in the form anywhere.
# We use xpath.
tree = html.fromstring(r.text)
loginUrl = 'https://myserver/' + tree.xpath("//form[@id='id4']/@action")[0]

# logging in with username and password and go to ReportPage with queries
r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
queryUrl = 'https://myserver/ReportPage?NAME=ALL&DATE=THISYEAR'
r = s.get( queryUrl )

# Get the download link for this query from this site. It's a link anywhere
# with value 'Download (UTF8)'
tree = html.fromstring( r.text )
downloadUrl = 'https://myserver/' + tree.xpath("//a[.='Download (UTF8)']/@href")[0]

# get the download file
r = s.get( downloadUrl )
if r.headers.get('Content-Disposition'):
    print( 'Downloading ...', file=sys.stderr )
    print( r.text )

# log out
r = s.get( 'https://myserver/logout' )
© www.soinside.com 2019 - 2024. All rights reserved.