请求响应为200,但下载的文件仍然是空白的。请帮助解决这个挑战。
import requests
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
r = requests.get(link, stream=True, headers=HEADERS)
with open(output_filename, 'wb') as f:
f.write(r.content)
html页面中有一个表单。您可以提交该表单,或获取密钥/编码字符串并使用新链接/url 发送 pdf 请求。
import requests
import re
from bs4 import BeautifulSoup as bs
url = "url1"
# get key from html page -> <form> -> <script> {encodedString = dfsfsdfs}</script>
with requests.Session() as session:
res = session.get(url)
soup = bs(res.text, 'html.parser')
form = soup.select_one('form#SummaryForm > script', string=re.compile('encodedString'))
key = re.findall(r"encodedString = '([^']*)'", form.text)[0]
# fid from url string
fid = re.findall(r'fundid=(\d+)', url)[0]
# For the download link, searched for xhr, by going to inspect devtools, -> Network tabs, and filter by fetch/xhr.
link = f'new_url_to_pdf?key={key}&fid={fid}'
output_filename = 'file.pdf'
r = requests.get(link)
with open(output_filename, 'wb') as f:
f.write(r.content)
还有一个替代方案,那就是 Selenium 和 requests-html。由于您需要与浏览器的交互非常少,因此您可以尝试 requests-html。
from requests_html import HTMLSession
url = 'url2'
script = """
window.addEventListener('load', function () {
iframe = document.selector('iframe:not([style*="display: none"])')[0].click();
e = $.Event("keydown");
e.which = 83; // S
e.ctrlKey = true; // CTRL
$(document).trigger(e);
})
"""
session = HTMLSession()
res = session.get(url)
res.html.render(sleep=2, keep_page=True, script=script)
当您第一次运行代码时,它会自动下载 webdriver。因此,第一次运行可能需要一点时间,另一次运行会更快。
编辑2:
requests-html
由于 iframe 的存在而无法工作