我正在使用 Python requests 库来打开 URL。该 URL 是指向 word 文档的 URL。我可以在浏览器上手动访问 URL,这会自动下载文档。我能够成功下载该文档。
但是,使用请求时,我收到了 ChunkedEncodingError。
我的代码:
import requests
url = 'https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc'
res = requests.get(url)
print(res)
错误:
引发 ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ('连接中断: IncompleteRead(读取 16834 字节,预计还有 87102 字节)', IncompleteRead(已读取 16834 字节,预计还会有 87102 字节))
我也尝试过使用其他库,例如aiohttp和urllib3,但也会出现错误。
重试请求不起作用,因为我每次都会收到错误。
如果有人能提供帮助,那就太好了!其他一些帖子说这可能是服务器端问题。但它在浏览器上运行良好,更多技术细节超出了我的范围。
这当然是一个服务器端问题 - 即使使用
wget
也会发生这种情况,尽管 wget
(和你的浏览器)足够聪明,可以从失败的字节重试:
wget -vvv 'https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc'
--2024-01-24 16:05:40-- https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc
Resolving legalref.judiciary.hk (legalref.judiciary.hk)... 118.143.43.114
Connecting to legalref.judiciary.hk (legalref.judiciary.hk)|118.143.43.114|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103936 (102K) [application/msword]
Saving to: ‘CACC000213A_2008.doc’
CACC000213A_2008.doc 14%[=================> ] 15,12K --.-KB/s in 0s
2024-01-24 16:05:42 (31,4 MB/s) - Read error at byte 15486/103936 (Connection reset by peer). Retrying.
--2024-01-24 16:05:43-- (try: 2) https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc
Connecting to legalref.judiciary.hk (legalref.judiciary.hk)|118.143.43.114|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 103936 (102K), 88450 (86K) remaining [application/msword]
Saving to: ‘CACC000213A_2008.doc’
CACC000213A_2008.doc 100%[++++++++++++++++++======================================================================================================>] 101,50K 25,8KB/s in 3,3s
2024-01-24 16:05:50 (25,8 KB/s) - ‘CACC000213A_2008.doc’ saved [103936/103936]
您可以通过使用
requests.get(..., stream=True)
来实现类似的逻辑,查看您获得的 Content-Length
,并将其与您已成功写入的字节进行比较;如果您遇到异常并且您阅读的内容少于预期(通过 Content-Length
),请使用 Range: bytes={start_byte}-
样式标题重试:
import requests
def download_with_resume(sess: requests.Session, url: str) -> bytes:
data = b""
expected_length = None
for attempt in range(10):
if len(data) == expected_length:
break
if len(data):
headers = {"Range": f"bytes={len(data)}-"}
expected_status = 206
else:
headers = {}
expected_status = 200
print(f"{url}: got {len(data)} / {expected_length} bytes...")
resp = sess.get(url, stream=True, headers=headers)
resp.raise_for_status()
if resp.status_code != expected_status:
raise ValueError(f"Unexpected status code: {resp.status_code}")
if expected_length is None: # Only update this on the first request
content_length = resp.headers.get("Content-Length")
if not content_length:
raise ValueError("Content-Length header not found")
expected_length = int(content_length)
try:
for chunk in resp.iter_content(chunk_size=8192):
data += chunk
except requests.exceptions.ChunkedEncodingError:
pass
if len(data) != expected_length:
raise ValueError(f"Expected {expected_length} bytes, got {len(data)}")
return data
with requests.Session() as sess:
data = download_with_resume(
sess,
url="https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc",
)
print("=>", len(data))