我是网络抓取新手,试图从 redfin.com 抓取一些住房信息,我使用 python requests 包来获取网站代码。然而,代码有时会工作并返回每个 url 的完整 html,而有时它只返回空白。
这是我的代码的简化版本:
import requests
headers = {
'user-agent': XXX
}
links = ['https://www.redfin.com/ID/Meridian/3642-N-Hollymount-Way-83646/home/106711385',
'https://www.redfin.com/ID/Meridian/1506-N-Penrith-Pl-83642/home/106700395',
'https://www.redfin.com/ID/Nampa/34-N-Middleton-Rd-83651/home/117266789',
'https://www.redfin.com/OR/The-Dalles/1308-Harris-St-97058/home/53055510']
for link in links:
response = requests.get(link, headers = headers)
html = response.text
print(html)
状态代码始终为 200,有时我可以获取 html,但大多数时候它只是返回空白。这真的让我很困惑,我非常感谢您帮助解决这个问题。谢谢!
以下代码(使用有效的用户代理)毫无例外地工作。
但是,由于速率限制,短时间内多次运行可能会导致 HTTP 429 Too Many Requests。
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}
links = [
"https://www.redfin.com/ID/Meridian/3642-N-Hollymount-Way-83646/home/106711385",
"https://www.redfin.com/ID/Meridian/1506-N-Penrith-Pl-83642/home/106700395",
"https://www.redfin.com/ID/Nampa/34-N-Middleton-Rd-83651/home/117266789",
"https://www.redfin.com/OR/The-Dalles/1308-Harris-St-97058/home/53055510",
]
with Session() as session:
for link in links:
try:
with session.get(link, headers=HEADERS) as response:
response.raise_for_status()
print(response.text)
except Exception as e:
print(e)