我正在为yellowpages.com开发一个web scraper,它似乎总体上运行良好。但是,在遍历长查询的分页时,requests.get(url)将随机返回<Response [503]>
或<Response [404]>
。偶尔,我会收到更糟糕的例外情况,例如:
requests.exceptions.ConnectionError:HTTPConnectionPool(host ='www.yellowpages.com',port = 80):使用url超出最大重试次数:/ search?search_terms = florists&geo_location_terms = FL&page = 22(由NewConnectionError引起(':无法建立新连接:[WinError 10053]已建立的连接被主机中的软件中止',))
使用time.sleep()似乎消除了503错误,但404和异常仍然存在问题。
我正在试图弄清楚如何“捕获”各种响应,因此我可以进行更改(等待,更改代理,更改用户代理)并再次尝试和/或继续。伪代码是这样的:
If error/exception with request.get:
wait and/or change proxy and user agent
retry request.get
else:
pass
此时,我甚至无法使用以下方法捕获问题:
try:
r = requests.get(url)
except requests.exceptions.RequestException as e:
print (e)
import sys #only added here, because it's not part of my stable code below
sys.exit()
我在github及以下地方开始的完整代码:
import requests
from bs4 import BeautifulSoup
import itertools
import csv
# Search criteria
search_terms = ["florists", "pharmacies"]
search_locations = ['CA', 'FL']
# Structure for Data
answer_list = []
csv_columns = ['Name', 'Phone Number', 'Street Address', 'City', 'State', 'Zip Code']
# Turns list of lists into csv file
def write_to_csv(csv_file, csv_columns, answer_list):
with open(csv_file, 'w') as csvfile:
writer = csv.writer(csvfile, lineterminator='\n')
writer.writerow(csv_columns)
writer.writerows(answer_list)
# Creates url from search criteria and current page
def url(search_term, location, page_number):
template = 'http://www.yellowpages.com/search?search_terms={search_term}&geo_location_terms={location}&page={page_number}'
return template.format(search_term=search_term, location=location, page_number=page_number)
# Finds all the contact information for a record
def find_contact_info(record):
holder_list = []
name = record.find(attrs={'class': 'business-name'})
holder_list.append(name.text if name is not None else "")
phone_number = record.find(attrs={'class': 'phones phone primary'})
holder_list.append(phone_number.text if phone_number is not None else "")
street_address = record.find(attrs={'class': 'street-address'})
holder_list.append(street_address.text if street_address is not None else "")
city = record.find(attrs={'class': 'locality'})
holder_list.append(city.text if city is not None else "")
state = record.find(attrs={'itemprop': 'addressRegion'})
holder_list.append(state.text if state is not None else "")
zip_code = record.find(attrs={'itemprop': 'postalCode'})
holder_list.append(zip_code.text if zip_code is not None else "")
return holder_list
# Main program
def main():
for search_term, search_location in itertools.product(search_terms, search_locations):
i = 0
while True:
i += 1
url = url(search_term, search_location, i)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
main = soup.find(attrs={'class': 'search-results organic'})
page_nav = soup.find(attrs={'class': 'pagination'})
records = main.find_all(attrs={'class': 'info'})
for record in records:
answer_list.append(find_contact_info(record))
if not page_nav.find(attrs={'class': 'next ajax-page'}):
csv_file = "YP_" + search_term + "_" + search_location + ".csv"
write_to_csv(csv_file, csv_columns, answer_list) # output data to csv file
break
if __name__ == '__main__':
main()
提前感谢您花时间阅读这篇长篇文章/回复:)
这样的事情呢
try:
req = ..
if req.status_code == 503:
pass
elif ..:
pass
else:
do something when request succeeds
except ConnectionError:
pass
你可以试试这个
try:
#do something
except requests.exceptions.ConnectionError as exception:
#handle the newConnectionError exception
except Exception as exception:
#handle any exception