这是我的scrap.py代码
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
website = "https://houston.craigslist.org/search/cta"
uClient = uReq(website)
page_html = uClient.read()
uClient.close()
soup_html = soup(page_html, "html.parser")
result_html = soup_html.findAll("p", {"class":"result-info"})
filename = "products.csv"
f = open(filename, "w", encoding='utf8')
headers = "car_name, price\n"
f.write(headers)
for container in result_html:
carname = container.a.text
price_container = container.findAll('span', {'class':'result-price'})
price = price_container[0].text
f.write(carname + "," + price + "\n")
f.close()
在终端上,它工作正常但是当我循环它时,它给出以下错误..
Traceback (most recent call last):
File "scrap.py", line 23, in <module>
price = price_container[0].text.splitlines()
IndexError: list index out of range
请帮忙。谢谢
试试下面的一个。它将获取所有物品和价格,并处理IndexError
,如果有的话。
from bs4 import BeautifulSoup
from urllib.request import urlopen
response = urlopen("https://houston.craigslist.org/search/cta")
soup_html = BeautifulSoup(response.read(), "html.parser")
for container in soup_html.find_all("p", {"class":"result-info"}):
carname = container.find_all("a")[0].text
try:
price = container.find_all('span', {'class':'result-price'})[0].text
except IndexError:
price = ""
print(carname,price)
我试图缩短你的代码,让它看起来更好。
这是因为有些汽车没有价格,例如this one。如果没有价格,你可以把价格放到unknown
:
price_container = container.findAll('span', {'class':'result-price'})
if len(price_container) > 0:
price = price_container[0].text
else:
price = 'unknown'
或者你可以跳过没有价格的那些,所以他们不会被写入文件:
price_container = container.findAll('span', {'class':'result-price'})
if len(price_container) == 0:
continue
price = price_container[0].text
我该如何按价格对其进行排序?
results = []
for container in result_html:
carname = container.a.text
price_container = container.findAll('span', {'class':'result-price'})
if len(price_container) == 0:
continue
price = price_container[0].text.strip('$')
results.append((int(price), carname))
for price, carname in sorted(results):
f.write("{}, {}\n".format(carname, price))
f.close()