您好,我已经可以抓取表格并将其从特定网站导出,但是希望添加更多网站以进行抓取。它仅返回我输入的第二个URL。预先致歉,因为我是Python的新手。谢谢。
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650', 'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']
for url in urls:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all('table')[3]
df = pd.read_html(str(table))[0]
print(df)
嗯,这里的问题是您在tables
上循环而没有附加。然后您就离开printing
。
示例:
for item in range(1, 4):
pass
print(item)
现在输出是:
3
因为它是loop
中最后返回的元素。
但是,如果我们像下面这样追加:
result = []
for item in range(1, 4):
result.append(item)
print(result)
因此我们将获得以下内容:
[1, 2, 3]
现在,让我们转到下一点,您已经可以使用table
直接读取pandas.read_html,因为urllib3
已经在pandas
的背景下,如下所示:
import pandas as pd
df = pd.read_html(
"http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650")[3]
print(df)
但是由于网站TCP
层被配置为Connection: close
ref。
HTTP / 1.1定义了发送者的“关闭”连接选项,以指示响应完成后将关闭该连接。例如,
Connection: close
因此我们将在requests
库下运行该程序,并保持Session
对象不被使用server
的requests.Session()
防火墙阻止,并为每个table
附加url
,然后将其串联在一起table
使用pd.concat
功能,然后转换为csv
using pd.to_csv()
:
import pandas as pd
import requests
urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650',
'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(urls):
goal = []
with requests.Session() as req:
for url in urls:
r = req.get(url, headers=headers)
df = pd.read_html(r.content)[3]
goal.append(df)
goal = pd.concat(goal)
goal.to_csv("data.csv", index=False)
main(urls)
输出:View Online