从多个URL抓取表格

问题描述 投票:0回答:1

您好,我已经可以抓取表格并将其从特定网站导出,但是希望添加更多网站以进行抓取。它仅返回我输入的第二个URL。预先致歉,因为我是Python的新手。谢谢。

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650', 'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']

for url in urls:
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
    page = requests.get(url,headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = soup.find_all('table')[3]
    df = pd.read_html(str(table))[0]

print(df)
python url beautifulsoup
1个回答
0
投票

嗯,这里的问题是您在tables上循环而没有附加。然后您就离开printing

示例

for item in range(1, 4):
    pass

print(item)

现在输出是:

3

因为它是loop中最后返回的元素。

但是,如果我们像下面这样追加:

result = []
for item in range(1, 4):
    result.append(item)

print(result)

因此我们将获得以下内容:

[1, 2, 3]

现在,让我们转到下一点,您已经可以使用table直接读取pandas.read_html,因为urllib3已经在pandas的背景下,如下所示:

import pandas as pd

df = pd.read_html(
    "http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650")[3]

print(df)

但是由于网站TCP层被配置为Connection: close ref

HTTP / 1.1定义了发送者的“关闭”连接选项,以指示响应完成后将关闭该连接。例如,

   Connection: close

因此我们将在requests库下运行该程序,并保持Session对象不被使用serverrequests.Session()防火墙阻止,并为每个table附加url,然后将其串联在一起table使用pd.concat功能,然后转换为csv using pd.to_csv()

import pandas as pd
import requests

urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650',
        'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}


def main(urls):
    goal = []
    with requests.Session() as req:
        for url in urls:
            r = req.get(url, headers=headers)
            df = pd.read_html(r.content)[3]
            goal.append(df)
    goal = pd.concat(goal)
    goal.to_csv("data.csv", index=False)


main(urls)

输出:View Online

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.