无法使用列表理解创建类似的列标题,就像 pandas 对特定表所做的那样

问题描述 投票:0回答:1

我正在尝试使用列表理解从网页中抓取表格的标题。我面临的问题是,当我使用 pandas 创建相同的标题时,外观有很大不同。只是通知您,

the headers created by padas address all columns

from bs4 import BeautifulSoup
import pandas as pd
import requests

link = 'https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/TableMatchingCriteria?GeographyType=SurveyZone&GeographyId=011002&CategoryLevel1=Primary%20Rental%20Market&CategoryLevel2=Vacancy%20Rate%20%28%25%29&ColumnField=2&RowField=TIMESERIES'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}

res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")

col_headers = [item.text.replace("\xa0","Unnamed: 0") for i in soup.select("table.CawdDataTable thead tr") for item in i.select("th, td")]
print(col_headers)

selector = soup.select_one("table.CawdDataTable")
df = pd.read_html(str(selector))[0]
print(list(df.columns))

输出:第一个是由列表理解创建的,第二个是由 pandas 创建的。

['Unnamed: 0', 'Bachelor', '1 Bedroom', '2 Bedroom', '3 Bedroom +', 'Total']
['Unnamed: 0', 'Bachelor', 'Bachelor.1', '1 Bedroom', '1 Bedroom.1', '2 Bedroom', '2 Bedroom.1', '3 Bedroom +', '3 Bedroom +.1', 'Total', 'Total.1']

我可以像 pandas 一样使用列表理解创建类似的列标题吗?

python pandas web-scraping beautifulsoup python-requests
1个回答
0
投票

IIUC 你可以做:

from io import StringIO

import pandas as pd
import requests
from bs4 import BeautifulSoup

link = "https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/TableMatchingCriteria?GeographyType=SurveyZone&GeographyId=011002&CategoryLevel1=Primary%20Rental%20Market&CategoryLevel2=Vacancy%20Rate%20%28%25%29&ColumnField=2&RowField=TIMESERIES"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest",
}

res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text, "html.parser")

table = soup.table

# remove colspan="2" from the <th>
for th in table.select("th[colspan]"):
    del th.attrs["colspan"]

# join every two <td>
for tr in table.select("tbody tr"):
    tds = tr.select("td")

    for i in range(0, len(tds), 2):
        td1, td2 = tds[i], tds[i + 1]
        td1.insert_before(
            BeautifulSoup(f"<td>{td1.text} {td2.text}</td>", "html.parser")
        )
        td1.extract()
        td2.extract()

df = pd.read_html(StringIO(str(table)))[0]
df = df.set_index(df.columns[0]).rename_axis(index=None)
print(df)

打印:

             Bachelor 1 Bedroom 2 Bedroom 3 Bedroom +   Total
2010 October   13.3 a     9.6 a     6.2 a          **  11.0 a
2011 October   13.3 a    14.9 a     8.7 a          **  11.1 a
2012 October   13.3 a     8.0 a     6.4 a          **   7.9 a
2013 October    6.7 a     5.1 b     4.7 a          **   4.0 a
2014 October    0.0 a     5.8 a     4.2 a          **   3.8 a
2015 October       **     4.0 b     2.7 b          **   2.6 a
2016 October    0.0 a     1.8 b     0.8 a          **   1.4 a
2017 October    0.0 a     1.5 a     0.0 d          **   0.5 a
2018 October       **     3.4 d     0.6 a          **   1.6 b
2019 October    0.0 a     0.0 c     0.5 a          **   0.2 a
2020 October       **     1.4 d     0.7 a          **   0.8 a
2021 October       **        **     2.5 b          **   2.0 b
2022 October       **     5.0 c    10.8 d          **   7.3 a
2023 October       **        **        **          **      **
© www.soinside.com 2019 - 2024. All rights reserved.