如何使用 BeautifulSoup 和 pandas 从维基百科中提取表格

问题描述 投票:0回答:1

我正在尝试从维基百科页面中提取表格并将其显示在 pandas DataFrame 中。这是我的代码:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find("table", {"class": "wikitable"})

headers = [header.text.strip() for header in table.find_all("th")]

df = pd.DataFrame(columns=headers)

rows = table.find_all("tr")
for row in rows:
    cells = row.find_all("td")
    cells = [cell.text.strip() for cell in cells]
    if len(cells) == len(headers):
        df = df.append(pd.Series(cells, index=headers), ignore_index=True)

print(df)


The DataFrame is empty. Can someone help me understand what I am doing wrong and how to fix it?
python pandas web-scraping beautifulsoup
1个回答
0
投票

选项1:
pd.read_html

pd.read_html
为您完成所有繁重的工作:

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"

# `list` of all tables, the one you need is at index `0`
df = pd.read_html(url)[0]

# columns will come out a bit funny, like `('Rank', 'Rank')`, adjust by
# setting level `two` to empty string if same as level `one`:
columns_tuples = [(one, '') if one == two else (one, two) for one, two in df.columns]

# correct MultiIndex columns
df.columns = pd.MultiIndex.from_tuples(columns_tuples)

输出:

df.head(2)

  Rank          Name     Industry      Revenue       Profit Employees  \
                                  USD millions USD millions             
0    1       Walmart       Retail     $611,289      $11,680   2100000   
1    2  Saudi Aramco  Oil and gas     $603,651     $159,069     70496   

  Headquarters[note 1] State-owned Ref. Revenue per worker  
                                                            
0        United States         NaN  [1]        $291,090.00  
1         Saudi Arabia         NaN  [4]      $8,562,911.37  

选项 2:摘自
soup

表格的标题位于第

0
行,第二级“收入”和“利润”位于第
1
行。请注意,
table.find_all('th')
会为您提供一堆元素 (
61
),这是因为
Rank
值是 also
th
元素。因此,
len(cells) == len(headers)
永远不是
True

以下是建议的调整:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find("table", {"class": "wikitable"})

rows = table.find_all("tr")

# get `headers` from `rows[0]`
headers = [header.text.strip() for header in rows[0].find_all('th')]

# `rows[1]` will have `USD millions`, but `soup` does not preserve the logic needed
# to add this secondary level for the correct columns automatically

# print(headers)
# ['Rank', 'Name', 'Industry', 'Revenue', 'Profit', 'Employees', 
# 'Headquarters[note 1]', 'State-owned', 'Ref.', 'Revenue per worker']
    
# create `data`, loop through `rows[2:]` and append each row (`list`) to `data`
data = []

for row in rows[2:]:
    data.append(
        [cell.text.strip() for cell in row.find_all("td")]
        )
# `df.append` is deprecated, but it was never a good idea to keep appending to a `df`;
# it creates intermediate copies. Instead construct `df` only after the loop
# with `data` complete

df = pd.DataFrame(data, columns=headers[1:], 
                  index=pd.Index(range(1, len(data) +1), name=headers[0])
                  )
# Note that we are reconstructing "Rank" as the index (the rank numbers are `th` elems!)

输出:

df.head(2)

              Name     Industry   Revenue    Profit  Employees  \
Rank                                                             
1          Walmart       Retail  $611,289   $11,680  2,100,000   
2     Saudi Aramco  Oil and gas  $603,651  $159,069     70,496   

     Headquarters[note 1] State-owned Ref. Revenue per worker  
Rank                                                           
1           United States              [1]        $291,090.00  
2            Saudi Arabia              [4]      $8,562,911.37  
© www.soinside.com 2019 - 2024. All rights reserved.