我正在尝试从维基百科页面中提取表格并将其显示在 pandas DataFrame 中。这是我的代码:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find("table", {"class": "wikitable"})
headers = [header.text.strip() for header in table.find_all("th")]
df = pd.DataFrame(columns=headers)
rows = table.find_all("tr")
for row in rows:
cells = row.find_all("td")
cells = [cell.text.strip() for cell in cells]
if len(cells) == len(headers):
df = df.append(pd.Series(cells, index=headers), ignore_index=True)
print(df)
The DataFrame is empty. Can someone help me understand what I am doing wrong and how to fix it?
pd.read_html
pd.read_html
为您完成所有繁重的工作:
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"
# `list` of all tables, the one you need is at index `0`
df = pd.read_html(url)[0]
# columns will come out a bit funny, like `('Rank', 'Rank')`, adjust by
# setting level `two` to empty string if same as level `one`:
columns_tuples = [(one, '') if one == two else (one, two) for one, two in df.columns]
# correct MultiIndex columns
df.columns = pd.MultiIndex.from_tuples(columns_tuples)
输出:
df.head(2)
Rank Name Industry Revenue Profit Employees \
USD millions USD millions
0 1 Walmart Retail $611,289 $11,680 2100000
1 2 Saudi Aramco Oil and gas $603,651 $159,069 70496
Headquarters[note 1] State-owned Ref. Revenue per worker
0 United States NaN [1] $291,090.00
1 Saudi Arabia NaN [4] $8,562,911.37
soup
表格的标题位于第
0
行,第二级“收入”和“利润”位于第 1
行。请注意, table.find_all('th')
会为您提供一堆元素 (61
),这是因为 Rank
值是 also th
元素。因此,len(cells) == len(headers)
永远不是 True
。
以下是建议的调整:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find("table", {"class": "wikitable"})
rows = table.find_all("tr")
# get `headers` from `rows[0]`
headers = [header.text.strip() for header in rows[0].find_all('th')]
# `rows[1]` will have `USD millions`, but `soup` does not preserve the logic needed
# to add this secondary level for the correct columns automatically
# print(headers)
# ['Rank', 'Name', 'Industry', 'Revenue', 'Profit', 'Employees',
# 'Headquarters[note 1]', 'State-owned', 'Ref.', 'Revenue per worker']
# create `data`, loop through `rows[2:]` and append each row (`list`) to `data`
data = []
for row in rows[2:]:
data.append(
[cell.text.strip() for cell in row.find_all("td")]
)
# `df.append` is deprecated, but it was never a good idea to keep appending to a `df`;
# it creates intermediate copies. Instead construct `df` only after the loop
# with `data` complete
df = pd.DataFrame(data, columns=headers[1:],
index=pd.Index(range(1, len(data) +1), name=headers[0])
)
# Note that we are reconstructing "Rank" as the index (the rank numbers are `th` elems!)
输出:
df.head(2)
Name Industry Revenue Profit Employees \
Rank
1 Walmart Retail $611,289 $11,680 2,100,000
2 Saudi Aramco Oil and gas $603,651 $159,069 70,496
Headquarters[note 1] State-owned Ref. Revenue per worker
Rank
1 United States [1] $291,090.00
2 Saudi Arabia [4] $8,562,911.37