如何修复 ValueError：无法设置列不匹配的行 |美丽的汤

Question

我收到错误：

ValueError: cannot set a row with mismatched columns

从维基百科上抓取。见下文。我该如何解决这个问题？

from bs4 import BeautifulSoup
import pandas as pd
import requests
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
table = soup.find_all('table')[0]

world_companies = soup.find('tr')
df = pd.DataFrame(columns = world_companies)

column_data = table.find_all('tr')
for row in column_data[2:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]    
    length = len(df)
    df.loc[length] = individual_row_data

Answer 1

你不需要漂亮的汤来毁掉一张有熊猫的桌子：

import pandas as pd

table_MN = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue')
for df in table_MN:
    if "Rank" in df.columns:
        print(df.to_string(index=False))

该表有两个标题行。所以如果你跳过第二行，它应该可以工作。

Answer 2

要解决

BeautifulSoup

的问题，请尝试迭代每个

<tr>

中的元素：

world_companies = [e.get_text(strip=True) for e in soup.find('tr')]
df = pd.DataFrame(columns = world_companies)

column_data = table.find_all('tr')
for row in column_data[2:]:
    individual_row_data = [data.text.strip() for data in row]
    length = len(df)
    df.loc[length] = individual_row_data

还建议使用

pandas

刮掉表格，因为您仍然作为模块加载并根据您的需要修复列标题

import pandas as pd

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue')[0]
df.columns = [
    f"{col[0]}  {col[1]}" if col[0] != col[1] else col[0]
    for col in df.columns.values
]

print(df)

排名	姓名	行业	收入百万美元	利润数百万美元	员工	总部[注1]	国有	参考。
1	沃尔玛	零售	$648,125	$15,511	2100000	美国	南	[1]
2	亚马逊	零售	$574,785	$30,425	1525000	美国	南	[4]
...
49	花旗集团	财务	$156,820	$9,228	237925	美国	南	[52]
50	Centene 公司	医疗保健	$153,999	$2,702	67700	美国	南	[53]

如何修复 ValueError：无法设置列不匹配的行 |美丽的汤

问题描述投票：0回答：2

2个回答

最新问题

如何修复 ValueError：无法设置列不匹配的行 |美丽的汤

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2