网络抓取的数据未完全加载到 csv 文件中

Question

我试图对这个网站进行网络攻击以获得获奖名单。但是我在我的csv文件中看到之后，颁奖典礼没有加载csv文件，我不想将参考加载到csv文件中，有些奖项没有显示被提名者，有些奖项没有显示年份，有些奖项有奖项名称前后有很多引号。

这些是我的代码

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Send a request to the website
url = 'https://en.wikipedia.org/wiki/List_of_awards_and_nominations_received_by_Exo'
response = requests.get(url)

# Step 2: Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Find the awards tables
awards_tables = soup.find_all('table', {'class': 'wikitable'})

# Step 4: Extract data
awards = []
for table in awards_tables:
    for row in table.find_all('tr')[1:]:  # skip the header row
        cells = row.find_all('td')
        if len(cells) >= 4:
            event = cells[0].get_text(strip=True)
            award_name = cells[1].get_text(strip=True)
            year = cells[2].get_text(strip=True)
            group_name = cells[3].get_text(strip=True)
            awards.append([event, award_name, year, group_name])

# Step 5: Create a DataFrame
df = pd.DataFrame(awards, columns=['Award Event', 'Award Name', 'Year', 'Group Name'])

# Step 6: Save to CSV
df.to_csv('exo_awards.csv', index=False)

print("Awards saved to exo_awards.csv")

预期输出示例：

第一个颁奖典礼是美国音乐奖，EXO获得两年提名（2019年和2020年）

csv 文件中的预期输出：

American Music award, 2019, EXO, favorite social artist, nominated
American Music award, 2020, EXO, favorite social artist, nominated

Answer 1

处理文本中的引用：清理文本以删除任何引用或多余的引号。
处理缺失数据：确保我们处理某些奖项未显示被提名者或年份的情况。

正确提取数据：确保我们正确提取和格式化数据，特别是对于具有多年和提名的奖项。这是脚本的修订版本：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

url = 'https://en.wikipedia.org/wiki/List_of_awards_and_nominations_received_by_Exo'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

awards_tables = soup.find_all('table', {'class': 'wikitable'})

def clean_text(text):
    # Remove references in square brackets
    text = re.sub(r'\[.*?\]', '', text)
    # Remove extra quotation marks
    text = text.replace('"', '')
    return text.strip()

awards = []
for table in awards_tables:
    for row in table.find_all('tr')[1:]:  # skip the header row
        cells = row.find_all('td')
        if len(cells) >= 4:
            event = clean_text(cells[0].get_text(strip=True))
            award_name = clean_text(cells[1].get_text(strip=True))
            year = clean_text(cells[2].get_text(strip=True))
            group_name = clean_text(cells[3].get_text(strip=True))
            awards.append([event, year, group_name, award_name, "Nominated"])

df = pd.DataFrame(awards, columns=['Award Event', 'Year', 'Group Name', 'Award Name', 'Status'])

df.to_csv('exo_awards.csv', index=False)

print("Awards saved to exo_awards.csv")

清理文本函数：clean_text函数从文本中删除引用（方括号内的文本）和多余的引号。
数据提取：脚本现在可以正确提取并清理每个数据单元格的内容。
数据格式化：脚本正确格式化奖项数据，包括使用“提名”等默认值处理丢失的数据必要时。

预期输出示例：该脚本应生成一个与预期输出类似的 CSV 文件：

奖项活动、年份、团体名称、奖项名称、现状美国音乐奖,2019,EXO,最喜爱的社交艺人,提名美国音乐奖,2020,EXO,最喜爱的社交艺人,提名

网络抓取的数据未完全加载到 csv 文件中

问题描述投票：0回答：1

1个回答

最新问题

网络抓取的数据未完全加载到 csv 文件中

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1