我目前正在 Colab 中使用 Python 学习数据分析的基础知识,为此我使用我的 IMDb 监视列表作为数据集。
在流派列中,可以在同一个单元格中注册多个电影流派(这使事情变得更加困难),我试图计算此数据集中呈现的流派的比例,然后用饼图或也许是条形图。
因此,我创建了变量来将每种流派的
value_counts()
存储为 True
或 False
,如下所示:
action = df['Genres'].str.contains('Action').value_counts()
animation = df['Genres'].str.contains('Animation').value_counts()
biography = df['Genres'].str.contains('Biography').value_counts()
comedy = df['Genres'].str.contains('Comedy').value_counts()
crime = df['Genres'].str.contains('Crime').value_counts()
drama = df['Genres'].str.contains('Drama').value_counts()
documentary = df['Genres'].str.contains('Documentary').value_counts()
family = df['Genres'].str.contains('Family').value_counts()
fantasy = df['Genres'].str.contains('Fantasy').value_counts()
film_noir = df['Genres'].str.contains('Film-Noir').value_counts()
history = df['Genres'].str.contains('History').value_counts()
horror = df['Genres'].str.contains('Horror').value_counts()
mystery = df['Genres'].str.contains('Mystery').value_counts()
music = df['Genres'].str.contains('Music').value_counts()
musical = df['Genres'].str.contains('Musical').value_counts()
romance = df['Genres'].str.contains('Romance').value_counts()
scifi = df['Genres'].str.contains('Sci-Fi').value_counts()
sport = df['Genres'].str.contains('Sport').value_counts()
thriller = df['Genres'].str.contains('Thriller').value_counts()
war = df['Genres'].str.contains('War').value_counts()
western = df['Genres'].str.contains('Western').value_counts()
然后我将这些变量放入
DataFrame
:
genres = pd.DataFrame(
[action, animation, biography,
comedy, crime, drama,
documentary, family, fantasy,
film_noir, history, horror,
mystery, music, musical,
romance, scifi, sport,
thriller, war, western],
)
genres.head(5)
问题出在输出中:
我希望它显示变量名称而不是“流派”,因为它显示在第一列中。可以吗?
我认为您可以通过使用字典创建一个
DataFrame
来实现此目的,其中键是流派名称,值是包含计数的相应系列。这是一个例子:
import pandas as pd
# Sample DataFrame
data = {'Genres': ['Action, Drama', 'Comedy, Romance', 'Action, Comedy', 'Drama', 'Comedy']}
df = pd.DataFrame(data)
# List of genres
genre_list = ['Action', 'Animation', 'Biography', 'Comedy', 'Crime', 'Drama', 'Documentary', 'Family',
'Fantasy', 'Film-Noir', 'History', 'Horror', 'Mystery', 'Music', 'Musical', 'Romance',
'Sci-Fi', 'Sport', 'Thriller', 'War', 'Western']
# Create a dictionary to store genre counts
genre_counts = {}
# Populate the dictionary with counts
for genre in genre_list:
genre_counts[genre] = df['Genres'].str.contains(genre).sum()
# Create a DataFrame from the dictionary
genres_df = pd.DataFrame(list(genre_counts.items()), columns=['Genre', 'Count'])
# Display the DataFrame
print(genres_df)
此代码创建一个字典
(genre_counts)
,其中键是流派名称,值是“流派”列中每种流派的计数。然后,它将字典转换为 DataFrame (genres_df)
并显示它。这样,DataFrame 将具有“流派”和“计数”列,而不是“流派”。
避免使用相对较慢的
for
循环的更快方法:
假设有以下数据框
Genres
0 Comedy, Horror
1 Comedy, Drama, War
2 Mistery, Romance, Thriller
建议的代码
import pandas as pd
# create the original DataFrame
df = pd.DataFrame({'Genres': ['Comedy, Horror', 'Comedy, Drama, War', 'Mistery, Romance, Thriller']})
# split the genres by comma and explode the list into separate rows
df = df.assign(Genres=df['Genres'].str.split(',')).explode('Genres')
# create an empty dictionary to store the genre counts
all_genres = df['Genres'].unique()
# Counting Matrix using crosstab method
genre_counts = pd.crosstab(index=df.index, columns=df['Genres'], margins=False).to_dict('index')
genre_counts = pd.DataFrame(genre_counts)
# count the number of 0s and 1s in each row
counts = ( genre_counts.apply(lambda row: [sum(row == 0), sum(row == 1)], axis=1) )
# Final count with 2 columns 'False' and 'True'
counts = pd.DataFrame(counts.tolist(), index=counts.index).rename(columns={0:'False', 1:'True'})
print(counts)
可视化
False True
Drama 2 1
Horror 2 1
Romance 2 1
Thriller 2 1
War 2 1
Comedy 1 2
Mistery 2 1