我正在尝试操作 Pandas 中的数据框并遇到一些问题。我查看了这里提出的问题的一些变体,其中大多数涉及使用数据透视并丢弃一些现有的列,我想知道是否有办法解决这个问题。
我创建了一些简单的数据作为说明,与我现有的数据类似:
import pandas as pd
raw_data = {'FirstName': ["John", "Jill", "Jack", "John", "Jill", "Jack",],
'LastName': ["Blue", "Green", "Yellow","Blue", "Green", "Yellow"],
'Building': ["Building1", "Building1", "Building2","Building1", "Building1", "Building2"],
'Month': ["November", "November", "November", "December","December", "December"],
'Sales': [100, 150, 275, 200, 150, 150]}
frame = pd.DataFrame(raw_data, columns =raw_data.keys())
这会生成一个如下所示的数据框:
输出帧 我想做的是将月份转换为列,同时保留其他数据。所以像这样:DesiredFrame
我已经尝试过从这里开始的建议:通过两个变量将熊猫从长到宽重塑
我尝试以月份为中心:
frame.pivot(columns = 'Month')
我尝试添加更多列以查看是否可以清理:
frame.pivot(columns = ('FirstName', 'LastName','Month'), values = 'Sales' )
在这两种情况下,我都得到了一些奇怪的专栏。我很好奇 Pandas 在这里做什么,但我不知道如何理解这一点。
我想我可以循环并重新创建数据,但我认为这一定是更好的方法?
事实上,你几乎已经和
pivot()
一起到达那里了。指定 index
将带您几乎一路到达那里:
import pandas as pd
raw_data = {'FirstName': ["John", "Jill", "Jack", "John", "Jill", "Jack",],
'LastName': ["Blue", "Green", "Yellow","Blue", "Green", "Yellow"],
'Building': ["Building1", "Building1", "Building2","Building1", "Building1", "Building2"],
'Month': ["November", "November", "November", "December","December", "December"],
'Sales': [100, 150, 275, 200, 150, 150]}
frame = pd.DataFrame(raw_data, columns =raw_data.keys())
df = frame.pivot(
index=["FirstName", "LastName", "Building"],
columns="Month",
values="Sales",
)
df
唯一的区别是您的数据框中将有一个多级索引。如果您想准确获得所需的输出,您需要折叠多重索引并重命名索引(您也可以链接它们)
import pandas as pd
raw_data = {'FirstName': ["John", "Jill", "Jack", "John", "Jill", "Jack",],
'LastName': ["Blue", "Green", "Yellow","Blue", "Green", "Yellow"],
'Building': ["Building1", "Building1", "Building2","Building1", "Building1", "Building2"],
'Month': ["November", "November", "November", "December","December", "December"],
'Sales': [100, 150, 275, 200, 150, 150]}
frame = pd.DataFrame(raw_data, columns =raw_data.keys())
df = (
frame.pivot(
index=["FirstName", "LastName", "Building"],
columns="Month",
values="Sales"
)
.reset_index() # collapses multi-index
.rename_axis(None, axis=1) # renames index
)
df
我赞成Murilo Cunha的上面的答案。
如果您有更大的 DataFrame 并且希望对单列进行更通用的答案以使其变宽,您可以对 Murilo 的答案进行以下修改,以便枢轴索引覆盖所有其他列,而不必按名称指定它们:
raw_data = {'FirstName': ["John", "Jill", "Jack", "John", "Jill", "Jack",],
'LastName': ["Blue", "Green", "Yellow","Blue", "Green", "Yellow"],
'Building': ["Building1", "Building1", "Building2", "Building1","Building1", "Building2"],
'Month': ["November", "November", "November", "December","December", "December"],
'Sales': [100, 150, 275, 200, 150, 150]}
frame = pd.DataFrame(raw_data, columns =raw_data.keys())
col_to_wide = "Month"
vals = "Sales"
# keep all other columns
keep_cols = [col for col in frame.columns if col not in [col_to_wide, vals]]
df = (
frame.pivot(
index=keep_cols,
columns=col_to_wide,
values=vals
)
.reset_index() # collapses multi-index
.rename_axis(None, axis=1) # renames index
)
df