对于此示例,我有2个数据帧,df1中的genre列为第3列,但在df2中则为2列,标题也略有不同。在我的实际脚本中,我必须搜索列名称,因为列位置在它读取的每个工作表中都不同。
我如何将不同的标头名称识别为同一事物?
df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
'VENDOR ID': ['1234','4321','4132'],
'GENRE(S)': ['Action', 'Adventure', 'Drama']})
df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
'Genre': ['Animation', 'Adventure', 'Family'],
'VENDOR ID': ['5678','8765','8576']})
column_names = ['TITLE','VENDOR ID','GENRE(S)']
appended_data = []
sheet1 = df1[df1.columns.intersection(column_names)]
appended_data.append(sheet1)
sheet2 = df2[df2.columns.intersection(column_names)]
appended_data.append(sheet2)
appended_data = pd.concat(appended_data, sort=False)
output:
TITLE VENDOR ID GENRE(S)
0 The Matrix 1234 Action
1 Die Hard 4321 Adventure
2 Kill Bill 4132 Drama
0 Toy Story 5678 NaN
1 Shrek 8765 NaN
2 Frozen 8576 NaN
desired output:
TITLE VENDOR ID GENRE(S)
0 The Matrix 1234 Action
1 Die Hard 4321 Adventure
2 Kill Bill 4132 Drama
0 Toy Story 5678 Animation
1 Shrek 8765 Adventure
2 Frozen 8576 Family
感谢您抽出宝贵的时间来做。提出一个很好的问题非常重要,既然您提出了一个连贯的问题,我就能很快找到一个简单的解决方案:
import pandas as pd
df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
'VENDOR ID': ['1234','4321','4132'],
'GENRE(S)': ['Action', 'Adventure', 'Drama']})
df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
'Genre': ['Animation', 'Adventure', 'Family'],
'VENDOR ID': ['5678','8765','8576']})
简单方法:我将在下面使用.append(df)
,但要使其正常工作,我们需要df1
和df2
中的列进行匹配。在这种情况下,我们只需将df2's
'Genre'
替换为'GENRE(S)'
df2.columns = ['TITLE', 'GENRE(S)', 'VENDOR ID']
df3 = df1.append(df2)
print(df3)
GENRE(S) TITLE VENDOR ID
0 Action The Matrix 1234
1 Adventure Die Hard 4321
2 Drama Kill Bill 4132
0 Animation Toy Story 5678
1 Adventure Shrek 8765
2 Family Frozen 8576
更多详细说明:现在,对于单个用例,这是可行的,但在某些情况下,您可能会有许多不匹配的列和/或必须重复执行此操作。这是一个使用布尔索引查找不匹配名称的解决方案,然后使用zip()
和.rename()
映射列名称:
# RELOAD YOUR ORIGINAL DF'S
df1_find = df1.columns[~df1.columns.isin(df2.columns)] # select col name that isnt in df2
df2_find = df2.columns[~df2.columns.isin(df1.columns)] # select col name that isnt in df1
zipped = dict(zip(df2_find, df1_find)) # df2_find as key, df1_find as value
df2.rename(columns=zipped, inplace=True) # map zipped dict to the column names
df3 = df1.append(df2)
print(df3)
GENRE(S) TITLE VENDOR ID
0 Action The Matrix 1234
1 Adventure Die Hard 4321
2 Drama Kill Bill 4132
0 Animation Toy Story 5678
1 Adventure Shrek 8765
2 Family Frozen 8576
请紧记:1.这样做的方法假设两个df的列数相同2. ALSO假设df1
具有理想的列名格式
我希望这会有所帮助。