提取数据时如何处理列名的不同拼写?

问题描述 投票:0回答:1

对于此示例,我有2个数据帧,df1中的genre列为第3列,但在df2中则为2列,标题也略有不同。在我的实际脚本中,我必须搜索列名称,因为列位置在它读取的每个工作表中都不同。

我如何将不同的标头名称识别为同一事物?

df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
               'VENDOR ID': ['1234','4321','4132'],
               'GENRE(S)': ['Action', 'Adventure', 'Drama']})

df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
               'Genre': ['Animation', 'Adventure', 'Family'],
               'VENDOR ID': ['5678','8765','8576']})

column_names = ['TITLE','VENDOR ID','GENRE(S)']

appended_data = []

sheet1 = df1[df1.columns.intersection(column_names)]
appended_data.append(sheet1)
sheet2 = df2[df2.columns.intersection(column_names)]
appended_data.append(sheet2)

appended_data = pd.concat(appended_data, sort=False)

output:

        TITLE VENDOR ID   GENRE(S)
0  The Matrix      1234     Action
1    Die Hard      4321  Adventure
2   Kill Bill      4132      Drama
0   Toy Story      5678        NaN
1       Shrek      8765        NaN
2      Frozen      8576        NaN

desired output:

        TITLE VENDOR ID   GENRE(S)
0  The Matrix      1234     Action
1    Die Hard      4321  Adventure
2   Kill Bill      4132      Drama
0   Toy Story      5678  Animation
1       Shrek      8765  Adventure
2      Frozen      8576     Family
python excel pandas extract
1个回答
0
投票

感谢您抽出宝贵的时间来做。提出一个很好的问题非常重要,既然您提出了一个连贯的问题,我就能很快找到一个简单的解决方案:

import pandas as pd

df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
                'VENDOR ID': ['1234','4321','4132'],
                 'GENRE(S)': ['Action', 'Adventure', 'Drama']})

df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
                    'Genre': ['Animation', 'Adventure', 'Family'],
                'VENDOR ID': ['5678','8765','8576']})

简单方法:我将在下面使用.append(df),但要使其正常工作,我们需要df1df2中的列进行匹配。在这种情况下,我们只需将df2's 'Genre'替换为'GENRE(S)'

df2.columns = ['TITLE', 'GENRE(S)', 'VENDOR ID']

df3 = df1.append(df2)
print(df3)

    GENRE(S)       TITLE VENDOR ID
0     Action  The Matrix      1234
1  Adventure    Die Hard      4321
2      Drama   Kill Bill      4132
0  Animation   Toy Story      5678
1  Adventure       Shrek      8765
2     Family      Frozen      8576

更多详细说明:现在,对于单个用例,这是可行的,但在某些情况下,您可能会有许多不匹配的列和/或必须重复执行此操作。这是一个使用布尔索引查找不匹配名称的解决方案,然后使用zip().rename()映射列名称:

# RELOAD YOUR ORIGINAL DF'S 

df1_find = df1.columns[~df1.columns.isin(df2.columns)] # select col name that isnt in df2
df2_find = df2.columns[~df2.columns.isin(df1.columns)] # select col name that isnt in df1

zipped = dict(zip(df2_find, df1_find)) # df2_find as key, df1_find as value

df2.rename(columns=zipped, inplace=True) # map zipped dict to the column names

df3 = df1.append(df2)
print(df3)

    GENRE(S)       TITLE VENDOR ID
0     Action  The Matrix      1234
1  Adventure    Die Hard      4321
2      Drama   Kill Bill      4132
0  Animation   Toy Story      5678
1  Adventure       Shrek      8765
2     Family      Frozen      8576

请紧记:1.这样做的方法假设两个df的列数相同2. ALSO假设df1具有理想的列名格式

我希望这会有所帮助。

© www.soinside.com 2019 - 2024. All rights reserved.