在 pandas 数据框中分解多列

问题描述 投票:0回答:1

我试图根据分隔符“|”在数据框下方进行爆炸。

    preferred_title_symbol                      mim_number  MDR_code    MDR_term    
0   17-BETA HYDROXYSTEROID DEHYDROGENASE III    264300.0    10037122    pseudohermaphroditism   
1   2,4-DIENOYL-CoA REDUCTASE DEFICIENCY        616034.0    10027417 | 10081311 | 10059750  leukodystrophy | metabolic acidosis | muscle retention  

预期输出是这样的:

    preferred_title_symbol                      mim_number  MDR_code    MDR_term    
0   17-BETA HYDROXYSTEROID DEHYDROGENASE III    264300.0    10037122    pseudohermaphroditism   
1   2,4-DIENOYL-CoA REDUCTASE DEFICIENCY        616034.0    10027417    leukodystrophy 
1   2,4-DIENOYL-CoA REDUCTASE DEFICIENCY        616034.0    10081311    metabolic acidosis 
1   2,4-DIENOYL-CoA REDUCTASE DEFICIENCY        616034.0    10059750    muscle retention

我正在使用下面的代码来做到这一点。

out = (
        pd.concat([df[col].str.split("\|")
                          .explode() for col in ["MDR_code", "MDR_term"]], axis=1)
             .join(df["preferred_title_symbol"])
      )

但是我收到以下错误: ValueError:无法在具有重复标签的轴上重新索引

我怎样才能达到想要的输出。

python pandas explode
1个回答
0
投票

您需要使用列作为参数来分解整个数据框

import pandas as pd

data = {
    'preferred_title_symbol': ['17-BETA HYDROXYSTEROID DEHYDROGENASE III', '2,4-DIENOYL-CoA REDUCTASE DEFICIENCY'],
    'mim_number': [264300.0, 616034.0],
    'MDR_code': ['10037122', '10027417 | 10081311 | 10059750'],
    'MDR_term': ['pseudohermaphroditism', 'leukodystrophy | metabolic acidosis | muscle retention']
}

df = pd.DataFrame(data)


df['MDR_code'] = df['MDR_code'].str.split(' | ')
df['MDR_term'] = df['MDR_term'].str.split(' | ')


df = df.explode('MDR_code').explode('MDR_term')

print(df)
© www.soinside.com 2019 - 2024. All rights reserved.