我试图根据分隔符“|”在数据框下方进行爆炸。
preferred_title_symbol mim_number MDR_code MDR_term
0 17-BETA HYDROXYSTEROID DEHYDROGENASE III 264300.0 10037122 pseudohermaphroditism
1 2,4-DIENOYL-CoA REDUCTASE DEFICIENCY 616034.0 10027417 | 10081311 | 10059750 leukodystrophy | metabolic acidosis | muscle retention
预期输出是这样的:
preferred_title_symbol mim_number MDR_code MDR_term
0 17-BETA HYDROXYSTEROID DEHYDROGENASE III 264300.0 10037122 pseudohermaphroditism
1 2,4-DIENOYL-CoA REDUCTASE DEFICIENCY 616034.0 10027417 leukodystrophy
1 2,4-DIENOYL-CoA REDUCTASE DEFICIENCY 616034.0 10081311 metabolic acidosis
1 2,4-DIENOYL-CoA REDUCTASE DEFICIENCY 616034.0 10059750 muscle retention
我正在使用下面的代码来做到这一点。
out = (
pd.concat([df[col].str.split("\|")
.explode() for col in ["MDR_code", "MDR_term"]], axis=1)
.join(df["preferred_title_symbol"])
)
但是我收到以下错误: ValueError:无法在具有重复标签的轴上重新索引
我怎样才能达到想要的输出。
您需要使用列作为参数来分解整个数据框
import pandas as pd
data = {
'preferred_title_symbol': ['17-BETA HYDROXYSTEROID DEHYDROGENASE III', '2,4-DIENOYL-CoA REDUCTASE DEFICIENCY'],
'mim_number': [264300.0, 616034.0],
'MDR_code': ['10037122', '10027417 | 10081311 | 10059750'],
'MDR_term': ['pseudohermaphroditism', 'leukodystrophy | metabolic acidosis | muscle retention']
}
df = pd.DataFrame(data)
df['MDR_code'] = df['MDR_code'].str.split(' | ')
df['MDR_term'] = df['MDR_term'].str.split(' | ')
df = df.explode('MDR_code').explode('MDR_term')
print(df)