假设我现在有一个带有列表的字典:
dic = { "protein1": ["func1", "func2"],
"protein2": ["func2", "func3", "func5"],
"protein3": ["func3", "func5"]}
和索引列表:
rows = ["protein1", "protein2", "protein3", "protein4"]
和列的列表:
columns = ["func1", "func2", "func3", "func4", "func5", "func6"]
我想将dic
转换为Pandas DataFrame之类的
func1 func2 func3 func4 func5 func6
protein1 1 1 0 0 0 0
protein2 0 1 1 0 1 0
protein3 0 0 1 0 1 0
protein4 0 0 0 0 0 0
编码这个的pythonic方法是什么?谢谢!
使用MultiLabelBinarizer和DataFrame.reindex
:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(dic.values()),columns=mlb.classes_, index=dic.keys())
.reindex(columns=columns, index=rows, fill_value=0))
print (df)
func1 func2 func3 func4 func5 func6
protein1 1 1 0 0 0 0
protein2 0 1 1 0 1 0
protein3 0 0 1 0 1 0
protein4 0 0 0 0 0 0
只有熊猫解决方案是可能的,但更慢 - 使用Series.str.get_dummies
:
df = (pd.Series({k:'|'.join(v) for k, v in dic.items()}).str.get_dummies()
.reindex(columns=columns, index=rows, fill_value=0))
另一种解决方案,其输出是具有布尔值的数据帧(可以视为整数)
import numpy as np
dic = { "protein1": ["func1", "func2"],
"protein2": ["func2", "func3", "func5"],
"protein3": ["func3", "func5"]}
columns = ["func1", "func2", "func3", "func4", "func5", "func6"]
n = len(columns)
# index arrays by column values
for key, value in dic.items():
newRow = np.empty(n, dtype=bool)
np.put(newRow, [columns.index(i) for i in value], True)
dic[key] = newRow
pd.DataFrame.from_dict(dic, orient='index', columns=columns)
# Out:
# func1 func2 func3 func4 func5 func6
# protein1 True True False False False False
# protein2 False True True False True False
# protein3 False False True False True False