我有以下数据框(来自使用pd.read_csv的大型csv文件):
sal_vcf_to_df = pd.read_csv(sal_filepath, delimiter='\t', header = 0, index_col = False,
low_memory=False, usecols=['listA', 'Amino_Acid_Change', 'Gene_Name'])
sal_df_wo_na = sal_vcf_to_df.dropna(axis = 0, how = 'any')
sal_df_wo_na['listA'] = sal_df_wo_na['listA'].apply(lambda x : ast.literal_eval(x))
sal_df_wo_na['listA'] = sal_df_wo_na['listA'].apply(lambda x: list(map(float, x)))
我得到的数据帧:
listA Amino_Acid_Change Gene_Name
0 "['133', '115', '3', '1']" Q637K ATM
1 "['114', '115', '2', '3']" I111 PIK3R1
2 "['51', '59', '1', '1']" T2491 KMT2C
我想将'listA'列转换为浮点列表。到目前为止,我已经尝试过几个步骤:
sal_df_wo_na['listA'] = sal_df_wo_na['listA'].apply(lambda x : ast.literal_eval(x))
然后:
sal_df_wo_na['DP4_freeBayes'] = sal_df_wo_na['DP4_freeBayes'].apply(lambda x: list(map(float, x)))
但是在第一步之后我收到了以下警告:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
有谁知道如何修复警告或有更好的解决方案?
选项1
pd.eval
- 最多可行100行
在这个可怕的专栏上执行转换的一种非常快速的方法是删除所有引号,然后调用pd.eval
-
v = pd.eval(df.listA.str.replace("['\"]", '')).astype(float)
v
array([[ 133., 115., 3., 1.],
[ 114., 115., 2., 3.],
[ 51., 59., 1., 1.]])
将结果分配回来 -
df['listA'] = v
df
listA Amino_Acid_Change Gene_Name
0 [133, 115, 3, 1] Q637K ATM
1 [114, 115, 2, 3] I111 PIK3R1
2 [51, 59, 1, 1] T2491 KMT2C
选项2
ast.literal_eval
- 可靠的主力
更新:pd.eval
only supports upto a 100 rows,所以更慢,更可靠的后备将使用ast.literal_eval
-
from ast import literal_eval
df.listA = df.listA.str.replace("'", '').apply(literal_eval)
df
listA Amino_Acid_Change Gene_Name
0 [133, 115, 3, 1] Q637K ATM
1 [114, 115, 2, 3] I111 PIK3R1
2 [51, 59, 1, 1] T2491 KMT2C
至于SettingWithCopyWarning
,最好的阅读来源是
简而言之,您正在做的是通过从更大的数据帧中提取切片/视图来创建sal_df_wo_na
,类似这样 -
sal_df_wo_na = df[<some condition here>]
这可能导致链式索引,大熊猫警告说。相反,你需要做类似的事情
sal_df_wo_na = df[<some condition here>].copy()
通过使用pd.DataFrame.copy
函数创建切片的副本。如果列中有对象,请将deep=True
作为参数添加到copy
。