转换数据框pandas中整个列的值

问题描述 投票:1回答:1

我有以下数据帧:

         chr start_position        end_position  gene_name
0        Chr       Position                 Ref  Gene_Name
1      chr22       24128945                   G        nan
2      chr19       45867080                   G      ERCC2
3       chr3       52436341                   C       BAP1
4       chr7      151875065                   G      KMT2C
5      chr19        1206633               CGGGT      STK11

并且我想将整个'end_position'列转换为包含'start_position'+ len('end_position')的值,结果应该是:

     chr start_position        end_position  gene_name
0        Chr       Position                 Ref  Gene_Name
1      chr22       24128945            24128946       nan
2      chr19       45867080            45867081      ERCC2
3       chr3       52436341            52436342       BAP1
4       chr7      151875065           151875066      KMT2C
5      chr19        1206633             1206638      STK11

我写了下面的脚本:

patient_vcf_to_df.apply(pd.to_numeric, errors='ignore')
patient_vcf_to_df['end_position'] = patient_vcf_to_df['end_position'].map(lambda x: patient_vcf_to_df['start_position'] + len(x))

但是我得到了错误:TypeError:必须是str,而不是int

任何人都知道我该如何解决这个问题?

非常感谢!

pandas dataframe
1个回答
1
投票

首先,我以一种0行将成为标题(列名称)的方式读取您的CSV:

df = pd.read_csv(filename, header=1)

获得以下DF:

     Chr   Position    Ref Gene_Name
0  chr22   24128945      G       NaN
1  chr19   45867080      G     ERCC2
2   chr3   52436341      C      BAP1
3   chr7  151875065      G     KMT2C
4  chr19    1206633  CGGGT     STK11

作为积极的副作用:

In [99]: df.dtypes
Out[99]:
chr          object
position      int64        # <--- NOTE
ref          object
gene_name    object
dtype: object

如果你想小写你的列:

In [97]: df.columns = df.columns.str.lower()

In [98]: df
Out[98]:
     chr   position    ref gene_name
0  chr22   24128945      G       NaN
1  chr19   45867080      G     ERCC2
2   chr3   52436341      C      BAP1
3   chr7  151875065      G     KMT2C
4  chr19    1206633  CGGGT     STK11

确保position列是数字dtype:

df['position'] = pd.to_numeric(df['position'], errors='coerce')

然后:

In [101]: df['end_position'] = df['position'] + df['ref'].str.len()

In [102]: df
Out[102]:
     chr   position    ref gene_name  end_position
0  chr22   24128945      G       NaN      24128946
1  chr19   45867080      G     ERCC2      45867081
2   chr3   52436341      C      BAP1      52436342
3   chr7  151875065      G     KMT2C     151875066
4  chr19    1206633  CGGGT     STK11       1206638
© www.soinside.com 2019 - 2024. All rights reserved.