我有以下数据帧:
chr start_position end_position gene_name
0 Chr Position Ref Gene_Name
1 chr22 24128945 G nan
2 chr19 45867080 G ERCC2
3 chr3 52436341 C BAP1
4 chr7 151875065 G KMT2C
5 chr19 1206633 CGGGT STK11
并且我想将整个'end_position'列转换为包含'start_position'+ len('end_position')的值,结果应该是:
chr start_position end_position gene_name
0 Chr Position Ref Gene_Name
1 chr22 24128945 24128946 nan
2 chr19 45867080 45867081 ERCC2
3 chr3 52436341 52436342 BAP1
4 chr7 151875065 151875066 KMT2C
5 chr19 1206633 1206638 STK11
我写了下面的脚本:
patient_vcf_to_df.apply(pd.to_numeric, errors='ignore')
patient_vcf_to_df['end_position'] = patient_vcf_to_df['end_position'].map(lambda x: patient_vcf_to_df['start_position'] + len(x))
但是我得到了错误:TypeError:必须是str,而不是int
任何人都知道我该如何解决这个问题?
非常感谢!
首先,我以一种0
行将成为标题(列名称)的方式读取您的CSV:
df = pd.read_csv(filename, header=1)
获得以下DF:
Chr Position Ref Gene_Name
0 chr22 24128945 G NaN
1 chr19 45867080 G ERCC2
2 chr3 52436341 C BAP1
3 chr7 151875065 G KMT2C
4 chr19 1206633 CGGGT STK11
作为积极的副作用:
In [99]: df.dtypes
Out[99]:
chr object
position int64 # <--- NOTE
ref object
gene_name object
dtype: object
如果你想小写你的列:
In [97]: df.columns = df.columns.str.lower()
In [98]: df
Out[98]:
chr position ref gene_name
0 chr22 24128945 G NaN
1 chr19 45867080 G ERCC2
2 chr3 52436341 C BAP1
3 chr7 151875065 G KMT2C
4 chr19 1206633 CGGGT STK11
确保position
列是数字dtype:
df['position'] = pd.to_numeric(df['position'], errors='coerce')
然后:
In [101]: df['end_position'] = df['position'] + df['ref'].str.len()
In [102]: df
Out[102]:
chr position ref gene_name end_position
0 chr22 24128945 G NaN 24128946
1 chr19 45867080 G ERCC2 45867081
2 chr3 52436341 C BAP1 52436342
3 chr7 151875065 G KMT2C 151875066
4 chr19 1206633 CGGGT STK11 1206638