如何在Python中读取vcf.gz文件?

问题描述 投票:0回答:3

我有一个

vcf.gz
格式的文件(例如
file_name.vcf.gz
) - 我需要以某种方式在 Python 中读取它。

我明白了,首先我必须解压它,然后才能阅读它。我找到了这个解决方案,但不幸的是它对我不起作用。即使第一行(

bgzip file_name.vcf
tabix file_name.vcf.gz
)也写着
SyntaxError: invalid syntax

你能帮我吗?

python bioinformatics vcf-variant-call-format
3个回答
2
投票

cyvcfpyvcf都可以读取vcf文件,但cyvcf速度更快并且维护更积极。


0
投票

最好的方法是使用basesorbytes提到的为您执行此操作的程序。但是,如果您想要自己的代码,您可以使用这种方法


# Import libraries

import gzip
import pandas as pd

class ReadFile():
    '''
    This class read a VCF file
    and does some data manipulation
    the outout is the full data found
    in the input of this class
    the filtering process happens
    in the following step
    '''
    def __init__(self,file_path):
        '''
        This is the built-in constructor method
        '''
        self.file_path = file_path

    def load_data(self):
        '''
        1) Convert VCF file into  data frame
           Read  header of the body dynamically and assign dtype
           
        '''

        # Open the VCF file and read line by line
        with io.TextIOWrapper(gzip.open(self.file_path,'r')) as f:

            lines =[l for l in f if not l.startswith('##')]
            # Identify columns name line and save it into a dict
            # with values as dtype
            dynamic_header_as_key = []
            for liness in f:
                if liness.startswith("#CHROM"):
                    dynamic_header_as_key.append(liness)
                    # Declare dtypes
            values = [str,int,str,str,str,int,str,str,str,str]
            columns2detype = dict(zip(dynamic_header_as_key,values))

            vcf_df = pd.read_csv(
                io.StringIO(''.join(lines)),
                dtype=columns2detype,
                sep='\t'
            ).rename(columns={'#CHROM':'CHROM'})

       return vcf_df

0
投票

如果你不喜欢OO设计,这里有一个更传统的方式

def vcf_loader(path:str)->pd.DataFrame:
    if path.endswith('.gz'):
        f = io.TextIOWrapper(gzip.open(path, 'r'))
    else:
        f = open(path, 'r')

    lines = f.readlines()
    f.close()
    lines_new = []
    header = None
    for l in lines:
        if l.startswith('##'): continue
        elif l.startswith('#CHROM'):
            header = l.strip().split('\t')
        else:
            lines_new.append(l.strip().split('\t'))
            
    df = pd.DataFrame(lines_new, columns=header)
    return df
        
© www.soinside.com 2019 - 2024. All rights reserved.