如何在Python中读取vcf.gz文件？

Question

我有一个

vcf.gz

格式的文件（例如

file_name.vcf.gz

） - 我需要以某种方式在 Python 中读取它。

我明白了，首先我必须解压它，然后才能阅读它。我找到了这个解决方案，但不幸的是它对我不起作用。即使第一行（

bgzip file_name.vcf

或

tabix file_name.vcf.gz

）也写着

SyntaxError: invalid syntax

。

你能帮我吗？

Answer 1

cyvcf和pyvcf都可以读取vcf文件，但cyvcf速度更快并且维护更积极。

Answer 2

最好的方法是使用basesorbytes提到的为您执行此操作的程序。但是，如果您想要自己的代码，您可以使用这种方法


# Import libraries

import gzip
import pandas as pd

class ReadFile():
    '''
    This class read a VCF file
    and does some data manipulation
    the outout is the full data found
    in the input of this class
    the filtering process happens
    in the following step
    '''
    def __init__(self,file_path):
        '''
        This is the built-in constructor method
        '''
        self.file_path = file_path

    def load_data(self):
        '''
        1) Convert VCF file into  data frame
           Read  header of the body dynamically and assign dtype
           
        '''

        # Open the VCF file and read line by line
        with io.TextIOWrapper(gzip.open(self.file_path,'r')) as f:

            lines =[l for l in f if not l.startswith('##')]
            # Identify columns name line and save it into a dict
            # with values as dtype
            dynamic_header_as_key = []
            for liness in f:
                if liness.startswith("#CHROM"):
                    dynamic_header_as_key.append(liness)
                    # Declare dtypes
            values = [str,int,str,str,str,int,str,str,str,str]
            columns2detype = dict(zip(dynamic_header_as_key,values))

            vcf_df = pd.read_csv(
                io.StringIO(''.join(lines)),
                dtype=columns2detype,
                sep='\t'
            ).rename(columns={'#CHROM':'CHROM'})

       return vcf_df

Answer 3

如果你不喜欢OO设计，这里有一个更传统的方式

def vcf_loader(path:str)->pd.DataFrame:
    if path.endswith('.gz'):
        f = io.TextIOWrapper(gzip.open(path, 'r'))
    else:
        f = open(path, 'r')

    lines = f.readlines()
    f.close()
    lines_new = []
    header = None
    for l in lines:
        if l.startswith('##'): continue
        elif l.startswith('#CHROM'):
            header = l.strip().split('\t')
        else:
            lines_new.append(l.strip().split('\t'))
            
    df = pd.DataFrame(lines_new, columns=header)
    return df

如何在Python中读取vcf.gz文件？

问题描述投票：0回答：3

3个回答

最新问题

如何在Python中读取vcf.gz文件？

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3