我应该如何格式化这个正则表达式来找到这个所需的氨基酸序列?

问题描述 投票:0回答:1

我正在 fasta 文件中的序列中寻找氨基酸。我正在检查序列中的第 564 个氨基酸是否是 V、I 或 E。正在访问的文件被读入并作为另一个函数的一部分进行处理。代码如下:

with open("/Users/adia/Downloads/FGFR_fusion.fasta",'r') as fasta:
    test = fasta.read()
    test = test.split(">")
    del test[0]
    out = open("/Users/adia/Desktop/PSB HW_6.txt", 'w')
    for x in test:
        if 'fibroblast growth factor receptor 1 isoform' in x:
            out.write(x)  

with open("/Users/adia/Desktop/PSB HW_6.txt", 'r') as filtered:
    test2 = filtered.read()
    out = open("/Users/adia/Desktop/564ID.txt", 'w')
    
   ** AA564 = re.compile('^M.{562}[IVE]')**
    matches = re.finditer(AA564,test2)

    found = [] 
    for match in matches: 
        found.append(match.group(0))
        print(f"Found {match.group(0)} at position {match.start()}.\n")
            
    if found == []:
            print("No match found.")

No match found.

我知道至少有一个匹配项(我自己阅读并分离了每个序列),但无论出于何种原因,这都行不通。

我尝试了这个正则表达式的很多变体:

  • re.compile('M.{562}[IVE]')
  • re.compile('^M.{562}[IVE]')
  • re.compile('^M.{562}[IVE]',re.M)
  • re.compile('^M.{562}[IVE]',re.DOTALL)
  • re.compile('M.{562}[IVE].*')
  • re.compile('^M.{562}[IVE].*')
  • re.compile('^M[GALMFWKQESOVICYHRNDT]{562}[IVE]')
  • re.compile('M[GALMFWKQESOVICYHRNDT]{562}[IVE]')
  • re.compile('^M[A-Z]{562}[IVE]'

这可能与第一次写出文件时的格式有关,因为它是我遇到问题的部分的输入。但我检查过,当它读出并返回时,两次它都是一个字符串。我无法进一步编辑它,否则我无法使用正则表达式。我尝试使用 regex101 来查看差异在哪里,但没有什么是明显的。任何人都可以提供有关如何解决此问题的见解吗? 蒂亚!

python regex bioinformatics
1个回答
0
投票

输入

FGFR_fusion.fasta
:

>fibroblast growth factor receptor 1 isoform
AAAAAAAAAE

带有代码:

import re 

with open("FGFR_fusion.fasta",'r') as fasta:
    test = fasta.read()
    
    print('test : ', test)
    test2 = test.split(">")[1].split('\n')[1]
    
    print('test2 : ', test2)
    # del test[1]
    
    print('test2 : ', test2)
    
    
    out = open("PSB HW_6.txt", 'w')
    for x in test.split(">"):
        
        print('x : ', x)
        
        if 'fibroblast growth factor receptor 1 isoform' in x:
            
            print('test2 : ', test2)
            out.write(test2)  
            out.close()

with open("PSB HW_6.txt", 'r') as filtered:
    test2 = filtered.read()
    
    print('test2 : ', test2)
    
    out = open("10ID.txt", 'w')
    
    AA10 = re.compile('^.{9}[IVE]')
    matches = re.finditer(AA10,test2)
    
    print('matches : ', matches)

    found = [] 
    for match in matches: 
        print('match.group() : ', match.group())
        found.append(match.group())
        print(f"Found {match.group(0)} at position {match.start()}\n")
        out.write(match.group()[9])  
        out.close()
            
    if found == []:
            print("No match found.")

输出:

test :  >fibroblast growth factor receptor 1 isoform
AAAAAAAAAE

    test2 :  AAAAAAAAAE
    test2 :  AAAAAAAAAE
    x :  
    x :  fibroblast growth factor receptor 1 isoform
    AAAAAAAAAE
    
    test2 :  AAAAAAAAAE
    test2 :  AAAAAAAAAE
    matches :  <callable_iterator object at 0x7fd97064cb50>
    match.group() :  AAAAAAAAAE
    Found AAAAAAAAAE at position 0

和文件:

PSB HW_6.txt
AAAAAAAAAE

10ID.txt
E

© www.soinside.com 2019 - 2024. All rights reserved.