用“类似于”之后的下一个词更改 fasta 标题

问题描述 投票:0回答:2

(交叉发布到 Biostars:https://www.biostars.org/p/9562110/

我有一个带注释的 fasta 文件,我想在 > 下一个单词“类似于”之后添加到第一个位置

>_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG

我希望输出是这样的

>Chid1_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG

我该怎么做? 我已经尝试过

sed -E 's/(Similar to )(\w+)/>CHIA_\2\1\2/' file.txt > new_file_2.txt

并将其存储在一个新文件中并尝试将其粘贴到标题中但它不起作用,有什么想法吗?

还有一个 python 脚本

def extract_similar_to_word(line):
    words = line.split()
    for i, word in enumerate(words):
        if word == "Similar":
            similar_to_word = words[i + 2].strip('""')
            if i + 3 < len(words) and words[i + 3].strip('""')[0].isupper():
                similar_to_word = words[i + 1].strip('""') + words[i + 2].strip('""')
            return similar_to_word
    return None

def modify_fasta_headers(input_file, output_file):
    with open(input_file, "r") as in_file, open(output_file, "w") as out_file:
        for line in in_file:
            if line.startswith(">"):
                similar_to_word = extract_similar_to_word(line)
                if similar_to_word:
                    # Find the first space in the line, then insert the similar_to_word
                    first_space_index = line.find(" ")
                    line = ">" + similar_to_word + "_" + line[1:first_space_index] + line[first_space_index:]
            out_file.write(line)



input_file = "all_chias.fasta"
output_file = "modified_output_fasta_v1.fasta"

modify_fasta_headers(input_file, output_file)
python bioinformatics fasta
2个回答
1
投票

对于

python
,您可以使用
Biopython SeqIO
重命名文件:

from Bio import SeqIO

def extract_similar_to(line):
    data = line.split('Similar to ')
    if len(data) > 1:
        return data[1].split(' ')[0] + "_" + line
    else:
        return line

def modify_fasta_headers(input_file, output_file):
    with open(output_file, "w") as outputs:
        for r in SeqIO.parse(input_file, "fasta"):
            # Rewrite description with similar to word
            r.description = extract_similar_to(r.description)
            # Remove old id
            r.id = ''
            SeqIO.write(r, outputs, "fasta")


input_file = "all_chias.fasta"
output_file = "modified_output_fasta_v1.fasta"

modify_fasta_headers(input_file, output_file)

extract_similar_to()
被重写,以便它通过关键字Similar to 分割行。对于您给出的示例:

>_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG

data = line.split('Similar to ')
行将返回以下内容:

['_Anouracaudifer_00017283-RA transcript Name:"', 'Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393']

由于名称中可能没有Similar to,我们检查返回的列表是否有

length > 1
。如果是这样,我们返回第二个列表元素中的第一个单词 + 原始行。否则,我们只返回原始行。

这是

modified_output_fasta_v1.fasta
文件的内容:

> Chid1__Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACC
CTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTC
TCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG

注意: 输出名称中出现双下划线的原因是因为您的示例已经包含下划线(我不确定那是否是故意的):

>_Anouracaudifer_00017283-RA ...

1
投票

Python 可能是完成此任务的最佳工具 (+1 Marcelo),但如果您想使用 sed,一个可能的选择是:

sed -n 's/>\(.*\"Similar to \)\([^ ]\{0,\}\) \(.*\)/>\2\1\2 \3/p; s/\(^[^>].*\)/\1/p' file.fasta > newfile.fasta

或者,使用 awk:

awk '!/>/ {
    print
}

/>/ {
    a = $0
    gsub(".*Similar to ", "", $0)
    gsub(" .*", "", $0)
    gsub(">", "", a)
    print ">" $0 a
}' file.fasta > newfile.fasta
© www.soinside.com 2019 - 2024. All rights reserved.