(交叉发布到 Biostars:https://www.biostars.org/p/9562110/)
我有一个带注释的 fasta 文件,我想在 > 下一个单词“类似于”之后添加到第一个位置
>_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG
我希望输出是这样的
>Chid1_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG
我该怎么做? 我已经尝试过
sed -E 's/(Similar to )(\w+)/>CHIA_\2\1\2/' file.txt > new_file_2.txt
并将其存储在一个新文件中并尝试将其粘贴到标题中但它不起作用,有什么想法吗?
还有一个 python 脚本
def extract_similar_to_word(line):
words = line.split()
for i, word in enumerate(words):
if word == "Similar":
similar_to_word = words[i + 2].strip('""')
if i + 3 < len(words) and words[i + 3].strip('""')[0].isupper():
similar_to_word = words[i + 1].strip('""') + words[i + 2].strip('""')
return similar_to_word
return None
def modify_fasta_headers(input_file, output_file):
with open(input_file, "r") as in_file, open(output_file, "w") as out_file:
for line in in_file:
if line.startswith(">"):
similar_to_word = extract_similar_to_word(line)
if similar_to_word:
# Find the first space in the line, then insert the similar_to_word
first_space_index = line.find(" ")
line = ">" + similar_to_word + "_" + line[1:first_space_index] + line[first_space_index:]
out_file.write(line)
input_file = "all_chias.fasta"
output_file = "modified_output_fasta_v1.fasta"
modify_fasta_headers(input_file, output_file)
对于
python
,您可以使用Biopython SeqIO
重命名文件:
from Bio import SeqIO
def extract_similar_to(line):
data = line.split('Similar to ')
if len(data) > 1:
return data[1].split(' ')[0] + "_" + line
else:
return line
def modify_fasta_headers(input_file, output_file):
with open(output_file, "w") as outputs:
for r in SeqIO.parse(input_file, "fasta"):
# Rewrite description with similar to word
r.description = extract_similar_to(r.description)
# Remove old id
r.id = ''
SeqIO.write(r, outputs, "fasta")
input_file = "all_chias.fasta"
output_file = "modified_output_fasta_v1.fasta"
modify_fasta_headers(input_file, output_file)
extract_similar_to()
被重写,以便它通过关键字Similar to 分割行。对于您给出的示例:
>_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG
data = line.split('Similar to ')
行将返回以下内容:
['_Anouracaudifer_00017283-RA transcript Name:"', 'Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393']
由于名称中可能没有Similar to,我们检查返回的列表是否有
length > 1
。如果是这样,我们返回第二个列表元素中的第一个单词 + 原始行。否则,我们只返回原始行。
这是
modified_output_fasta_v1.fasta
文件的内容:
> Chid1__Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACC
CTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTC
TCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG
注意: 输出名称中出现双下划线的原因是因为您的示例已经包含下划线(我不确定那是否是故意的):
>_Anouracaudifer_00017283-RA ...
Python 可能是完成此任务的最佳工具 (+1 Marcelo),但如果您想使用 sed,一个可能的选择是:
sed -n 's/>\(.*\"Similar to \)\([^ ]\{0,\}\) \(.*\)/>\2\1\2 \3/p; s/\(^[^>].*\)/\1/p' file.fasta > newfile.fasta
或者,使用 awk:
awk '!/>/ {
print
}
/>/ {
a = $0
gsub(".*Similar to ", "", $0)
gsub(" .*", "", $0)
gsub(">", "", a)
print ">" $0 a
}' file.fasta > newfile.fasta