您好,我有一个巨大的fasta文件,例如:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence2 [virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence3
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence5 hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence7 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence8 |hypothetical protein[Musca domestica salivary gland hypertrophy virus]
MNKITRTDYLLNKLCRPQDGDDNLVASFMPCERAAIRRKYTTLYAYNYTECPHRILETCK
LQRIPYFTCIEYRANVECVERHVCDIFPVHIGLRLDRQIYAFLYGDDELNSPAVQRTMYD
LYGTIFVVSPQYFSNIFTNRKEIIHSSRDSDKLYNIYMYDVHDRGHRIWMTADANKTCIF
RNSNGQEHVIEASQSFRDFIDGIEYEVDIQRHMNFERMFEAFARYQPINDIDDLSNKNIL
>sequence9 |hypothetical protein[Musca domestica salivary gland hypertrophy virus]
MNKITRTDYLLNKLCRPQDGDDNLVASFMPCERAAIRRKYTTLYAYNYTECPHRILETCK
LQRIPYFTCIEYRANVECVERHVCDIFPVHIGLRLDRQIYAFLYGDDELNSPAVQRTMYD
LYGTIFVVSPQYFSNIFTNRKEIIHSSRDSDKLYNIYMYDVHDRGHRIWMTADANKTCIF
RNSNGQEHVIEASQSFRDFIDGIEYEVDIQRHMNFERMFEAFARYQPINDIDDLSNKNIL
并且我正在寻找一种方法来删除重复的序列:
例如,sequence1_CP,sequence2,sequence3,sequence6和sequence7具有完全相同的序列,那么我只想保留一个。序列4_CP和序列5或序列6和序列7或序列8和9相同。
文件中的序列号为:2196136
所以我需要一种快速的方法...
这里我应该参加示例:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence8 |hypothetical protein[Musca domestica salivary gland hypertrophy virus]
MNKITRTDYLLNKLCRPQDGDDNLVASFMPCERAAIRRKYTTLYAYNYTECPHRILETCK
LQRIPYFTCIEYRANVECVERHVCDIFPVHIGLRLDRQIYAFLYGDDELNSPAVQRTMYD
LYGTIFVVSPQYFSNIFTNRKEIIHSSRDSDKLYNIYMYDVHDRGHRIWMTADANKTCIF
RNSNGQEHVIEASQSFRDFIDGIEYEVDIQRHMNFERMFEAFARYQPINDIDDLSNKNIL
您好,我有一个巨大的Fasta文件,例如:> sequence1_CP [seq病毒] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE> ...