我有一个fasta文件,它看起来像下面,还有其他头和相应的序列。我如何用 "for循环 "写一段代码,将其转换为数据帧,其中ORF名称存储在第1列,相应的上下游序列存储在第2列?(使用R studio)
>YAL001C TFC3 SGDID:S000000001, Chr I from 152168-146596, reverse complement, Verified ORF, "Largest of six subunits of the RNA polymerase III transcription initiation factor complex (TFIIIC); part of the TauB domain of TFIIIC that binds DNA at the BoxB promoter sites of tRNA and similar genes; cooperates with Tfc6p in DNA binding"
ACTTGTAAATATATCTTTTATTTTCCGAGAGGAAAAAGTTTCAAAAAAAAAAAAAAAAAA
AGAAGAAAAATAACTTTCTCATGTAATAAAAGGTAACTAATGTAGACAAAAAAGTATACA
TTTAGCTTTTCTTTTTTTGATGATTTTTGAGTTTCATGTTACTAATCAGAACAATTAACG
请大家试一试。
tmp <- scan("foo.fa",sep="\n",what="character")
tmp.paste <- paste(tmp,collapse="\t")
tmp.fa <- strsplit(tmp.paste,">")[[1]]
tmp.dt <- t(sapply(tmp.fa,function(x){
x1 <- strsplit(x,"\t")[[1]]
x1.head <- strsplit(x1[1]," ")[[1]][1]
x1.fa <- paste(x1[-1],collapse="")
x1.fa <- gsub("\t","",x1.fa)
c(x1.head,x1.fa)
}))
colnames(tmp.dt) <- c("ORFID","Fasta")
我依靠stringr进行字符串操作,所以我用了这个,我不确定你的意思是否是没有包,因为没有fasta修改包,但这里有一个for循环的要求。
library(stringr)
#read in fasta
fa <- readLines("./test.fa")
#initialize empty data frame
df <- data.frame()
for (line in fa) {
if (startsWith(line, ">")) {
lin <- str_extract(line, "(\".+)(\")")
df[nrow(df)+1,c(1,2)] <- c(substr(lin, 2, nchar(lin)-1), str_extract(line, "[0-9]+-[0-9]+"))
}
}
#change col names
colnames(df) <- c("ORF", "SEQloc")
如果你只对fasta的头行感兴趣,可以在R中加载它之前对文件进行一些前期编辑,用。
grep "^>" test.fa > header_only_test.fa