我进行了一项分析(背景说明如下),生成了一个表格,其中基因名称列表作为行名称。 一些这些基因名称添加了额外的后缀“.#”,例如.1、.2 等
这是表中间的行名称的一部分。您可以看到大多数都有“.#”后缀,但有些没有。
"Cck", "Pcp4.3", "Adcy1.4", "Cpe.3", "Dsp.5", "Dgkh.4", "Ncdn.4", "Marcksl1.3","Fam163b.5","Stxbp6.3","Prox1.3", "Neurod6.3", "Dock10.2","Fth1.3","Ipcef1","Pgbd5.3","Sema5a.2","Ahcyl2.2","Hpca.3","Gpr151"
这些后缀不应该存在。我第一个猜测是使用 dplyr 解决方案来删除它们,例如
df %>% rownames_to_column(var = "Gene.ID") %>% separate_wider_delim(Gene.ID, delim = ".", names = c("Gene", "num"))
但该方法返回了此错误:
Error in `separate_wider_delim()`:
! Expected 2 pieces in each element of `Gene.ID`.
! 3271 values were too short.
ℹ Use `too_few = "debug"` to diagnose the problem.
ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
! 2 values were too long.
ℹ Use `too_many = "debug"` to diagnose the problem.
ℹ Use `too_many = "drop"/"merge"` to silence this message.
Backtrace:
1. All_Markers %>% rownames_to_column(var = "Gene.ID") %>% ...
2. tidyr::separate_wider_delim(., Gene.ID, delim = ".", names = c("Gene", "Occur"))
我认为发生这种情况是因为某些元素不包含分隔符。有什么办法可以解决这个问题吗?
额外背景:
软件正在对同一批中的多次比较进行分析,并将结果基因名称输入为行名称。我认为,由于行名必须是唯一的,因此添加后缀以避免重复。因此,我预计数字后缀 = (出现次数) - 1
但我想确定这一点,我想用分隔符分隔基因,并制作每个基因名称的频率表,并将其与拉出的后缀进行比较以确认。
您可以匹配所需字符串的一部分,然后用匹配的文本替换整个字符串:
df %>% rownames_to_column(var = "Gene.ID") %>% mutate(Gene.ID = gsub("(.+?)\\..*$", "\\1", Gene.ID))