我有一个如下所示的遗传数据集,其中在
pos
列中包含重复项(相同的基因组位置)。我想按 pos
对数据进行分组,并使用相应非缺失单元格中的信息填充每组中缺失的单元格。如果两个重复行之间的等位基因不同,则填充后缺失的单元格也应与其相应的 alleles
信息相匹配。其规则基于 DNA 互补序列:
TT ↔ AA
GG ↔ CC
包含缺失单元格的数据集:
df <- data.frame(
chr = c(1, 1, 2, 2),
alleles = c("G/A", "C/T", "T/C", "T/C"),
pos = c(13276, 13276, 56329, 56329),
B_005 = c("GG", "CC", "TT", "NN"),
B_087 = c("AA", "TT", "TT", "TT"),
B_013 = c("GA", "NN", "TT", "TT"),
B_140 = c("NN", "CC", "NN", "CC")
)
填充后所需输出:
df_filled <- data.frame(
chr = c(1, 1, 2, 2),
alleles = c("G/A", "C/T", "T/C", "T/C"),
pos = c(13276, 13276, 56329, 56329),
B_005 = c("GG", "CC", "TT", "TT"),
B_087 = c("AA", "TT", "TT", "TT"),
B_013 = c("GA", "CT", "TT", "TT"),
B_140 = c("GG", "CC", "CC", "CC")
)
这种方法提供了所需的输出,但很混乱。很想看到更优雅的方法。
library(tidyverse)
# replace "NN" with <NA>
df2 <- df |>
mutate(across(-(1:3), ~if_else(.x == "NN", NA_character_, .x))) |>
mutate(row = row_number(), .before = 1)
# table with corresponding values for everything (lower case for record keeping)
df_corresp <- df2 |>
mutate(across(-(1:4), ~str_replace_all(.x, "T", "a") %>%
str_replace_all("A", "t") %>%
str_replace_all("G", "c") %>%
str_replace_all("C", "g"))) |>
group_by(chr, pos) |>
fill(everything(), .direction = "downup") |>
ungroup()
df2 |>
# 1) Fill matching alleles with same values
group_by(chr, pos, alleles) |>
fill(everything(), .direction = "downup") |>
# 2) fill non-matching alleles with corresponding values
group_by(chr, pos) |>
rows_patch(df_corresp) |>
ungroup() |>
mutate(across(-(1:4), str_to_upper))
结果
# A tibble: 4 × 8
row chr alleles pos B_005 B_087 B_013 B_140
<int> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 1 G/A 13276 GG AA GA GG
2 2 1 C/T 13276 CC TT CT CC
3 3 2 T/C 56329 TT TT TT CC
4 4 2 T/C 56329 TT TT TT CC