如何处理互补行对并根据参考列填充缺失值?

问题描述 投票:0回答:1

我有一个如下所示的遗传数据集,其中在

pos
列中包含重复项(相同的基因组位置)。我想按
pos
对数据进行分组,并使用相应非缺失单元格中的信息填充每组中缺失的单元格。如果两个重复行之间的等位基因不同,则填充后缺失的单元格也应与其相应的
alleles
信息相匹配。其规则基于 DNA 互补序列:

TT ↔ AA
GG ↔ CC

包含缺失单元格的数据集:

df <- data.frame(
  chr = c(1, 1, 2, 2),
  alleles = c("G/A", "C/T", "T/C", "T/C"),
  pos = c(13276, 13276, 56329, 56329),
  B_005 = c("GG", "CC", "TT", "NN"),
  B_087 = c("AA", "TT", "TT", "TT"),
  B_013 = c("GA", "NN", "TT", "TT"),
  B_140 = c("NN", "CC", "NN", "CC")
)

填充后所需输出:

df_filled <- data.frame(
  chr = c(1, 1, 2, 2),
  alleles = c("G/A", "C/T", "T/C", "T/C"),
  pos = c(13276, 13276, 56329, 56329),
  B_005 = c("GG", "CC", "TT", "TT"),
  B_087 = c("AA", "TT", "TT", "TT"),
  B_013 = c("GA", "CT", "TT", "TT"),
  B_140 = c("GG", "CC", "CC", "CC")
)
r missing-data
1个回答
0
投票

这种方法提供了所需的输出,但很混乱。很想看到更优雅的方法。

library(tidyverse)
# replace "NN" with <NA>
df2 <- df |>
  mutate(across(-(1:3), ~if_else(.x == "NN", NA_character_, .x))) |>
  mutate(row = row_number(), .before = 1)

# table with corresponding values for everything (lower case for record keeping)
df_corresp <- df2 |>
  mutate(across(-(1:4), ~str_replace_all(.x, "T", "a") %>%
                  str_replace_all("A", "t") %>%
                  str_replace_all("G", "c") %>%
                  str_replace_all("C", "g"))) |>
  group_by(chr, pos) |>
  fill(everything(), .direction = "downup") |>
  ungroup()

df2 |>
  # 1) Fill matching alleles with same values
  group_by(chr, pos, alleles) |>
  fill(everything(), .direction = "downup") |>
  # 2) fill non-matching alleles with corresponding values
  group_by(chr, pos) |>
  rows_patch(df_corresp) |>
  ungroup() |>
  mutate(across(-(1:4), str_to_upper))

结果

# A tibble: 4 × 8
    row   chr alleles   pos B_005 B_087 B_013 B_140
  <int> <dbl> <chr>   <dbl> <chr> <chr> <chr> <chr>
1     1     1 G/A     13276 GG    AA    GA    GG   
2     2     1 C/T     13276 CC    TT    CT    CC   
3     3     2 T/C     56329 TT    TT    TT    CC   
4     4     2 T/C     56329 TT    TT    TT    CC  
© www.soinside.com 2019 - 2024. All rights reserved.