R 中如何重新排列每行中的值以避免跨列重复?

问题描述 投票:0回答:1

问题

我在 R 中有一个数据框,其中每行包含多个带有分类值的列。我的目标是重新排列每行中的值,以便同一行中的列之间不会重复任何值。原始数据框可能包含表示为空字符串或 NA 的缺失值,我想在重新排列后保留每列相同数量的值。

主要规则是:

  1. 对于每个时间(t1、t2、t3、t4),值出现的时间应尽可能相同。
  2. 同一行内,值不应重复。 (顺序无关紧要)
  3. 对于 t1 和 t2,所有行都应该有一个值,对于 t3 和 t4,只有特定百分比的行应该有一个值。在本例中,我选择 40%。

这是我输入数据的示例:

df <- data.frame(
  t1 = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
  t2 = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
  t3 = c("A", "B", "C", "D", "", "", "", "", "", ""),
  t4 = c("A", "B", "C", "D", "", "", "", "", "", "")
)

预期产出

我想重新排列每一行,以便各列之间不存在重复值,顺序无关紧要,例如,一行可以是 A、B、C、NA 或 A、B、C、D,到目前为止它没有重复值,我没问题。我还需要保留原始数据框中非缺失值的数量。这是所需输出的示例:

# Example of expected rearrangement (order may vary):
df_rearranged <- data.frame(
  t1 = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
  t2 = c("D", "A", "B", "C", "D", "A", "B", "C", "D", "A"),
  t3 = c("B", "", "A", "", "C", "", "D", "", "", ""),
  t4 = c("", "", "", "A", "", "C", "", "", "", "D")
)

说明

对于上下文,每列表示一次有多个编码员(时间列中的值)对某个项目 (RID) 进行评分。每次新的时间,我都需要编码员对不同的项目进行评分。编码员将在前两次(时间 1 和时间 2)对所有项目进行编码;然而,对于 time3 和 time4,编码员将仅编码 25% 的项目(在上面的示例中,为了简单起见,我使用了 40%,但是这个百分比会有所不同,因此我需要语法来自动调整)。任何帮助表示赞赏;我被困在这里了。

真实数据

输入

structure(list(RID = c(2L, 9L, 14L, 24L, 44L, 64L, 95L, 116L, 
165L, 169L, 170L, 171L, 172L, 177L, 192L, 215L, 217L, 226L, 246L, 
247L, 288L, 292L, 300L, 306L, 313L, 316L, 339L, 344L, 352L, 355L, 
375L, 378L, 384L, 421L, 476L, 488L, 493L, 495L, 498L, 503L, 532L, 
553L, 581L, 588L, 604L, 605L, 608L, 639L, 640L, 642L, 664L, 669L, 
702L, 742L, 744L, 746L, 749L, 756L, 767L, 820L, 822L, 824L, 825L, 
826L, 842L, 843L, 856L, 865L, 895L, 901L, 916L, 920L, 921L, 929L, 
930L, 934L, 936L, 939L, 952L, 958L), time1 = c("MV", "MV", "AF", 
"RP", "MV", "AC", "RP", "MV", "FL", "MV", "AF", "AF", "AF", "RP", 
"MV", "AF", "RP", "RP", "MV", "AC", "AC", "FL", "MV", "AF", "FL", 
"AC", "AF", "RP", "FL", "AF", "AC", "AL", "FL", "AL", "FL", "AF", 
"RP", "AC", "RP", "RP", "FL", "AL", "FL", "FL", "RP", "MV", "MV", 
"AC", "MV", "AL", "AL", "RP", "AC", "AF", "AC", "MV", "AL", "AL", 
"RP", "AL", "FL", "MV", "RP", "AL", "AL", "AC", "RP", "FL", "AC", 
"AL", "MV", "AC", "AF", "AF", "AL", "AL", "FL", "AC", "FL", "AF"
), time2 = c("RP", NA, NA, NA, NA, NA, "AL", "RP", NA, NA, NA, 
"FL", NA, NA, NA, NA, NA, NA, NA, "MV", NA, NA, NA, "FL", "AL", 
"MV", NA, "AL", NA, NA, NA, NA, NA, NA, "RP", NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, "AF", NA, NA, NA, "AF", "AC", NA, NA, 
"AC", NA, NA, NA, NA, NA, NA, NA, NA, "AC", NA, NA, NA, NA, "MV", 
NA, "MV", NA, "AF", NA, NA, NA, NA, NA, "RP", NA, "FL"), time3 = c("MV", 
"MV", "AF", "RP", "MV", "AC", "RP", "MV", "FL", "MV", "AF", "AF", 
"AF", "RP", "MV", "AF", "RP", "RP", "MV", "AC", "AC", "FL", "MV", 
"AF", "FL", "AC", "AF", "RP", "FL", "AF", "AC", "AL", "FL", "AL", 
"FL", "AF", "RP", "AC", "RP", "RP", "FL", "AL", "FL", "FL", "RP", 
"MV", "MV", "AC", "MV", "AL", "AL", "RP", "AC", "AF", "AC", "MV", 
"AL", "AL", "RP", "AL", "FL", "MV", "RP", "AL", "AL", "AC", "RP", 
"FL", "AC", "AL", "MV", "AC", "AF", "AF", "AL", "AL", "FL", "AC", 
"FL", "AF"), time4 = c("RP", NA, NA, NA, NA, NA, "AL", "RP", 
NA, NA, NA, "FL", NA, NA, NA, NA, NA, NA, NA, "MV", NA, NA, NA, 
"FL", "AL", "MV", NA, "AL", NA, NA, NA, NA, NA, NA, "RP", NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, "AF", NA, NA, NA, "AF", "AC", 
NA, NA, "AC", NA, NA, NA, NA, NA, NA, NA, NA, "AC", NA, NA, NA, 
NA, "MV", NA, "MV", NA, "AF", NA, NA, NA, NA, NA, "RP", NA, "FL"
)), row.names = c(NA, -80L), class = c("tbl_df", "tbl", "data.frame"
))

预期产出

structure(list(RID = c(2, 9, 14, 24, 44, 64, 95, 116, 165, 169, 
170, 171, 172, 177, 192, 215, 217, 226, 246, 247, 288, 292, 300, 
306, 313, 316, 339, 344, 352, 355, 375, 378, 384, 421, 476, 488, 
493, 495, 498, 503, 532, 553, 581, 588, 604, 605, 608, 639, 640, 
642, 664, 669, 702, 742, 744, 746, 749, 756, 767, 820, 822, 824, 
825, 826, 842, 843, 856, 865, 895, 901, 916, 920, 921, 929, 930, 
934, 936, 939, 952, 958), time1 = c("MV", "MV", "AF", "RP", "MV", 
"AC", "RP", "MV", "FL", "MV", "AF", "AF", "AF", "RP", "MV", "AF", 
"RP", "RP", "MV", "AC", "AC", "FL", "MV", "AF", "FL", "AC", "AF", 
"RP", "FL", "AF", "AC", "AL", "FL", "AL", "FL", "AF", "RP", "AC", 
"RP", "RP", "FL", "AL", "FL", "FL", "RP", "MV", "MV", "AC", "MV", 
"AL", "AL", "RP", "AC", "AF", "AC", "MV", "AL", "AL", "RP", "AL", 
"FL", "MV", "RP", "AL", "AL", "AC", "RP", "FL", "AC", "AL", "MV", 
"AC", "AF", "AF", "AL", "AL", "FL", "AC", "FL", "AF"), time2 = c("RP", 
NA, NA, NA, NA, NA, "AL", "RP", NA, NA, NA, "FL", NA, NA, NA, 
NA, NA, NA, NA, "MV", NA, NA, NA, NA, "AL", "MV", NA, "AL", NA, 
"FL", NA, NA, NA, NA, "RP", NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, "AF", NA, NA, NA, "AF", "AC", NA, NA, "AC", NA, NA, NA, NA, 
NA, NA, NA, NA, "AC", NA, NA, NA, NA, "MV", NA, "MV", NA, "AF", 
NA, NA, NA, NA, NA, "RP", NA, "FL"), time3 = c("FL", "AF", "MV", 
"MV", "AF", "RP", "MV", "AC", "RP", "AL", "FL", "MV", "FL", "AF", 
"AF", "RP", "MV", "AF", "RP", "RP", "MV", "AC", "AC", "FL", "MV", 
"AF", "FL", "AC", "AF", "RP", "FL", "AF", "AC", "MV", "AF", "AL", 
"FL", "AF", "FL", "AC", "RP", "RP", "AL", "AL", "FL", "FL", "RP", 
"MV", "RP", "AC", "MV", "AL", "AL", "RP", "FL", "AF", "AC", "MV", 
"AL", "MV", "RP", "AL", "FL", "MV", "RP", "AL", "AL", "AC", "RP", 
"FL", "AC", "AL", "MV", "AC", "AF", "AF", "AL", "AL", "AC", "AC"
), time4 = c(NA, NA, NA, NA, "AL", NA, NA, NA, NA, NA, "RP", 
NA, NA, NA, "FL", NA, NA, "MV", NA, NA, NA, NA, "FL", NA, NA, 
"AL", "MV", NA, "AL", NA, NA, NA, NA, "RP", NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, "AF", NA, NA, "AF", NA, NA, NA, "AC", NA, 
NA, NA, "AC", NA, NA, NA, NA, NA, "AC", NA, NA, NA, NA, "MV", 
NA, "MV", NA, "AF", NA, NA, NA, NA, NA, "RP", "FL", "RP", NA)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -80L))
r tidyverse data-cleaning data-wrangling
1个回答
0
投票

这似乎是一个多维分配问题

在您的情况下,您希望排列 t2、t3 和 t4 列中的值,以便各行之间不存在重复项。

HungarianSolver
包中的匈牙利算法求解器 (
RcppHungarian
) 似乎是一个不错的选择。该函数“解决加权二部匹配问题(例如,人与车的最优匹配或学生与大学的最优匹配等......)

该功能的使用非常简单。它只需要一个参数,即成本矩阵。我认为对于这种情况,我们只需要 0 和 1,其中 0 表示无成本,即允许任何配对,而 1 表示不需要特定配对的非零成本。为此,我使用

outer
==
作为 FUN(不包括 NA)。结果是一个配对向量,第二列是最小化成本的所需索引。

library(RcppHungarian)

既然您提供了一个玩具数据集和一个真实的数据集,我将把我的代码包装在一个函数中,以便我可以调用它两次。唯一的论点是数据。

fn <- function(data) {
  # Helper function for the `outer` function.
  equal <- function(x, y) (x==y) & !is.na(x) & !is.na(y)
  
  # Extract the four columns
  t1 <- data[,1, drop=TRUE]
  t2 <- data[,2, drop=TRUE]
  t3 <- data[,3, drop=TRUE]
  t4 <- data[,4, drop=TRUE]
  
  # Create the cost matrix for t1 and t2
  cost2 <- outer(t1, t2, FUN=equal)

  # Solve the problem and assign the result
  res2 <- HungarianSolver(cost2)
  t2a <- t2[res2$pairs[,2]]
  
  # Repeat for the 3rd and 4th columns (aggregating the costs)
  cost3 <- (outer(t1, t3, equal) + outer(t2a, t3, equal))
  res3 <- HungarianSolver(cost3)
  t3a <- t3[res3$pairs[,2]]

  cost4 <- (outer(t1, t4, equal) + outer(t2a, t4, equal) + outer(t3a, t4, equal))
  res4 <- HungarianSolver(cost4)
  t4a <- t4[res4$pairs[,2]]
  
  data.frame(t1, t2=t2a, t3=t3a, t4=t4a)
}

为玩具数据集调用上述函数

fn(df)

   t1 t2   t3   t4
1   A  B    C    D
2   B  A    D    C
3   C  D    A    B
4   D  C    B    A
5   A  B <NA> <NA>
6   B  A <NA> <NA>
7   C  D <NA> <NA>
8   D  C <NA> <NA>
9   A  B <NA> <NA>
10  B  A <NA> <NA>

我们看到行不重复。现在我们尝试一下真实的数据。

DF_arranged <- fn(DF[,-1])
head(DF_arranged, 10)

   t1 t2   t3   t4
1  MV AF   RP   FL
2  MV RP <NA> <NA>
3  AF MV <NA>   RP
4  RP MV   FL <NA>
5  MV AC <NA> <NA>
6  AC MV <NA> <NA>
7  RP MV   AL <NA>
8  MV RP <NA> <NA>
9  FL MV   RP <NA>
10 MV FL <NA>   RP
...

前十行看起来不错(各行之间没有重复)。进一步检查验证剩余的行。

sum(apply(DF_arranged, 1, FUN=\(x) sum(duplicated(x, incomparables=NA))))
# [1] 0
© www.soinside.com 2019 - 2024. All rights reserved.