我正在尝试最大化横截面面板数据中的数据点数量。我的矩阵结构如下,y 轴为年份,x 轴为国家:
A B C D
2000 NA 50 NA 85
2001 110 75 76 86
2002 120 NA 78 87
2003 130 100 80 88
因此,我试图找到年度数据点的所有可能组合,以获得每个组合最多的国家/地区。使用上面的示例,我尝试生成向量、列表或其他类型的对象,类似于这样:
2000, 2001, 2002, 2003 = D
2000, 2001, 2003 = D and B
2001, 2002, 2003 = D, A and C
2000, 2001 = D and B
2001, 2002 = D, A and C
2002, 2003 = D, A and C
2000 = D and B
2001 = A, B, C and D
2002 = A, C and D
2003 = A, B, C and D
这是一件抽象的事情,我无法理解它。我将不胜感激任何帮助。
这是一个很好的起点,但可能还可以改进的解决方案:
library(RcppAlgos)
getCombs <- function(myMat, upper = NULL, minYears = NULL) {
numRows <- nrow(myMat)
myColNames <- colnames(myMat)
if (is.null(minYears)) ## set default
repZero <- numRows - 1
else if (minYears >= numRows || minYears < 1) ## check for extreme cases
repZero <- numRows - 1
else
repZero <- numRows - minYears
combs <- comboGeneral(v = 0:numRows, m = numRows,
freqs = c(repZero, rep(1, numRows)),
upper = upper)
## I think this part could be improved
out <- lapply(1:nrow(combs), \(x) {
myRows <- myMat[combs[x,],]
if (is.null(nrow(myRows)))
result <- !is.na(myRows)
else
result <- complete.cases(t(myRows))
myColNames[result]
})
myRowNames <- rownames(myMat)
names(out) <- lapply(1:nrow(combs), \(x) {
myRowNames[combs[x, combs[x, ] > 0]]
})
out
}
这是 OP 示例的输出。 (OP 缺少以下 5 个结果):
testMat <- matrix(
c(NA, 50, NA, 85,
110, 75, 76, 86,
120, NA, 78, 87,
130, 100, 80, 88),
nrow = 4, byrow = TRUE
)
row.names(testMat) <- 2000:2003
colnames(testMat) <- LETTERS[1:4]
getCombs(testMat)
#> $`2000`
#> [1] "B" "D"
#>
#> $`2001`
#> [1] "A" "B" "C" "D"
#>
#> $`2002`
#> [1] "A" "C" "D"
#>
#> $`2003`
#> [1] "A" "B" "C" "D"
#>
#> $`c("2000", "2001")`
#> [1] "B" "D"
#>
#> $`c("2000", "2002")`
#> [1] "D"
#>
#> $`c("2000", "2003")`
#> [1] "B" "D"
#>
#> $`c("2001", "2002")`
#> [1] "A" "C" "D"
#>
#> $`c("2001", "2003")`
#> [1] "A" "B" "C" "D"
#>
#> $`c("2002", "2003")`
#> [1] "A" "C" "D"
#>
#> $`c("2000", "2001", "2002")`
#> [1] "D"
#>
#> $`c("2000", "2001", "2003")`
#> [1] "B" "D"
#>
#> $`c("2000", "2002", "2003")`
#> [1] "D"
#>
#> $`c("2001", "2002", "2003")`
#> [1] "A" "C" "D"
#>
#> $`c("2000", "2001", "2002", "2003")`
#> [1] "D"
但是,这个答案或任何未来的答案不会为您提供所有组合,因为您拥有 144 个国家/地区和 47 年的数据。这会产生一个非常非常大的数字。任何长度达到n的每个组合都相当于幂集。幂集中的元素数量就是
2^n
。由于我们没有计算空集的等价物,因此我们需要减一,因此:
suppressWarnings(suppressMessages(library(gmp)))
sub.bigz(pow.bigz(2, 47), 1)
#> Big Integer ('bigz') :
#> [1] 140737488355327
是的,超过一百万亿!!!您可能需要重新考虑您的方法,因为结果太多了。
一切都还没有失去!您可以使用
upper
参数来限制结果的数量,以便您仍然可以研究可能的组合。观察:
set.seed(11111)
biggerTest <- matrix(sample(100, 20*20, replace = TRUE), nrow = 20)
colnames(biggerTest) <- LETTERS[1:20]
rownames(biggerTest) <- 1988:2007
## set 10% of values to NA
myNAs <- sample(400, 400 / 10)
biggerTest[myNAs] <- NA
biggerTest[1:6, 1:10]
#> A B C D E F G H I J
#> 1988 95 12 29 19 47 38 48 75 68 80
#> 1989 98 NA 87 77 99 30 92 30 66 27
#> 1990 33 49 17 2 77 NA 80 67 6 93
#> 1991 70 35 49 45 52 NA 46 13 27 86
#> 1992 42 37 28 95 34 13 53 NA 6 98
#> 1993 23 14 10 96 61 6 41 8 93 NA
## Getting all 1,048,575 results takes a good bit of time
system.time(allResults <- getCombs(biggerTest))
#> user system elapsed
#> 20.488 0.330 20.852
## Using upper greatly reduces the amount of time
system.time(smallSampTest <- getCombs(biggerTest, upper = 10000))
#> user system elapsed
#> 0.083 0.001 0.084
或者,您可以使用
minYears
参数仅返回具有最少年份组合数的结果。例如,根据 OP 对 @CPak 答案的评论,如果您只想查看 15 年或以上组合的结果,我们有:
system.time(minYearTest <- getCombs(biggerTest, minYears = 15))
#> user system elapsed
#> 0.414 0.003 0.417
set.seed(123)
minYearTest[sample(length(minYearTest), 5)]
#> $`c("1988", "1990", "1991", "1992", "1993", "1994", "1997", "1998", "1999", "2001", "2002", "2003", "2004", "2005", "2006", "2007")`
#> [1] "E" "L" "R"
#>
#> $`c("1988", "1990", "1991", "1992", "1993", "1995", "1996", "1997", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007")`
#> [1] "E" "L" "R"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1996", "1998", "1999", "2000", "2001", "2002", "2003", "2005", "2006", "2007")`
#> [1] "E" "L" "P" "R"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1996", "1997", "1998", "2000", "2001", "2002", "2003", "2005", "2006")`
#> [1] "E" "L" "R"
#>
#> $`c("1988", "1989", "1990", "1991", "1993", "1994", "1995", "1997", "1998", "1999", "2001", "2002", "2003", "2004", "2007")`
#> [1] "C" "D" "E" "I" "L" "R"
或者同时使用两个参数:
system.time(
bothConstraintsTest <- getCombs(biggerTest, 10000, minYears = 10)
)
#> user system elapsed
#> 0.137 0.002 0.138
bothConstraintsTest[1:5]
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997")`
#> [1] "A" "C" "D" "E" "I" "L" "R" "S"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1998")`
#> [1] "A" "C" "D" "E" "I" "L" "P" "R" "S"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1999")`
#> [1] "C" "D" "E" "I" "L" "P" "R"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2000")`
#> [1] "A" "C" "D" "E" "I" "L" "P" "R" "S"
#>
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2001")`
#> [1] "A" "C" "D" "E" "I" "L" "P" "R"
我们需要做的第一件事是确定n年的每一个组合。这归结为找到 multiset c(rep(0, n-1), 1:n)
的所有
n元组,或者等效地,n 元素集的幂集减去空集。例如,对于年份
2000:2003
(4 年跨度),可能的组合由下式给出:
comboGeneral(v = 0:4, m = 4, freqs = c(3, rep(1, 4)))
#> [,1] [,2] [,3] [,4]
#> [1,] 0 0 0 1
#> [2,] 0 0 0 2
#> [3,] 0 0 0 3
#> [4,] 0 0 0 4
#> [5,] 0 0 1 2
#> [6,] 0 0 1 3
#> [7,] 0 0 1 4
#> [8,] 0 0 2 3
#> [9,] 0 0 2 4
#> [10,] 0 0 3 4
#> [11,] 0 1 2 3
#> [12,] 0 1 2 4
#> [13,] 0 1 3 4
#> [14,] 0 2 3 4
#> [15,] 1 2 3 4
现在,我们迭代组合的每一行,其中每一行告诉我们要测试原始矩阵中的哪些行组合来测试
NAs
。如果特定组合仅包含一个结果,我们将确定哪些索引不是 NA
。这很容易通过!is.na(
实现。如果我们有多于一行,我们使用 complete.cases(t
来获取仅包含数字的列(即没有出现 NA
)。
此后,我们只需使用索引来获取结果的名称,瞧,我们就得到了想要的结果。
library(tidyverse)
我首先 1)使行名 - 年份 - 一列,2)将数据转换为长格式,3)丢弃
is.na(value) == TRUE
的行
df <- data %>%
mutate(year = rownames(data)) %>%
gather(countries, value, A:D) %>%
filter(is.finite(value)) %>%
arrange(year) %>%
select(-value)
valid_countries
是一个函数,可筛选 df
查找感兴趣的年份(vec
是各个年份的组合),然后筛选在感兴趣的所有年份中找到的国家/地区。它返回 [years-of-interest as a comma-separated string, valid-countries as comma-separated string]
的 2 元素向量
valid_countries <- function(df, vec) {
ans <- df %>%
filter(year %in% vec) %>%
count(countries) %>%
filter(n == length(vec)) %>%
pluck("countries")
c(toString(vec), toString(unique(sort(ans))))
}
以下
lapply
将迭代数据中的年数的 1:N
年。它将绘制大小为 1:N
的独特年份组合,然后根据指定条件返回有效国家/地区。我使用 as.data.frame(t(Reduce(...)))
将数据格式化为更易于阅读的格式
result <- lapply(
seq_len(length(unique(df$year))),
function(i) {
apply(
combn(unique(df$year), i),
2,
function(j) { valid_countries(df, as.numeric(j)) }
)
}
)
as.data.frame(t(Reduce("cbind", result)))
结果
V1 V2
1 2000 B, D
2 2001 A, B, C, D
3 2002 A, C, D
4 2003 A, B, C, D
5 2000, 2001 B, D
6 2000, 2002 D
7 2000, 2003 B, D
8 2001, 2002 A, C, D
9 2001, 2003 A, B, C, D
10 2002, 2003 A, C, D
11 2000, 2001, 2002 D
12 2000, 2001, 2003 B, D
13 2000, 2002, 2003 D
14 2001, 2002, 2003 A, C, D
15 2000, 2001, 2002, 2003 D
数据
data <- read.table(text="A B C D
NA 50 NA 85
110 75 76 86
120 NA 78 87
130 100 80 88", header=TRUE, stringsAsFactors=FALSE)
rownames(data) <- 2000:2003