估计r中面板数据的组合?

问题描述 投票:0回答:2

我正在尝试最大化横截面面板数据中的数据点数量。我的矩阵结构如下,y 轴为年份,x 轴为国家:

        A     B    C     D 
2000   NA    50    NA    85
2001   110   75    76    86
2002   120   NA    78    87
2003   130   100   80    88

因此,我试图找到年度数据点的所有可能组合,以获得每个组合最多的国家/地区。使用上面的示例,我尝试生成向量、列表或其他类型的对象,类似于这样:

2000, 2001, 2002, 2003 = D
2000, 2001, 2003 = D and B
2001, 2002, 2003 = D, A and C
2000, 2001 = D and B
2001, 2002 = D, A and C
2002, 2003 = D, A and C
2000 = D and B
2001 = A, B, C and D
2002 = A, C and D
2003 = A, B, C and D

这是一件抽象的事情,我无法理解它。我将不胜感激任何帮助。

r combinations panel
2个回答
2
投票

更新

这是一个很好的起点,但可能还可以改进的解决方案:

library(RcppAlgos)
getCombs <- function(myMat, upper = NULL, minYears = NULL) {

    numRows <- nrow(myMat)
    myColNames <- colnames(myMat)

    if (is.null(minYears))  ## set default
        repZero <- numRows - 1
    else if (minYears >= numRows || minYears < 1)  ## check for extreme cases
        repZero <- numRows - 1
    else
        repZero <- numRows - minYears

    combs <- comboGeneral(v = 0:numRows, m = numRows,
                          freqs = c(repZero, rep(1, numRows)),
                          upper = upper)

    ## I think this part could be improved
    out <- lapply(1:nrow(combs), \(x) {
        myRows <- myMat[combs[x,],]

        if (is.null(nrow(myRows)))
            result <- !is.na(myRows)
        else
            result <- complete.cases(t(myRows))

        myColNames[result]
    })

    myRowNames <- rownames(myMat)
    names(out) <- lapply(1:nrow(combs), \(x) {
        myRowNames[combs[x, combs[x, ] > 0]]
    })
    out
}

这是 OP 示例的输出。 (OP 缺少以下 5 个结果):

testMat <- matrix(
    c(NA, 50, NA, 85,
      110, 75, 76, 86,
      120, NA, 78, 87,
      130, 100, 80, 88),
    nrow = 4, byrow = TRUE
)

row.names(testMat) <- 2000:2003
colnames(testMat) <- LETTERS[1:4]
getCombs(testMat)
#> $`2000`
#> [1] "B" "D"
#> 
#> $`2001`
#> [1] "A" "B" "C" "D"
#> 
#> $`2002`
#> [1] "A" "C" "D"
#> 
#> $`2003`
#> [1] "A" "B" "C" "D"
#> 
#> $`c("2000", "2001")`
#> [1] "B" "D"
#> 
#> $`c("2000", "2002")`
#> [1] "D"
#> 
#> $`c("2000", "2003")`
#> [1] "B" "D"
#> 
#> $`c("2001", "2002")`
#> [1] "A" "C" "D"
#> 
#> $`c("2001", "2003")`
#> [1] "A" "B" "C" "D"
#> 
#> $`c("2002", "2003")`
#> [1] "A" "C" "D"
#> 
#> $`c("2000", "2001", "2002")`
#> [1] "D"
#> 
#> $`c("2000", "2001", "2003")`
#> [1] "B" "D"
#> 
#> $`c("2000", "2002", "2003")`
#> [1] "D"
#> 
#> $`c("2001", "2002", "2003")`
#> [1] "A" "C" "D"
#> 
#> $`c("2000", "2001", "2002", "2003")`
#> [1] "D"

但是,这个答案或任何未来的答案不会为您提供所有组合,因为您拥有 144 个国家/地区和 47 年的数据。这会产生一个非常非常大的数字。任何长度达到n的每个组合都相当于幂集。幂集中的元素数量就是

2^n
。由于我们没有计算空集的等价物,因此我们需要减一,因此:

suppressWarnings(suppressMessages(library(gmp)))
sub.bigz(pow.bigz(2, 47), 1)
#> Big Integer ('bigz') :
#> [1] 140737488355327

是的,超过一百万亿!!!您可能需要重新考虑您的方法,因为结果太多了。

一切都还没有失去!您可以使用

upper
参数来限制结果的数量,以便您仍然可以研究可能的组合。观察:

set.seed(11111)
biggerTest <- matrix(sample(100, 20*20, replace = TRUE), nrow = 20)

colnames(biggerTest) <- LETTERS[1:20]
rownames(biggerTest) <- 1988:2007

## set 10% of values to NA
myNAs <- sample(400, 400 / 10)
biggerTest[myNAs] <- NA

biggerTest[1:6, 1:10]
#>       A  B  C  D  E  F  G  H  I  J
#> 1988 95 12 29 19 47 38 48 75 68 80
#> 1989 98 NA 87 77 99 30 92 30 66 27
#> 1990 33 49 17  2 77 NA 80 67  6 93
#> 1991 70 35 49 45 52 NA 46 13 27 86
#> 1992 42 37 28 95 34 13 53 NA  6 98
#> 1993 23 14 10 96 61  6 41  8 93 NA

## Getting all 1,048,575 results takes a good bit of time
system.time(allResults <- getCombs(biggerTest))
#>    user  system elapsed 
#>  20.488   0.330  20.852

## Using upper greatly reduces the amount of time
system.time(smallSampTest <- getCombs(biggerTest, upper = 10000))
#>    user  system elapsed 
#>   0.083   0.001   0.084

或者,您可以使用

minYears
参数仅返回具有最少年份组合数的结果。例如,根据 OP 对 @CPak 答案的评论,如果您只想查看 15 年或以上组合的结果,我们有:

system.time(minYearTest <- getCombs(biggerTest, minYears = 15))
#>    user  system elapsed 
#>   0.414   0.003   0.417

set.seed(123)
minYearTest[sample(length(minYearTest), 5)]
#> $`c("1988", "1990", "1991", "1992", "1993", "1994", "1997", "1998", "1999", "2001", "2002", "2003", "2004", "2005", "2006", "2007")`
#> [1] "E" "L" "R"
#> 
#> $`c("1988", "1990", "1991", "1992", "1993", "1995", "1996", "1997", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007")`
#> [1] "E" "L" "R"
#> 
#> $`c("1988", "1989", "1990", "1991", "1992", "1996", "1998", "1999", "2000", "2001", "2002", "2003", "2005", "2006", "2007")`
#> [1] "E" "L" "P" "R"
#> 
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1996", "1997", "1998", "2000", "2001", "2002", "2003", "2005", "2006")`
#> [1] "E" "L" "R"
#> 
#> $`c("1988", "1989", "1990", "1991", "1993", "1994", "1995", "1997", "1998", "1999", "2001", "2002", "2003", "2004", "2007")`
#> [1] "C" "D" "E" "I" "L" "R"

或者同时使用两个参数:

system.time(
    bothConstraintsTest <- getCombs(biggerTest, 10000, minYears = 10)
)
#>    user  system elapsed 
#>   0.137   0.002   0.138

bothConstraintsTest[1:5]
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997")`
#> [1] "A" "C" "D" "E" "I" "L" "R" "S"
#> 
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1998")`
#> [1] "A" "C" "D" "E" "I" "L" "P" "R" "S"
#> 
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1999")`
#> [1] "C" "D" "E" "I" "L" "P" "R"
#> 
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2000")`
#> [1] "A" "C" "D" "E" "I" "L" "P" "R" "S"
#> 
#> $`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2001")`
#> [1] "A" "C" "D" "E" "I" "L" "P" "R"

说明

我们需要做的第一件事是确定n年的每一个组合。这归结为找到 multiset c(rep(0, n-1), 1:n) 的所有

n
元组,或者等效地,n 元素集的幂集减去空集。例如,对于年份
2000:2003
(4 年跨度),可能的组合由下式给出:

comboGeneral(v = 0:4, m = 4, freqs = c(3, rep(1, 4)))
#>       [,1] [,2] [,3] [,4]
#>  [1,]    0    0    0    1
#>  [2,]    0    0    0    2
#>  [3,]    0    0    0    3
#>  [4,]    0    0    0    4
#>  [5,]    0    0    1    2
#>  [6,]    0    0    1    3
#>  [7,]    0    0    1    4
#>  [8,]    0    0    2    3
#>  [9,]    0    0    2    4
#> [10,]    0    0    3    4
#> [11,]    0    1    2    3
#> [12,]    0    1    2    4
#> [13,]    0    1    3    4
#> [14,]    0    2    3    4
#> [15,]    1    2    3    4

现在,我们迭代组合的每一行,其中每一行告诉我们要测试原始矩阵中的哪些行组合来测试

NAs
。如果特定组合仅包含一个结果,我们将确定哪些索引不是
NA
。这很容易通过
!is.na(
实现。如果我们有多于一行,我们使用
complete.cases(t
来获取仅包含数字的列(即没有出现
NA
)。

此后,我们只需使用索引来获取结果的名称,瞧,我们就得到了想要的结果。


1
投票
library(tidyverse)

我首先 1)使行名 - 年份 - 一列,2)将数据转换为长格式,3)丢弃

is.na(value) == TRUE

的行
df <- data %>%
        mutate(year = rownames(data)) %>%
        gather(countries, value, A:D) %>%
        filter(is.finite(value)) %>%
        arrange(year) %>%
        select(-value)

valid_countries
是一个函数,可筛选
df
查找感兴趣的年份(
vec
是各个年份的组合),然后筛选在感兴趣的所有年份中找到的国家/地区。它返回
[years-of-interest as a comma-separated string, valid-countries as comma-separated string]

的 2 元素向量
valid_countries <- function(df, vec) {
    ans <- df %>%
            filter(year %in% vec) %>%
            count(countries) %>%
            filter(n == length(vec)) %>%
            pluck("countries")
    c(toString(vec), toString(unique(sort(ans))))
}

以下

lapply
将迭代数据中的年数的
1:N
年。它将绘制大小为
1:N
的独特年份组合,然后根据指定条件返回有效国家/地区。我使用
as.data.frame(t(Reduce(...)))

将数据格式化为更易于阅读的格式
result <- lapply(
            seq_len(length(unique(df$year))),
            function(i) {
                apply(
                    combn(unique(df$year), i), 
                    2,
                    function(j) { valid_countries(df, as.numeric(j)) }
                )
            }
        )
as.data.frame(t(Reduce("cbind", result)))

结果

                       V1         V2
1                    2000       B, D
2                    2001 A, B, C, D
3                    2002    A, C, D
4                    2003 A, B, C, D
5              2000, 2001       B, D
6              2000, 2002          D
7              2000, 2003       B, D
8              2001, 2002    A, C, D
9              2001, 2003 A, B, C, D
10             2002, 2003    A, C, D
11       2000, 2001, 2002          D
12       2000, 2001, 2003       B, D
13       2000, 2002, 2003          D
14       2001, 2002, 2003    A, C, D
15 2000, 2001, 2002, 2003          D    

数据

data <- read.table(text="A     B    C     D 
NA    50    NA    85
110   75    76    86
120   NA    78    87
130   100   80    88", header=TRUE, stringsAsFactors=FALSE)
rownames(data) <- 2000:2003
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.