mgcv 语法中的嵌套样本设计

问题描述 投票:0回答:1

我在将示例设计转换为随机效应的正确 mgcv 包语法时遇到困难。

设置如下:我们每年连续 7 个月在每个“收集”地点拖网捕鱼一次,每月(希望如此)在相同的 12 个地点(我认为这些是我的随机拦截)。然而,有时数据库中的站点要么被意外跳过(NA),要么被备用站点替换(当条件恶劣阻碍我们在那里拖网时,备用站点会替换预期站点)。数据还存在重复测量方面,其中在同一地点记录了许多鱼的长度(这只是协变量,但认为随机截距也用于重复测量)。如果我的目标是忠实地表现这个设计,我应该如何构建我的随机效果?像这样(?):

(fSite = 因子站点;fCYR = 因子日历年)

s(fSite, bs='re') + s(fCYR, , bs='re')

s(fSite, fCYR, bs='re') <- Does this only work if the same 12 sites are present in each fCYR level?

数据:

> dput(station)
structure(list(CYR = c(2009, 2009, 2009, 2009, 2009, 2009, 2010, 
2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011, 
2011, 2011, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2013, 2013, 
2013, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014, 
2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 
2016, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 
2019, 2019, 2019, 2019, 2019, 2019), Month = c(6, 7, 8, 9, 10, 
11, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 
9, 10, 11, 4, 5, 6, 7, 8, 9, 10, 5, 6, 7, 8, 9, 10, 5, 6, 7, 
8, 9, 10, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 1, 
3, 5, 6, 7, 8, 9, 10, 11, 2, 3, 5, 6, 7, 8, 9, 10, 11), `Coll. Site 1` = c(21, 
23, 22, 23, 22, 23, 22, 22, 21, 68, 23, 22, 22, 70, 20, 23, 22, 
20, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 22, 22, 22, 
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 
22, 22, NA, NA, 22, 22, 22, NA, NA, 22, 22, 22, 22, 22), `Coll. Site 2` = c(40, 
70, 70, 70, 68, 68, 68, 68, 70, 101, 112, 70, 70, 143, 70, 70, 
68, 67, 70, 67, 67, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70), 
    `Coll. Site 3` = c(70, 112, 112, 122, 123, 112, 122, 122, 
    111, 136, 134, 124, 135, 158, 122, 123, 124, 124, 112, 123, 
    123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 
    123, 123, 123, 123, 123, 123, 124, 123, 123, 123, 123, 123, 
    123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 
    123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 
    123, 123, 123, NA, 123, 123, 123, 123, 123, 123, 123), `Coll. Site 4` = c(147, 
    147, 133, 134, 146, 159, 54, 134, 134, 159, 156, 145, 146, 
    167, 159, 146, 144, 146, 147, 146, 145, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146, 146, 145, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146), `Coll. Site 5` = c(176, 173, 
    168, 168, 169, 172, 168, 168, 168, 168, 168, 169, 167, 241, 
    169, 172, 146, 168, 168, 169, 167, 168, 168, 168, 168, 168, 
    168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 
    168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 
    168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 
    168, 168, 168, 168, 168, 168, 168, 168, NA, 168, 168, 168, 
    168, 168, 168, 168, 168), `Coll. Site 6` = c(241, 258, 241, 
    241, 241, 241, 144, 241, 169, 241, 241, 241, 241, 266, 241, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 240, 241, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 240, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 
    241, 241, 241, 241), `Coll. Site 7` = c(280, 279, 266, 266, 
    253, 266, 24, 266, 241, 270, 266, 266, 266, 270, 270, 266, 
    266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 257, 266, 
    266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 
    266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 
    257, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 
    266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 
    266, NA, 266), `Coll. Site 8` = c(281, 283, 270, 270, 257, 
    270, NA, 270, 266, 279, 270, 270, 270, 301, 301, 270, 270, 
    270, 270, 270, 270, NA, 270, 270, 270, 270, 265, 270, 270, 
    270, 270, 270, 270, 270, 270, NA, 270, 270, 270, 270, 270, 
    270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 266, 
    270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 
    270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 
    270, 270), `Coll. Site 9` = c(314, 302, 301, 314, 301, 301, 
    NA, 301, 270, 301, 301, 301, 301, 402, 402, 301, 301, 301, 
    314, 301, 314, 312, 301, 301, 302, 301, 301, 301, 301, 301, 
    NA, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 
    301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 
    301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 
    301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 
    301), `Coll. Site 10` = c(401, 314, 303, 609, 314, 402, NA, 
    314, 301, 314, 314, 314, 314, 618, 618, 314, 314, 314, 403, 
    402, 402, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 
    314, 314, 314, 314, 314, 314, 314, 315, 314, 314, 314, 314, 
    314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 
    314, 314, 314, 314, 314, 314, 314, 314, NA, 314, 314, 314, 
    314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314
    ), `Coll. Site 11` = c(618, 402, 402, 618, 402, 618, NA, 
    402, 314, 402, 402, 402, 402, 314, 266, 402, 402, 402, 609, 
    618, 618, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 
    402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 
    402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 
    402, NA, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 
    402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402
    ), `Coll. Site 12` = c(NA, 621, 618, NA, 621, NA, NA, 618, 
    402, 618, 619, 618, 618, NA, NA, 621, 618, 618, 616, 314, 
    301, 618, 621, 618, 618, 618, 616, 618, 618, 618, 618, 618, 
    618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 
    618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 
    618, 618, 618, 618, 618, 618, 621, 618, 618, 618, 618, 618, 
    618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618)), row.names = c(NA, 
-79L), class = c("tbl_df", "tbl", "data.frame"))
r mgcv longitudinal random-effects
1个回答
2
投票

只有在每个

fCYR
级别中存在相同的 12 个站点时,这才有效吗?

尝试一下看看。 tldr:是的,它有效;不,它不需要每年都存在相同的 12 个站点。关键是独特的站点应该进行独特的编码。然而,使用您在模型中提到的两种随机效应形式可能是有益的,因为它可以导致更简约的拟合。

假设您的数据位于

df
中并且格式正确,符合我认为您的模型的格式

library("mgcv")
library("tidyr")
library("dplyr")
library("janitor")

df2 <- pivot_longer(df, cols = c(-Month, -CYR), names_to = "site",
             values_to = "value", names_prefix = "^Coll\\. ") |>
janitor::clean_names() |>
mutate(fcyr = factor(cyr), site = factor(site))

我们可以拟合三个模型来比较随机效应的作用:

knots <- list(month = c(0.5, 12.5))
m1 <- gam(value ~ s(month, bs = "cc") +
            s(fcyr, bs = "re") + s(site, bs = "re"),
          data = df2, family = nb(), knots = knots)
m2 <- gam(value ~ s(month, bs = "cc") + s(fcyr, site, bs = "re"),
          data = df2, family = nb(), knots = knots)
m3 <- gam(value ~ s(month, bs = "cc") +
            s(fcyr, bs = "re") + s(site, bs = "re") +
            s(fcyr, site, bs = "re"),
          data = df2, family = nb(), knots = knots)

因为有 132 个独特的(地点、年份)对

> with(df2, nlevels(interaction(fcyr, site, drop = TRUE)))                    
[1] 132

m2
使用了近 131 个 EDF(由于模型截距 IIRC 丢失了一个 df)

> model_edf(m1)                                                               
# A tibble: 1 × 2
  model   edf
  <chr> <dbl>
1 m1     19.8

r$> model_edf(m2)                                                               
# A tibble: 1 × 2
  model   edf
  <chr> <dbl>
1 m2     132.

r$> model_edf(m3)                                                               
# A tibble: 1 × 2
  model   edf
  <chr> <dbl>
1 m3     58.1

m3
包含所有三个 ranef 术语,在 ~58 EDF 上使用。

正如您的评论推测的那样,如果您包含站点和年份随机效应以及它们的组合,那么您会比仅按照

site:fcyr
安装
m2
形式来更简洁地拟合数据。

m2
m3
都允许您估计所有 132 个(地点、年份)对的随机截距。那么有什么区别呢?

正如您的评论所猜测的那样,两个“主要”随机效应编码了每个

site
的平均效应和每个
year
的平均效应。然后,
s(site, fcyr, bs = "re")
项估计各个(站点、年份)对的站点和年份平均值的偏差。一旦您估计了平均
site
year
效应,各个站点因年份而产生的偏差(或等效于各个站点平均值多年来的偏差)就相对较小。因此
s(site,fcyr, bs = "re")
项仅使用 ~40 EDF;由于模型中已经考虑了随机
site
year
效应,因此许多潜在偏差都被缩小了。

所以在这种情况下你可以使用either形式:

  1. ~ s(fcyr, bs = "re") + s(site, bs = "re") + s(fcyr, site, bs = "re")
    ,或
  2. ~ s(fcyr, site, bs = "re")

但选项 1 更省钱。然而情况不一定如此。如果站点内和年份内的变化很大,则分解形式 (1.) 不会比 2 提供任何优势,这并非不可想象。

随机效应的主要作用是确保组的标签是唯一的。假设您有 3 个田地,并从每个田地内的 6 个位置重复进行测量。您不会在每个字段中将六个位置编码为

loc1
loc2
、...、
loc6
,因为模型会认为
location
等于
loc1
的任何行都是同一位置。假设字段标记为
A
B
C
,您可以将位置编码为
locA1
locA2
、...、
locB1
、...、
locC6
以避免任何歧义。在 R 的混合效果软件世界中,这种问题经常出现;在那里,通过像
lme()
lmer()
这样的公式接口的工作方式,你可以摆脱
loc1
编码形式 if 你说
location
嵌套在
field
中。因此讨论交叉或嵌套随机效应。如果您正确编码分组变量,整个讨论就会消失。

对于 mgcv,如果您使用

loc1
形式,然后拟合模型的公式 1. 形式,您就会出错

y ~ s(field, bs="re") + s(location, bs="re") + s(field, location, bs="re")

因为

s(location, bs="re")
术语会认为所有标记为
loc1
的观察结果都是同一位置(“主题”),但事实并非如此。在这种情况下,模型的 2. 形式将是正确的形式

y ~ s(field, location, bs="re")

总而言之,您使用哪种形式实际上取决于您如何编码分组因素,而不是设计是嵌套还是交叉。

© www.soinside.com 2019 - 2024. All rights reserved.