我有一个包含三种海鸟出现计数数据的小数据集(n=28),并运行了 hurdel GAM 模型(使用 mgcv::gam()),首先使用具有存在/缺席的二项式模型,然后使用负二项式模型只要存在。存在模型使每只海鸟的样本量达到 12,12 和 22。海鸟数据也过度分散,零值较高且通常较低(< 10) occurrence at presence points. this is the models for each seabird: #three seabirds; prion, storm petrel, sooty shearwater
prion_binary <- mgcv::gam(prion_binary ~ s(avg_SST) +
s(avg_SSS)+
s(delta_SST)+
s(delta_SSS)+
s(distance, k=8)+ # 9 different distances
s(total_zp)+ #total zooplankton
s(trip_factor,bs = "re"),
method = "ML",
family = binomial(link = "logit"),
data = seabird)
prion_count <- mgcv::gam(prion ~ s(avg_SST) +
s(avg_SSS)+
s(delta_SST)+
s(delta_SSS)+
s(distance, k=5)+ # 6 different distances
s(total_zp)+ #total zooplankton
s(trip_factor,bs = "re"),
method = "ML",
family = "ziP",
data = seabird[seabird$prion >0,])
我的问题是模型的输出显示出解释的非常高的偏差,并且相对没有显着的预测因子。在一种情况下,r2 也是负值。我认为我可能有太多的预测变量,但是当我运行单变量模型时,所有预测变量都会出现偏差解释和 p<0.05 so not sure which to remove. The residual plots also don't aline with such high dev explined.
不确定下一步该去哪里,因此我们将不胜感激。
这是三个海鸟模型的输出:
prion_binary
Family: binomial
Link function: logit
Formula:
prion_binary ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) + s(delta_SSS) +
s(distance, k = 8) + s(total_zp) + s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.557 4.021 -0.636 0.525
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000 1.000 0.687 0.407
s(avg_SSS) 1.000 1.000 0.324 0.569
s(delta_SST) 1.000 1.000 0.282 0.596
s(delta_SSS) 1.000 1.000 0.440 0.507
s(distance) 1.000 1.000 0.963 0.326
s(total_zp) 1.742 2.051 0.782 0.736
s(trip_factor) 1.349 3.000 3.615 0.120
R-sq.(adj) = 0.995 Deviance explained = 97.5%
-ML = 6.8535 Scale est. = 1 n = 28
朊病毒计数:
Family: Negative Binomial(2277108.965)
Link function: log
Formula:
prion ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) + s(delta_SSS) +
s(distance, k = 5) + s(total_zp) + s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8218 0.1979 4.153 3.29e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000e+00 1 0.017 0.896
s(avg_SSS) 1.000e+00 1 0.675 0.411
s(delta_SST) 1.000e+00 1 0.085 0.771
s(delta_SSS) 1.000e+00 1 0.148 0.700
s(distance) 1.000e+00 1 0.727 0.394
s(total_zp) 1.000e+00 1 0.059 0.809
s(trip_factor) 1.016e-07 2 0.000 0.508
R-sq.(adj) = -0.177 Deviance explained = 55.1%
-ML = 17.514 Scale est. = 1 n = 12
乌黑海鸥二元
Family: binomial
Link function: logit
Formula:
shearwater_binary ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) +
s(delta_SSS, k = 15) + s(distance, k = 8) + s(total_zp) +
s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.611 2.656 1.36 0.174
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000 1 0.917 0.33829
s(avg_SSS) 1.000 1 0.914 0.33915
s(delta_SST) 1.000 1 0.000 0.99210
s(delta_SSS) 1.000 1 0.017 0.89504
s(distance) 1.000 1 0.004 0.94848
s(total_zp) 1.000 1 0.113 0.73652
s(trip_factor) 1.141 3 11.683 0.00514 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.472 Deviance explained = 59.4%
-ML = 9.4597 Scale est. = 1 n = 28
海鸥计数
Family: Negative Binomial(6642419.022)
Link function: log
Formula:
shearwater ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) + s(delta_SSS) +
s(distance, k = 8) + s(total_zp) + s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9002 0.1211 15.7 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000e+00 1.000 1.841 0.1748
s(avg_SSS) 3.606e+00 4.272 125.264 < 2e-16 ***
s(delta_SST) 1.000e+00 1.000 5.619 0.0178 *
s(delta_SSS) 1.000e+00 1.000 18.094 2.11e-05 ***
s(distance) 4.657e+00 5.328 277.393 < 2e-16 ***
s(total_zp) 1.000e+00 1.000 10.490 0.0012 **
s(trip_factor) 9.002e-07 3.000 0.000 0.4375
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.999 Deviance explained = 99.5%
-ML = 67.138 Scale est. = 1 n = 22
风暴海燕双星
Family: binomial
Link function: logit
Formula:
storm_petrel_binary ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) +
s(delta_SSS) + s(distance, k = 8) + s(total_zp) + s(trip_factor,
bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7884 0.9928 -0.794 0.427
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000e+00 1.00 0.174 0.676
s(avg_SSS) 1.000e+00 1.00 0.190 0.663
s(delta_SST) 1.000e+00 1.00 1.038 0.308
s(delta_SSS) 1.000e+00 1.00 0.000 0.996
s(distance) 3.003e+00 3.69 5.302 0.213
s(total_zp) 1.000e+00 1.00 0.039 0.844
s(trip_factor) 5.115e-07 3.00 0.000 0.369
R-sq.(adj) = 0.595 Deviance explained = 64.9%
-ML = 12.629 Scale est. = 1 n = 28
风暴彼得尔计数
Family: Negative Binomial(1572380.699)
Link function: log
Formula:
storm_petrel ~ s(avg_SST) + s(avg_SSS) + s(delta_SST) + s(delta_SSS) +
s(distance, k = 5) + s(total_zp) + s(trip_factor, bs = "re")
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8340 0.2065 4.039 5.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(avg_SST) 1.000e+00 1 2.654 0.1033
s(avg_SSS) 1.000e+00 1 0.861 0.3535
s(delta_SST) 1.000e+00 1 1.389 0.2386
s(delta_SSS) 1.000e+00 1 4.626 0.0315 *
s(distance) 1.000e+00 1 0.562 0.4534
s(total_zp) 1.000e+00 1 0.580 0.4463
s(trip_factor) 1.018e-07 2 0.000 0.2196
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.647 Deviance explained = 73.8%
-ML = 19.901 Scale est. = 1 n = 12
我猜测正在发生的事情(但如果没有看到数据就无法知道)是你的解释变量彼此高度相关。每个变量的显着性是根据当您将该变量添加到包含除该变量之外的所有变量的简化模型时解释的附加方差量来计算的。因此,如果您的解释变量是共线的,那么添加另一个解释变量并不能解释其他解释变量无法解释的方差。
此外,对于您拥有的数据来说,预测变量肯定太多了。对于只有 12 个数据,您可能不需要超过一两个预测变量(尽管请阅读其他地方的其他观点)。
一种可能的前进方法是对解释变量或自然分组的解释变量子集进行主成分分析。如果一两个主成分解释了解释变量中很大一部分方差,则使用这些主成分作为预测变量。
另一种可能性是放弃任何看起来不那么重要的预测因素先验(并且不是事后,除非你只是在进行探索性数据分析)。