从独立负二项式构造狄利克雷多项式

Question

我有一个数据集，其中包含 ISO3 国家代码、年份（从 2010 年到 2019 年）、死因类别（总共 24 个类别）和死亡人数。各种原因造成的死亡总和是固定的。我的目标是预测 2020-2023 年期间每个国家/原因组合的预期死亡人数。

我不被允许分享数据，但这里有一个只有一个国家和 4 个死因类别的数据样本：

# A tibble: 40 × 3
    Year Cause                      Deaths
   <dbl> <chr>                       <dbl>
 1  2010 Communicable diseaes       97203.
 2  2010 Non-communicable diseases   2743.
 3  2010 Injuries                    7128.
 4  2010 Other                      12712.
 5  2011 Communicable diseaes       96536.
 6  2011 Non-communicable diseases   2716.
 7  2011 Injuries                    7264.
 8  2011 Other                      13088.
 9  2012 Communicable diseaes       95786.
10  2012 Non-communicable diseases   2642.
11  2012 Injuries                    7140.
12  2012 Other                      13556.
13  2013 Communicable diseaes       94153.
14  2013 Non-communicable diseases   2652.
15  2013 Injuries                    7191.
16  2013 Other                      14058.
17  2014 Communicable diseaes       92192.
18  2014 Non-communicable diseases   2828.
19  2014 Injuries                    7288.
20  2014 Other                      14825.
21  2015 Communicable diseaes       96254.
22  2015 Non-communicable diseases   2617.
23  2015 Injuries                    8128.
24  2015 Other                      16339.
25  2016 Communicable diseaes      101976.
26  2016 Non-communicable diseases   2470.
27  2016 Injuries                    8366.
28  2016 Other                      17007.
29  2017 Communicable diseaes       97219 
30  2017 Non-communicable diseases   2365 
31  2017 Injuries                    7675 
32  2017 Other                      15686 
33  2018 Communicable diseaes       95826 
34  2018 Non-communicable diseases   2180 
35  2018 Injuries                    7937 
36  2018 Other                      15807 
37  2019 Communicable diseaes       97264 
38  2019 Non-communicable diseases   2080 
39  2019 Injuries                    8031 
40  2019 Other                      15660

我的想法是使用 R 中的 gam 函数为每个国家/原因组合运行独立的负二项式回归模型，因为我的数据明显存在过度分散。以下是数据样本的代码示例（例如只有一个国家/地区）：

# original dataset
df <- df %>%
  mutate(expected_cause = NA,
         expected_cause_se = NA) %>%
  arrange(Year)

# create a dataset to stored predictions for 2020-2023
pred_pand <- expand.grid(Year = rep(c(2020, 2021,2022,2023), each = 1),
                               Cause = rep(unique(df$Cause))) %>%
  mutate(expected_cause = NA,
         expected_cause_se = NA) %>%
  arrange(Year)

for (i in unique(df$Cause)) {
  # create temp dataset
  whichs <- which(df$Cause == i)
  temp <- df[whichs, ]

  # run the model
  model <- gam(Deaths ~ s(Year), data = temp, family = nb(theta = NULL, link = "log"))
  pred <- predict(model, se.fit = TRUE, type = "response", newdata = 
  data.frame(Year = c(2020,2021,2022,2023)))

  # store predictions for 2020-2023
  whichs_pred <- which(pred_pand$Cause == i)
  pred_pand[whichs_pred, "expected_cause"] <- pred$fit
  pred_pand[whichs_pred, "expected_cause_se"] <- pred$se.fit
  
  # store predictions for years prior to 2019
  years <- unique(temp$Year)
  range_years <- min(years):max(years)
  pred_hist <- predict(model, se.fit = TRUE, type = "response",
                       newdata = data.frame(Year = range_years))
  df[whichs, "expected_cause"] <- pred_hist$fit
  df[whichs, "expected_cause_se"] <- pred_hist$se.fit

}

为了确保各种原因造成的死亡总和等于已知的固定总数，我似乎需要从这些独立的负二项式构造一个狄利克雷多项式。请参阅此参考文献：https://arxiv.org/pdf/2001.04343“狄利克雷多项分布相当于独立负二项分布的集合，其具有以其总和为条件的相同尺度参数”。这是正确的方法吗？

如果是这样，我不确定我需要在 R 中做什么来应用它。我在下面的代码中尝试了不同的 theta 值，但各种原因的死亡总和并没有达到固定的总计 + 设置定义的 theta 值会影响模型拟合。

model <- gam(Deaths ~ s(Year), data = temp, family = nb(theta = **NULL**, link = "log"))

我该怎么办？如果我手动重新调整按原因划分的预期死亡人数，以使各种原因的总和与固定总数匹配，则会影响模型拟合。如果我不重新调整比例，各种原因的总和与固定总数不匹配。

Answer 1

我认为 arxiv 论文中提到的方法是完全错误的。对于负二项式的总和，作者指出总和 M 也分布为负二项式。

M ~ NB(a₀, theta)，其中 a₀=Sum_i a_i

这就是为什么它对你不起作用，M 不是一个固定值，只有出于某种巧合，它的采样才可能等于你的数据值。

狄利克雷多项式条件和分布应从一开始就施加，而不是稍后尝试将其附加。

不确定千篇一律的解决方案，但我会从 Bioconductor 的 Dirichlet-Multinomial 开始 https://bioconductor.org/packages/release/bioc/html/DirichletMultinomial.html

看起来是合理的封装，不需要采样狄利克雷分量的数量，你已经知道了。可能需要挖掘内部以提供惩罚等

从独立负二项式构造狄利克雷多项式

问题描述投票：0回答：1

1个回答

最新问题

从独立负二项式构造狄利克雷多项式

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1