模拟数据样本

Question

我对每个组具有以下概率，并且每个组代表一定范围的值。我的目标是模拟与组和百分比相对应的1,234行数据：

ages = c(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99)
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01) / 100

age_bins = sapply(list(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99), length)
age_weighted = rep(age_probs/age_bins, age_bins)

set.seed(1)
n = 1234
data = data.frame(ID = sample(n),
                  Age = sample(ages, size = n, prob = age_weighted, replace = TRUE))

但是，数据的百分比不匹配，并且有时差异太大（我认为是因为数据不够大）。我发现了另一个post, which mentions that this happens because this, our "view" of the randomness is effectively "one cell at a time", instead of "one column at a time".，这是参考sample（）函数。

如何更改样本函数以更好地表示人口百分比？

哦，这是我检查数据框列的方式

to_export = data[order(data$ID),]


for (i in (1:length(to_export$Age))) {
  if (to_export$Age[i] >= 21 & to_export$Age[i] <= 29) to_export$block[i] = "21-29"
  if (to_export$Age[i] >= 30 & to_export$Age[i] <= 39) to_export$block[i] = "30-39"
  if (to_export$Age[i] >= 40 & to_export$Age[i] <= 49) to_export$block[i] = "40-49"
  if (to_export$Age[i] >= 50 & to_export$Age[i] <= 59) to_export$block[i] = "50-59"
  if (to_export$Age[i] >= 60 & to_export$Age[i] <= 69) to_export$block[i] = "60-69"
  if (to_export$Age[i] >= 70 & to_export$Age[i] <= 79) to_export$block[i] = "70-79"
  if (to_export$Age[i] >= 80 & to_export$Age[i] <= 89) to_export$block[i] = "80-89"
  if (to_export$Age[i] >= 90) to_export$block[i] = "90+"

}

#to_export

age_table = to_export %>% group_by(block) %>% summarise(percentage = round(n()/1234 * 100,2))

age_table

Answer 1

我建议进行重新设计。我正在使用dplyr和ggplot，但基本上不需要它们：

set.seed(1)
n = 1234

# Definition of the age buckets
ages = c("21:29", "30:39","40:49", "50:59", "60:69", "70:79", "80:89", "90:99")

# probability for each bucket
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01)

# normalise the probabilities since they don't add up to 1
c_age_probs = cumsum(age_probs)/sum(age_probs)

# create the data.frame
data = data.frame(ID = 1:n,
                  Age = ages[findInterval(runif(n), c_age_probs) + 1])

# plotting the data
ggplot(data, aes(x=Age)) + 
  geom_bar()

根据给定的概率，数据图看起来还不错。让我们看一下百分比：

# getting the percentage
data %>%
  group_by(Age) %>%
  summarise(percentage = n()/n)

#   A tibble: 7 x 2
#   Age   percentage
#   <chr>      <dbl>
# 1 21:29     0.0989
# 2 30:39     0.105 
# 3 40:49     0.133 
# 4 50:59     0.269 
# 5 60:69     0.198 
# 6 70:79     0.126 
# 7 80:89     0.0705

关键部分是ages[findInterval(runif(n), c_age_probs) + 1]。我创建了一些统一的分布数字，并使用累积（和归一化）概率来获得相应的年龄段。这样，我什至不需要创建多个case_when语句。

模拟数据样本

问题描述投票：0回答：1

1个回答

最新问题

模拟数据样本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1