我对每个组具有以下概率,并且每个组代表一定范围的值。我的目标是模拟与组和百分比相对应的1,234行数据:
ages = c(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99)
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01) / 100
age_bins = sapply(list(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99), length)
age_weighted = rep(age_probs/age_bins, age_bins)
set.seed(1)
n = 1234
data = data.frame(ID = sample(n),
Age = sample(ages, size = n, prob = age_weighted, replace = TRUE))
但是,数据的百分比不匹配,并且有时差异太大(我认为是因为数据不够大)。我发现了另一个post, which mentions that this happens because this, our "view" of the randomness is effectively "one cell at a time", instead of "one column at a time".,这是参考sample()函数。
如何更改样本函数以更好地表示人口百分比?
哦,这是我检查数据框列的方式
to_export = data[order(data$ID),]
for (i in (1:length(to_export$Age))) {
if (to_export$Age[i] >= 21 & to_export$Age[i] <= 29) to_export$block[i] = "21-29"
if (to_export$Age[i] >= 30 & to_export$Age[i] <= 39) to_export$block[i] = "30-39"
if (to_export$Age[i] >= 40 & to_export$Age[i] <= 49) to_export$block[i] = "40-49"
if (to_export$Age[i] >= 50 & to_export$Age[i] <= 59) to_export$block[i] = "50-59"
if (to_export$Age[i] >= 60 & to_export$Age[i] <= 69) to_export$block[i] = "60-69"
if (to_export$Age[i] >= 70 & to_export$Age[i] <= 79) to_export$block[i] = "70-79"
if (to_export$Age[i] >= 80 & to_export$Age[i] <= 89) to_export$block[i] = "80-89"
if (to_export$Age[i] >= 90) to_export$block[i] = "90+"
}
#to_export
age_table = to_export %>% group_by(block) %>% summarise(percentage = round(n()/1234 * 100,2))
age_table
我建议进行重新设计。我正在使用dplyr
和ggplot
,但基本上不需要它们:
set.seed(1)
n = 1234
# Definition of the age buckets
ages = c("21:29", "30:39","40:49", "50:59", "60:69", "70:79", "80:89", "90:99")
# probability for each bucket
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01)
# normalise the probabilities since they don't add up to 1
c_age_probs = cumsum(age_probs)/sum(age_probs)
# create the data.frame
data = data.frame(ID = 1:n,
Age = ages[findInterval(runif(n), c_age_probs) + 1])
# plotting the data
ggplot(data, aes(x=Age)) +
geom_bar()
根据给定的概率,数据图看起来还不错。让我们看一下百分比:
# getting the percentage
data %>%
group_by(Age) %>%
summarise(percentage = n()/n)
# A tibble: 7 x 2
# Age percentage
# <chr> <dbl>
# 1 21:29 0.0989
# 2 30:39 0.105
# 3 40:49 0.133
# 4 50:59 0.269
# 5 60:69 0.198
# 6 70:79 0.126
# 7 80:89 0.0705
关键部分是ages[findInterval(runif(n), c_age_probs) + 1]
。我创建了一些统一的分布数字,并使用累积(和归一化)概率来获得相应的年龄段。这样,我什至不需要创建多个case_when
语句。