我有一个针对不同状态的序数数据集。由于每个州都有不同的人口和参与者。如何以适当的速率推广它们以确定每个州的最大发生率。例如,吸烟的人/在R的10,000人口
sitecode AK AL AR AZB CA CO DE FL HI IA IL KS KY MI MO MS MT NE NJ NV NY OK PA RI SC SD TN WI WV WY
class sex
No female 1591 2324 3948 1784 555 885 6853 2363 383 1633 3787 1840 2305 9653 3373 4215 8966 2738 655 1705 6204 3951 1602 3448 2745 527 3858 767 4092 6414
male 1425 1951 3243 1630 544 847 6117 2115 338 1419 3193 1712 2013 8548 3030 3300 8143 2424 529 1505 5513 3643 1517 2880 2031 459 3436 654 3624 5774
Yes female 146 223 508 304 68 83 615 297 63 155 394 154 233 988 324 405 903 281 66 219 650 381 148 320 304 51 410 87 437 704
male 73 98 167 92 22 37 255 92 18 52 175 74 121 444 136 140 377 98 29 78 218 166 49 116 101 11 147 22 179 309
我会通过将这些数据重新整形为长格式(例如3列:类,性别,状态;以及每个组合的~400行)来处理此问题,然后连接到具有州群的单独表(最好包括相同的州缩写),然后按州人口标准化每个数字。
这是一个例子。
library(tidyverse)
df %>%
gather(state, count, AK:WY) %>% # easier to work with if we gather into long form
left_join( # left_join will pull in matching data for each row
state.x77 %>% # ...using data included in R's base datasets,
as.data.frame() %>% # updated in this case to just show state abbrev
rownames_to_column() %>% # and a population number
mutate(state = state.abb) %>%
select(state, Population)
) %>%
mutate(rate_10k = count / (Population * 10)) %>% # Find rate per 10k
filter(class == "Yes") %>% # Only include the "Yes" answers
ggplot(aes(state %>% fct_reorder(rate_10k),
rate_10k,
color = sex)) +
geom_point() +
coord_flip() +
labs(x = "Rate per 10,000 population",
y = "State")