假设我有
iris
数据。
我知道我可以创建一个变量来显示属于某个百分位数的值:
library(tidyverse)
iris %>% mutate(Range = cut(Sepal.Length, quantile(Sepal.Length, probs=c(0,.2,.4,.6,.8,1)),include.lowest=TRUE))
这会产生:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Range
1 4.3 3.0 1.1 0.1 setosa [4.3,4.6]
2 4.4 2.9 1.4 0.2 setosa [4.3,4.6]
3 4.6 3.1 1.5 0.2 setosa [4.3,4.6]
4 4.6 3.4 1.4 0.3 setosa [4.3,4.6]
5 4.7 3.2 1.3 0.2 setosa (4.6,4.8]
6 4.8 3.4 1.6 0.2 setosa (4.6,4.8]
7 4.8 3.0 1.4 0.1 setosa (4.6,4.8]
8 4.9 3.0 1.4 0.2 setosa (4.8,5]
9 4.9 3.1 1.5 0.1 setosa (4.8,5]
10 5.0 3.6 1.4 0.2 setosa (4.8,5]
11 5.0 3.4 1.5 0.2 setosa (4.8,5]
12 5.1 3.5 1.4 0.2 setosa (5,5.4]
13 5.4 3.9 1.7 0.4 setosa (5,5.4]
14 5.4 3.7 1.5 0.2 setosa (5,5.4]
15 5.7 4.4 1.5 0.4 setosa (5.4,5.8]
16 5.8 4.0 1.2 0.2 setosa (5.4,5.8]
我怎样才能创建另一个变量来显示观察值所属的百分位范围?我不想用 ifelse 语句等手动创建变量,但希望有一个函数可以自动创建它。
我正在寻找可以生成这样的表格的东西:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Percent Range
1 4.3 3.0 1.1 0.1 setosa [4.3,4.6] [0,.2]
2 4.4 2.9 1.4 0.2 setosa [4.3,4.6] [0,.2]
3 4.6 3.1 1.5 0.2 setosa [4.3,4.6] [0,.2]
4 4.6 3.4 1.4 0.3 setosa [4.3,4.6] [0,.2]
5 4.7 3.2 1.3 0.2 setosa (4.6,4.8] (.2,.4]
6 4.8 3.4 1.6 0.2 setosa (4.6,4.8] (.2,.4]
7 4.8 3.0 1.4 0.1 setosa (4.6,4.8] (.2,.4]
8 4.9 3.0 1.4 0.2 setosa (4.8,5] (.4,.6]
9 4.9 3.1 1.5 0.1 setosa (4.8,5] (.4,.6]
10 5.0 3.6 1.4 0.2 setosa (4.8,5] (.4,.6]
11 5.0 3.4 1.5 0.2 setosa (4.8,5] (.4,.6]
12 5.1 3.5 1.4 0.2 setosa (5,5.4] (.6,.8]
13 5.4 3.9 1.7 0.4 setosa (5,5.4] (.6,.8]
14 5.4 3.7 1.5 0.2 setosa (5,5.4] (.6,.8]
15 5.7 4.4 1.5 0.4 setosa (5.4,5.8] [.8,1]
16 5.8 4.0 1.2 0.2 setosa (5.4,5.8] [.8,1]
是的,?quantile
帮助页面的
另请参阅部分将向您指出
ecdf
函数,“对于经验分布,其中quantile
是其逆分布”。
有趣的是,
ecdf()
是一个函数,所以我们必须用它创建一个函数,然后在输入上调用该函数。然后我们可以像处理分位数一样cut
得到结果。
iris %>%
mutate(
Range = cut(Sepal.Length, quantile(Sepal.Length, probs=c(0,.2,.4,.6,.8,1)),include.lowest=TRUE),
ecdf = cut(ecdf(Sepal.Length)(Sepal.Length), breaks = c(0, 0.2, .4, .6, .8, 1), include.lowest = TRUE)
)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Range ecdf
# 1 5.1 3.5 1.4 0.2 setosa (5,5.6] (0.2,0.4]
# 2 4.9 3.0 1.4 0.2 setosa [4.3,5] [0,0.2]
# 3 4.7 3.2 1.3 0.2 setosa [4.3,5] [0,0.2]
# 4 4.6 3.1 1.5 0.2 setosa [4.3,5] [0,0.2]
# 5 5.0 3.6 1.4 0.2 setosa [4.3,5] (0.2,0.4]
# 6 5.4 3.9 1.7 0.4 setosa (5,5.6] (0.2,0.4]
# 7 4.6 3.4 1.4 0.3 setosa [4.3,5] [0,0.2]
# 8 5.0 3.4 1.5 0.2 setosa [4.3,5] (0.2,0.4]
# ...
使用
cut
与 labels=
的方法,它将确保您获得相应的匹配 Ranges
prob <- c(0, .2, .4, .6, .8, 1)
# the labels have "prob-length" - 1
lab <- apply(cbind(seq(length(prob)-1), seq(length(prob))[-1]), 1, \(x)
paste(prob[x[1]:x[2]], collapse=","))
quant <- quantile(iris$Sepal.Length, probs=prob, labels = lab)
iris %>%
mutate(Percent = cut(Sepal.Length, breaks=quant, include.lowest=T),
Range = cut(Sepal.Length, breaks=quant, labels=lab, include.lowest=T))
输出
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Percent Range
1 5.1 3.5 1.4 0.2 setosa (5,5.6] 0.2,0.4
2 4.9 3.0 1.4 0.2 setosa [4.3,5] 0,0.2
3 4.7 3.2 1.3 0.2 setosa [4.3,5] 0,0.2
4 4.6 3.1 1.5 0.2 setosa [4.3,5] 0,0.2
5 5.0 3.6 1.4 0.2 setosa [4.3,5] 0,0.2
6 5.4 3.9 1.7 0.4 setosa (5,5.6] 0.2,0.4
7 4.6 3.4 1.4 0.3 setosa [4.3,5] 0,0.2
8 5.0 3.4 1.5 0.2 setosa [4.3,5] 0,0.2
9 4.4 2.9 1.4 0.2 setosa [4.3,5] 0,0.2
10 4.9 3.1 1.5 0.1 setosa [4.3,5] 0,0.2
11 5.4 3.7 1.5 0.2 setosa (5,5.6] 0.2,0.4
...