如何创建显示观察值所在百分位范围的变量

问题描述 投票:0回答:2

假设我有

iris
数据。

我知道我可以创建一个变量来显示属于某个百分位数的值:

library(tidyverse)
iris %>% mutate(Range = cut(Sepal.Length, quantile(Sepal.Length, probs=c(0,.2,.4,.6,.8,1)),include.lowest=TRUE))

这会产生:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   Range
1           4.3         3.0          1.1         0.1  setosa [4.3,4.6]
2           4.4         2.9          1.4         0.2  setosa [4.3,4.6]
3           4.6         3.1          1.5         0.2  setosa [4.3,4.6]
4           4.6         3.4          1.4         0.3  setosa [4.3,4.6]
5           4.7         3.2          1.3         0.2  setosa (4.6,4.8]
6           4.8         3.4          1.6         0.2  setosa (4.6,4.8]
7           4.8         3.0          1.4         0.1  setosa (4.6,4.8]
8           4.9         3.0          1.4         0.2  setosa   (4.8,5]
9           4.9         3.1          1.5         0.1  setosa   (4.8,5]
10          5.0         3.6          1.4         0.2  setosa   (4.8,5]
11          5.0         3.4          1.5         0.2  setosa   (4.8,5]
12          5.1         3.5          1.4         0.2  setosa   (5,5.4]
13          5.4         3.9          1.7         0.4  setosa   (5,5.4]
14          5.4         3.7          1.5         0.2  setosa   (5,5.4]
15          5.7         4.4          1.5         0.4  setosa (5.4,5.8]
16          5.8         4.0          1.2         0.2  setosa (5.4,5.8]

我怎样才能创建另一个变量来显示观察值所属的百分位范围?我不想用 ifelse 语句等手动创建变量,但希望有一个函数可以自动创建它。

我正在寻找可以生成这样的表格的东西:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   Percent  Range
1           4.3         3.0          1.1         0.1  setosa [4.3,4.6]  [0,.2]
2           4.4         2.9          1.4         0.2  setosa [4.3,4.6]  [0,.2]
3           4.6         3.1          1.5         0.2  setosa [4.3,4.6]  [0,.2]
4           4.6         3.4          1.4         0.3  setosa [4.3,4.6]  [0,.2]
5           4.7         3.2          1.3         0.2  setosa (4.6,4.8]  (.2,.4]
6           4.8         3.4          1.6         0.2  setosa (4.6,4.8]  (.2,.4]
7           4.8         3.0          1.4         0.1  setosa (4.6,4.8]  (.2,.4]
8           4.9         3.0          1.4         0.2  setosa   (4.8,5]  (.4,.6]
9           4.9         3.1          1.5         0.1  setosa   (4.8,5]  (.4,.6]
10          5.0         3.6          1.4         0.2  setosa   (4.8,5]  (.4,.6]
11          5.0         3.4          1.5         0.2  setosa   (4.8,5]  (.4,.6]
12          5.1         3.5          1.4         0.2  setosa   (5,5.4]  (.6,.8]
13          5.4         3.9          1.7         0.4  setosa   (5,5.4]  (.6,.8]
14          5.4         3.7          1.5         0.2  setosa   (5,5.4]  (.6,.8]
15          5.7         4.4          1.5         0.4  setosa (5.4,5.8]  [.8,1]
16          5.8         4.0          1.2         0.2  setosa (5.4,5.8]  [.8,1]
r dplyr range percentile
2个回答
2
投票

是的,?quantile帮助页面的

另请参阅
部分将向您指出
ecdf
函数,“对于经验分布,其中
quantile
是其逆分布”

有趣的是,

ecdf()
是一个函数,所以我们必须用它创建一个函数,然后在输入上调用该函数。然后我们可以像处理分位数一样
cut
得到结果。

iris %>%
  mutate(
    Range = cut(Sepal.Length, quantile(Sepal.Length, probs=c(0,.2,.4,.6,.8,1)),include.lowest=TRUE),
    ecdf = cut(ecdf(Sepal.Length)(Sepal.Length), breaks = c(0, 0.2, .4, .6, .8, 1), include.lowest = TRUE)
  )

#     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species      Range      ecdf
# 1            5.1         3.5          1.4         0.2     setosa    (5,5.6] (0.2,0.4]
# 2            4.9         3.0          1.4         0.2     setosa    [4.3,5]   [0,0.2]
# 3            4.7         3.2          1.3         0.2     setosa    [4.3,5]   [0,0.2]
# 4            4.6         3.1          1.5         0.2     setosa    [4.3,5]   [0,0.2]
# 5            5.0         3.6          1.4         0.2     setosa    [4.3,5] (0.2,0.4]
# 6            5.4         3.9          1.7         0.4     setosa    (5,5.6] (0.2,0.4]
# 7            4.6         3.4          1.4         0.3     setosa    [4.3,5]   [0,0.2]
# 8            5.0         3.4          1.5         0.2     setosa    [4.3,5] (0.2,0.4]
# ...

0
投票

使用

cut
labels=
的方法,它将确保您获得相应的匹配 Ranges

prob <- c(0, .2, .4, .6, .8, 1)

# the labels have "prob-length" - 1
lab <- apply(cbind(seq(length(prob)-1), seq(length(prob))[-1]), 1, \(x) 
  paste(prob[x[1]:x[2]], collapse=","))

quant <- quantile(iris$Sepal.Length, probs=prob, labels = lab)

iris %>% 
  mutate(Percent = cut(Sepal.Length, breaks=quant, include.lowest=T), 
         Range = cut(Sepal.Length, breaks=quant, labels=lab, include.lowest=T))

输出

    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Percent   Range
1            5.1         3.5          1.4         0.2  setosa (5,5.6] 0.2,0.4
2            4.9         3.0          1.4         0.2  setosa [4.3,5]   0,0.2
3            4.7         3.2          1.3         0.2  setosa [4.3,5]   0,0.2
4            4.6         3.1          1.5         0.2  setosa [4.3,5]   0,0.2
5            5.0         3.6          1.4         0.2  setosa [4.3,5]   0,0.2
6            5.4         3.9          1.7         0.4  setosa (5,5.6] 0.2,0.4
7            4.6         3.4          1.4         0.3  setosa [4.3,5]   0,0.2
8            5.0         3.4          1.5         0.2  setosa [4.3,5]   0,0.2
9            4.4         2.9          1.4         0.2  setosa [4.3,5]   0,0.2
10           4.9         3.1          1.5         0.1  setosa [4.3,5]   0,0.2
11           5.4         3.7          1.5         0.2  setosa (5,5.6] 0.2,0.4
...
© www.soinside.com 2019 - 2024. All rights reserved.