如何根据分位数按日期删除行?

问题描述 投票:1回答:1

我的问题如下:我想删除数据框中低于每个日期定义的第50个百分位数的行。以下示例说明了我的问题。

我有以下数据框:

date <- c("01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011","01.02.2011",
          "01.02.2011","01.02.2011","01.02.2011","01.02.2011",
          "02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011","02.02.2011",
          "02.02.2011","02.02.2011","02.02.2011","02.02.2011")
date <- as.Date(date, format="%d.%m.%Y")
ID <- c("A","B","C","D","E","F","G","H","I","J",
        "A","B","C","D","E","F","G","H","I","J")
values <- as.numeric(c("1","8","2","3","5","13","2","4","1","16",
                       "4","2","12","16","8","1","7","11","2","10"))

df <- data.frame(ID, date, values)

看起来像这样:

   ID       date values
1   A 2011-02-01      1
2   B 2011-02-01      8
3   C 2011-02-01      2
4   D 2011-02-01      3
5   E 2011-02-01      5
6   F 2011-02-01     13
7   G 2011-02-01      2
8   H 2011-02-01      4
9   I 2011-02-01      1
10  J 2011-02-01     16
11  A 2011-02-02      4
12  B 2011-02-02      2
13  C 2011-02-02     12
14  D 2011-02-02     16
15  E 2011-02-02      8
16  F 2011-02-02      1
17  G 2011-02-02      7
18  H 2011-02-02     11
19  I 2011-02-02      2
20  J 2011-02-02     10

我想删除值低于第50个百分位数(按日期定义)的每个日期的所有行,以获得:

   ID       date values
2   B 2011-02-01      8
5   E 2011-02-01      5
6   F 2011-02-01     13
8   H 2011-02-01      4
10  J 2011-02-01     16
13  C 2011-02-02     12
14  D 2011-02-02     16
15  E 2011-02-02      8
18  H 2011-02-02     11
20  J 2011-02-02     10

如果需要对我的问题进行任何编辑,请随时让我知道

r dataframe quantile
1个回答
1
投票

您有几种方法可以做到这一点。这里有一些解决方案,但是还有更多的方法可以做到这一点。他们都采用相同的想法:首先按日期计算中位数,然后过滤数据。

数据表

[如果要使用data.table,请先使用:=通过引用更新数据,然后进行过滤。如果您的数据量很大,data.table是非常有效的方法。

library(data.table)
setDT(df)

df[, quant := quantile(values, probs = .5),by = "date"]
df2 <- df[values>quant]
df2[,'quant' := NULL]

df2
    ID       date values
 1:  B 2011-02-01      8
 2:  E 2011-02-01      5
 3:  F 2011-02-01     13
 4:  H 2011-02-01      4
 5:  J 2011-02-01     16
 6:  C 2011-02-02     12
 7:  D 2011-02-02     16
 8:  E 2011-02-02      8
 9:  H 2011-02-02     11
10:  J 2011-02-02     10

dplyr

使用dplyr,您可以将自己的运算通过管道传输:按组计算分位数,然后进行过滤

library(dplyr)
df %>%
   group_by(date) %>%
   mutate(quant = quantile(values, .5)) %>%
   filter(values>quant) %>%
   select(-quant)

Groups:   date [2]
   ID    date       values
   <fct> <date>      <dbl>
 1 B     2011-02-01      8
 2 E     2011-02-01      5
 3 F     2011-02-01     13
 4 H     2011-02-01      4
 5 J     2011-02-01     16
 6 C     2011-02-02     12
 7 D     2011-02-02     16
 8 E     2011-02-02      8
 9 H     2011-02-02     11
10 J     2011-02-02     10
© www.soinside.com 2019 - 2024. All rights reserved.