https://stackoverflow.com/a/60814481/2939136)中找到的代码。 它为我的目的非常有效,但现在我想回去弄清楚哪些行被踢了,但我不知道如何编辑代码来执行此操作。
在这里是使用的原始代码:
# IQR method functions
# @param x A numeric vector
# @param na.rm Whether to exclude NAs when computing quantiles
is_outlier <- function(x, na.rm = TRUE) {
qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
lowerq <- qs[1]
upperq <- qs[2]
iqr = upperq - lowerq
extreme.threshold.upper = (iqr * 1.5) + upperq
extreme.threshold.lower = lowerq - (iqr * 1.5)
# Return logical vector
x > extreme.threshold.upper | x < extreme.threshold.lower
}
# Remove rows with outliers in given columns
# Any row with at least 1 outlier will be removed
# @param df A data.frame
# @param cols Names of the columns of interest, defaults to all columns.
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
cat("Removing outliers in column: ", col, " \n")
df <- df[!is_outlier(df[[col]]),]
}
df
}
问题是,如果我只运行第一个功能,则我有列,我不想对包括日期列的离群值进行分析,但更重要的是,ID列。 ID列是我将用来弄清楚谁被踢出的内容以及从下游分析中省略了哪些信息。
示例数据集(我只关心在列中查找异常值
P-C
,
P-D
和
P-E
)
remove_outliers
(第二个也是最后一个功能)时,我会做一些像Sex | 诊断 | |||||||
---|---|---|---|---|---|---|---|---|
columns_of_interest <- c("P-C", "P-D", "P-E")
remove_outliers(df, columns_of_interest)
现在我想知道那些异常值是什么。您可以修改您的
remove_outliers
以打印每列中离群值的ID。
remove_outliers <- function(df, cols = names(df), id_col) {
outliers <- c()
for(col in cols){
cat("Removing outliers in column: ", col, " \n")
removed_row_id <- df[is_outlier(df[[col]]), id_col]
cat(id_col, "of rows removed:", removed_row_id, "\n")
outliers <- append(outliers, removed_row_id)
}
outliers <- unique(outliers)
df[!df[[id_col]] %in% outliers,]
}
> remove_outliers(dat, cols = c("P-C", "P-D", "P-E"), id_col = "ID")
# output
# Removing outliers in column: P-C
# ID of rows removed: 5 6
# Removing outliers in column: P-D
# ID of rows removed: 6
# Removing outliers in column: P-E
# ID of rows removed:
ID Date Sex Diagnosis P-A P-B P-C P-D P-E
1 1 1/2/23 1 1 105 70 200 15 50
2 2 1/4/18 1 1 40 50 150 15 12
3 3 1/9/20 1 1 70 20 70 10 12
4 4 <NA> 2 NA 150 150 80 6 44
7 7 8/10/11 1 5 110 90 15 5 23
8 8 <NA> 2 2 120 40 60 12 44
9 9 9/23/22 1 2 99 30 70 15 35
dat <- structure(list(ID = 1:9, Date = c("1/2/23", "1/4/18", "1/9/20",
NA, "7/9/15", "6/10/24", "8/10/11", NA, "9/23/22"), Sex = c(1L,
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), Diagnosis = c(1L, 1L, 1L, NA,
3L, NA, 5L, 2L, 2L), `P-A` = c(105L, 40L, 70L, 150L, 148L, 115L,
110L, 120L, 99L), `P-B` = c(70L, 50L, 20L, 150L, 60L, 10L, 90L,
40L, 30L), `P-C` = c(200L, 150L, 70L, 80L, 900L, 1200L, 15L,
60L, 70L), `P-D` = c(15L, 15L, 10L, 6L, 7L, 40L, 5L, 12L, 15L
), `P-E` = c(50L, 12L, 12L, 44L, 56L, 46L, 23L, 44L, 35L)), class = "data.frame", row.names = c(NA,
-9L))