如何可视化哪些行是在我的数据框中以异常值踢的,而仅查看特定列?

问题描述 投票:0回答:1
因此,我将所有代码都设置为使用IQR方法从数据集中过滤离群值。我从其他帖子(

https://stackoverflow.com/a/60814481/2939136)中找到的代码。 它为我的目的非常有效,但现在我想回去弄清楚哪些行被踢了,但我不知道如何编辑代码来执行此操作。

在这里是使用的原始代码:

# IQR method functions # @param x A numeric vector # @param na.rm Whether to exclude NAs when computing quantiles is_outlier <- function(x, na.rm = TRUE) { qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm) lowerq <- qs[1] upperq <- qs[2] iqr = upperq - lowerq extreme.threshold.upper = (iqr * 1.5) + upperq extreme.threshold.lower = lowerq - (iqr * 1.5) # Return logical vector x > extreme.threshold.upper | x < extreme.threshold.lower } # Remove rows with outliers in given columns # Any row with at least 1 outlier will be removed # @param df A data.frame # @param cols Names of the columns of interest, defaults to all columns. remove_outliers <- function(df, cols = names(df)) { for (col in cols) { cat("Removing outliers in column: ", col, " \n") df <- df[!is_outlier(df[[col]]),] } df }

问题是,如果我只运行第一个功能,则我有列,我不想对包括日期列的离群值进行分析,但更重要的是,ID列。 ID列是我将用来弄清楚谁被踢出的内容以及从下游分析中省略了哪些信息。
示例数据集(我只关心在列中查找异常值
P-C

P-D

P-E

IdDateP-aP-bP-cP-dP-e11/2/231110570200155021/4/18114050150151231/9/20117020701012421501508064457/9/15231486090075666/10/242115101200404678/10/111511090155238221204060124499/23/22129930701535,运行remove_outliers(第二个也是最后一个功能)时,我会做一些像
Sex 诊断
columns_of_interest <- c("P-C", "P-D", "P-E") remove_outliers(df, columns_of_interest)

现在我想知道那些异常值是什么。
    

您可以修改您的
remove_outliers
以打印每列中离群值的ID。

remove_outliers <- function(df, cols = names(df), id_col) { outliers <- c() for(col in cols){ cat("Removing outliers in column: ", col, " \n") removed_row_id <- df[is_outlier(df[[col]]), id_col] cat(id_col, "of rows removed:", removed_row_id, "\n") outliers <- append(outliers, removed_row_id) } outliers <- unique(outliers) df[!df[[id_col]] %in% outliers,] }

r
1个回答
0
投票

> remove_outliers(dat, cols = c("P-C", "P-D", "P-E"), id_col = "ID") # output # Removing outliers in column: P-C # ID of rows removed: 5 6 # Removing outliers in column: P-D # ID of rows removed: 6 # Removing outliers in column: P-E # ID of rows removed: ID Date Sex Diagnosis P-A P-B P-C P-D P-E 1 1 1/2/23 1 1 105 70 200 15 50 2 2 1/4/18 1 1 40 50 150 15 12 3 3 1/9/20 1 1 70 20 70 10 12 4 4 <NA> 2 NA 150 150 80 6 44 7 7 8/10/11 1 5 110 90 15 5 23 8 8 <NA> 2 2 120 40 60 12 44 9 9 9/23/22 1 2 99 30 70 15 35

data

dat <- structure(list(ID = 1:9, Date = c("1/2/23", "1/4/18", "1/9/20", 
NA, "7/9/15", "6/10/24", "8/10/11", NA, "9/23/22"), Sex = c(1L, 
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L), Diagnosis = c(1L, 1L, 1L, NA, 
3L, NA, 5L, 2L, 2L), `P-A` = c(105L, 40L, 70L, 150L, 148L, 115L, 
110L, 120L, 99L), `P-B` = c(70L, 50L, 20L, 150L, 60L, 10L, 90L, 
40L, 30L), `P-C` = c(200L, 150L, 70L, 80L, 900L, 1200L, 15L, 
60L, 70L), `P-D` = c(15L, 15L, 10L, 6L, 7L, 40L, 5L, 12L, 15L
), `P-E` = c(50L, 12L, 12L, 44L, 56L, 46L, 23L, 44L, 35L)), class = "data.frame", row.names = c(NA, 
-9L))

最新问题
© www.soinside.com 2019 - 2025. All rights reserved.