我创建了一个函数,在 lapply() 中使用一个已经分割的数据框。问题是,如果分割数据帧的某些条件缺失,则 lapply() 不会完全运行该函数。或者有没有办法通过 lapply() 删除包含 NA 的部分? 另一个问题是,在使用 lapply() 代码之前我无法删除带有 NA 的部分,因为 lapply() 希望 split() 函数中的所有条件都存在,否则它会给出与存在 NA 部分相同的错误.
我尝试使用以下解决方案删除一个拆分部分,并且它有效:
数据结构:
Participant visit trial tl tr br bl time CoPx CoPy
1 006 FU1 lc 81 7 6 1 1 -0.8 -13
2 006 FU1 lo 79 7 8 1 2 -0.7 -15
3 006 FU1 rc 7 9 9 0 3 -0.7 -51
4 006 FU1 ro 9 8 3 1 4 -0.4 -15
5 006 FU2 lc 9 4 8 1 5 -0.5 -17
6 006 FU2 lo 79 7 2 10 6 -0.5 -17
7 006 FU2 rc 7 4 9 1 7 -0.3 -86
8 006 FU2 ro 7 4 7 13 8 -0.8 -200
9 009 FU1 lc 81 7 6 1 1 -0.8 -13
10 009 FU1 lo 79 7 8 1 2 -0.7 -15
11 009 FU1 rc 7 9 9 0 3 -0.7 -51
12 009 FU1 ro 9 8 3 1 4 -0.4 -15
13 009 FU2 lc 9 4 8 1 5 -0.5 -17
14 009 FU2 lo 79 7 2 10 6 -0.5 -17
15 009 FU2 rc NA NA NA NA NA NA NA
16 009 FU2 ro 7 4 7 13 8 -0.8 -200
代码:
`minifunc.area <- function(df){
# This mini function is designed to extract the center of the ellipse and the area of the 95% confidence interval
# The code in this function comes from https://stackoverflow.com/questions/38782051/how-to- calculate-the-area-of-ellipse-drawn-by-ggplot2
# remove NA's before running filter
#df <- drop_na(df)
# Plot object
p <- ggplot(df, aes(x = CoPx, y = CoPy))+
geom_point()+
stat_ellipse(level = 0.95) # 95% confidence interval
# Get ellipse coordinates from plot
pb <- ggplot_build(p)
pb <- as.data.frame(pb$data[[2]])
el <- pb[,1:2]
# Center of ellipse
ctr <- MASS::cov.trob(el)$center
# Calculate distance to center from each point on the ellipse
dist2center <- sqrt(rowSums((t(t(el) - ctr))^2))
# Calculate area of ellipse from semi-major and semi-minor axes.
# These are, respectively, the largest and smallest values of dist2center.
answer <- pi * min(dist2center) * max(dist2center)
} # end of mini function
splt.df <- split(df, list(df$trial, df$Participant))
`
输出示例:
$lo.FU2.006 # A tibble: 6,041 × 12 # Groups: Participant, visit, trial [1] Participant visit trial tl tr br bl time CoPx CoPy <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> 1 006 FU2 lo 99.4 75.6 67.1 88.3 1 -0.678 -1.198 2 006 FU2 lo 67.5 98.3 44.8 85.6 2 -0.375 -0.375 $rc.FU2.009 # A tibble: 6,041 × 12 # Groups: Participant, visit, trial [1] Participant visit trial tl tr br bl time CoPx CoPy <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> 1 009 FU2 rc NA NA NA NA NA NA NA
# Figure out which df has missing data and define which row is missing
# removing duplicated trials per participant
temp.df <- df %>%
distinct(Participant, trial)
# creating a new column with assenting numbers to count row number
temp.df$row.nr <- c(1:nrow(temp.df))
# identify which row has the missing value
save.na <- temp.df %>%
dplyr::filter(Participant == p.nr & trial == trial.name)
na.row.nr <- unlist(save.na1[3]) # chr with the section number that needs to be removed from the split section (example: '15')
# apply the mini function to the splt data
splt <- lapply(splt.df[-na.row.nr], function(splt.df) minifunc.area(splt.df))
上面的方法有效,但是,现在我想做同样的事情,但缺少两个部分。
我已经尝试过以下尝试:
# Attempt 1)
na.row.nr <- unlist(save.na1[3])
# where na.row.nr[1] is the first section to be removed and na.row.nr[2] is the second
# apply the mini function to the splt data
splt <- lapply(splt.df[-c(na.row.nr[1], na.row.nr[2])], function(splt.df) minifunc.area(splt.df))
# Attempt 2)
na.row.nr <- unlist(save.na1[3])
# where na.row.nr[1] is the first section to be removed and na.row.nr[2] is the second
# apply the mini function to the splt data
splt <- lapply(splt.df[-na.row.nr[1], -na.row.nr[2]], function(splt.df) minifunc.area(splt.df))
# Attempt 3)
na.row.nr <- unlist(save.na1[3])
# where na.row.nr[1] is the first section to be removed and na.row.nr[2] is the second
missing.rows <- c(na.row.nr[1], na.row.nr[2])
# apply the mini function to the splt data
splt <- lapply(splt.df[-na.row.nr[1] & -na.row.nr[2]], function(splt.df) minifunc.area(splt.df))
输出:
Error in `[.data.frame`(pb, , 1:2) : undefined columns selected In addition: Warning message: Removed 1 rows containing non-finite values (`stat_ellipse()`).
根据我的理解,这意味着 lapply() 无法遍历所有部分,并且一旦到达包含 NA 或不存在的部分就会停止
您可以使用
if()
在错误发生之前捕获错误,或者如评论中所述,使用 purrr中的
safely()
。
我认为椭圆至少需要 3 个点,所以你可以用
if()
来捕捉这种情况:
minifunc.area <- function(df){
if(nrow(df)<3) {
cat("Not enough rows in data to calculate an ellipse\n")
return(NA_real_)
}
如果没有足够的独特点,
stat_ellipse()
功能也可能会失败,因此您可以使用if()
再次捕获:
# Get ellipse coordinates from plot
pb <- ggplot_build(p)
pb <- as.data.frame(pb$data[[2]])
if(nrow(pb)<3 | ncol(pb)<2) return(NA_real_)
这应该足以处理大多数情况。让我们尝试使用整个数据集。
minifunc.area(df)
[1] 1031.753
Warning message:
Removed 1 row containing non-finite outside the scale range (`stat_ellipse()`).
现在让我们按一些变量进行划分。我将在这里使用
sapply
,以便输出更适合屏幕。
按参与者划分:
splt.df <- split(df, df$Participant)
sapply(splt.df, minifunc.area)
6 9
291.9459 175.9905
按试验分割:
splt.df <- split(df, df$trial)
sapply(splt.df, minifunc.area)
lc lo rc ro
12.24941 NA NA NA
按参与者和试验划分:
splt.df <- split(df, list(df$trial, df$Participant))
lapply(splt.df, minifunc.area)
lc.6 lo.6 rc.6 ro.6 lc.9 lo.9 rc.9 ro.9
NA NA NA NA NA NA NA NA