R：用两个最连续值的平均值替换NA

Question

一个数据帧：

x <- c(3,4,8,10,NA,NA,NA,8,10,10,NA,22)
y <- c(1,6,3,5,NA,44,23,NA,NA,5,34,33)
df <- data.frame(x,y)

x   y
<dbl>   <dbl>
3   1           
4   6           
8   3           
10  5           
NA  NA          
NA  44          
NA  23          
8   NA          
10  NA          
10  5           
NA  34          
22  33

我想用两个最连续值的平均值代替NA值。例如，df[5,2]是NA，但我们可以将其替换为平均值5和44：

df[5,2] <- (df[4,2]+df[6,2])/2

df[5,2]
[1] 24.5

但是，如果连续值也是NA，则无法完成。用df[6,1]和df[5,1]之间的平均值替换df[7,1]无效，因为它们也是NA。

我要完成的工作是确保我用来计算平均值的值是两个最连续的值，而不是NA。我创建了一个for循环，以创建在其中找到NAs的索引的数据框。然后，我创建了代表NA旁边的索引的变量，并进行了评估它们是否为NA的测试。如果它们是TRUE是NA，则索引相对于NA索引的位置而增加或减少：

x <- as.data.frame(which(is.na(df), arr.ind = TRUE))
str(x)

  'data.frame': 7 obs. of  2 variables:
   $ row: int  5 6 7 11 5 8 9
   $ col: int  1 1 1 1 2 2 2

[您将看到一个数据框，其中具有数据集中NAs位置的行和列值。现在，我尝试覆盖它们：

for (i in 1:dim(x)[1]) {

    row <- x[i,1]          # First for loop assigns row and column values using the location of NA
    col <- x[i,2]

    b <- row - 1           # Create a list of the indices that precede the NA
    a <- row + 1           # Create a list of the indices that go after the NA

    ifelse(is.na(df[b[i],col]), b[i]-1, b[i])    # If the value in the list is also an NA, keep looking
    ifelse(is.na(df[a[i],col]), a[i]+1, a[i])

    df[row,col] <- (df[b,col]+df[a,col])/2       # Replace the NA with the mean of values where we could 
                                                 # find integers

}

A，我无法通过所有NA。我没有想出更好的解决方案，因此转向更好的想法。非常感谢！

y <- as.data.frame(which(is.na(df), arr.ind = TRUE))
str(y)

'data.frame':   5 obs. of  2 variables:
 $ row: int  5 6 7 8 9
 $ col: int  1 1 1 2 2

Answer 1

我们可以为此使用zoo::na.locf()功能：

x <- c(3,4,8,10,NA,NA,NA,8,10,10,NA,22)
y <- c(1,6,3,5,NA,44,23,NA,NA,5,34,33)
df <- data.frame(x,y)

contiguous_mean <- function(vec) {
    return( (zoo::na.locf(vec) + zoo::na.locf(vec, fromLast = TRUE)) / 2 )
}

apply(df, 2, contiguous_mean)

#        x    y
#  [1,]  3  1.0
#  [2,]  4  6.0
#  [3,]  8  3.0
#  [4,] 10  5.0
#  [5,]  9 24.5
#  [6,]  9 44.0
#  [7,]  9 23.0
#  [8,]  8 14.0
#  [9,] 10 14.0
# [10,] 10  5.0
# [11,] 16 34.0
# [12,] 22 33.0

在这里，“ locf”代表l ast o观测c到达f或之前的位置，它将NA值替换为最后观察到的值；通过fromLast参数，您可以使用最接近的上一个观察值，或最接近的后续观察值。我们需要最后一个上一个观测值和下一个下一个观测值的平均值，因此我们将结果的总和除以fromLast为TRUE和FALSE。

R：用两个最连续值的平均值替换NA

问题描述投票：0回答：1

1个回答

最新问题

R：用两个最连续值的平均值替换NA

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1