问题是从我的随机森林模型中获得仅NA响应

问题描述 投票:0回答:1

这是我对Stake溢出的第一篇文章,因此,如果需要更多信息,请问我以下问题!

情况:我已经为淡水生态系统汇编了Maritimes(加拿大大西洋)的水化学数据,因为我正尝试使用随机森林模型(RFM)为入侵物种创建预测物种分布模型。不幸的是,加拿大大西洋地区缺乏一致的水监控程序,而现有的监控程序无法监控与其他小组相同的参数。因此,我的数据库(包括培训和测试)都有许多NA。

问题:这是我不断从RFM收到的回复:

> p1 <- predict(model2, newdata=Test_Dataset,type="prob")[,2]
> p1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA35 36 37NA NA NA

我尝试过的:

  1. 我使用各种预测变量构建了RFM(即model2)。我确实包括:

    model2

**请注意,变量的最大列表是预测变量,而CMS是物种。

  1. 我尝试将测试数据集(Test_Dataset)与训练数据集(Validation_Dataset)进行匹配。

    Test_Dataset

  2. 我已经搜索并阅读了多种资源(包括明显的R页面和链接到那里的引用)。

  3. 我已经按照以下步骤对数据框进行了变异(我将只显示Validation_Dataset,因为这是两个相同的变异):

    更改数据集以解决R读取NA单元的问题

    Validation_Dataset %dplyr :: mutate(#将year转换为分类变量年=因子(年),#将叶绿素浓度从字符文件转换为数字文件#适当时将“ NA”转换为缺失值数据叶绿素= dplyr :: na_if(叶绿素,“ NA”),叶绿素=因子(叶绿素),硬度= dplyr :: na_if(硬度,“ NA”),硬度=系数(硬度),碱度= dplyr :: na_if(碱度,“ NA”),碱度=因子(碱度),Ca = dplyr :: na_if(Ca,“ NA”),Ca =系数(Ca),TOC = dplyr :: na_if(TOC,“ NA”),TOC =因子(TOC),Cond = dplyr :: na_if(Cond,“ NA”),Cond =系数(Cond),Na = dplyr :: na_if(Na,“ NA”),Na =系数(条件),NH4 = dplyr :: na_if(NH4,“ NA”),NH4 =因子(NH4),NO3 = dplyr :: na_if(NO3,“ NA”),NO3 =因子(NO3),pH = dplyr :: na_if(pH,“ NA”),pH =因子(pH),T_N = dplyr :: na_if(T_N,“ NA”),T_N =因子(T_N),T_P = dplyr :: na_if(T_P,“ NA”),T_P =系数(T_P),DO = dplyr :: na_if(DO,“ NA”),DO =系数(DO),盐度= dplyr :: na_if(盐度,“ NA”),盐度=因子(盐度),No_Stocking = dplyr :: na_if(No_Stocking,“ NA”),No_Stocking =因素(No_Stocking),No_Fish_Species = dplyr :: na_if(No_Fish_Species,“ NA”),No_Fish_Species =因子(No_Fish_Species),Dist_Hwy = dplyr :: na_if(Dist_Hwy,“ NA”),Dist_Hwy =系数(Dist_Hwy),No_Boat_Launches = dplyr :: na_if(No_Boat_Launches,“ NA”),No_Boat_Launches =因子(No_Boat_Launches),Connected_Lakes = dplyr :: na_if(Connected_Lakes,“ NA”),Connected_Lakes =系数(Connected_Lakes),入侵= dplyr :: na_if(入侵,“ NA”),入侵=因子(入侵),纬度=因子(纬度),Lon =因子(Lon),CMS =因子(CMS))

问题:有人知道如何真正使编码起作用,以便model2在Test_Dataset上进行预测吗?我认为这个问题实际上可能很小,但是我没有看到。

这里是训练数据集(Validation_Dataset):

> str(Validation_Dataset)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    37 obs. of  31 variables:
 $ Name            : chr  "Canard River" "Cedar Creek" "Holland River" "Speed River" ...
 $ STN #/COUNTY    : chr  "10000200202" "16001800202" "3007700202" "16018403402" ...
 $ Province        : chr  "ON" "ON" "ON" "ON" ...
 $ Lat             : Factor w/ 37 levels "42.03204214",..: 2 1 11 9 10 8 7 5 6 3 ...
 $ Lon             : Factor w/ 37 levels "-83.01879548",..: 1 2 11 8 10 6 7 9 5 4 ...
 $ Year            : Factor w/ 9 levels "2007, 2011","2010, 2015, 2011",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ Month           : chr  "4" "4" "4" "4" ...
 $ Day             : chr  "11" "12" "26" "27" ...
 $ Data Source     : chr  "ON Provincial (Streams) Water Quality Monitoring Network" "ON Provincial (Streams) Water Quality Monitoring Network" "ON Provincial (Streams) Water Quality Monitoring Network" "ON Provincial (Streams) Water Quality Monitoring Network" ...
 $ pH              : Factor w/ 35 levels "6.073333","6.13",..: 18 21 28 29 25 34 30 32 19 26 ...
 $ Alkalinity      : Factor w/ 31 levels "1.8","2.8","3.933333333",..: 19 22 31 30 27 NA NA 26 NA 21 ...
 $ Hardness        : Factor w/ 13 levels "14.8","36.8",..: 7 8 11 10 9 NA NA 13 NA NA ...
 $ Ca              : Factor w/ 24 levels "3.833333333",..: 18 19 24 20 21 NA NA 22 NA NA ...
 $ Chlorophyll     : Factor w/ 15 levels "0.423601","0.453791",..: NA NA NA NA NA NA NA NA NA NA ...
 $ DO              : Factor w/ 26 levels "0.27","6.2","6.96",..: 21 24 18 16 4 25 17 14 2 7 ...
 $ TOC             : Factor w/ 3 levels "4.8","5.5","8.8": NA NA NA NA NA NA NA NA NA NA ...
 $ T_P             : Factor w/ 24 levels "0.002","0.003",..: 23 22 18 10 15 14 16 13 21 20 ...
 $ T_N             : Factor w/ 32 levels "0.006","0.13",..: 30 31 27 28 17 29 24 32 21 25 ...
 $ NO3+NO2         : num  2.173 2.292 1.092 1.695 0.426 ...
 $ NO3             : Factor w/ 32 levels "0.027","0.035",..: 30 31 26 27 11 29 24 32 22 8 ...
 $ NH4             : Factor w/ 27 levels "0.005","0.006",..: 26 25 22 17 9 11 13 19 23 27 ...
 $ Cond            : Factor w/ 34 levels "41","97","134",..: 24 21 29 23 22 14 34 31 21 17 ...
 $ Salinity        : Factor w/ 9 levels "0.11","0.15",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Na              : Factor w/ 34 levels "41","97","134",..: 24 21 29 23 22 14 34 31 21 17 ...
 $ No_Stocking     : Factor w/ 3 levels "0","1","2": 1 2 2 3 1 2 1 2 1 2 ...
 $ No_Fish_Species : Factor w/ 9 levels "0","1","2","3",..: 1 4 6 4 1 5 1 9 1 9 ...
 $ Dist_Hwy        : Factor w/ 16 levels "0.003","0.006",..: NA NA 16 NA NA NA NA 8 NA 5 ...
 $ No_Boat_Launches: Factor w/ 8 levels "0","1","2","3",..: 1 1 5 1 1 1 1 8 1 3 ...
 $ Connected_Lakes : Factor w/ 11 levels "0","1","2","3",..: 7 2 3 4 9 6 2 3 2 5 ...
 $ Invasives       : Factor w/ 3 levels "0","1","2": NA NA NA NA NA NA NA NA NA NA ...
 $ CMS             : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...
r dataframe dplyr linear-regression random-forest
1个回答
0
投票

使用参数na.roughfix。如果要使用它,必须首先在randomForest函数之外指定它。我将以虹膜数据集为例。

iris.roughfix <- na.roughfix(iris.na)
iris.narf <- randomForest(Species ~ ., iris.na, na.action=na.roughfix)
© www.soinside.com 2019 - 2024. All rights reserved.