我正在尝试为Kaggle - Titanic数据集(URL-https://www.kaggle.com/c/titanic/data为“train.csv”和“test.csv”)训练一个朴素的贝叶斯分类器。
我到目前为止提出的代码如下─
library(e1071)
train_d <- read.csv("train.csv", stringsAsFactors = TRUE)
# columns chosen for training data-
# colnames(TD) OR names(TD)
# "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch","Embarked"
train_data <- train_d[, c(2:3, 5:8, 12)]
# to find out which columns contain NA (missing values)-
colnames(train_data)[apply(is.na(train_data), 2, any)]
# mean(TD$age, na.rm = TRUE) # to find mean of 'age' which contains 'NA'
# which(is.na(age))
# fill in missing value (NA) with mean of 'Age' column-
train_data$Age[which(is.na(train_data$Age))] <- mean(train_data$Age, na.rm = TRUE)
# check whether there are any existing NAs-
which(is.na(train_data$Age))
# OR-
colnames(train_data)[apply(is.na(train_data), 2, any)]
test_d <- read.csv("test.csv", stringsAsFactors = TRUE)
# columns chosen for training data-
# "Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked"
test_data <- test_d[, c(2, 4:7, 11)]
# find out missing values (NA)-
colnames(test_data)[apply(is.na(test_data), 2, any)]
# fill in missing value (NA) with mean of 'Age' column-
test_data$Age[which(is.na(test_data$Age))] <- mean(test_data$Age, na.rm = TRUE)
# check whether there are any existing NAs-
which(is.na(train_data$Age))
# OR-
colnames(train_data)[apply(is.na(train_data), 2, any)]
# training a naive-bayes classifier-
titanic_nb <- naiveBayes(Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked, data = train_data)
# predict using trained naive-bayes classifier-
output <- predict(titanic_nb, test_data, type = "class")
但是,'输出'实际上并不包含任何内容。 “输出”变量的输出是 -
> output
factor(0)
Levels:
出了什么问题?
谢谢!
Here is the answer:原始问题被删除,所以web-cache链接。
原因是该模型并不真正知道如何处理字符列,因为您可以看到是否运行data.matrix(test_data)
。
解决方案是首先将您的角色列转换为因子,确保列车和测试中的因子水平一致。
在旁注中,我建议从随机森林开始,因为它通常在没有任何参数调整的情况下表现良好,并且不关心变量的分布(与假设高斯分布的NB相反)。