我的模型精度很差。该数据取自 https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original,显示逻辑回归模型的准确度为 96%,所以问题确实出在我的模型中。我在 R 中构建了以下模型。
# Importing dataset
tumor_study <- read.csv("breast-cancer-wisconsin.data", header = FALSE, na.strings = "NA")
# Adding column names
features <- c("id_number", "ClumpThickness", "Uniformity_CellSize",
"Uniformity_CellShape", "MarginalAdhesion",
"SingleEpithelial_CellSize", "BareNuclei", "Bland_Chromatin",
"Normal_Nucleoli", "Mitoses", "Class")
colnames(tumor_study) <- features
# Cleaning data
# Remove the 1st column (id_number)
tumor_study <- tumor_study[,-1]
# Convert "?" to NA in BareNuclei column and then to numeric
tumor_study$BareNuclei[tumor_study$BareNuclei == "?"] <- NA
tumor_study$BareNuclei <- as.numeric(tumor_study$BareNuclei)
# Remove rows with missing values in BareNuclei
tumor_study <- tumor_study[!is.na(tumor_study$BareNuclei),]
# Convert Class to a factor
tumor_study$Class <- factor(tumor_study$Class, levels = c(2, 4), labels = c("Benign", "Malignant"))
# Splitting the dataset into training and test sets
library(caTools)
set.seed(123)
split <- sample.split(tumor_study$Class, SplitRatio = 0.8)
training_set <- tumor_study[split == TRUE,]
test_set <- tumor_study[split == FALSE,]
# Applying feature scaling
training_set[, 1:9] <- scale(training_set[, 1:9])
test_set[, 1:9] <- scale(test_set[, 1:9])
# Building the logistic regression model
classifier <- glm(formula = Class ~ ., family = binomial, data = training_set)
# Predicting probabilities for the training set
prob_y_train <- predict(classifier, type = 'response', newdata = training_set[,-10])
predicted_y_training <- ifelse(prob_y_train >= 0.5, "Benign", "Malignant")
# prediction using test_set
prob_y_test <- predict(classifier, type = 'response', newdata = test_set[,-10])
predicted_y_test <- ifelse(prob_y_test >= 0.5, "Benign", "Malignant")
# Checking the accuracy with confusion matrix
cm_test <- table(test_set[,10], predicted_y_test)
print(cm_test)
## if you check the accuracy ... it is close to 2%
我无法找出模型中的问题。一定有问题。我期待有人告诉我错误在哪里。
您预测的是“恶性”,而不是“良性”,因为结果是一个具有 2 个水平的因子,并且 R glm 函数在运行逻辑回归时预测第二水平。
levels(training_set$Class)
[1] "Benign" "Malignant"
只需将两个
ifelse
命令的“良性”与“恶性”切换即可,您将获得更好的准确性。
predicted_y_training <- ifelse(prob_y_train >= 0.5, "Malignant", "Benign")
predicted_y_test <- ifelse(prob_y_test >= 0.5, "Malignant", "Benign")
...
print(cm_test)
predicted_y_test
Benign Malignant
Benign 86 3
Malignant 2 46