我的逻辑回归机器学习模型有问题。请帮我纠正我的代码

问题描述 投票:0回答:1

我的模型精度很差。该数据取自 https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original,显示逻辑回归模型的准确度为 96%,所以问题确实出在我的模型中。我在 R 中构建了以下模型。

# Importing dataset 
tumor_study <- read.csv("breast-cancer-wisconsin.data", header = FALSE, na.strings = "NA")

# Adding column names
features <- c("id_number", "ClumpThickness", "Uniformity_CellSize",
              "Uniformity_CellShape", "MarginalAdhesion",
              "SingleEpithelial_CellSize", "BareNuclei", "Bland_Chromatin",
              "Normal_Nucleoli", "Mitoses", "Class")

colnames(tumor_study) <- features 

# Cleaning data
# Remove the 1st column (id_number)
tumor_study <- tumor_study[,-1]

# Convert "?" to NA in BareNuclei column and then to numeric
tumor_study$BareNuclei[tumor_study$BareNuclei == "?"] <- NA
tumor_study$BareNuclei <- as.numeric(tumor_study$BareNuclei)

# Remove rows with missing values in BareNuclei
tumor_study <- tumor_study[!is.na(tumor_study$BareNuclei),]

# Convert Class to a factor
tumor_study$Class <- factor(tumor_study$Class, levels = c(2, 4), labels = c("Benign", "Malignant"))

# Splitting the dataset into training and test sets
library(caTools)
set.seed(123)
split <- sample.split(tumor_study$Class, SplitRatio = 0.8)
training_set <- tumor_study[split == TRUE,]
test_set <- tumor_study[split == FALSE,]

# Applying feature scaling
training_set[, 1:9] <- scale(training_set[, 1:9])
test_set[, 1:9] <- scale(test_set[, 1:9])

# Building the logistic regression model
classifier <- glm(formula = Class ~ ., family = binomial, data = training_set)

# Predicting probabilities for the training set
prob_y_train <- predict(classifier, type = 'response', newdata = training_set[,-10])
predicted_y_training <- ifelse(prob_y_train >= 0.5, "Benign", "Malignant")

# prediction using test_set
prob_y_test <- predict(classifier, type = 'response', newdata = test_set[,-10])
predicted_y_test <- ifelse(prob_y_test >= 0.5, "Benign", "Malignant")

# Checking the accuracy with confusion matrix
cm_test <- table(test_set[,10], predicted_y_test)
print(cm_test)

## if you check the accuracy ... it is close to 2%

我无法找出模型中的问题。一定有问题。我期待有人告诉我错误在哪里。

r machine-learning logistic-regression
1个回答
0
投票

您预测的是“恶性”,而不是“良性”,因为结果是一个具有 2 个水平的因子,并且 R glm 函数在运行逻辑回归时预测第二水平。

levels(training_set$Class)
[1] "Benign"    "Malignant"

只需将两个

ifelse
命令的“良性”与“恶性”切换即可,您将获得更好的准确性。

predicted_y_training <- ifelse(prob_y_train >= 0.5, "Malignant", "Benign")
predicted_y_test <- ifelse(prob_y_test >= 0.5, "Malignant", "Benign")
...

print(cm_test)
           predicted_y_test
            Benign Malignant
  Benign        86         3
  Malignant      2        46
© www.soinside.com 2019 - 2024. All rights reserved.