是否有一种方法可以计算组之间的重叠散点图,以便能够使用 SVM 模型进行分类?

问题描述 投票:0回答:1

为了澄清这一点,我使用了一些数据集来解释二维数据的变体

可以通过以下方式访问数据集:https://drive.google.com/file/d/14-VivVlGSlaJo6BXlYMqn-1leorSU6ET/view?usp=sharing

还有一个辅助函数:

scatterplot_check <- function(data, dependent_col, x_column, y_column, legend_pos="topright"){
  x11()
  data_subsets <- data[,c(which(colnames(data) %in% c(dependent_col, x_column, y_column)))]
  if(class(data_subsets[[dependent_col]]) == "factor"){
    factor_key <- levels(data_subsets[[dependent_col]])
    data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
    factor_num <- sort(unique(data_subsets[[dependent_col]]))
    plot(data_subsets[[x_column]],data_subsets[[y_column]], 
         col = data_subsets[[dependent_col]], pch=18, 
         xlab=x_column, ylab=y_column)
    legend(legend_pos, legend=factor_key, col = factor_num, pch=18) 
  }
  else if(class(data_subsets[[dependent_col]]) == "character"){
    data_subsets[[dependent_col]] <- as.factor(data_subsets[[dependent_col]])
    factor_key <- levels(data_subsets[[dependent_col]])
    data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
    factor_num <- sort(unique(data_subsets[[dependent_col]]))
    plot(data_subsets[[x_column]],data_subsets[[y_column]], 
         col = data_subsets[[dependent_col]], pch=18, 
         xlab=x_column, ylab=y_column)
    legend(legend_pos, legend=factor_key, col = factor_num, pch=18) 
  }
  else if(class(data_subsets[[dependent_col]]) == "integer"){
    if(min(data_subsets[[dependent_col]]) == 0){
      data_subsets[[dependent_col]] <- data_subsets[[dependent_col]] + 1
      plot(data_subsets[[x_column]],data_subsets[[y_column]], 
           col = data_subsets[[dependent_col]], pch=18, 
           xlab=x_column, ylab=y_column)
      legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]]-1)), 
             col = sort(unique(data_subsets[[dependent_col]])), pch=18) 
    }else{
      plot(data_subsets[[x_column]],data_subsets[[y_column]], 
           col = data_subsets[[dependent_col]], pch=18, 
           xlab=x_column, ylab=y_column)
      legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]])), 
             col = sort(unique(data_subsets[[dependent_col]])), pch=18) 
    }
  }
}

假设我将所有数据读入环境中:

dataset1 <- read.csv("dataset1.csv")
dataset2 <- read.csv("dataset2.csv")
dataset3 <- read.csv("dataset3.csv")

这是散点图的一些变体:

scatterplot_check(dataset1, "y","x.1","x.2")

(这“很可能”能够被归类为 SVM 模型) This is Most likely to capable to be classified as SVM Models scatterplot_check(dataset2, "Purchased","Age","EstimatedSalary")

也有可能

能够被归类为 SVM 模型 enter image description here scatterplot_check(dataset3, "grades","english","math")

不可能

能够被归类为 SVM 模型 enter image description here scatterplot_check(dataset3, "grades","read","math", legend_pos="topleft")

不太可能

能够被归类为 SVM 模型 enter image description here 是否有任何最佳方法来计算使用 SVM 模型建模 2D 散点图的可能性?

r svm scatter-plot
1个回答
0
投票

计算 X 和 Y 变量在范围序列中的百分比
  1. 定义百分比阈值(在我的例子中我使用 5%)
  2. 通过 5% 百分比过滤检查 X 和 Y 分布的结果,如果所有 X 和 Y 变量在每个类别中具有相同的序列分布。它不太可能被建模为 SVM,因为它显示出对所选类的独立性,另一方面,如果任何 X 和 Y 变量在每个类中具有不同的序列分布,则它很可能被建模为 SVM,因为它显示与所选类不同的分布
  3. 这是我对这 4 个案例实施时的结果:

d1_compare <- dataset_class_comparison(dataset1, "y", "x.1", "x.2") ============================================================================ Class = -1 SeqX(-10,10,1) SeqY(-10,10,1) x.1_-2 to -1 (pct) x.1_-1 to 0 (pct) x.1_0 to 1 (pct) x.1_1 to 2 (pct) 0.16 0.38 0.30 0.10 x.2_-2 to -1 (pct) x.2_-1 to 0 (pct) x.2_0 to 1 (pct) x.2_1 to 2 (pct) 0.14 0.28 0.46 0.08 ============================================================================ ============================================================================ Class = 1 SeqX(-10,10,1) SeqY(-10,10,1) x.1_-1 to 0 (pct) x.1_1 to 2 (pct) x.1_2 to 3 (pct) x.1_3 to 4 (pct) 0.08 0.42 0.36 0.08 x.2_-1 to 0 (pct) x.2_0 to 1 (pct) x.2_1 to 2 (pct) x.2_2 to 3 (pct) x.2_3 to 4 (pct) 0.06 0.26 0.38 0.20 0.06 ============================================================================ Conclusion: Since each class within a 5% threshold not having similiar distribution from x.1 or x.2 SVM Likely can be modeled d2_compare <- dataset_class_comparison(dataset2, "Purchased", "Age", "EstimatedSalary") ============================================================================ Class = 0 SeqX(10,100,10) SeqY(10000,1e+06,10000) Age_10 to 20 (pct) Age_20 to 30 (pct) Age_30 to 40 (pct) Age_40 to 50 (pct) 0.066 0.325 0.413 0.178 EstimatedSalary_10000 to 20000 (pct) EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct) 0.063 0.077 0.059 EstimatedSalary_40000 to 50000 (pct) EstimatedSalary_50000 to 60000 (pct) EstimatedSalary_60000 to 70000 (pct) 0.098 0.182 0.112 EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct) 0.210 0.150 ============================================================================ ============================================================================ Class = 1 SeqX(10,100,10) SeqY(10000,1e+06,10000) Age_30 to 40 (pct) Age_40 to 50 (pct) Age_50 to 60 (pct) 0.222 0.392 0.304 EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct) EstimatedSalary_40000 to 50000 (pct) 0.123 0.105 0.056 EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct) EstimatedSalary_90000 to 1e+05 (pct) 0.080 0.080 0.074 EstimatedSalary_1e+05 to 110000 (pct) EstimatedSalary_110000 to 120000 (pct) EstimatedSalary_120000 to 130000 (pct) 0.093 0.062 0.062 EstimatedSalary_130000 to 140000 (pct) EstimatedSalary_140000 to 150000 (pct) 0.093 0.099 ============================================================================ Conclusion: Since each class within a 5% threshold not having similiar distribution from Age or EstimatedSalary SVM Likely can be modeled d3_compare <- dataset_class_comparison(dataset3, "grades", "english", "math") ============================================================================ Class = KK-08 SeqX(0,100,10) SeqY(100,1000,100) english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct) 0.571 0.162 0.061 0.084 0.056 math_600 to 700 (pct) 0.989 ============================================================================ ============================================================================ Class = KK-06 SeqX(0,100,10) SeqY(100,1000,100) english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct) 0.377 0.262 0.098 0.131 0.066 math_600 to 700 (pct) 0.984 ============================================================================ Conclusion: Since each class within a 5% threshold having similiar distribution either from english and math SVM Unlikely can be modeled d4_compare <- dataset_class_comparison(dataset3, "grades", "math", "read") ============================================================================ Class = KK-08 SeqX(100,1000,100) SeqY(100,1000,100) math_600 to 700 (pct) 0.989 read_600 to 700 (pct) 0.992 ============================================================================ ============================================================================ Class = KK-06 SeqX(100,1000,100) SeqY(100,1000,100) math_600 to 700 (pct) 0.984 read_600 to 700 (pct) 1 ============================================================================ Conclusion: Since each class within a 5% threshold having similiar distribution either from math and read SVM Unlikely can be modeled

dataset_class_comparison

是一个超过300行的自定义函数,可以在

https://drive.google.com/file/d/1RmIhbNnKZWS2jFIsS9p4LWjhcbikpOga/view?usp=sharing
中找到

© www.soinside.com 2019 - 2024. All rights reserved.