如何在R中计算决策树规则

Question

我使用 RPart 来构建决策树。没有问题，我正在这样做。但是，我需要了解（或计算）树被分割了多少次？我的意思是，树有多少条规则（if-else 语句）？例如：

                  X
                 - - 
        if (a<9)-   - if(a>=9)
               Y     H
              -
      if(b>2)- 
            Z

有3条规则。

当我写总结（模型）时：

摘要（model_dt）

Call:
rpart(formula = Alert ~ ., data = train)
  n= 18576811 

         CP nsplit  rel error     xerror         xstd
1 0.9597394      0 1.00000000 1.00000000 0.0012360956
2 0.0100000      1 0.04026061 0.05290522 0.0002890205

Variable importance
         ip.src frame.protocols   tcp.flags.ack tcp.flags.reset       frame.len 
             20              17              17              17              16 
         ip.ttl 
        `    12 

Node number 1: 18576811 observations,    complexity param=0.9597394
  predicted class=yes  expected loss=0.034032  P(node) =1
    class counts: 632206 1.79446e+07
   probabilities: 0.034 0.966 
  left son=2 (627091 obs) right son=3 (17949720 obs)
  Primary splits:
      ip.src          splits as LLLLLLLRRRLLRR ............ LLRLRLRRRRRRRRRRRRRRRR
    improve=1170831.0, (0 missing)

      ip.dts splits as  LLLLLLLLLLLLLLLLLLLRLLLLLLLLLLL, improve=1013082.0, (0 missing)
      tcp.flags.ctl   < 1.5   to the right, improve=1007953.0, (2645 missing)
      tcp.flags.syn < 1.5   to the right, improve=1007953.0, (2645 missing)
      frame.len       < 68    to the right, improve= 972871.3, (30 missing)
  Surrogate splits:
      frame.protocols splits as  LLLLLLLLLLLLLLLLLLLRLLLLLLLLLLL, agree=0.995, adj=0.841, (0 split)
      tcp.flags.ack   < 1.5   to the right, agree=0.994, adj=0.836, (0 split)
      tcp.flags.reset < 1.5   to the right, agree=0.994, adj=0.836, (0 split)
      frame.len       < 68    to the right, agree=0.994, adj=0.809, (0 split)
      ip.ttl          < 230.5 to the right, agree=0.987, adj=0.612, (0 split)

Node number 2: 627091 observations
  predicted class=no   expected loss=0.01621615  P(node) =0.03375666
    class counts: 616922 10169
   probabilities: 0.984 0.016 

Node number 3: 17949720 observations
  predicted class=yes  expected loss=0.0008514896  P(node) =0.9662433
    class counts: 15284 1.79344e+07
   probabilities: 0.001 0.999

如果有人帮助我理解，我将不胜感激

真诚的伊雷

Answer 1

有几种方法可以通过了解如何返回树对象 (

?rpart.object

) 来实现此目的。

我将按照

kyphosis

中的第一个示例展示在 R 中使用

?rpart

数据集的两种方法：

fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)

选项1

> tail(fit$cptable[, "nsplit"], 1)
3 
4
> unname(tail(fit$cptable[, "nsplit"], 1)) ## or
[1] 4

来自

cptable

，其中包含给定大小的树的成本复杂性信息

> fit$cptable
          CP nsplit rel error   xerror      xstd
1 0.17647059      0 1.0000000 1.000000 0.2155872
2 0.01960784      1 0.8235294 1.176471 0.2282908
3 0.01000000      4 0.7647059 1.176471 0.2282908

据我所知，该表的最后一行将引用当前最大的树。如果根据 CP 将树修剪到特定大小，则该矩阵的最后一行将包含该大小的树的信息：

> fit2 <- prune(fit, cp = 0.02)
> fit2$cptable
         CP nsplit rel error   xerror      xstd
1 0.1764706      0 1.0000000 1.000000 0.2155872
2 0.0200000      1 0.8235294 1.176471 0.2282908

选项2

第二个选项是计算拟合模型的

<leaf>

分量的

var

列中

frame

的出现次数：

> fit$frame
      var  n wt dev yval complexity ncompete nsurrogate    yval2.V1    yval2.V2
1   Start 81 81  17    1 0.17647059        2          1  1.00000000 64.00000000
2   Start 62 62   6    1 0.01960784        2          2  1.00000000 56.00000000
4  <leaf> 29 29   0    1 0.01000000        0          0  1.00000000 29.00000000
5     Age 33 33   6    1 0.01960784        2          2  1.00000000 27.00000000
10 <leaf> 12 12   0    1 0.01000000        0          0  1.00000000 12.00000000
11    Age 21 21   6    1 0.01960784        2          0  1.00000000 15.00000000
22 <leaf> 14 14   2    1 0.01000000        0          0  1.00000000 12.00000000
23 <leaf>  7  7   3    2 0.01000000        0          0  2.00000000  3.00000000
3  <leaf> 19 19   8    2 0.01000000        0          0  2.00000000  8.00000000
      yval2.V3    yval2.V4    yval2.V5 yval2.nodeprob
1  17.00000000  0.79012346  0.20987654     1.00000000
2   6.00000000  0.90322581  0.09677419     0.76543210
4   0.00000000  1.00000000  0.00000000     0.35802469
5   6.00000000  0.81818182  0.18181818     0.40740741
10  0.00000000  1.00000000  0.00000000     0.14814815
11  6.00000000  0.71428571  0.28571429     0.25925926
22  2.00000000  0.85714286  0.14285714     0.17283951
23  4.00000000  0.42857143  0.57142857     0.08641975
3  11.00000000  0.42105263  0.57894737     0.23456790

该值 - 1 是分割数。为了进行计数，我们可以使用：

> grepl("^<leaf>$", as.character(fit$frame$var))
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
> sum(grepl("^<leaf>$", as.character(fit$frame$var))) - 1
[1] 4

我使用的正则表达式可能有点矫枉过正，但它意味着检查以 (

) 开头并以 (

)

"<leaf>"

结尾的字符串，即，这是整个字符串。我使用 grepl()

将

var

 列上的匹配项作为逻辑向量返回，我们可以对

TRUE

 求和并从中减去 1。由于

var

 存储为因子，因此我在

grepl()

 调用中将其转换为字符向量。

您还可以使用

grep()

 来返回匹配项的索引并使用

length()

 来对它们进行计数：

> grep("^<leaf>$", as.character(fit$frame$var))
[1] 3 5 7 8 9
> length(grep("^<leaf>$", as.character(fit$frame$var))) - 1
[1] 4

Answer 2

根据经验，我发现我可以制作一个数据框并使用

rpart.plot

 包来计算行数：

library(rpart.plot)

> fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
> data.frame(rpart.rules(fit))
   Kyphosis Var.2 Var.3 Var.4 Var.5 Var.6 Var.7 Var.8 Var.9 Var.10 Var.11 Var.12 Var.13
4      0.00  when Start    >=                15                                        
10     0.00  when Start    is     9    to    15     &   Age     <      55              
22     0.14  when Start    is     9    to    15     &   Age     >=                  111
23     0.57  when Start    is     9    to    15     &   Age     is     55     to    111
3      0.58  when Start    <      9                                                    
> nrow(data.frame(rpart.rules(tree)))
5

由于我的结果是5，而另一个答案是4，也许我没有完全数清你想要的是什么？我想这往往就是我想要的。

如何在R中计算决策树规则

问题描述投票：0回答：2

2个回答

选项1

选项2

最新问题

如何在R中计算决策树规则

问题描述 投票：0回答：2

2个回答

选项1

选项2

最新问题

问题描述投票：0回答：2