预测期间决策树中的KeyError

问题描述 投票:0回答:1

我想在我的 DecisionTreeClassifier 实现中创建 Predict 和 Predict_proba 方法,但它给出了错误

Traceback (most recent call last):
  File "c:\Users\Nijat\project.py", line 136, in <module>
    print(model.predict(X))
          ^^^^^^^^^^^^^^^^
  File "c:\Users\Nijat\project.py", line 128, in predict
    return [1 if p[0] > 0.5 else 0 for p in self.predict_proba(X)]
                                            ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Nijat\project.py", line 121, in predict_proba
    class1_proba = self.bypass_tree(self.tree, sample)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Nijat\project.py", line 105, in bypass_tree
    while node['type'] == 'node':
          ~~~~^^^^^^^^
KeyError: 'type'

这是我的代码:

import numpy as np
import pandas as pd

class MyTreeClf:
    def __init__(self, max_depth=5, min_samples_split=2, max_leafs=20):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_leafs = max_leafs
        self.tree = None
        self.leafs_cnt = 0

    def node_entropy(self, probs):
        return -np.sum([p * np.log2(p) for p in probs if p > 0])

    def node_ig(self, x_col, y, split_value):
        left_mask = x_col <= split_value
        right_mask = x_col > split_value

        if len(x_col[left_mask]) == 0 or len(x_col[right_mask]) == 0:
            return 0

        left_counts = np.bincount(y[left_mask])
        right_counts = np.bincount(y[right_mask])

        left_probs = left_counts / len(y[left_mask]) if len(y[left_mask]) > 0 else np.zeros_like(left_counts)
        right_probs = right_counts / len(y[right_mask]) if len(y[right_mask]) > 0 else np.zeros_like(right_counts)

        entropy_after = (len(y[left_mask]) / len(y) * self.node_entropy(left_probs) +
                         len(y[right_mask]) / len(y) * self.node_entropy(right_probs))
        entropy_before = self.node_entropy(np.bincount(y) / len(y))

        return entropy_before - entropy_after

    def get_best_split(self, X: pd.DataFrame, y: pd.Series):
        best_col, best_split_value, best_ig = None, None, -np.inf

        for col in X.columns:
            sorted_unique_values = np.sort(X[col].unique())

            for i in range(1, len(sorted_unique_values)):
                split_value = (sorted_unique_values[i - 1] + sorted_unique_values[i]) / 2

                ig = self.node_ig(X[col], y, split_value)

                if ig > best_ig:
                    best_ig = ig
                    best_col = col
                    best_split_value = split_value

        return best_col, best_split_value

    def fit(self, X: pd.DataFrame, y: pd.Series, depth=1, node=None):
        if self.max_leafs < 2:
            self.leafs_cnt = 2
            return

        if node is None:
            node = {}
            self.tree = node

        best_col, best_split_value = self.get_best_split(X, y)

        node['type'] = None
        node['feature'] = best_col
        node['threshold'] = best_split_value

        if len(y.unique()) == 1:
            self.leafs_cnt += 1
            node['type'] = 'leaf'
            node['class_counts'] = {y.unique()[0]: len(y)}
            return

        if len(y) == 1:
            self.leafs_cnt += 1
            node['type'] = 'leaf'
            node['class_counts'] = {y.values[0]: 1}
            return

        if depth >= self.max_depth or len(y) < self.min_samples_split or (self.leafs_cnt + 2 > self.max_leafs):
            self.leafs_cnt += 1
            node['type'] = 'leaf'
            node['class_counts'] = y.value_counts().to_dict()
            return

        if best_col is None:
            node['type'] = 'leaf'
            node['class_counts'] = y.value_counts().to_dict()
            self.leafs_cnt += 1
            return

        node['type'] = 'node'
        node['feature'] = best_col
        node['threshold'] = best_split_value

        left_mask = X[best_col] <= best_split_value
        right_mask = X[best_col] > best_split_value

        node['left'] = {}
        node['right'] = {}

        self.fit(X[left_mask], y[left_mask], depth + 1, node['left'])
        self.fit(X[right_mask], y[right_mask], depth + 1, node['right'])

    def bypass_tree(self, node, sample):
        while node['type'] == 'node':
            feature_value = sample[node['feature']]
            if feature_value <= node['threshold']:
                node = node['left']
            else:
                node = node['right']
    
        total_count = sum(node['class_counts'].values())
        class_1_count = node['class_counts'].get(1, 0)
        class1_proba = class_1_count / total_count if total_count > 0 else 0

        return class1_proba

    def predict_proba(self, X: pd.DataFrame):
        proba = []
        for _, sample in X.iterrows():
            class1_proba = self.bypass_tree(self.tree, sample)

            proba.append(class1_proba)
    
        return np.array(proba)

    def predict(self, X: pd.DataFrame):
        return [1 if p[0] > 0.5 else 0 for p in self.predict_proba(X)]

df = pd.read_csv('c:\\Users\\Nijat\\Downloads\\banknote+authentication.zip', header=None)
df.columns = ['variance', 'skewness', 'curtosis', 'entropy', 'target']
X, y = df.iloc[:, :4], df['target']

model = MyTreeClf(max_depth=3, min_samples_split=2, max_leafs=1)
model.fit(X, y)
print(model.predict(X))

predict和predict_proba方法采用pandas数据帧形式的特征矩阵。 对于数据框中的每一行: 遍历树的节点直到其中一片叶子。 写出第一类的概率。 返回预测列表 - 与数据框中的行数一样多: Predict_proba - 返回概率(对于第一类)。 预测 - 按阈值 > 0.5

将概率转换为二元类

验证:

输入数据:决策树的两组参数

输出:返回的概率和标签的预测(它们的总和)

我完全不知道用于检查代码的数据集,但我认为它是“钞票认证”

输入示例: {“最大深度”:3,“最小样本分割”:2,“最大叶数”:1} 示例输出: 11.2443438914

python machine-learning decision-tree
1个回答
0
投票

这里有两点:

model = MyTreeClf(max_depth=3, min_samples_split=2, max_leafs=1)
由于

max_leafs <2

 条件,
将触发拟合返回 将其设置为两个将允许构建树

我没有您的数据集,所以我们将使用随机数据集进行测试

df = pd.DataFrame(columns=['variance', 'skewness', 'curtosis', 'entropy', 'target'],data=np.random.random(size=(500, 5)))
df['target'] =df['target'].apply(lambda x: 0 if x<0.5 else 1)
X, y = df.iloc[:, :4], df['target']

然后,查看模型的树,看看它是否已构建

model = MyTreeClf(max_depth=3, min_samples_split=2, max_leafs=2)
model.fit(X, y)
print(model.tree)

它给了我们一棵树,因此节点结构将起作用 您现在可以运行

model.predict_proba(X)
,它给出一个数组,因此 p 是一个值而不是列表,您需要修改函数预测:

def predict(self, X: pd.DataFrame):
    return [1 if p > 0.5 else 0 for p in self.predict_proba(X)]
© www.soinside.com 2019 - 2024. All rights reserved.