分类列上的问题训练xgboost

问题描述 投票:2回答:4

我正在尝试运行Python笔记本(link)。在下面的行中[446]:作者训练为XGBoost的地方出现错误

ValueError:数据的DataFrame.dtypes必须为int,float或bool。没想到字段StateHoliday,Assortment中的数据类型

# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

这是最少的测试代码

import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split

with open('train_store', 'rb') as f:
    train_store = pickle.load(f)

train_store.shape

predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day', 
              'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 
              'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen', 
              'PromoOpen']

y = np.log(train_store.Sales) # log transformation of Sales
X = train_store

# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, # 30% for the evaluation set
                                                    random_state = 42)

# base parameters
params = {
    'booster': 'gbtree', 
    'objective': 'reg:linear', # regression task
    'subsample': 0.8,          # 80% of data to grow trees and prevent overfitting
    'colsample_bytree': 0.85,  # 85% of features used
    'eta': 0.1, 
    'max_depth': 10, 
    'seed': 42} # for reproducible results

num_round = 60 # default 300

dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest  = xgb.DMatrix(X_test[predictors],  y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

链接到train_store数据文件:Link 1

python xgboost categorical-data
4个回答
1
投票

尝试一下

train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])

1
投票

我在进行Rossmann销售预测项目时遇到了完全相同的问题。似乎新版本的xgboost不接受StateHolidayAssortmentStoreType的数据类型。您可以使用[]查看Mykhailo Lisovyi建议的数据类型

print(test_train.dtypes)

您需要在此处用X_train替换test_train

您可能会得到

DayOfWeek                      int64
Promo                          int64
StateHoliday                   int64
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
Year                           int64
Month                          int64
Day                            int64

错误上升到object

类型。您可以使用进行转换
from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))

经过这些步骤,一切都会顺利进行。


0
投票

正如错误消息所暗示的那样,xgboost不满意,您试图为它提供未知类型。它说它不能处理分类或日期时间功能。检查StateHoliday, Assortment功能的类型,并以某种方式将它们编码为数字(例如,一键编码,标签编码(适用于基于树的模型)或目标编码)


0
投票

H2O软件包中的XGBoost版本可以处理类别变量(但不能太多!),但看来XGBoost作为其自己的软件包不能。

© www.soinside.com 2019 - 2024. All rights reserved.