我正在尝试运行Python笔记本(link)。在下面的行中[446]:作者训练为XGBoost
的地方出现错误
ValueError:数据的DataFrame.dtypes必须为int,float或bool。没想到字段StateHoliday,Assortment中的数据类型
# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
这是最少的测试代码
import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
with open('train_store', 'rb') as f:
train_store = pickle.load(f)
train_store.shape
predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day',
'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth',
'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen',
'PromoOpen']
y = np.log(train_store.Sales) # log transformation of Sales
X = train_store
# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3, # 30% for the evaluation set
random_state = 42)
# base parameters
params = {
'booster': 'gbtree',
'objective': 'reg:linear', # regression task
'subsample': 0.8, # 80% of data to grow trees and prevent overfitting
'colsample_bytree': 0.85, # 85% of features used
'eta': 0.1,
'max_depth': 10,
'seed': 42} # for reproducible results
num_round = 60 # default 300
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
链接到train_store数据文件:Link 1
尝试一下
train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])
我在进行Rossmann销售预测项目时遇到了完全相同的问题。似乎新版本的xgboost不接受StateHoliday,Assortment和StoreType的数据类型。您可以使用[]查看Mykhailo Lisovyi建议的数据类型
print(test_train.dtypes)
您需要在此处用X_train替换test_train
您可能会得到
类型。您可以使用进行转换DayOfWeek int64 Promo int64 StateHoliday int64 SchoolHoliday int64 StoreType object Assortment object CompetitionDistance float64 CompetitionOpenSinceMonth float64 CompetitionOpenSinceYear float64 Promo2 int64 Promo2SinceWeek float64 Promo2SinceYear float64 Year int64 Month int64 Day int64
错误上升到object
from sklearn import preprocessing lbl = preprocessing.LabelEncoder() test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str)) test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))
经过这些步骤,一切都会顺利进行。
正如错误消息所暗示的那样,xgboost
不满意,您试图为它提供未知类型。它说它不能处理分类或日期时间功能。检查StateHoliday, Assortment
功能的类型,并以某种方式将它们编码为数字(例如,一键编码,标签编码(适用于基于树的模型)或目标编码)
H2O软件包中的XGBoost版本可以处理类别变量(但不能太多!),但看来XGBoost作为其自己的软件包不能。