我正在努力改善我在Kaggle House价格竞赛中找到here的提交。我正在使用爱荷华州的数据here。我正在尝试使用管道(sklearn.pipeline.Pipeline)训练和测试我的模型,使用GridSearchCV(sklearn.model_selection.GridSearchCV)交叉验证并使用和使用XGBRegressor(xgboost.XGBRegressor)。选定的功能具有分类数据和必须估算的NaN值(sklearn.impute.SimpleImputer()。初始设置:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.impute import SimpleImputer
# Path of the file to read.
iowa_file_path = '../input/train.csv'
original_home_data = pd.read_csv(iowa_file_path)
home_data = original_home_data.copy()
# delete rows where SalePrice is Nan
home_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
# Create a target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
extra_features = ['OverallCond', 'GarageArea', 'LotFrontage', 'OverallQual', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', 'GrLivArea', 'MoSold']
categorical_data = ['LotShape', 'MSZoning', 'Neighborhood', 'BldgType', 'HouseStyle', 'Foundation', 'KitchenQual']
features.extend(extra_features)
features.extend(categorical_data)
X = home_data[features]
分类数据是一个热门编码:
X = pd.get_dummies(X, prefix='OHE', columns=categorical_data)
具有缺失值的列通过以下方式收集:
cols_with_missing = (col for col in X.columns if X[col].isnull().any())
for col in cols_with_missing:
X[col + '_was_missing'] = X[col].isnull()
然后将培训和验证数据分开:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1, test_size=0.25)
train_X, val_X = train_X.align(val_X, join='left', axis=1)
然后创建管道以使用回归量来估算NaN的平均值
my_pipeline = Pipeline([('imputer', SimpleImputer()), ('xgbrg', XGBRegressor())])
param_grid = {
'xgbrg__n_estimators': [10, 50, 100, 500, 1000],
'xgbrg__learning_rate': [0.01, 0.04, 0.05, 0.1, 0.5, 1]
}
fit_params = {
'xgbrg__early_stopping_rounds': 10,
'xgbrg__verbose': False,
'xgbrg__eval_set': [(np.array(val_X), val_y)]
}
然后我初始化了交叉验证器:
searchCV = GridSearchCV(my_pipeline, cv=5, param_grid=param_grid, return_train_score=True, scoring='neg_mean_absolute_error')
然后我安装了我的交叉验证器:
searchCV = GridSearchCV(my_pipeline, cv=5, param_grid=param_grid, return_train_score=True, scoring='neg_mean_absolute_error')
并适合模型(注意下一行):
searchCV.fit(X=np.array(train_X), y=train_y, **fit_params)
然后我对测试数据做了同样的事情(一个热编码,用NaN获取列,)
# path to file you will use for predictions
test_data_path = '../input/test.csv'
# read test data file using pandas
test_data = pd.read_csv(test_data_path)
# create test_X which comes from test_data but includes only the columns you used for prediction.
original_test_X = test_data[features]
test_X = original_test_X.copy()
# to one hot encode the data
test_X = pd.get_dummies(test_X, prefix='OHE', columns=categorical_data)
for col in cols_with_missing:
test_X[col + '_was_missing'] = test_X[col].isnull()
# to align the training and test data and discard columns not in the training data
X, test_X = X.align(test_X, join='inner', axis=1)
然后,我尝试使用训练数据中的平均值转换测试数据,以将NaN值归入测试数据中:
test_X = my_pipeline.named_steps['imputer'].transform(test_X)
然后我得到这个错误:
NotFittedError: This SimpleImputer instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
所以我甚至不能用这条线进行预测:
test_preds = searchCV.predict(test_X)
如果我尝试为测试数据创建一个新的SimpleImputer()实例并为NaN输入并执行fit_transform:
test_pipeline = SimpleImputer()
test_X = test_pipeline.fit_transform(test_X)
我添加并运行:
test_preds = searchCV.predict(test_X)
我收到以下错误:
ValueError: X has 72 features per sample, expected 74
在Missing Data阶段改进我的模型时,我有同样的“这个SimpleImputer实例尚未安装”错误。经过大量的反复试验,以下为我做了诀窍:
在准备训练数据的同一循环中准备测试数据。基本上,应该同时运行“for cols in cols_with_missing”循环以进行训练和测试数据。我也是这个领域的新手(刚开始上周),但我猜这个错误可能是由于列中的不匹配而发生的,如果你分别为训练和测试数据运行col循环。
我的代码片段有效:
cols_with_missing = (col for col in X_train.columns
if X_train[col].isnull().any())
for col in cols_with_missing:
imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
imputed_final_test_plus[col + 'was_missing'] = imputed_final_test_plus[col].isnull()