我有一个数据集用于训练和测试,另一个数据集作为测试集。
我已经使用训练数据集获得了最佳模型,并希望将模型应用于测试集来进行预测,但遇到了以下错误消息:
ValueError: The feature names should match those that were passed during fit.
如何解决此错误?
df = pd.read_csv('training.csv')
df.drop(['USAGE(0)', 'CUSTOMERID'], axis=1, inplace=True)
# Initialize encoder and fit on training data
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoder.fit(df.select_dtypes(include=['object']))
# Encode string-type variables in training data
for column in df.select_dtypes(include=['object']).columns:
encoded_result = encoder.transform(df[[column]])
encoded_df = pd.DataFrame(encoded_result, columns=encoder.get_feature_names_out([column]))
df.drop(column, axis=1, inplace=True)
df = pd.concat([df, encoded_df], axis=1)
# Separate features and label
label = 'PAYMENT(0)'
excluded_columns = [label]
features = [feature for feature in df.columns if feature not in excluded_columns]
X = df[features]
y = df[label]
# Train-test split
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
# Build and train initial decision tree model
model = DecisionTreeClassifier(criterion='gini', min_samples_leaf=3000)
model.fit(X_train, y_train)
# Hyperparameter tuning with grid search and cross-validation
param_grid = {
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [10, 20, 30, 40, 50, 60, 70, 80]
}
cv = KFold(n_splits=10, shuffle=True)
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='accuracy')
grid_search.fit(X_train, y_train)
# Train optimal model
optimal_model = grid_search.best_estimator_
# Visualize the optimal decision tree
plt.figure(figsize=(100, 20))
plot_tree(optimal_model, filled=True, feature_names=features)
plt.show()
# Load the new test data
new_test_df = pd.read_csv('trial.csv')
# Keep the 'ID' column for the submission file
submission_ids = new_test_df['CUSTOMERID'].copy()
new_test_df.drop('USAGE(0)', axis=1, inplace=True)
# Ensure all necessary columns are present
for column in df.select_dtypes(include=['object']).columns:
if column not in new_test_df.columns:
new_test_df[column] = 0
# Encode string-type variables in new test data
for column in new_test_df.select_dtypes(include=['object']).columns:
encoded_result = encoder.transform(new_test_df[[column]]) # Use the fitted encoder to transform
encoded_df = pd.DataFrame(encoded_result, columns=encoder.get_feature_names_out([column]))
new_test_df.drop(column, axis=1, inplace=True)
new_test_df = pd.concat([new_test_df, encoded_df], axis=1)
# Ensure that the new test data has the same feature columns as the training data
for feature in features:
if feature not in new_test_df.columns:
new_test_df[feature] = 0
X_new_test = new_test_df[features]
# Apply the optimal model to the new test data
y_new_test_pred = optimal_model.predict(X_new_test)
# Save the predictions to a CSV file named "mapping.csv"
submission = pd.DataFrame({
'CUSTOMERID': submission_ids,
'PAYMENT(0)': y_new_test_pred
})
submission.to_csv('mapping.csv', index=False)
您需要确保测试集中的特征名称及其顺序与模型训练期间使用的完全匹配:
** 在“new_test_df.drop('USAGE(0)', axis=1, inplace=True)”行之后添加这些代码行
# Ensure all necessary columns are present in new test data
for column in encoder.get_feature_names_out(df.select_dtypes(include=['object']).columns):
if column not in new_test_df.columns:
new_test_df[column] = 0
# Encode string-type variables in new test data
for column in new_test_df.select_dtypes(include=['object']).columns:
encoded_result = encoder.transform(new_test_df[[column]])
encoded_df = pd.DataFrame(encoded_result, columns=encoder.get_feature_names_out([column]))
new_test_df.drop(column, axis=1, inplace=True)
new_test_df = pd.concat([new_test_df, encoded_df], axis=1)
# Ensure that the new test data has the same feature columns as the training data
missing_features = set(features) - set(new_test_df.columns)
for feature in missing_features:
new_test_df[feature] = 0
# Reorder the columns to match the training set
new_test_df = new_test_df[features]