我目前使用以下代码运行随机森林模型。我将random_state设置为100。
from sklearn.cross_validation import train_test_split
X_train_RIA_INST_PWM, X_test_RIA_INST_PWM, y_train_RIA_INST_PWM, y_test_RIA_INST_PWM = train_test_split(X_RIA_INST_PWM, Y_RIA_INST_PWM, test_size=0.3, random_state = 100)
# Random Forest Regressor for RIA_INST_PWM accounts
import numpy as np
from sklearn.ensemble import RandomForestRegressor
regressor_RIA_INST_PWM = RandomForestRegressor(n_estimators=100, min_samples_split = 10)
regressor_RIA_INST_PWM.fit(X_RIA_INST_PWM, Y_RIA_INST_PWM)
print ("R^2 for training set:"),
print (regressor_RIA_INST_PWM.score(X_train_RIA_INST_PWM, y_train_RIA_INST_PWM))
print ('-'*50)
print ("R^2 for test set:"),
print (regressor_RIA_INST_PWM.score(X_test_RIA_INST_PWM, y_test_RIA_INST_PWM))
然后我使用以下代码来计算预测值。
def predict_AUM(df, features, regressor):
# Reset index for later merge of predicted target values with Account IDs
df.reset_index();
# Set predictor variables
X_Predict = df[features]
# Clean inputs
X_Predict = X_Predict.replace([np.inf, -np.inf], np.nan)
X_Predict = X_Predict.fillna(0)
# Predict Current_AUM
Y_AUM_Snapshot_1yr_Predict = regressor.predict(X_Predict)
df['PREDICTED_SPAN'] = Y_AUM_Snapshot_1yr_Predict
return df
df_EVENT5_20 = predict_AUM(df_EVENT5_19, dfzip_features_AUM_RIA_INST_PWM, regressor_RIA_INST_PWM)
最后,我计算结果的RMSE:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(df_EVENT5_20['SPAN_DAYS'], df_EVENT5_20['PREDICTED_SPAN']))
rmse
每次我运行我的代码......我的RMSE都会改变。它从7.75到16.4不等。为什么会发生这种情况?每次运行代码时,如何才能拥有相同的RMSE?另外,如何针对RMSE优化我的模型?
您只接种了train_test_split,这确保了对训练和测试集的数据的随机分配是可重现的。
顾名思义,RandomForestRegressor还包含算法中依赖于随机数的部分(例如,特别是数据的不同部分或用于训练个体决策树的不同特征)。如果您想要可重复的结果,您也需要播种它。为此你需要使用random_state来初始化它:
regressor_RIA_INST_PWM = RandomForestRegressor(
n_estimators=100,
min_samples_split = 10,
random_state=100
)