如何缩小不同机器学习模型的训练和测试分数之间的差距？

Question

我正在使用多种机器学习模型进行 AQI 预测。数据为日格式，共有1850条记录。我的训练 R2 分数约为 99，测试分数约为 91。这个差距可以吗？如果没有，我怎样才能提高我的考试成绩？

X = data[['Year', 'Month', 'Day', 'Raw Conc.', 'NowCast Conc.']]
y = data['AQI']

# Split data into training and test sets using time series splitting
tscv = TimeSeriesSplit(n_splits=2)  

for train_index, test_index in tscv.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Standardize the features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

# Define parameter grids for each model
param_grids = {
    "Decision Tree": {'max_depth': [3, 5, 7, 10]},
    "Random Forest": {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 7, 10]},
    "Gradient Boosting": {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7]},
    "AdaBoost": {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.5]},
    "XGBoost": {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7]},
    "CatBoost": {'iterations': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'depth': [3, 5, 7]}, 0.7]},
}

# List of models to evaluate
models = [
    ("Decision Tree", DecisionTreeRegressor(random_state=42)),
    ("Random Forest", RandomForestRegressor(random_state=42)),
    ("Gradient Boosting", GradientBoostingRegressor(random_state=42)),
    ("AdaBoost", AdaBoostRegressor(random_state=42)),
    ("XGBoost", XGBRegressor(random_state=42)),
    ("CatBoost", CatBoostRegressor(verbose=0)),
]

#Dictionaries to store model performance and feature importances
model_performance = {}
feature_importance_dict = {}
predictions = {}

for name, model in models:
    param_grid = param_grids[name]
    
    if param_grid:
        grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error')
        grid_search.fit(X_train_scaled, y_train)
        best_model = grid_search.best_estimator_
    else:
        best_model = model
        best_model.fit(X_train_scaled, y_train)
    
    # Calculate predictions
    y_train_pred = best_model.predict(X_train_scaled)
    y_test_pred = best_model.predict(X_test_scaled)

    # Store predictions
    predictions[name] = {'model_name': name, 'y_test_pred': y_test_pred}

    
    # Calculate evaluation metrics for train set
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    train_r2 = r2_score(y_train, y_train_pred)
    train_mae = mean_absolute_error(y_train, y_train_pred)
    
    # Calculate evaluation metrics for test set
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_r2 = r2_score(y_test, y_test_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    
    # Store model performance metrics
    model_performance[name] = {
        "Train_RMSE": train_rmse, 
        "Train_R2": train_r2, 
        "Train_MAE": train_mae,
        "Test_RMSE": test_rmse,
        "Test_R2": test_r2,
        "Test_MAE": test_mae
    }
    
    # For all models, try to extract feature importances
    if hasattr(best_model, 'feature_importances_') or hasattr(best_model, 'coef_'):
        feature_importances = best_model.feature_importances_ if hasattr(best_model, 'feature_importances_') else best_model.coef_
        
        # Get feature names
        if isinstance(best_model, (LinearRegression, Ridge, Lasso)):  # For linear models
            feature_names = ['Raw Conc.', 'NowCast Conc.']
        else:  # For other models, get feature names from original DataFrame
            feature_names = ['Raw Conc.', 'NowCast Conc.']  # Replace this with the actual feature names
        
        # Store feature importances with feature names
        feature_importance_dict[name] = {feature_names[i]: feature_importances[i] for i in range(min(len(feature_importances), len(feature_names)))}

# Convert model performance dictionary to DataFrame
model_performance_df = pd.DataFrame.from_dict(model_performance, orient='index')

# Print model performance
print(model_performance_df)

我减少了这里的分割（tscv = TimeSeriesSplit(n_splits=2)），我的测试分数从 91 提高到 94。我还能做什么？

Answer 1

这个问题没有单一答案。理论上，训练集和测试集的准确率/R²之间的差距不能小于所谓的

Bayes

误差，在人类自然感知较高的领域（例如

NLP

），这有时相当于人类水平的智能。和

Vision

）。然而，在

time series

中，很难预测我们可以在多大程度上将这种差距最小化。以下是我建议的一些步骤：

使用实验跟踪工具：首先使用可记录元数据的工具（例如交叉验证结果）并将模型保存为工件。我更喜欢
```
Weights&Biases
```
它可以让您使用以下命令进行多个实验
```
sweep
```
和
```
Grid Search
```
或
```
Bayesian Optimization
```
最大化在您的
```
cross-validation
```
上为
```
HPO
```
定义指标
从简单的模型开始：避免从不相关或过度的模型开始复杂的模型。从简单的模型开始并监控他们的
```
bias
```
和
```
variance
```
。如果您观察到
```
underfitting
```
，您可能需要使用可以捕捉非线性关系并与表格时间序列数据，例如
```
Random Forest
```
和
```
XGBoost
```
。避免直接跳转到复杂的
```
RNN
```
模型，例如
```
LSTM
```
，这些模型最初是为
```
NLP
```
应用程序开发，但未能及时表现良好系列比赛。
解决过度拟合：一旦解决了
```
underfitting
```
问题，你可能会达到一个可以学习非线性关系的模型训练数据。此时，您的模型可能会表现出很高的
```
variance
```
和
```
overfitting
```
在训练数据上。有几种方法可以减轻
```
overfitting
```
:

添加更多训练数据或使用
```
data augmentation
```
技术。为了例如，2017 年 Kaggle 获奖的表格数据解决方案增强和表示学习使用 DAE link。
```
Regularization
```
技术：应用
```
L1
```
和
```
L2
```
正则化（称为 XGBoost 中的
```
reg_lambda
```
和
```
reg_alpha
```
）来惩罚大重量和系数。
```
Early stopping
```
、
```
Dropout
```
和
```
Reduce Learning Rate on Plateau
```
是
```
neural networks
```
常用的其他技术。
使用集成方法：使用以下技术组合多个模型
```
soft voting
```
。
混合和堆叠：实施
```
blending
```
和
```
stacking
```
技术利用不同模型的优势。
高级时间序列表示：探索高级方法，例如作为
```
signature kernels
```
和
```
wavelets
```
创建更好的功能和您的数据的表示。
高级表格 ML 模型：研究
```
GRANDE
```
等新模型，它结合了
```
tree-based
```
模型和
```
neural networks
```
的优点。请注意，如果您想使用
```
RF
```
、
```
XGB
```
或
```
GRANDE
```
对于时间序列问题，你应该做一些形状变换首先。
改进的时间序列CV：您可以使用更先进的时间序列交叉验证技术，例如
```
Embargo & Purge
```
，通常用于量化金融

如何缩小不同机器学习模型的训练和测试分数之间的差距？

问题描述投票：0回答：1

1个回答

最新问题

如何缩小不同机器学习模型的训练和测试分数之间的差距？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1