我正在使用大约 8 个 mio 样本的数据训练 XGBoost 模型,每个样本具有 1356 个特征。 这导致在拟合期间 python 进程的 SIGKILL,最有可能来自内存过度消耗。
为了避免这种情况,我尝试像这样迭代地拟合模型:
def train(model, df):
X_train, X_val, y_train, y_val = train_test_split(df.values[:,:-1], df.values[:,-1], test_size=0.1, random_state=RANDOM_STATE)
return model_xgbc.fit(X_train, y_train, eval_set=[(X_val, y_val)],
verbose=True,
early_stopping_rounds = 3,
xgb_model = model)
# Load training set
files_train = glob.glob(mounted_input_path + "/*.parquet")
files_iter = iter(files_train)
file = next(files_iter, None)
model = None
while (file is not None):
print(f'Processing parquet ({count+1}/{len(files_train)}) {file}')
df = pd.read_parquet(file)
if ( count == 0):
num_features = len(df.columns) - 1
number_of_samples += len(df.index)
model = train(model, df)
count += 1
file = next(files_iter, None)
print('Classifier fitted')
虽然这行得通,但我了解到它并没有给我想要的结果,就好像我会一次用所有数据拟合模型一样。
假设特征向量不能收缩,有没有办法让我的模型更有效地拟合我的模型?