我通过深度学习得到了许多具有相似值的预测,这在相关图中生成了一条水平线。
我生成了一个可以重现问题的小数据集(data),但我的数据集要大得多。这就是为什么层如此大,但如果我将它们调整为这个简化案例的大小,我会遇到同样的问题。
如果我尝试使用随机森林等其他算法来预测目标值,我会使用这个小数据集得到 0.4 的 R。有了完整的数据集,如果我运行深度学习方法,然后删除水平线上的所有值,我会得到与随机森林类似的 R。我不知道为什么它不以相同的方式预测水平线的样本。你有什么线索吗?
这是重现问题和一些相关图的代码:
import torch, torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
var='target'
data = pd.read_csv('data800.csv', index_col=0)
train_dataset = data.sample(frac=0.8,random_state=1)
test_dataset = data.drop(train_dataset.index)
train_labels = train_dataset.pop(var)
test_labels = test_dataset.pop(var)
model = nn.Sequential(nn.Linear(train_dataset.shape[1], 1024), nn.ReLU(), nn.BatchNorm1d(1024),
nn.Linear(1024, 128), nn.ReLU(), nn.BatchNorm1d(128),
nn.Linear(128, 64), nn.ReLU(), nn.BatchNorm1d(64),
nn.Linear(64, 1))
optim = torch.optim.Adam(model.parameters(), 0.01)
for epoch in range(200):
yhat = model(torch.tensor(train_dataset.values).to(torch.float32))
loss = nn.MSELoss()(yhat.ravel(), torch.tensor(train_labels).to(torch.float32))
optim.zero_grad()
loss.backward()
optim.step()
yhatt=model(torch.tensor(test_dataset.values).to(torch.float32))
yhatt = yhatt.detach().numpy()
score = np.corrcoef(test_labels, yhatt.reshape(test_labels.shape))
if epoch % 20 == 0:
print('epoch', epoch, '| loss:', loss.item(), '| R:', score[0,1])
yhat = model(torch.tensor(test_dataset.values).to(torch.float32))
yhat = yhat.detach().numpy()
plt.scatter(test_labels, yhat)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.axis('equal')
plt.axis('square')
_ = plt.plot([-1000, 1000], [-1000, 1000])
plt.show()
我认为问题在于每个功能通常都有混合分布。当特征对称分布且规模相似时,机器学习算法通常效果最好。我通过用百分位数替换每个特征,将特征转换为均匀分布。这使分布变得平坦:
模型具有较好的收敛性。然后我还调整了架构。它最初从大约 50 个特征增加到 1024 个。我更改为锥形架构,逐渐缩小输入特征大小。这也改善了结果。最终训练 RMSE 为 0.14,测试集 r=0.42。代码如下。
import torch, torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
var = 'target'
data = pd.read_csv('data800.csv', index_col=0)
train_dataset = data.sample(frac=0.8, random_state=1)
test_dataset = data.drop(train_dataset.index)
train_labels = train_dataset.pop(var)
test_labels = test_dataset.pop(var)
#Flatten distribution by replacing each value with its percentile
train_dataset_transformed = train_dataset.copy()
test_dataset_transformed = test_dataset.copy()
for feature in train_dataset.columns:
#Percentiles estimated from train data
bin_res = 0.2
eval_percentiles = np.arange(bin_res, 100, bin_res)
percentiles = [
np.percentile(train_dataset[feature], p)
for p in eval_percentiles
]
#Apply to both train and test data
train_dataset_transformed[feature] = pd.cut(
train_dataset[feature],
bins=[-np.inf] + percentiles + [np.inf],
labels=False
).astype(np.float32)
test_dataset_transformed[feature] = pd.cut(
test_dataset[feature],
bins=[-np.inf] + percentiles + [np.inf],
labels=False
).astype(np.float32)
#Hist before and after:
# plt.hist(train_dataset.iloc[:, 0])
# plt.hist(train_dataset_transformed.iloc[:, 0], bins=100)
n_feat = train_dataset.shape[1]
model = nn.Sequential(
nn.Linear(n_feat, n_feat), nn.ReLU(), nn.BatchNorm1d(n_feat),
nn.Linear(n_feat, n_feat // 2), nn.ReLU(), nn.BatchNorm1d(n_feat // 2),
# nn.Linear(n_feat // 2, n_feat // 2), nn.ReLU(), nn.BatchNorm1d(n_feat // 2),
nn.Linear(n_feat // 2, n_feat // 4), nn.ReLU(), nn.BatchNorm1d(n_feat // 4),
# nn.Linear(n_feat // 4, n_feat // 4), nn.ReLU(), nn.BatchNorm1d(n_feat // 4),
nn.Linear(n_feat // 4, 1)
)
optim = torch.optim.Adam(model.parameters(), 0.01)
#Scale
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(train_dataset_transformed)
X_train = scaler.transform(train_dataset_transformed)
X_test = scaler.transform(test_dataset_transformed)
#Convert to tensors
X_train = torch.tensor(X_train).float()
y_train = torch.tensor(train_labels.values).float()
X_test = torch.tensor(X_test).float()
y_test = torch.tensor(test_labels.values).float()
torch.manual_seed(0)
for epoch in range(1770):
yhat = model(X_train)
loss = nn.MSELoss()(yhat.ravel(), y_train)
optim.zero_grad()
loss.backward()
optim.step()
with torch.no_grad():
yhatt = model(X_test)
score = np.corrcoef(y_test, yhatt.ravel())
if epoch % 30 == 0:
print('epoch', epoch, '| loss:', loss.item(), '| R:', score[0, 1])
yhat = model(X_test)
yhat = yhat.detach().numpy()
plt.scatter(test_labels, yhat)
ax_lims = plt.gca().axis()
plt.plot([0, 100], [0, 100], 'k:', label='y=x')
plt.gca().axis(ax_lims)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.legend()