我已经尝试在Python 3.7中运行以下代码,以遍历数据框'Rawdata'中数据列的每种组合,以使用statsmodel库创建回归模型的子集并返回最佳模型。在运行最后一行:best_subset(X,Y)之前,代码不会引发任何错误。它返回:“ IndexingError:索引器太多”。
任何想法有什么问题/如何解决?
如果有人可以提供帮助,那就太好了!谢谢
#Data
Rawdata = pd.read_csv(r'C:\Users\Lucas\Documents\sample.csv')
#Main code
def best_subset(X, Y):
n_features = X.shape[1]
subsets = chain.from_iterable(combinations(range(n_features), k+1) for k in range(n_features))
best_score = -np.inf
best_subset = None
for subset in subsets:
lin_reg = sm.OLS(Y, X.iloc[:, subset]).fit()
score = lin_reg.rsquared_adj
if score > best_score:
best_score, best_subset = score, subset
return best_subset, best_score
#Define data inputs and call code above
X = Rawdata.iloc[:, 1:10]
Y = Rawdata.iloc[:, 0]
#To return best model
best_subset(X, Y)
lin_reg = sm.OLS(Y, X.iloc[:, (0, 1)]).fit()
熊猫不知道该如何处理(请参阅here)。一种解决方案是将子集的类型从元组转换为列表:
for subset in subsets: subset = list(subset) lin_reg = sm.OLS(Y, X.iloc[:, subset]).fit()