我正在尝试隔离要用于线性回归的数据集的哪些特征(即Pandas DataFrame的列),并且我想选择那些不具有强相关性的特征(假设自变量必须是彼此不相关,因此我们要删除任何看起来高度相关的内容。]
我已经隔离了与目标变量相关的功能的初始列表,如下所示:
# get the absolute correlations for the target variable
correlations_target = abs(df.corr()[target_variable_name])
# filter out those that are below our threshold
correlated_features = correlations_target[correlations_target >= correlation_threshold]
# drop the target variable's column
correlated_features.drop(target_variable_name, inplace=True)
# get the column names for later use
correlated_feature_variable_names = correlated_features.index
我现在想遍历这些相关特征变量中的每一个,并确保它们中没有一个具有强相关性,如果确实如此,则将相关性最弱的那个变量放到目标变量中。这是我为此准备的:
# the collection of feature variable names we'll drop due to their being correlated to other features
correlated_feature_variable_names_to_drop = []
# loop over the feature combinations
for name_1 in correlated_feature_variable_names:
for name_2 in correlated_feature_variable_names:
# only look at correlations between separate feature variables
if name_1 != name_2:
# drop one of the feature variables if there's a strong correlation
if abs(df[[name_1, name_2]].corr()[name_1][name_2]) > 0.6:
# only worry about it if neither of the variables have been added to the drop list
if (name_1 not in correlated_feature_variable_names_to_drop) and \
(name_2 not in correlated_feature_variable_names_to_drop):
# drop the one which has the least correlation to the target variable
if correlated_features[name_1] >= correlated_features[name_2]:
correlated_feature_variable_names_to_drop.append(name_2)
else:
correlated_feature_variable_names_to_drop.append(name_1)
# drop the variables we've found that qualify
correlated_features.drop(correlated_feature_variable_names_to_drop, inplace=True)
# get the remaining variables' column names for later use
filtered_feature_variable_names = correlated_features.index
经过过滤的功能集将用作简单回归模型的输入。例如:
# fit a simple ordinary least squares model to the features
X = df[filtered_feature_variable_names]
y = df[target_variable_name]
estimate = sm.OLS(y, np.asarray(X)).fit()
# display the regression results
estimate.summary()
因为这是我第一次尝试这种方法,所以我不确定这是否是正确的处理方法,如果是,那么可能会有更聪明或更有效(或“ Pythonic”)的执行方法比我上面使用的循环方法更有效。
我认为您的意图是寻找并过滤出彼此高度相关但本身用来解释第三件事的事物。例如,年龄和天数实际上是在描述同一件事(时间)。两者中的任何一个(但不是全部)都可以描述与第三种事物(如树木直径)的某种关系。这里的概念是共线性,您绝对有必要减少它。