我想在我的csv文件中提取两个功能之间的关系。我想使用linearRegression来确定与这些年相关的肥胖趋势。这是我的代码;
CODE
#Analysis of obesity by country
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn import metrics
from sklearn.linear_model import LinearRegression
address = 'C:/Users/Andre/Desktop/Python/firstMN/obesity-cleaned.csv'
dt = pd.read_csv(address)
#eliminate superfluos data
dt.drop(dt['Obesity (%)'][dt['Obesity (%)'].values == 'No data'].index, inplace=True)
for i in range(len(dt)):
dt['Obesity (%)'].values[i] = float(dt['Obesity (%)'].values[i].split()[0])
obMean = dt['Obesity (%)'].mean()
print('%0.3f' %obMean, '\n')
dt['Obesity (%)'] = dt['Obesity (%)'].astype(float) #converto il tipo in float
group = dt.groupby('Country')
print(group[['Year', 'Obesity (%)']].mean(), '\n')
dt1 = dt[dt['Sex'] == 'Both sexes']
print(dt1[dt1['Obesity (%)'] == dt1['Obesity (%)'].max()], '\n')
sb.lmplot('Year', 'Obesity (%)', dt1)
plt.show()
#linear regression predictions
group1 = dt1.groupby('Year')
x = np.array(np.linspace(1975, 2016, 2016-1975+1)).tolist()
y = np.array([group1['Obesity (%)'].mean()]).tolist()[0]
x1 = np.array(np.linspace(1975, 2016, 2016-1975+1)).reshape(1, -1)
y1 = np.array([group1['Obesity (%)'].mean()])
lr = LinearRegression(fit_intercept=False)
lr.fit(x1, y1)
plt.plot(x, y)
plt.show()
print('Coefficients: ', lr.coef_)
print("Intercept: ", lr.intercept_ )
y_hat = lr.predict(x1)
print('MSE: ', sklearn.metrics.mean_squared_error(y_hat, y1))
print('R^2: ', model.score(x1, y1) )
print('var: ', y1.var())
问题是我获得了多个系数,而我只获得了一个系数和一个截距,为什么这样做?
输出
Coefficients: [[7.68857169e-05 7.69246464e-05 7.69635759e-05 ... 7.84039665e-05
7.84428960e-05 7.84818255e-05]
[7.95627446e-05 7.96030295e-05 7.96433144e-05 ... 8.11338570e-05
8.11741419e-05 8.12144269e-05]
[8.22150421e-05 8.22566700e-05 8.22982979e-05 ... 8.38385290e-05
8.38801569e-05 8.39217848e-05]
...
[2.24882685e-04 2.24996549e-04 2.25110414e-04 ... 2.29323406e-04
2.29437271e-04 2.29551135e-04]
[2.30366573e-04 2.30483214e-04 2.30599855e-04 ... 2.34915584e-04
2.35032225e-04 2.35148866e-04]
[2.35708263e-04 2.35827609e-04 2.35946955e-04 ... 2.40362755e-04
2.40482101e-04 2.40601447e-04]]
Intercept: 0.0
MSE: 7.099748146989106e-30
您可以看到我的截距是0,我想是因为我选择fit_intercept = False,但是我的系数不止一个,为什么?
它会产生多个系数,因为您要求这样做。您是否考虑过尝试构建自己的回归器以获得结果?我通过在Enlight上找到的教程构建了自己的教程:https://enlight.nyc/projects/linear-regression