我目前在python中工作,试图学习如何使用财富500强数据集进行线性回归。到目前为止,我已经通过删除N.As清理了我的数据集。但是,当我到达问题D时,我不确定如何建立此模型。根据我为x假设的说明,我将使用Revenues(以百万为单位),但是,我不知道X中还应该包含什么。如何继续建立该模型?
B部分:通过删除利润为N.A.的记录(行)来清理数据集,并研究收入与利润之间的关系。
dfCleanX = df[ df['Profit (in millions)']!='N.A.']
dfCleanX.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25131 entries, 0 to 25499
Data columns (total 5 columns):
Year 25131 non-null int64
Rank 25131 non-null int64
Revenue (in millions) 25131 non-null float64
Profit (in millions) 25131 non-null object
Company 25131 non-null object
dtypes: float64(1), int64(2), object(2)
memory usage: 1.2+ MB
dfClean = dfCleanX.astype({'Profit (in millions)': 'float64'})
print(dfClean.values.shape )
dfClean.info()
(25131, 5)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25131 entries, 0 to 25499
Data columns (total 5 columns):
Year 25131 non-null int64
Rank 25131 non-null int64
Revenue (in millions) 25131 non-null float64
Profit (in millions) 25131 non-null float64
Company 25131 non-null object
dtypes: float64(2), int64(2), object(1)
memory usage: 1.2+ MB
dfClean.plot.scatter(x='Revenue (in millions)', y='Profit (in millions)')
<matplotlib.axes._subplots.AxesSubplot at 0x23e0222a3c8>
C部分:在这一部分中,我们仅关注具有“正利润”的案例。我们要研究收入(即x)和利润(即y)之间的关系,以建立线性模型y = a * x + b
可视化y与x的关系,其中y和x是利润(> 0)和收入。
positiveProfitMask = dfClean['Profit (in millions)'] > 0
dfClean[ positiveProfitMask ].plot.scatter(
x='Revenue (in millions)',
y='Profit (in millions)'
)
<matplotlib.axes._subplots.AxesSubplot at 0x23e023b8358>
问题D:仅关注“正利润”的案例。在下面的单元格中将缺少的代码填写到
from sklearn.linear_model import LinearRegression
x = dfClean[(Revenues (in millions) )][??? ]
y = dfClean[( Profits (in millions) )][??? ]
model = LinearRegression(fit_intercept=True)
model.fit(positiveProfitMask , y)
print( "model.coef_ =", model.coef_ )
print( "model.intercept_ =", model.intercept_ )
print( "Linear model about y(profit) and x(revenue): y=",
model.coef_, '* x +', model.intercept_)
yfit = model.predict(??? )
plt.scatter(x, y)
plt.plot(x, yfit, 'r');
如果只需要填写下面的yfit = model.predict(??? )
行,则只需传递向量X,以查看模型对给定值的预测。由于您只需要正利润,因此需要从您的X中过滤掉我们的第一个。
这里介绍如何在pandas
中执行此操作:
cleaned_df = df[df['profit'] > 0]
y = df['y'].values
X = df.drop(columns=['y']).values
yfit = model.predict(X)