以y = a * x + b的形式在Python中学习线性模型

问题描述 投票:0回答:1

我目前在python中工作,试图学习如何使用财富500强数据集进行线性回归。到目前为止,我已经通过删除N.As清理了我的数据集。但是,当我到达问题D时,我不确定如何建立此模型。根据我为x假设的说明,我将使用Revenues(以百万为单位),但是,我不知道X中还应该包含什么。如何继续建立该模型?

B部分:通过删除利润为N.A.的记录(行)来清理数据集,并研究收入与利润之间的关系。

dfCleanX = df[ df['Profit (in millions)']!='N.A.']
dfCleanX.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 25131 entries, 0 to 25499

Data columns (total 5 columns):

Year                     25131 non-null int64

Rank                     25131 non-null int64

Revenue (in millions)    25131 non-null float64

Profit (in millions)     25131 non-null object

Company                  25131 non-null object

dtypes: float64(1), int64(2), object(2)

memory usage: 1.2+ MB

dfClean = dfCleanX.astype({'Profit (in millions)': 'float64'})

print(dfClean.values.shape )

dfClean.info()

(25131, 5)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25131 entries, 0 to 25499
Data columns (total 5 columns):
Year                     25131 non-null int64
Rank                     25131 non-null int64
Revenue (in millions)    25131 non-null float64
Profit (in millions)     25131 non-null float64
Company                  25131 non-null object
dtypes: float64(2), int64(2), object(1)
memory usage: 1.2+ MB

dfClean.plot.scatter(x='Revenue (in millions)', y='Profit (in millions)')

<matplotlib.axes._subplots.AxesSubplot at 0x23e0222a3c8>

C部分:在这一部分中,我们仅关注具有“正利润”的案例。我们要研究收入(即x)和利润(即y)之间的关系,以建立线性模型y = a * x + b

可视化y与x的关系,其中y和x是利润(> 0)和收入。

positiveProfitMask = dfClean['Profit (in millions)'] > 0
dfClean[ positiveProfitMask  ].plot.scatter(
    x='Revenue (in millions)', 
    y='Profit (in millions)'
    )

<matplotlib.axes._subplots.AxesSubplot at 0x23e023b8358>

问题D:仅关注“正利润”的案例。在下面的单元格中将缺少的代码填写到

  1. [以y = a * x + b的形式学习线性模型以对收入(即x)和正利润(即y)之间的关系进行建模,
  2. 使用模型找到这些案例的预测利润,然后
  3. 将预测与数据一起绘制,以查看模型如何拟合数据。
from sklearn.linear_model import LinearRegression

x = dfClean[(Revenues (in millions) )][??? ]
y = dfClean[( Profits (in millions) )][??? ]

model = LinearRegression(fit_intercept=True)
model.fit(positiveProfitMask  , y)

print( "model.coef_ =", model.coef_ )
print( "model.intercept_ =", model.intercept_ )
print( "Linear model about y(profit) and x(revenue): y=",  
       model.coef_, '* x +', model.intercept_)
yfit = model.predict(???  )

plt.scatter(x, y)
plt.plot(x, yfit, 'r');
python scikit-learn linear-regression
1个回答
1
投票

如果只需要填写下面的yfit = model.predict(??? )行,则只需传递向量X,以查看模型对给定值的预测。由于您只需要正利润,因此需要从您的X中过滤掉我们的第一个。

这里介绍如何在pandas中执行此操作:

 cleaned_df = df[df['profit'] > 0]
 y = df['y'].values
 X = df.drop(columns=['y']).values

 yfit = model.predict(X)
© www.soinside.com 2019 - 2024. All rights reserved.