我正在解决线性回归问题。使用统计模型进行的分析得出R平方为0.907,非常高。因此,我使用sklearn计算的模型的纵横比分数应该更大,但是我得到的分数仅为0.6478154705337766,这有点低。
我想念什么吗?在统计模型中,所有变量的p值均小于0.05。我没有检查其他变量,例如系数,因为我听说很多人都没有必要检查其他变量。详细信息升级问题如下。
问题陈述和相关数据集:https://datahack.analyticsvidhya.com/contest/black-friday/
Sklearn得分:0.6478154705337766
统计模型摘要:
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.907
Model: OLS Adj. R-squared (uncentered): 0.907
Method: Least Squares F-statistic: 6.458e+04
Date: Mon, 21 Oct 2019 Prob (F-statistic): 0.00
Time: 18:57:44 Log-Likelihood: -5.2226e+06
No. Observations: 550068 AIC: 1.045e+07
Df Residuals: 549985 BIC: 1.045e+07
Df Model: 83
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -59.0946 9.426 -6.269 0.000 -77.569 -40.620
x2 401.0189 10.441 38.409 0.000 380.555 421.483
x3 1e+04 23.599 423.786 0.000 9954.518 1e+04
x4 1.035e+04 21.740 475.990 0.000 1.03e+04 1.04e+04
x5 1.04e+04 23.309 446.356 0.000 1.04e+04 1.04e+04
x6 1.041e+04 26.858 387.693 0.000 1.04e+04 1.05e+04
x7 1.065e+04 27.715 384.315 0.000 1.06e+04 1.07e+04
x8 1.041e+04 32.580 319.469 0.000 1.03e+04 1.05e+04
x9 614.8732 19.178 32.061 0.000 577.285 652.462
x10 710.7823 23.135 30.723 0.000 665.438 756.126
x11 865.8851 27.138 31.906 0.000 812.695 919.076
x12 849.9004 18.358 46.296 0.000 813.919 885.881
x13 596.9014 31.632 18.870 0.000 534.904 658.899
x14 762.7278 25.809 29.553 0.000 712.143 813.312
x15 638.7214 18.085 35.319 0.000 603.276 674.166
x16 450.8858 82.928 5.437 0.000 288.349 613.423
x17 831.6309 43.033 19.325 0.000 747.287 915.975
x18 9266.9203 32.520 284.958 0.000 9203.182 9330.659
x19 548.8524 32.358 16.962 0.000 485.432 612.273
x20 819.7812 21.937 37.370 0.000 776.786 862.776
x21 575.2436 41.598 13.829 0.000 493.713 656.775
x22 780.1032 22.922 34.032 0.000 735.176 825.030
x23 854.8429 31.605 27.048 0.000 792.898 916.788
x24 603.5181 23.772 25.388 0.000 556.926 650.111
x25 635.8521 20.312 31.305 0.000 596.042 675.662
x26 455.0734 41.495 10.967 0.000 373.745 536.402
x27 1241.9456 36.844 33.708 0.000 1169.732 1314.160
x28 491.6905 21.378 23.000 0.000 449.791 533.590
x29 599.4075 10.701 56.014 0.000 578.434 620.381
x30 1024.8516 11.618 88.210 0.000 1002.080 1047.623
x31 282.3561 11.849 23.830 0.000 259.133 305.579
x32 218.2959 12.181 17.921 0.000 194.421 242.171
x33 194.9270 12.699 15.350 0.000 170.037 219.817
x34 -1038.1290 29.412 -35.296 0.000 -1095.776 -980.482
x35 -1429.4546 40.730 -35.096 0.000 -1509.284 -1349.625
x36 -1.021e+04 36.784 -277.658 0.000 -1.03e+04 -1.01e+04
x37 -5982.2095 15.651 -382.220 0.000 -6012.885 -5951.534
x38 3004.0730 28.298 106.159 0.000 2948.610 3059.536
x39 4535.2965 54.872 82.652 0.000 4427.749 4642.844
x40 -4645.1924 16.698 -278.195 0.000 -4677.919 -4612.466
x41 3110.6592 160.033 19.438 0.000 2797.000 3424.318
x42 7195.3346 48.059 149.718 0.000 7101.140 7289.529
x43 -7488.9490 24.289 -308.323 0.000 -7536.555 -7441.343
x44 -1.068e+04 53.542 -199.516 0.000 -1.08e+04 -1.06e+04
x45 -1.19e+04 45.546 -261.177 0.000 -1.2e+04 -1.18e+04
x46 1175.4639 83.574 14.065 0.000 1011.662 1339.266
x47 2354.8546 42.888 54.907 0.000 2270.795 2438.914
x48 2935.1657 35.917 81.721 0.000 2864.769 3005.562
x49 -1895.0141 134.688 -14.070 0.000 -2158.999 -1631.029
x50 -9003.5945 59.618 -151.022 0.000 -9120.444 -8886.745
x51 -1.194e+04 81.812 -145.944 0.000 -1.21e+04 -1.18e+04
x52 -1.158e+04 65.553 -176.632 0.000 -1.17e+04 -1.15e+04
x53 1489.1716 24.670 60.364 0.000 1440.819 1537.524
x54 2238.5714 93.608 23.914 0.000 2055.102 2422.041
x55 -732.7678 41.730 -17.560 0.000 -814.558 -650.978
x56 480.2321 29.776 16.128 0.000 421.872 538.592
x57 1076.8803 30.482 35.328 0.000 1017.136 1136.624
x58 1023.1860 128.939 7.935 0.000 770.470 1275.902
x59 987.0863 17.776 55.530 0.000 952.246 1021.926
x60 307.3852 45.456 6.762 0.000 218.293 396.478
x61 1979.9974 67.180 29.473 0.000 1848.327 2111.667
x62 441.5194 29.476 14.979 0.000 383.746 499.292
x63 203.3906 34.692 5.863 0.000 135.396 271.386
x64 250.2751 16.466 15.200 0.000 218.003 282.547
x65 653.6979 20.591 31.747 0.000 613.340 694.055
x66 893.8433 18.950 47.168 0.000 856.702 930.985
x67 1052.2746 29.336 35.870 0.000 994.777 1109.772
x68 1211.0301 61.789 19.599 0.000 1089.925 1332.135
x69 626.3778 131.545 4.762 0.000 368.553 884.202
x70 -3303.6544 99.019 -33.364 0.000 -3497.728 -3109.581
x71 678.0397 31.709 21.383 0.000 615.891 740.188
x72 449.4691 50.429 8.913 0.000 350.631 548.308
x73 1881.4959 33.873 55.546 0.000 1815.106 1947.886
x74 488.1976 34.729 14.057 0.000 420.130 556.266
x75 -818.2759 94.178 -8.689 0.000 -1002.861 -633.690
x76 -476.0159 78.144 -6.091 0.000 -629.176 -322.855
x77 369.1793 37.992 9.717 0.000 294.716 443.642
x78 -610.9179 49.224 -12.411 0.000 -707.395 -514.441
x79 217.0498 26.327 8.244 0.000 165.450 268.650
x80 -144.8580 24.612 -5.886 0.000 -193.097 -96.619
x81 475.4497 21.298 22.323 0.000 433.705 517.194
x82 1404.9458 27.294 51.474 0.000 1351.450 1458.442
x83 329.1859 49.154 6.697 0.000 232.846 425.526
==============================================================================
Omnibus: 27530.062 Durbin-Watson: 1.533
Prob(Omnibus): 0.000 Jarque-Bera (JB): 81968.349
Skew: -0.223 Prob(JB): 0.00
Kurtosis: 4.838 Cond. No. 48.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
请让我知道,如果您需要任何其他信息。我尚未在sklearn和statmodels中共享确切的代码详细信息,因为我认为这可能会使问题陈述变得复杂,因此我愿意在必要时共享它。
线性回归的基本形式在statsmodels
和scikit-learn
中相同。但是,实现方式有所不同,在极端情况下可能会产生不同的结果,并且scikit learning通常为更大的模型提供更多支持。例如,statsmodels当前很少使用稀疏矩阵。
最重要的区别是周围的基础结构和直接支持的用例。
Statsmodels在很大程度上遵循传统模型,我们想知道给定模型与数据的拟合程度,以及什么变量“ explain”或影响结果,或影响的大小。 Scikit-learn遵循机器学习的传统,其中主要的支持任务是选择“ best”模型进行预测。
因此,statsmodels支持功能的重点在于分析训练数据,其中包括假设检验和拟合优度度量,而scikit-learn支持基础设施的重点在于模型的选择,样本外预测,因此对“测试数据”进行交叉验证。
旁注:您的问题更适合https://stats.stackexchange.com