我编写了以下代码来学习机器学习方法中的分数。但我收到以下错误。会是什么原因呢??
ValueError:发现输入变量的样本数量不一致:[6396, 1599]
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Armenian Market Car Prices.csv')
df['Car Name'] = df['Car Name'].astype('category').cat.codes
df = df.join(pd.get_dummies(df.FuelType, dtype=int))
df = df.drop('FuelType', axis=1)
df['Region'] = df['Region'].astype('category').cat.codes
df['Price'] = df.pop('Price')
X = df.drop('Price', axis=1)
y = df['Price']
df
[1]: https://i.sstatic.net/vTKsYZro.jpg
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[358], line 1
----> 1 model.fit(X_train, y_train)
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1466 estimator._validate_params()
1468 with config_context(
1469 skip_parameter_validation=(
1470 prefer_skip_nested_validation or global_skip_validation
1471 )
1472 ):
-> 1473 return fit_method(estimator, *args, **kwargs)
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_base.py:609, in LinearRegression.fit(self, X, y, sample_weight)
605 n_jobs_ = self.n_jobs
607 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 609 X, y = self._validate_data(
610 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
611 )
613 has_sw = sample_weight is not None
614 if has_sw:
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py:650, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
648 y = check_array(y, input_name="y", **check_y_params)
649 else:
--> 650 X, y = check_X_y(X, y, **check_params)
651 out = X, y
653 if not no_val_X and check_params.get("ensure_2d", True):
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py:1291, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1273 X = check_array(
1274 X,
1275 accept_sparse=accept_sparse,
(...)
1286 input_name="X",
1287 )
1289 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
-> 1291 check_consistent_length(X, y)
1293 return X, y
File ~\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py:460, in check_consistent_length(*arrays)
458 uniques = np.unique(lengths)
459 if len(uniques) > 1:
--> 460 raise ValueError(
461 "Found input variables with inconsistent numbers of samples: %r"
462 % [int(l) for l in lengths]
463 )
ValueError: Found input variables with inconsistent numbers of samples: [6396, 1599]
我尝试了一切,但没有任何效果,或者我不知道如何解决问题。
Jupyternaut
您提供的错误信息表明输入数据有问题。具体来说,输入数据似乎有两种不同的版本,一种有 6396 个样本,另一种有 1599 个样本。当尝试拟合模型或对数据执行其他操作时,这可能会导致问题。
要解决此问题,您可能需要检查代码并确保每个操作使用正确版本的输入数据。您可能还想尝试通过删除任何重复或不一致的内容来清理输入数据。
您的代码中
train_test_split
的输出顺序不正确。这是正确的代码:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)