我正在做一个机器学习项目来预测 Jupyter Notebook 上电动汽车的价格。
我运行这些单元格:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']
for col in cols:
le.fit(t[col])
x[col] = le.transform(x[col])
print(le.classes_)
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.5, random_state = 0)
r2_score(y_test, lm.predict(x_test))
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(x_train, y_train)
r2_score(y_test, regressor.predict(x_test))
r2_score(y_train, regressor.predict(x_train))
uv = np.nanpercentile(df2['Base MSRP'], [99])[0]*2
df2['Base MSRP'][(df2['Base MSRP']>uv)] = uv
df2 = df2[df2['Model Year'] != 'N/'] # Filter out rows where 'Model Year' is 'N/'
for col in cols:
df2[col] = df2[col].replace('N/', -1)
le.fit(df2[col])
df2[col] = le.transform(df2[col])
print(le.classes_)
le = preprocessing.LabelEncoder()
cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']
for col in cols:
le.fit(t[col])
df2[col] = le.transform(df2[col])
print(le.classes_)
我收到此错误:
TypeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_16424\1094749331.py in <module>
1 for col in cols:
2 le.fit(t[col])
----> 3 df2[col] = le.transform(df2[col])
4 print(le.classes_)
~\.conda\envs\electricvehiclepriceprediction\lib\site-packages\sklearn\preprocessing\_label.py in transform(self, y)
136 return np.array([])
137
--> 138 return _encode(y, uniques=self.classes_)
139
140 def inverse_transform(self, y):
~\.conda\envs\electricvehiclepriceprediction\lib\site-packages\sklearn\utils\_encode.py in _encode(values, uniques, check_unknown)
185 else:
186 if check_unknown:
--> 187 diff = _check_unknown(values, uniques)
188 if diff:
189 raise ValueError(f"y contains previously unseen labels: {str(diff)}")
~\.conda\envs\electricvehiclepriceprediction\lib\site-packages\sklearn\utils\_encode.py in _check_unknown(values, known_values, return_mask)
259
260 # check for nans in the known_values
--> 261 if np.isnan(known_values).any():
262 diff_is_nan = np.isnan(diff)
263 if diff_is_nan.any():
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
我尝试了什么?
我尝试使用以下代码:
le = preprocessing.LabelEncoder()
cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']
for col in cols:
le.fit(t[col])
df2[col] = le.transform(df2[col])
print(le.classes_)
代码给了我具体的错误。
为了解决这个问题,我尝试输入缺失值(“N/”),而不是使用以下代码删除它:
for col in cols:
le.fit(t[col].fillna('Missing')) # Impute missing values with 'Missing'
df2[col] = le.transform(df2[col].fillna('Missing'))
print(le.classes_)
但是,我仍然遇到同样的错误。
这是数据集的链接: https://www.kaggle.com/datasets/rithurajnambiar/electric-vehicle-data
如何解决这个问题?
不要混合数据帧 df/df2/t 来处理训练集,并且只对训练集进行一次编码。下面的代码不会导致任何Python错误:
df_train = pd.read_csv('train.csv',header=0)
le = preprocessing.LabelEncoder()
cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model',
'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']
for col in cols:
le.fit(df_train[col])
df_train[col] = le.transform(df_train[col])
但是正如 scikit-learn 文档中关于 LabelEncoder 所说的那样“这个转换器应该用于编码目标值,即 y,而不是输入 X。” 相反,您可以使用 OrdinalEncoder() ,它将给出与上面相同的结果。它还将使训练集的代码变得更短,因为您只需要 1 行:
df_train[cols] = oe.fit_transform(df_train[cols])
with oe = preprocessing.OrdinalEncoder(handle_unknown="use_encoded_value",unknown_value=-1)
注意,您必须对测试集执行相同的操作,但使用训练集上安装的编码器,并使用它对测试集进行编码(以避免信息从测试泄漏到训练集),即只需使用 oe.transform。
我在创建 oe 时添加了 2 个参数,以便每次在测试集中找到未知类别时都会输入 -1(如果不是,则会出现错误)。
请参阅 OrdinalEnncoder 的文档,网址为 https://scikit-learn.org/stable/modules/ generated/sklearn.preprocessing.OrdinalEncoder.html
PS:在进行编码之前,您应该处理缺失值(插补/删除/...)。