获取“TypeError:输入类型不支持 ufunc 'isnan'”

问题描述 投票:0回答:1

我正在做一个机器学习项目来预测 Jupyter Notebook 上电动汽车的价格。

我运行这些单元格:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']
for col in cols:
    le.fit(t[col])
    x[col] = le.transform(x[col]) 
    print(le.classes_)

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.5, random_state = 0)

r2_score(y_test, lm.predict(x_test))


from sklearn.tree import DecisionTreeRegressor 
regressor = DecisionTreeRegressor(random_state = 0) 
regressor.fit(x_train, y_train)
r2_score(y_test, regressor.predict(x_test))


r2_score(y_train, regressor.predict(x_train))

uv = np.nanpercentile(df2['Base MSRP'], [99])[0]*2


df2['Base MSRP'][(df2['Base MSRP']>uv)] = uv


df2 = df2[df2['Model Year'] != 'N/']  # Filter out rows where 'Model Year' is 'N/'

for col in cols:
    df2[col] = df2[col].replace('N/', -1)
    le.fit(df2[col])
    df2[col] = le.transform(df2[col]) 
    print(le.classes_)

le = preprocessing.LabelEncoder()

cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']

for col in cols:
    le.fit(t[col])
    df2[col] = le.transform(df2[col]) 
    print(le.classes_)

我收到此错误:

TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_16424\1094749331.py in <module>
      1 for col in cols:
      2     le.fit(t[col])
----> 3     df2[col] = le.transform(df2[col])
      4     print(le.classes_)

~\.conda\envs\electricvehiclepriceprediction\lib\site-packages\sklearn\preprocessing\_label.py in transform(self, y)
    136             return np.array([])
    137 
--> 138         return _encode(y, uniques=self.classes_)
    139 
    140     def inverse_transform(self, y):

~\.conda\envs\electricvehiclepriceprediction\lib\site-packages\sklearn\utils\_encode.py in _encode(values, uniques, check_unknown)
    185     else:
    186         if check_unknown:
--> 187             diff = _check_unknown(values, uniques)
    188             if diff:
    189                 raise ValueError(f"y contains previously unseen labels: {str(diff)}")

~\.conda\envs\electricvehiclepriceprediction\lib\site-packages\sklearn\utils\_encode.py in _check_unknown(values, known_values, return_mask)
    259 
    260         # check for nans in the known_values
--> 261         if np.isnan(known_values).any():
    262             diff_is_nan = np.isnan(diff)
    263             if diff_is_nan.any():

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

我尝试了什么?

我尝试使用以下代码:

le = preprocessing.LabelEncoder()
cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']
for col in cols:
    le.fit(t[col])
    df2[col] = le.transform(df2[col]) 
    print(le.classes_)

代码给了我具体的错误。

为了解决这个问题,我尝试输入缺失值(“N/”),而不是使用以下代码删除它:

for col in cols:
  le.fit(t[col].fillna('Missing'))  # Impute missing values with 'Missing'
  df2[col] = le.transform(df2[col].fillna('Missing'))
  print(le.classes_)

但是,我仍然遇到同样的错误。

这是我的笔记本的链接:https://github.com/SteveAustin583/electric-vehicle-price-prediction-revengers/blob/main/revengers.ipynb

这是数据集的链接: https://www.kaggle.com/datasets/rithurajnambiar/electric-vehicle-data

如何解决这个问题?

python machine-learning scikit-learn
1个回答
0
投票

不要混合数据帧 df/df2/t 来处理训练集,并且只对训练集进行一次编码。下面的代码不会导致任何Python错误:

df_train = pd.read_csv('train.csv',header=0)
le = preprocessing.LabelEncoder()
cols = ['County', 'City', 'State', 'ZIP Code', 'Model Year', 'Make', 'Model', 
        'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility']

for col in cols:
    le.fit(df_train[col])
    df_train[col] = le.transform(df_train[col]) 

但是正如 scikit-learn 文档中关于 LabelEncoder 所说的那样“这个转换器应该用于编码目标值,即 y,而不是输入 X。” 相反,您可以使用 OrdinalEncoder() ,它将给出与上面相同的结果。它还将使训练集的代码变得更短,因为您只需要 1 行:

df_train[cols] = oe.fit_transform(df_train[cols]) 

with oe = preprocessing.OrdinalEncoder(handle_unknown="use_encoded_value",unknown_value=-1)

注意,您必须对测试集执行相同的操作,但使用训练集上安装的编码器,并使用它对测试集进行编码(以避免信息从测试泄漏到训练集),即只需使用 oe.transform。

我在创建 oe 时添加了 2 个参数,以便每次在测试集中找到未知类别时都会输入 -1(如果不是,则会出现错误)。

请参阅 OrdinalEnncoder 的文档,网址为 https://scikit-learn.org/stable/modules/ generated/sklearn.preprocessing.OrdinalEncoder.html

PS:在进行编码之前,您应该处理缺失值(插补/删除/...)。

© www.soinside.com 2019 - 2024. All rights reserved.