代码:
num_features = [feature for feature in x.columns if x[feature].dtypes != 'O']
x[num_features].replace('N',value=0)
from sklearn.preprocessing import StandardScaler
stds = StandardScaler()
x[num_features]= stds.fit(x)
错误:
ValueError: could not convert string to float: 'N' (无法将字符串转换为浮点数)
要选择数字列。
numeric_features = df.select_dtypes(include=['numeric'])
你的方法现在不会排除布尔值(True和False),Category值,或者Date和Time。
仅有数值是不够的。我们也可能有缺失的值。我们需要在标准化之前处理它们。
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# numeric transformer
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
# now before we transform we impute missing values with columns median.
normalise_numeric_features = numeric_transformer.fit_transform(numeric_feature)
这可能还不够,因为你的数据集中可能有不同的数据类型,每个类型都需要自己的转换。你可以这样做。
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
# select features types
numeric_features = df.select_dtypes(include=['numeric'])
categorical_features = df.select_dtypes(include=['category'])
# create transformer for each feature types
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# create a process
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# if you have a model you can add it in a pipeline to
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
# now we can use train the model
clf.fit(df[train_features], df[train_target])
# the fit will take the features, do the preprocessing(impute, normalise etc) that is fit_transforms, and train your model.
clf.predict(df[test_features])
# predict will just transform your features, using the learned metrics from train_features.
希望这能帮助你充分利用scikit -learn。