我在预处理后使用训练测试分割来分割数据。
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2,random_state=42)
然后分别对测试和训练中的数值列进行鲁棒缩放
from sklearn.preprocessing import RobustScaler
robust = RobustScaler()
X_train_ = robust.fit_transform(X_train[numeric_columns])
X_test_ = robust.transform(X_test[numeric_columns])
X_train_sc_num=pd.DataFrame(X_train_,columns=[numeric_columns])
X_test_sc_num=pd.DataFrame(X_test_,columns=[numeric_columns])
然后进行串联。
X_train_scaled=pd.concat([X_train_sc_num,X_train[categoric_columns]],axis=1)
X_test_scaled=pd.concat([X_test_sc_num,X_test[categoric_columns]],axis=1)
但是形状被破坏了,并且在输出数据的分类列中添加了很多“nan”值。 sahpe 为 (466,17)+(466,11), 应该是(466,28), 但它变成了(560,28)。
如何解决这个问题?
我想在 Train_Test_split 之后对数据进行鲁棒缩放,但不触及我的 OHE 列,
您的问题可能是由以下几件事引起的。
columns=[numeric_columns]
,它将列表视为单个列名称。相反,它应该只是columns=numeric_columns
。index=X_train.index
初始化添加一个附加参数 X_test
(或 pd.DataFrame()
,根据情况而定)即可。这是一个使用此示例数据的可重现示例,说明了您需要遵循的数据步骤:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
# Create example data
np.random.seed(42)
n_samples = 466
numerical_data = {"temperature": np.random.normal(14, 3, n_samples), "moisture": np.random.normal(96, 2, n_samples)}
categorical_data = {"color": np.random.choice(["green", "yellow", "purple"], size=n_samples, p=[0.8, 0.1, 0.1])}
# Create DataFrame
df = pd.DataFrame({**numerical_data, **categorical_data})
# Define numeric and categorical columns
numerical_columns = numerical_data.keys()
categorical_columns = categorical_data.keys()
# One-hot encode categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_columns)
# Split features and target (creating dummy target for example)
y = np.random.randint(0, 2, n_samples)
X = df_encoded
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Get the one-hot encoded column names
# In general, these will be different than categorical_columns, since you're doing OHE
categorical_columns_encoded = [col for col in X_train.columns if col not in numerical_columns]
# Initialize RobustScaler
robust = RobustScaler()
# Scale only numeric columns
X_train_scaled_numeric = robust.fit_transform(X_train[numerical_columns])
X_test_scaled_numeric = robust.transform(X_test[numerical_columns])
# Create DataFrames with correct column names for scaled numeric data
X_train_scaled_numeric_df = pd.DataFrame(
X_train_scaled_numeric,
columns=numerical_columns,
index=X_train.index, # Preserve the index
)
X_test_scaled_numeric_df = pd.DataFrame(
X_test_scaled_numeric,
columns=numerical_columns,
index=X_test.index, # Preserve the index
)
# Concatenate with categorical columns
X_train_scaled = pd.concat([X_train_scaled_numeric_df, X_train[categorical_columns_encoded]], axis=1)
X_test_scaled = pd.concat([X_test_scaled_numeric_df, X_test[categorical_columns_encoded]], axis=1)
# Verify the shapes
print("Original shapes:")
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print("\nScaled shapes:")
print("X_train_scaled: {X_train_scaled.shape} = {X_train_scaled_numeric.shape} + {X_train[categorical_columns_encoded].shape}")
print(f"X_test_scaled: {X_test_scaled.shape} = {X_test_scaled_numeric.shape} + {X_test[categorical_columns_encoded].shape}")
# Verify no NaN values
print("\nNaN check:")
print("NaN in X_train_scaled:", X_train_scaled.isna().sum().sum())
print("NaN in X_test_scaled:", X_test_scaled.isna().sum().sum())
这将打印:
Original shapes:
X_train: (372, 5)
X_test: (94, 5)
Scaled shapes:
X_train_scaled: (372, 5) = (372, 2) + (372, 3)
X_test_scaled: (94, 5) = (94, 2) + (94, 3)