为什么我的数据没有正确连接?

问题描述 投票:0回答:1

预处理后我使用

train_test_split
分割数据:

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2,random_state=42)

然后分别对测试和训练中的数值列进行鲁棒缩放:

from sklearn.preprocessing import RobustScaler
robust = RobustScaler()
X_train_ = robust.fit_transform(X_train[numeric_columns])
X_test_ = robust.transform(X_test[numeric_columns])
X_train_sc_num=pd.DataFrame(X_train_,columns=[numeric_columns])
X_test_sc_num=pd.DataFrame(X_test_,columns=[numeric_columns])

然后进行串联:

X_train_scaled=pd.concat([X_train_sc_num,X_train[categoric_columns]],axis=1)
X_test_scaled=pd.concat([X_test_sc_num,X_test[categoric_columns]],axis=1)

但是形状被破坏了,并且在输出数据的分类列中添加了很多“nan”值。 sahpe 为 (466,17)+(466,11), 应该是(466,28), 但它变成了(560,28)。

如何解决这个问题?

我想在

train_test_split
之后对我的数据进行稳健缩放,但不触及我的 OHE 列。

pandas dataframe scikit-learn concatenation scaling
1个回答
1
投票

您的问题可能是由以下几件事引起的。

  1. 首先,您使用
    columns=[numeric_columns]
    ,它将列表视为单个列名称。相反,它应该只是
    columns=numeric_columns
  2. 创建缩放后的 DataFrame 时,您没有保留原始数据的索引。为此,您只需向
    index=X_train.index
    初始化添加一个附加参数
    X_test
    (或
    pd.DataFrame()
    ,根据情况而定)即可。

这是一个使用此示例数据的可重现示例,说明了您需要遵循的数据步骤:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler

# Create example data
np.random.seed(42)
n_samples = 466

numerical_data = {"temperature": np.random.normal(14, 3, n_samples), "moisture": np.random.normal(96, 2, n_samples)}
categorical_data = {"color": np.random.choice(["green", "yellow", "purple"], size=n_samples, p=[0.8, 0.1, 0.1])}

# Create DataFrame
df = pd.DataFrame({**numerical_data, **categorical_data})

# Define numeric and categorical columns
numerical_columns = numerical_data.keys()
categorical_columns = categorical_data.keys()

# One-hot encode categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_columns)

# Split features and target (creating dummy target for example)
y = np.random.randint(0, 2, n_samples)
X = df_encoded

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Get the one-hot encoded column names
# In general, these will be different than categorical_columns, since you're doing OHE
categorical_columns_encoded = [col for col in X_train.columns if col not in numerical_columns]

# Initialize RobustScaler
robust = RobustScaler()

# Scale only numeric columns
X_train_scaled_numeric = robust.fit_transform(X_train[numerical_columns])
X_test_scaled_numeric = robust.transform(X_test[numerical_columns])

# Create DataFrames with correct column names for scaled numeric data
X_train_scaled_numeric_df = pd.DataFrame(
    X_train_scaled_numeric,
    columns=numerical_columns,
    index=X_train.index,  # Preserve the index
)

X_test_scaled_numeric_df = pd.DataFrame(
    X_test_scaled_numeric,
    columns=numerical_columns,
    index=X_test.index,  # Preserve the index
)

# Concatenate with categorical columns
X_train_scaled = pd.concat([X_train_scaled_numeric_df, X_train[categorical_columns_encoded]], axis=1)
X_test_scaled = pd.concat([X_test_scaled_numeric_df, X_test[categorical_columns_encoded]], axis=1)

# Verify the shapes
print("Original shapes:")
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print("\nScaled shapes:")
print("X_train_scaled: {X_train_scaled.shape} = {X_train_scaled_numeric.shape} + {X_train[categorical_columns_encoded].shape}")
print(f"X_test_scaled: {X_test_scaled.shape} = {X_test_scaled_numeric.shape} + {X_test[categorical_columns_encoded].shape}")

# Verify no NaN values
print("\nNaN check:")
print("NaN in X_train_scaled:", X_train_scaled.isna().sum().sum())
print("NaN in X_test_scaled:", X_test_scaled.isna().sum().sum())

这将打印:

Original shapes:
X_train: (372, 5)
X_test: (94, 5)

Scaled shapes:
X_train_scaled: (372, 5) = (372, 2) + (372, 3)
X_test_scaled: (94, 5) = (94, 2) + (94, 3)
© www.soinside.com 2019 - 2024. All rights reserved.