如何预处理分类问题的时间序列数据并将其输入 PyTorch LSTM 模型?我有如下图所示的数据集。如何预处理数据集以将其输入 LSTM 模型?
这里,
event_type
是目标列,这是一个二元分类问题。我想使用 LSTM 训练这个数据集。
这个问题需要很长的答案。 首先删除目标列中的所有空值。
数据需要具有相同的比例,因此首先您可以使用一些缩放器来标准化您的数据。 MinMaxScaler 将是一个不错的选择。
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
df = pd.DataFrame(data)
# Select numerical columns
numeric_columns = [c for c in df.columns if c not in ['event_type']]
print(numeric_columns)
# Scale features
scaler = MinMaxScaler()
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])
# Combine scaled features with original data
for i, col in enumerate(numeric_columns):
df_normalized[col] = df[numeric_columns].iloc[:, i]
df_normalized['event_type'] = df['event_type']
然后您需要为每行创建一个包含先前“seq_length”行的数组,然后将序列添加到数据帧中:
# Define sequence length
seq_length = 3 # Use a smaller sequence length for simplicity
def create_sequence_col(df, col_name, seq_length):
df[col_name + '_seq'] = df[col_name] \
.apply(lambda x: x.rolling(window=seq_length).apply(lambda y: list(y), raw=False)) \
.shift(-seq_length+1)
return df
# Create sequence columns
for col_name in ['feature1', 'feature2']:
df = create_sequence_col(df, col_name, seq_length)
# Filter out rows where sequences contain NaN values
pdf_sequences = df.dropna(subset=[f"{col_name}_seq" for col_name in ['feature1', 'feature2']])
for col_name in numeric_columns:
pdf_sequences = pdf_sequences[~np.any(list(map(np.isnan, pdf_sequences[f'{col_name}_seq'])), axis=1)]
# Combine sequences into feature array
X = np.stack([pdf_sequences[f"{col_name}_seq"].apply(pd.Series).values for col_name in numeric_columns], axis=-1)
y = pdf_sequences['event_type'].values
# Split into training and testing data
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
y_train = y_train.reshape((-1, 1))
y_test = y_test.reshape((-1, 1))
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# Print shapes of resulting arrays
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")