如何格式化我的时间序列数据以进行 PyTorch LSTM 分类？

Question

如何预处理分类问题的时间序列数据并将其输入 PyTorch LSTM 模型？我有如下图所示的数据集。如何预处理数据集以将其输入 LSTM 模型？

这里，

event_type

是目标列，这是一个二元分类问题。我想使用 LSTM 训练这个数据集。

Answer 1

这个问题需要很长的答案。首先删除目标列中的所有空值。

数据需要具有相同的比例，因此首先您可以使用一些缩放器来标准化您的数据。 MinMaxScaler 将是一个不错的选择。

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler


df = pd.DataFrame(data)

# Select numerical columns
numeric_columns = [c for c in df.columns if c not in ['event_type']]
print(numeric_columns)

# Scale features
scaler = MinMaxScaler()
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

# Combine scaled features with original data
for i, col in enumerate(numeric_columns):
    df_normalized[col] = df[numeric_columns].iloc[:, i]

df_normalized['event_type'] = df['event_type']

然后您需要为每行创建一个包含先前“seq_length”行的数组，然后将序列添加到数据帧中：

# Define sequence length
seq_length = 3  # Use a smaller sequence length for simplicity

def create_sequence_col(df, col_name, seq_length):
    df[col_name + '_seq'] = df[col_name] \
                               .apply(lambda x: x.rolling(window=seq_length).apply(lambda y: list(y), raw=False)) \
                               .shift(-seq_length+1)
    return df

# Create sequence columns
for col_name in ['feature1', 'feature2']:
    df = create_sequence_col(df, col_name, seq_length)

# Filter out rows where sequences contain NaN values
pdf_sequences = df.dropna(subset=[f"{col_name}_seq" for col_name in ['feature1', 'feature2']])

for col_name in numeric_columns:
    pdf_sequences = pdf_sequences[~np.any(list(map(np.isnan, pdf_sequences[f'{col_name}_seq'])), axis=1)]

# Combine sequences into feature array
X = np.stack([pdf_sequences[f"{col_name}_seq"].apply(pd.Series).values for col_name in numeric_columns], axis=-1)
y = pdf_sequences['event_type'].values

# Split into training and testing data
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

y_train = y_train.reshape((-1, 1))
y_test = y_test.reshape((-1, 1))

X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Print shapes of resulting arrays
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

如何格式化我的时间序列数据以进行 PyTorch LSTM 分类？

问题描述投票：0回答：1

1个回答

最新问题

如何格式化我的时间序列数据以进行 PyTorch LSTM 分类？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1