我使用特征哈希器对我的数据进行编码,我在训练数据上获得的准确性约为 90%,但我仍然无法通过模型预测新的数据点,我猜编码我的新数据点有问题:请有一个看一下训练数据编码的代码:
columns_to_hash = ['port_cluster', 'org', 'asn', 'protocol', 'event_type', 'os', 'country_name', 'city_name', 'class']
attribute_weights = {
'port_cluster': 0.7,
"protocol": 0.4,
"city_name": 0.4,
'org': 0.5,
'asn': 0.5,
'event_type': 0.6,
'os': 0.3,
'country_name': 0.4,
'class': 0.3,
}
# Create a dictionary to store the hashed features
hashed_feature_dict = {}
# Iterate over unique src_ips
unique_src_ips = df['src_ip'].unique()
for src_ip in unique_src_ips:
src_ip_data = df[df['src_ip'] == src_ip]
# Initialize the FeatureHasher for the current src_ip
hasher = FeatureHasher(n_features=20, input_type='string')
src_ip_hashed_feature_dict = {}
# Iterate over columns to hash and store hashed features
for column in columns_to_hash:
hashed_features = hasher.fit_transform(src_ip_data[column].astype(str).values.reshape(-1, 1))
weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
src_ip_hashed_feature_dict[column] = weighted_hashed_features
hashed_features_array = np.concatenate([src_ip_hashed_feature_dict[column] for column in columns_to_hash], axis=1)
hashed_feature_dict[src_ip] = hashed_features_array
# Concatenate all hashed features
all_hashed_features = np.concatenate(list(hashed_feature_dict.values()), axis=0)
# Split the data into training and testing sets
X = all_hashed_features
y = df['cluster_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
新数据预测代码:
# Create a dictionary to store the hashed features for the new data point
new_data = {
'src_ip': '65.21.234.90',
'asn': 24940,
'country_name': 'Finland',
'city_name': 'Helsinki',
'open_ports': ['5060/sip', '2000/ikettle'],
'protocol': 'UDP',
'ip_rep': None,
'first_seen': '2023-08-22T16:56:45.733Z',
'last_time': '2023-08-22T16:56:45.733Z',
'class': 'A',
'event_type': 'sip',
'event_data': {
'request_line': 'OPTIONS sip:[email protected] SIP/2.0',
'uri': 'sip:[email protected]',
'version': 'SIP/2.0',
'method': 'OPTIONS'
},
'link': None,
'os': 'None',
'org': 'Hetzner Online GmbH',
'port_cluster': -1
}
# Initialize the FeatureHasher for the new data point
# hasher = FeatureHasher(n_features=11, input_type='string')
new_data_hashed_feature_dict = {}
# Iterate over columns to hash and store hashed features for the new data point
for column in columns_to_hash:
if column in new_data:
# Convert a single string to a list containing that string
feature_value = [str(new_data[column])]
hashed_features = hasher.transform([feature_value])
weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
new_data_hashed_feature_dict[column] = weighted_hashed_features
# Concatenate all hashed features for the new data point
new_data_features = np.concatenate([new_data_hashed_feature_dict[column] for column in columns_to_hash], axis=1)
# Predict the cluster label for the new data point
predicted_label = model.predict(new_data_features)
print(f"Predicted Cluster Label: {predicted_label[0]}")
在训练数据编码中,您为每个唯一的
FeatureHasher
使用新的 src_ip
。在新的数据预测代码中,您使用通用哈希器对象。我只是认为 FeatureHasher
类不会从数据中学习,因此对新数据点使用不同的类可能不会导致一致的散列。也许您需要对训练和预测使用相同的哈希过程以确保一致性。
还在您的训练数据中,我看到您连接了每个
src_ip
的哈希特征。然而,在您的预测代码中,您仅对一个 src_ip
的特征进行散列和连接。所以我假设如果您使用不同数量的散列特征进行训练和预测,这将导致您的模型失败。
另一句话,您正在使用
columns_to_hash
列表来迭代字典。但是,我认为您需要确保 columns_to_hash
中列出的所有键确实存在于 new_data
字典中,这就是我假设您的代码正在跳过丢失的键的相同形状的输入数据,用于训练和预测型号
new_data_hashed_feature_dict = {}
# Here I iterate over columns to hash and store hashed features for the new data point
for column in columns_to_hash:
if column in new_data:
feature_value = [str(new_data[column])] # Convert a single string to a list containing that string
hashed_features = hasher.transform([feature_value])
weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
new_data_hashed_feature_dict[column] = weighted_hashed_features
else:
print(f"Warning: Missing key {column} in new data")
# Then I concatenated all hashed features for the new data point
new_data_features = np.concatenate([new_data_hashed_feature_dict[column] for column in columns_to_hash if column in new_data], axis=1)
# Here you will need to predict the cluster label for the new data point
predicted_label = model.predict(new_data_features)
print(f"Predicted Cluster Label: {predicted_label[0]}")