如何使用sklearn特征哈希器学习的模型进行预测

问题描述 投票:0回答:1

我使用特征哈希器对我的数据进行编码,我在训练数据上获得的准确性约为 90%,但我仍然无法通过模型预测新的数据点,我猜编码我的新数据点有问题:请有一个看一下训练数据编码的代码:


columns_to_hash = ['port_cluster', 'org', 'asn', 'protocol', 'event_type', 'os', 'country_name', 'city_name', 'class']

attribute_weights = {
    'port_cluster': 0.7,
    "protocol": 0.4,
    "city_name": 0.4,
    'org': 0.5,
    'asn': 0.5,
    'event_type': 0.6,
    'os': 0.3,
    'country_name': 0.4,
    'class': 0.3,
}

# Create a dictionary to store the hashed features
hashed_feature_dict = {}

# Iterate over unique src_ips
unique_src_ips = df['src_ip'].unique()
for src_ip in unique_src_ips:
    src_ip_data = df[df['src_ip'] == src_ip]
    
    # Initialize the FeatureHasher for the current src_ip
    hasher = FeatureHasher(n_features=20, input_type='string')
    
    src_ip_hashed_feature_dict = {}
    
    # Iterate over columns to hash and store hashed features
    for column in columns_to_hash:
        hashed_features = hasher.fit_transform(src_ip_data[column].astype(str).values.reshape(-1, 1))

        
        weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
        src_ip_hashed_feature_dict[column] = weighted_hashed_features
 
    hashed_features_array = np.concatenate([src_ip_hashed_feature_dict[column] for column in columns_to_hash], axis=1)
    hashed_feature_dict[src_ip] = hashed_features_array
# Concatenate all hashed features
all_hashed_features = np.concatenate(list(hashed_feature_dict.values()), axis=0)

# Split the data into training and testing sets
X = all_hashed_features
y = df['cluster_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

新数据预测代码:

# Create a dictionary to store the hashed features for the new data point
new_data = {
    'src_ip': '65.21.234.90',
    'asn': 24940,
    'country_name': 'Finland',
    'city_name': 'Helsinki',
    'open_ports': ['5060/sip', '2000/ikettle'],
    'protocol': 'UDP',
    'ip_rep': None,
    'first_seen': '2023-08-22T16:56:45.733Z',
    'last_time': '2023-08-22T16:56:45.733Z',
    'class': 'A',
    'event_type': 'sip',
    'event_data': {
        'request_line': 'OPTIONS sip:[email protected] SIP/2.0',
        'uri': 'sip:[email protected]',
        'version': 'SIP/2.0',
        'method': 'OPTIONS'
    },
    'link': None,
    'os': 'None',
    'org': 'Hetzner Online GmbH',
    'port_cluster': -1
}

# Initialize the FeatureHasher for the new data point
# hasher = FeatureHasher(n_features=11, input_type='string')

new_data_hashed_feature_dict = {}

# Iterate over columns to hash and store hashed features for the new data point
for column in columns_to_hash:
    if column in new_data:
        # Convert a single string to a list containing that string
        feature_value = [str(new_data[column])]
        hashed_features = hasher.transform([feature_value])
        weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
        new_data_hashed_feature_dict[column] = weighted_hashed_features

# Concatenate all hashed features for the new data point
new_data_features = np.concatenate([new_data_hashed_feature_dict[column] for column in columns_to_hash], axis=1)

# Predict the cluster label for the new data point
predicted_label = model.predict(new_data_features)

print(f"Predicted Cluster Label: {predicted_label[0]}")

python pandas machine-learning scikit-learn hash
1个回答
0
投票

在训练数据编码中,您为每个唯一的

FeatureHasher
使用新的
src_ip
。在新的数据预测代码中,您使用通用哈希器对象。我只是认为
FeatureHasher
类不会从数据中学习,因此对新数据点使用不同的类可能不会导致一致的散列。也许您需要对训练和预测使用相同的哈希过程以确保一致性。

还在您的训练数据中,我看到您连接了每个

src_ip
的哈希特征。然而,在您的预测代码中,您仅对一个
src_ip
的特征进行散列和连接。所以我假设如果您使用不同数量的散列特征进行训练和预测,这将导致您的模型失败。

另一句话,您正在使用

columns_to_hash
列表来迭代字典。但是,我认为您需要确保
columns_to_hash
中列出的所有键确实存在于
new_data
字典中,这就是我假设您的代码正在跳过丢失的键的相同形状的输入数据,用于训练和预测型号

new_data_hashed_feature_dict = {}

# Here I iterate over columns to hash and store hashed features for the new data point
for column in columns_to_hash:
    if column in new_data:
        feature_value = [str(new_data[column])]  # Convert a single string to a list containing that string
        hashed_features = hasher.transform([feature_value])
        weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
        new_data_hashed_feature_dict[column] = weighted_hashed_features
    else:
        print(f"Warning: Missing key {column} in new data")

# Then I concatenated all hashed features for the new data point
new_data_features = np.concatenate([new_data_hashed_feature_dict[column] for column in columns_to_hash if column in new_data], axis=1)

# Here you will need to predict the cluster label for the new data point
predicted_label = model.predict(new_data_features)

print(f"Predicted Cluster Label: {predicted_label[0]}")
© www.soinside.com 2019 - 2024. All rights reserved.