在嵌套循环中使用 np.array_equal() 函数识别具有相同特征值的股票

问题描述 投票:0回答:1

我想了解我的代码是否正常工作。

数据框 df2 是股票特征的垂直堆叠时间序列。

库存_id log_target_vol_corr_32_clusters_stnd
1 0.4
1 0.8
1 0.7
2 0.3
2 0.4
2 0.0
3 0.4
3 0.8
3 0.7
4 0.9
4 0.9
4 0.1
5 0.9
5 0.9
5 0.1

请注意,股票(1 和 3)和(4 和 5)具有相同的特征值,因此我想将它们分组到一个集群中。最终,我想找到属于每个集群的所有股票 ID。

## find stock ids of clusters having same feature values
column = 'log_target_vol_corr_32_clusters_stnd'
remaining_stocks = df2['stock_id'].unique().astype(int)
clusters = {}
for s in remaining_stocks:
    print(s)
    clusters[s] = []
    a1 = df2[df2['stock_id'] == s ][column]
    remaining_stocks = np.delete(remaining_stocks,np.where(remaining_stocks==s))
    for s1 in remaining_stocks:
        a2 = df2[df2['stock_id'] == s1 ][column]
        if np.array_equal(a1,a2):
            print(s1)
            remaining_stocks = np.delete(remaining_stocks,np.where(remaining_stocks==s1))
            clusters[s].append(s1)
            print(remaining_stocks)

您能解释一下这段代码的错误是什么吗?

我编写了以下代码,似乎获得的数据超出了数据框中实际的簇数。

python nested-loops feature-clustering
1个回答
0
投票

问题在于您在迭代数据时修改了数据!

试试这个:

import pandas as pd
import numpy as np


# Convert the feature values to a hashable type (e.g., tuple) and then to a string if exact match is necessary
df2['features_hash'] = df2.groupby('stock_id')[column].transform(lambda x: hash(tuple(x)))

# Now, group by this new hash and list stock_ids in each group
clustered_stocks = df2.groupby('features_hash')['stock_id'].unique()

# Convert the grouped object into a dictionary for easier handling
clusters = clustered_stocks.to_dict()

# If you need to, invert the dictionary so that stock_id is the key and cluster identifiers are the values
# This step might need adjustments based on how you want to use the clusters
clusters_by_stock_id = {}
for cluster_hash, stocks in clusters.items():
    for stock in stocks:
        clusters_by_stock_id[stock] = cluster_hash

© www.soinside.com 2019 - 2024. All rights reserved.