我想了解我的代码是否正常工作。
数据框 df2 是股票特征的垂直堆叠时间序列。
库存_id | log_target_vol_corr_32_clusters_stnd |
---|---|
1 | 0.4 |
1 | 0.8 |
1 | 0.7 |
2 | 0.3 |
2 | 0.4 |
2 | 0.0 |
3 | 0.4 |
3 | 0.8 |
3 | 0.7 |
4 | 0.9 |
4 | 0.9 |
4 | 0.1 |
5 | 0.9 |
5 | 0.9 |
5 | 0.1 |
请注意,股票(1 和 3)和(4 和 5)具有相同的特征值,因此我想将它们分组到一个集群中。最终,我想找到属于每个集群的所有股票 ID。
## find stock ids of clusters having same feature values
column = 'log_target_vol_corr_32_clusters_stnd'
remaining_stocks = df2['stock_id'].unique().astype(int)
clusters = {}
for s in remaining_stocks:
print(s)
clusters[s] = []
a1 = df2[df2['stock_id'] == s ][column]
remaining_stocks = np.delete(remaining_stocks,np.where(remaining_stocks==s))
for s1 in remaining_stocks:
a2 = df2[df2['stock_id'] == s1 ][column]
if np.array_equal(a1,a2):
print(s1)
remaining_stocks = np.delete(remaining_stocks,np.where(remaining_stocks==s1))
clusters[s].append(s1)
print(remaining_stocks)
您能解释一下这段代码的错误是什么吗?
我编写了以下代码,似乎获得的数据超出了数据框中实际的簇数。
问题在于您在迭代数据时修改了数据!
试试这个:
import pandas as pd
import numpy as np
# Convert the feature values to a hashable type (e.g., tuple) and then to a string if exact match is necessary
df2['features_hash'] = df2.groupby('stock_id')[column].transform(lambda x: hash(tuple(x)))
# Now, group by this new hash and list stock_ids in each group
clustered_stocks = df2.groupby('features_hash')['stock_id'].unique()
# Convert the grouped object into a dictionary for easier handling
clusters = clustered_stocks.to_dict()
# If you need to, invert the dictionary so that stock_id is the key and cluster identifiers are the values
# This step might need adjustments based on how you want to use the clusters
clusters_by_stock_id = {}
for cluster_hash, stocks in clusters.items():
for stock in stocks:
clusters_by_stock_id[stock] = cluster_hash