我有 9000 个美国积分(即账户),具有各种不同的字符串和数字列/属性。我试图将这些点/帐户均匀地划分为公平的分组,这些分组既在空间上分组,又根据员工数量进行加权(在重力意义上),这是列/属性之一。我使用 sklearn K-means 聚类进行分组,它似乎工作正常,但我注意到分组不相等。有些团体有约 600 人,有些团体有约 70 人。这有点合乎逻辑,因为某些领域有更多数据。这里的问题是我需要这些群体更加平等。这是我使用的代码:
kmeans = KMeans(n_clusters = 30, max_iter=1000, init ='k-means++')
lat_long = dftobeclustered[dftobeclustered.columns[1:3]]
_employees = dftobeclustered[dftobeclustered.columns[3]]
weighted_kmeans_clusters = kmeans.fit(lat_long, sample_weight = _employees)
dftobeclustered['cluster_label'] = kmeans.predict(lat_long, sample_weight = _employees)
centers = kmeans.cluster_centers_
labels = dftobeclustered['cluster_label']
是否可以以更平等的方式划分 k 均值簇?我认为核心问题是,当我真正需要将蒙大拿州或夏威夷等人口稀少的地区合并成更大的组时,它会将这些地区分成自己的组。但我不知道。
K-means 并不是为了这样工作而编写的。根据观测值与质心的实际测量距离将观测值分配给簇。
如果您尝试强制集群中的成员数量,它会完全取消距离测量组件,尤其是当您在地理上使用经纬度进行交谈时。
您可能需要考虑另一种方法来对观察结果进行子集化或重新考虑簇的等效大小。
老实说,大多数时候,地理距离聚类与其他方式的观察结果的相似性直接相关(想想社区的房屋风格、人口统计或收入,以及如何将其转化为邮政编码或树木类型)局部区域)。这类事情不尊重我们对它们成为相同大小的群体的需求。
如果偶数个观测值存在明显差异,那么基于地理以外的质量的聚类更有可能趋于平稳,而不是直接经纬度,因为它们将进行距离排序......没有办法解决。
因此,观测人口密集的地区将比观测人口较少的地区拥有更多的成员。并且 MT 和 HI 之间的距离始终大于 MT 和 NYC 之间的距离,因此它们不会按距离在地理上聚集。
我理解您想要平等的分组...是否有必要按地理分组?考虑到 MT 和 HI 会在一起,地理标签意义不大。最好使用所有非地理数值进行聚类以创建上下文相似的观察结果。
否则,您可以使用业务规则来剖析观察结果(我的意思是 if var_x > 7 & var_y <227 & .... label=1 and make some groups yourself. You can use
groupby()
和 describe()
在 pandas 中创建交叉表,以查看哪些值可能是分割的好值。
尝试 DBSCAN。 请参阅下面我的示例代码。
# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
# define the number of kilometers in one radian
kms_per_radian = 6371.0088
# load the data set
df = pd.read_csv('C:\\your_path\\summer-travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()
# how many rows are in this data set?
len(df)
# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)
# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)
# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))
# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
size = 150
if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
color = 'gray'
size = 30
# plot the points that match the current cluster label
# X.iloc[:-1]
# df.iloc[:, 0]
x_coords = df_coords.iloc[:, 0]
y_coords = df_coords.iloc[:, 1]
ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)
ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian
# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))
# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))
结果:
Number of clusters: 138
# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()
结果:
0 lat lon
1587 37.921659 22...
1 lat lon
1658 37.933609 23...
2 lat lon
1607 37.966766 23...
3 lat lon
1586 38.149019 22...
4 lat lon
1584 38.374766 21...
133 lat lon
662 50.37369 18.889205
134 lat lon
561 50.448704 19.0...
135 lat lon
661 50.462271 19.0...
136 lat lon
559 50.489304 19.0...
137 lat lon
1 51.474005 -0.450999
数据来源:
https://github.com/gboeing/2014-summer-travels/tree/master/data
相关资源:
https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
在聚类分配期间,还可以在距离上添加“频率惩罚”。这在“频率敏感竞争学习”中进行了描述 高维超球面上的平衡聚类 - Arindam Banerjee 和 Joydeep Ghosh - IEEE Transactions on Neural Networks
http://www.ideal.ece.utexas.edu/papers/arindam04tnn.pdf
他们还有在线/流媒体版本。
您可以使用
linear_sum_assigment()
,最多可以得到大约10k点。最初的想法来自这个相关堆栈溢出答案。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
def get_balanced_clusters(X, n_clusters, max_iter=100, tol=1e-4):
centers = X[np.random.choice(X.shape[0], n_clusters, replace=False)]
for i in range(max_iter):
centers_repeated = centers[np.arange(len(X)) % n_clusters]
labels = linear_sum_assignment(cdist(X, centers_repeated))[1] % n_clusters
new_centers = np.array([X[labels == j].mean(axis=0) for j in range(n_clusters)])
if np.all(np.linalg.norm(centers - new_centers, axis=1) < tol):
break
print(i)
centers = new_centers
return labels, new_centers
np.random.seed(42)
X, _ = make_blobs(n_samples=500, centers=20)
labels, centers = get_balanced_clusters(X, 10)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab20', alpha=0.6)
plt.scatter(centers[:, 0], centers[:, 1], c='black', marker='X')
plt.title('10 Balanced Clusters')
plt.show()