我有一个 pandas 数据框定义为
Alejandro Ana Beatriz Jose Juan Luz Maria Ruben
Alejandro 0.0 0.0 1000.0 0.0 1037.0 1014.0 100.0 0.0
Ana 0.0 0.0 15.0 0.0 100.0 0.0 16.0 1100.0
Beatriz 1000.0 15.0 0.0 100.0 1000.0 1100.0 15.0 0.0
Jose 0.0 0.0 100.0 0.0 0.0 100.0 1000.0 14.0
Juan 1037.0 100.0 1000.0 0.0 0.0 1014.0 0.0 100.0
Luz 1014.0 0.0 1100.0 100.0 1014.0 0.0 0.0 0.0
Maria 100.0 16.0 15.0 1000.0 0.0 0.0 0.0 0.0
Ruben 0.0 1100.0 0.0 14.0 100.0 0.0 0.0 0.0
此数据框包含兼容性测量 我想将这些人分成两三个人一组。这可能是 [Alejandro、Ana、Beatriz]、[Jose、Juan、Luz]、[Maria、Ruben]。
要做到这一点,我必须最大限度地提高每个小组内部的兼容性。
有人可以给我指出一种方法吗?
看起来您是从距离矩阵而不是实际样本值开始的。
AgglomerativeClustering
可以使用距离矩阵将样本分组到您指定的任意多个簇中。
在下面的代码中,我对数据运行
AgglomerativeClustering
以获取分配给每个名称的集群。然后,为了可视化,我以二维形式表示原始距离矩阵,并根据分配的簇对点进行着色。
数据:
import pandas as pd
import matplotlib.pyplot as plt
#Data: distance matrix
data = {
'Alejandro': [0.0, 0.0, 1000.0, 0.0, 1037.0, 1014.0, 100.0, 0.0],
'Ana': [0.0, 0.0, 15.0, 0.0, 100.0, 0.0, 16.0, 1100.0],
'Beatriz': [1000.0, 15.0, 0.0, 100.0, 1000.0, 1100.0, 15.0, 0.0],
'Jose': [0.0, 0.0, 100.0, 0.0, 0.0, 100.0, 1000.0, 14.0],
'Juan': [1037.0, 100.0, 1000.0, 0.0, 0.0, 1014.0, 0.0, 100.0],
'Luz': [1014.0, 0.0, 1100.0, 100.0, 1014.0, 0.0, 0.0, 0.0],
'Maria': [100.0, 16.0, 15.0, 1000.0, 0.0, 0.0, 0.0, 0.0],
'Ruben': [0.0, 1100.0, 0.0, 14.0, 100.0, 0.0, 0.0, 0.0]
}
df = pd.DataFrame(
data,
index=['Alejandro', 'Ana', 'Beatriz', 'Jose', 'Juan', 'Luz', 'Maria', 'Ruben']
)
df
执行聚类:
#
#Cluster based on distances
#
from sklearn.cluster import AgglomerativeClustering
n_clusters = 3
clusterer = AgglomerativeClustering(
n_clusters=n_clusters, metric='precomputed', linkage='average'
)
label_foreach_name = clusterer.fit(df).labels_
#View results as a table
cluster_info = pd.DataFrame(
{'name': df.columns,
'cluster': label_foreach_name}
).set_index('cluster').sort_index()
display(
cluster_info
)
在 2D 坐标空间中可视化:
#
# Visualisation
# Represent in 2D, coloured by cluster
#
#Invert distance matrix to 2D coordinate space
from sklearn.manifold import MDS
mds = MDS(n_components=2, dissimilarity='precomputed', random_state=0)
coords_2d = mds.fit_transform(df)
#Plot points, coloured by assigned cluster
import matplotlib.pyplot as plt
cluster_cmap = plt.get_cmap('Set1', n_clusters)
f, ax = plt.subplots(figsize=(4, 4))
ax.scatter(coords_2d[:, 0], coords_2d[:, 1],
c=cluster_cmap(label_foreach_name),
marker='s', s=100)
ax.set_xlabel('x')
ax.set_ylabel('y')
f.suptitle('Inversion of the distance matrix to 2D', fontsize=11)
ax.set_title('Sampels coloured by cluster', fontsize=10)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
#Add text annotations
for row_num, name in enumerate(df.columns):
ax.annotate(name, xy=coords_2d[row_num], xytext=(10, -5), textcoords='offset pixels')
#Add legend
cluster_colours = plt.get_cmap
for colour, label in zip(cluster_cmap(range(n_clusters)),
[f'cluster {i}' for i in range(n_clusters)]):
ax.scatter([], [], marker='s', s=50, c=colour, label=label)
f.legend(bbox_to_anchor=(1.3, 1))