pyTorch中有没有实现聚类评估方法?

问题描述 投票:0回答:2

我发现 PyTorch 中使用 GPU 实现 k 均值比 CPU 快 30 倍。是否有任何方法(例如 Silhouette 分数、Dunn 指数等)最好在使用 GPU 的 PyTorch 中实现?

pytorch gpu cluster-analysis evaluation silhouette
2个回答
3
投票

我根据 Alexandre Abraham 的 numpy 实现在 PyTorch 中实现了轮廓分数 (https://gist.github.com/AlexandreAbraham/5544803):

https://github.com/maxschelski/pytorch-cluster-metrics

安装软件包后,您可以按如下方式计算轮廓分数:

from torchclustermetrics import silhouette
score = silhouette.score(X, labels)

X 是多维数据(NumPy 数组或 PyTorch 张量;样本的第一维),标签是每个样本的一维标签数组。

我在 PyTorch = 1.10.1 (cuda11.3_cudnn8_0) 上测试了代码。

在我的手中,与 scikit-learn 实现相比,它在 GPU 上的速度提高了大约 30 倍。


0
投票

环顾四周后,不想安装任何额外的软件包。我写了这个简短、优化且简单的实现。希望有帮助!

def silhouette_scores(X: Tensor, labels: Tensor) -> Tensor:
    """Silhouette score implemented in PyTorch for GPU acceleration

    .. seealso::

        `Wikipedia's entry on Silhouette score <https://en.wikipedia.org/wiki/Silhouette_(clustering)>`_

    Parameters
    ----------
    X : Tensor
        Data points of shape (n_samples, n_features)
    labels : Tensor
        Predicted label for each sample, shape (n_samples)

    Returns
    -------
    Tensor
        Silhouette score for each sample, shape (n_samples). Same device and dtype as X
    """
    unique_labels = torch.unique(labels)

    intra_dist = torch.zeros(labels.size(), dtype=X.dtype, device=X.device)
    for label in unique_labels:
        where = torch.where(labels == label)[0]
        distances = torch.cdist(X[where], X[where])
        intra_dist[where] = distances.sum(dim=1) / (distances.shape[0] - 1)

    inter_dist = torch.full(labels.size(), torch.inf, dtype=X.dtype, device=X.device)

    for label_a, label_b in torch.combinations(unique_labels, 2):
        where_a = torch.where(labels == label_a)[0]
        where_b = torch.where(labels == label_b)[0]

        dist = torch.cdist(X[where_a], X[where_b])
        dist_a = dist.mean(dim=1)
        dist_b = dist.mean(dim=0)

        inter_dist[where_a] = torch.minimum(dist_a, inter_dist[where_a])
        inter_dist[where_b] = torch.minimum(dist_b, inter_dist[where_b])

    sil_samples = (inter_dist - intra_dist) / torch.maximum(intra_dist, inter_dist)

    return sil_samples.nan_to_num()

© www.soinside.com 2019 - 2024. All rights reserved.