我有训练数据集,我在K = 4上运行了K-means,得到了四个集群中心。对于新的数据点,我不仅想知道预测的集群,还想知道该集群中心的距离。是否有API来计算距离中心的欧氏距离?如果需要,我可以进行2次API调用。我正在使用Scala,我无法在任何地方找到任何示例。
由于Spark 2.0 Vectors.sqdist可用于计算两个向量之间的平方距离。
您可以使用UDF计算每个点与其中心的距离,如下所示:
import org.apache.spark.ml.linalg.{Vectors, Vector}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.sql.functions.udf
// Sample points
val points = Seq(Vectors.dense(1,0), Vectors.dense(2,-3), Vectors.dense(0.5, -1), Vectors.dense(1.5, -1.5))
val df = points.map(Tuple1.apply).toDF("features")
// K-means
val kmeans = new KMeans()
.setFeaturesCol("features")
.setK(2)
val kmeansModel = kmeans.fit(df)
val predictedDF = kmeansModel.transform(df)
// predictedDF.schema = (features: Vector, prediction: Int)
// Cluster Centers
kmeansModel.clusterCenters foreach println
/*
[1.75,-2.25]
[0.75,-0.5]
*/
// UDF that calculates for each point distance from each cluster center
val distFromCenter = udf((features: Vector, c: Int) => Vectors.sqdist(features, kmeansModel.clusterCenters(c)))
val distancesDF = predictedDF.withColumn("distanceFromCenter", distFromCenter($"features", $"prediction"))
distancesDF.show(false)
/*
+----------+----------+------------------+
|features |prediction|distanceFromCenter|
+----------+----------+------------------+
|[1.0,0.0] |1 |0.3125 |
|[2.0,-3.0]|0 |0.625 |
|[0.5,-1.0]|1 |0.3125 |
|[1.5,-1.5]|0 |0.625 |
+----------+----------+------------------+
*/
注意:Vectors.sqdist
计算2个矢量之间的平方距离(没有平方根)。如果你需要欧几里德距离,你可以使用Math.sqrt(Vectors.sqdist(...))
以下对我有用......
def EuclideanDistance(x: Array[Double], y: Array[Double]) = {
scala.math.sqrt((xs zip ys).map { case (x,y) => scala.math.pow(y - x, 2.0) }.sum)
}