聚类几何图形递归地超出聚类大小限制

问题描述 投票:0回答:0

我希望每个集群最多有

20
个项目。这是我在 PostgreSQL 中使用 PostGIS 扩展的代码:

WITH RECURSIVE clustered_data AS (-- Step 1: Perform initial clustering
    SELECT pma.*
         , ST_ClusterKMeans(ST_Transform("centerPoint", 24313), 1, 20000) 
             OVER (PARTITION BY "areaID" ORDER BY "ID") AS "nta_cluster_id" -- Initial clustering
         , 1 AS "recursion_depth" -- Initialize recursion depth
    FROM "potential_missed_areas" pma 
    WHERE pma."campaignid" = -1
      AND pma."status" = 'approved'
),cluster_sizes AS (-- Step 2: Calculate cluster sizes
    SELECT "nta_cluster_id"
         , COUNT(*) AS "cluster_size"
    FROM clustered_data
    GROUP BY "nta_cluster_id"
),recursive_clusters AS (-- Base case: Start with clusters from previous CTE
    SELECT cd.*--column list skipped for brevity
         , cd."nta_cluster_id" -- Include nta_cluster_id
         , cs."cluster_size" -- Include cluster size from the previous part
         , cd."recursion_depth" -- Include recursion_depth from base case
    FROM clustered_data cd
    JOIN cluster_sizes cs ON cd."nta_cluster_id" = cs."nta_cluster_id"
    UNION ALL-- Recursive case: Re-cluster if any cluster has more than 20 items
    SELECT rc.* --column list skipped for brevity
         , rc."nta_cluster_id" -- Include nta_cluster_id
         , cs."cluster_size" -- Include cluster size from the previous part
         , rc."recursion_depth" + 1 AS "recursion_depth" -- Increment recursion depth
    FROM recursive_clusters rc
    JOIN cluster_sizes cs ON rc."nta_cluster_id" = cs."nta_cluster_id"
    WHERE cs."cluster_size" > 20 -- Only re-cluster if there are more than 20 items
      AND rc."recursion_depth" < 100 -- Limit recursion depth to avoid infinite loops
),dataset as (-- Final selection
SELECT *
     , ST_ClusterKMeans(ST_Transform("centerPoint", 24313), GREATEST(1, CEIL("cluster_size" / 20.0)::int)) 
         OVER (PARTITION BY "areaID") AS "new_cluster_id" -- Re-cluster based on size
FROM recursive_clusters
WHERE "recursion_depth" <= 100
)SELECT "areaID"
      , new_cluster_id
      , count(*)
 FROM dataset
 GROUP BY 1,2;

我观察到大多数行的计数明显大于

20
,特别是返回
900
100
200
等值。

这种差异引起了一些担忧,因为如此高的计数可能表明查询中使用的递归逻辑存在潜在问题。我特别有兴趣深入研究这种递归方法是否按预期运行,或者是否可能会导致结果不准确。

此外,我希望了解可以实现相同目标的替代方法或策略的见解,从而可能产生更可靠和准确的计数。

sql postgresql cluster-analysis postgis
© www.soinside.com 2019 - 2024. All rights reserved.