我希望每个集群最多有
20
个项目。这是我在 PostgreSQL 中使用 PostGIS 扩展的代码:
WITH RECURSIVE clustered_data AS (-- Step 1: Perform initial clustering
SELECT pma.*
, ST_ClusterKMeans(ST_Transform("centerPoint", 24313), 1, 20000)
OVER (PARTITION BY "areaID" ORDER BY "ID") AS "nta_cluster_id" -- Initial clustering
, 1 AS "recursion_depth" -- Initialize recursion depth
FROM "potential_missed_areas" pma
WHERE pma."campaignid" = -1
AND pma."status" = 'approved'
),cluster_sizes AS (-- Step 2: Calculate cluster sizes
SELECT "nta_cluster_id"
, COUNT(*) AS "cluster_size"
FROM clustered_data
GROUP BY "nta_cluster_id"
),recursive_clusters AS (-- Base case: Start with clusters from previous CTE
SELECT cd.*--column list skipped for brevity
, cd."nta_cluster_id" -- Include nta_cluster_id
, cs."cluster_size" -- Include cluster size from the previous part
, cd."recursion_depth" -- Include recursion_depth from base case
FROM clustered_data cd
JOIN cluster_sizes cs ON cd."nta_cluster_id" = cs."nta_cluster_id"
UNION ALL-- Recursive case: Re-cluster if any cluster has more than 20 items
SELECT rc.* --column list skipped for brevity
, rc."nta_cluster_id" -- Include nta_cluster_id
, cs."cluster_size" -- Include cluster size from the previous part
, rc."recursion_depth" + 1 AS "recursion_depth" -- Increment recursion depth
FROM recursive_clusters rc
JOIN cluster_sizes cs ON rc."nta_cluster_id" = cs."nta_cluster_id"
WHERE cs."cluster_size" > 20 -- Only re-cluster if there are more than 20 items
AND rc."recursion_depth" < 100 -- Limit recursion depth to avoid infinite loops
),dataset as (-- Final selection
SELECT *
, ST_ClusterKMeans(ST_Transform("centerPoint", 24313), GREATEST(1, CEIL("cluster_size" / 20.0)::int))
OVER (PARTITION BY "areaID") AS "new_cluster_id" -- Re-cluster based on size
FROM recursive_clusters
WHERE "recursion_depth" <= 100
)SELECT "areaID"
, new_cluster_id
, count(*)
FROM dataset
GROUP BY 1,2;
我观察到大多数行的计数明显大于
20
,特别是返回 900
、100
和 200
等值。
这种差异引起了一些担忧,因为如此高的计数可能表明查询中使用的递归逻辑存在潜在问题。我特别有兴趣深入研究这种递归方法是否按预期运行,或者是否可能会导致结果不准确。
此外,我希望了解可以实现相同目标的替代方法或策略的见解,从而可能产生更可靠和准确的计数。