如何有效地从数据集中删除重叠的圆圈

Question

我有一个包含约 20,000 条记录的数据集，代表人口超过 20,000 的全球城市。我估计了半径，或多或少描述了城市的大小。这并不完全准确，但就我的目的而言，这已经足够了。

我正在将其加载到我的 Panda 数据框对象中。以下是样本

name_city,country_code,latitude,longitude,geohash,estimated_radius,population
Vitry-sur-Seine,FR,48.78716,2.40332,u09tw9qjc3v3,1000,81001
Vincennes,FR,48.8486,2.43769,u09tzkx5dr13,500,45923
Villeneuve-Saint-Georges,FR,48.73219,2.44925,u09tpxrmxdth,500,30881
Villejuif,FR,48.7939,2.35992,u09ttdwmn45z,500,48048
Vigneux-sur-Seine,FR,48.70291,2.41357,u09tnfje022n,500,26692
Versailles,FR,48.80359,2.13424,u09t8s6je2sd,1000,85416
Vélizy-Villacoublay,FR,48.78198,2.19395,u09t9bmxdspt,500,21741
Vanves,FR,48.82345,2.29025,u09tu059nwwp,500,26068
Thiais,FR,48.76496,2.3961,u09tqt2u3pmt,500,29724
Sèvres,FR,48.82292,2.21757,u09tdryy15un,500,23724
Sceaux,FR,48.77644,2.29026,u09tkp7xqgmw,500,21511
Saint-Mandé,FR,48.83864,2.41579,u09tyfz1eyre,500,21261
Saint-Cloud,FR,48.84598,2.20289,u09tfhhh7n9u,500,28839
Paris,FR,48.85341,2.3488,u09tvmqrep8n,12000,2138551
Orly,FR,48.74792,2.39253,u09tq6q1jyzt,500,20528
Montrouge,FR,48.8162,2.31393,u09tswsyyrpr,500,38708
Montreuil,FR,48.86415,2.44322,u09tzx7n71ub,2000,111240
Montgeron,FR,48.70543,2.45039,u09tpf83dnpn,500,22843
Meudon,FR,48.81381,2.235,u09tdy73p38y,500,44652
Massy,FR,48.72692,2.28301,u09t5yqqvupx,500,38768
Malakoff,FR,48.81999,2.29998,u09tsr6v13tr,500,29420
Maisons-Alfort,FR,48.81171,2.43945,u09txtbkg61z,1000,53964
Longjumeau,FR,48.69307,2.29431,u09th0q9tq1s,500,20771
Le Plessis-Robinson,FR,48.78889,2.27078,u09te9txch23,500,22510
Le Kremlin-Bicêtre,FR,48.81471,2.36073,u09ttwrn2crz,500,27867
Le Chesnay,FR,48.8222,2.12213,u09t8rc3cjwz,500,29154
La Celle-Saint-Cloud,FR,48.85029,2.14523,u09tbufje6p6,500,21539
Ivry-sur-Seine,FR,48.81568,2.38487,u09twq8egqrc,1000,57897
Issy-les-Moulineaux,FR,48.82104,2.27718,u09tezd5njkr,1000,61447
Fresnes,FR,48.75568,2.32241,u09tkgenkj6r,500,24803
Fontenay-aux-Roses,FR,48.79325,2.29275,u09ts4t92cn3,500,24680
Clamart,FR,48.80299,2.26692,u09tes6dp0dn,1000,51400
Choisy-le-Roi,FR,48.76846,2.41874,u09trn12bez7,500,35590
Chevilly-Larue,FR,48.76476,2.3503,u09tmmr7mfns,500,20125
Châtillon,FR,48.8024,2.29346,u09tshnn96xx,500,32383
Châtenay-Malabry,FR,48.76507,2.26655,u09t7t6mn7yj,500,32715
Charenton-le-Pont,FR,48.82209,2.41217,u09twzu3r9hq,500,30910
Cachan,FR,48.79632,2.33661,u09tt5j7nvqd,500,26540
Bagnolet,FR,48.86667,2.41667,u09tyzzubrxb,500,33504
Bagneux,FR,48.79565,2.30796,u09tsdbx727w,500,38900
Athis-Mons,FR,48.70522,2.39147,u09tn6t2mr16,500,31225
Alfortville,FR,48.80575,2.4204,u09txhf6p7jp,500,37290
Quinze-Vingts,FR,48.84656,2.37439,u09tyh0zz6c8,500,26265
Croulebarbe,FR,48.81003,2.35403,u09tttd5hc5f,500,20062
Gare,FR,48.83337,2.37513,u09ty1cdbxcq,1000,75580
Maison Blanche,FR,48.82586,2.3508,u09tv2rz1xgx,1000,64302

以下是数据样本的直观表示：

我的目标是找到一种有效的算法，可以删除相交的圆，只保留最大的

population

。

我最初的方法是使用半正矢公式确定哪些圆相交。问题是，要检查每条记录是否与其他记录相交，需要遍历整个数据集。这个时间复杂度太高了。

我的第二种方法是按国家/地区代码分隔数据集，并按块运行比较：

  def _remove_intersecting_circles_for_country(df_country):
    """Helper function to remove intersections within a single country."""
    indices_to_remove = set()
    for i in range(len(df_country)):
      for j in range(i + 1, len(df_country)):
        distance = haversine(df_country['latitude'].iloc[i], df_country['longitude'].iloc[i],
                            df_country['latitude'].iloc[j], df_country['longitude'].iloc[j])
        if distance < df_country['estimated_radius'].iloc[i] + df_country['estimated_radius'].iloc[j]:
          if df_country['population'].iloc[i] < df_country['population'].iloc[j]:
            indices_to_remove.add(df_country.index[i]) 
          else:
            indices_to_remove.add(df_country.index[j])
    return indices_to_remove

  all_indices_to_remove = set()
  for country_code in df['country_code'].unique():
    df_country = df[df['country_code'] == country_code]
    indices_to_remove = _remove_intersecting_circles_for_country(df_country)
    all_indices_to_remove.update(indices_to_remove)

  new_df = df.drop(index=all_indices_to_remove)
  return new_df

这显着提高了性能，因为要检查每条记录，我们只需要检查具有相同

country_code

的所有记录。但这仍然会产生很多不必要的比较

Answer 1

从视觉表现来看，首先想到的是使用具有最外层轮廓的图像处理方法（并将地理坐标映射到像素坐标）。不确定是否适合城市之间间距很大的大型数据集（但为什么要查看里昂是否与巴黎重叠？），但在提供的巴黎附近采样上，其时间比当前解决方案更快。

与当前的 O(n^2) 相比，时间复杂度为 O(n log(n)) + O(n)。然而，主要的权衡可能是创建具有足够分辨率的图像。考虑到底层优化的 C++ OpenCV，它的轮廓处理应该没问题

import numpy as np
import cv2
import pandas as pd
from typing import Tuple
from io import StringIO

def detect_non_overlapping_circles(df: pd.DataFrame, image_size:Tuple[int, int] = (128, 128)):
    img = np.zeros(image_size, dtype=np.uint8) # Black image

    lat_min, lat_max = df['latitude'].min(), df['latitude'].max()
    lon_min, lon_max = df['longitude'].min(), df['longitude'].max()

    def geo_to_pixel(lat: float, lon: float):
        x = int((lon - lon_min) / (lon_max - lon_min) * (image_size[0] - 100) + 50)
        y = int((lat - lat_min) / (lat_max - lat_min) * (image_size[1] - 100) + 50)
        return (x, y)

    # Convert radius to pixels based on the image size and coordinate range
    def radius_to_pixel(radius_meters: float):
        # Using an approximate conversion factor
        earth_circumference = 40075000  # meters
        meters_per_degree = earth_circumference / 360
        lat_range = lat_max - lat_min
        pixels_per_meter = (image_size[1] - 100) / (lat_range * meters_per_degree)
        return int(radius_meters * pixels_per_meter)

    df_sorted = df.sort_values('population', ascending=False)

    kept_indices = []

    for idx, row in df_sorted.iterrows():
        test_img = img.copy()
    
        # Current circle
        center = geo_to_pixel(row['latitude'], row['longitude'])
        radius = radius_to_pixel(row['estimated_radius'])
    
        # Find outermost contour at each time
        cv2.circle(test_img, center, radius, 255, -1)    
        contours, _ = cv2.findContours(test_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
        # If exactly one contour, this circle dont overlap with any one
        if len(contours) == len(kept_indices) + 1:
            kept_indices.append(idx)
            img = test_img  # Update the main image

    return df.loc[kept_indices]

def main(df):
    result_df = detect_non_overlapping_circles(df)
    
    print(f"Original number of cities: {len(df)}")
    print(f"Cities after removing overlaps: {len(result_df)}")
    print(result_df[['name_city', 'population', 'estimated_radius']].to_string())

    return result_df

# Not sure if you're working with csv but from what provided
data = 
"""name_city,country_code,latitude,longitude,geohash,estimated_radius,population
Vitry-sur-Seine,FR,48.78716,2.40332,u09tw9qjc3v3,1000,81001
Vincennes,FR,48.8486,2.43769,u09tzkx5dr13,500,45923
Villeneuve-Saint-Georges,FR,48.73219,2.44925,u09tpxrmxdth,500,30881
Villejuif,FR,48.7939,2.35992,u09ttdwmn45z,500,48048
Vigneux-sur-Seine,FR,48.70291,2.41357,u09tnfje022n,500,26692
"""

df = pd.read_csv(StringIO(data))
df.dropna()

proc_df = main(df)

使用提供的 csv 和城市间距，图像分辨率可以很好地降低到 (128, 128)。在我的机器上，处理 (1024, 1024) 图像的执行时间仍然比

_remove_intersecting_circles_for_country

更快

如何有效地从数据集中删除重叠的圆圈

问题描述投票：0回答：1

1个回答

最新问题

如何有效地从数据集中删除重叠的圆圈

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1