我正在Python中进行一项相关研究,需要一个包含60,000个数据点的数据集中每对坐标之间的距离矩阵。我尝试过矢量化,并使用 geopandas,但 geopandas 的问题是要运行距离函数,我需要重复列表中的 x 和 y 数据(x 数据重复 60,000 个塔的集合,y 重复每个坐标 60,000 次连续)使每个列表 3.6e9 值长,并且在完成之前我的计算机内存不足,或者当我尝试在学校的远程桌面上运行它时,需要半个多小时,但我还没有'无法成功运行它。这是我正在运行的代码:
#Florida Tower Matrix
#take coordinates of Florida towers
#CHECK THE LAT/LONG order
import geojson
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
features = []
with open("/Users/katcn/Desktop/Spring 2024/Research/PleaseWork.geojson") as f:
gj = geojson.load(f)
for i in range(59629):
features.append(gj['features'][i]["geometry"]['coordinates'])
#OR make the X matrix all in one column
#make the Y matrix repeat each value 59000 times
longitude = []
latitude = []
for i in range(len(features)):
for j in range(len(features)):
longitude.append(features[j][0])
for k in range(len(features)):
latitude.append(features[i][0])
dict = {"longitude" : longitude, "latitude" : latitude}
df = pd.DataFrame(dict)
dict2 = {"longitude" : longitude, "latitude" : latitude}
df2 = pd.DataFrame(dict2)
#calculate distance between two towers
geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
gdf = gpd.GeoDataFrame(df, crs={'init': 'epsg:4326'}, geometry=geometry)
geometry2 = [Point(xy) for xy in zip(df2.longitude, df2.latitude)]
gdf2 = gpd.GeoDataFrame(df2, crs={'init': 'epsg:4326'}, geometry=geometry2)
distances = gdf.geometry.distance(gdf2.geometry)
print(distances)
任何有关如何以不同方式处理此问题以使其运行更合理的建议都会很好。
您实际上不需要使用
geopandas
距离函数来执行此操作。基本上,您唯一需要的是scipy
。首先将坐标放入数组中:
import numpy as np
import pandas as pd
import geojson
from scipy.spatial import distance
with open("/Users/katcn/Desktop/Spring 2024/Research/PleaseWork.geojson") as f:
gj = geojson.load(f)
coords = np.array([feature["geometry"]["coordinates"] for feature in gj['features']])
dist_matrix = distance.cdist(coords, coords, 'euclidean')
dist_df = pd.DataFrame(dist_matrix)
print(dist_df)
由于您没有提供数据,我创建了一个包含美国大陆点的示例数据集(这里有 60 000 个点,从佛罗里达州 KKey West 到加拿大边境以及从西海岸到东海岸):
import numpy as np
import pandas as pd
from scipy.spatial import distance
import time
np.random.seed(42)
latitudes = np.random.uniform(low=25.0, high=49.0, size=60000)
longitudes = np.random.uniform(low=-125.0, high=-66.0, size=60000)
coords = np.column_stack((latitudes, longitudes))
start_time = time.time()
dist_matrix = distance.cdist(coords, coords, 'euclidean')
end_time = time.time()
elapsed_time = end_time - start_time
print("Distance matrix computed in {:.2f} seconds".format(elapsed_time))
print("Shape of the distance matrix:", dist_matrix.shape)
哪个采取
Distance matrix computed in 185.64 seconds
Shape of the distance matrix: (60000, 60000)
大约3分钟。