我有一个数据框,其中有一列名为“几何”,它包含多多边形和多边形值。我想使用 Pyspark 或 Python 查找并返回坐标包含 [X, Y, Z] 的位置;还想创建一个代码块来删除 Z 值(如果存在)。我该怎么做?下面的示例,我想返回第一个坐标值,而不返回任何其他值。
我想做类似下面的事情,但不知道如何将新列附加到数据帧,查找并仅返回具有 X、Y、Z 几何形状的行所需的编码:
我正在考虑使用代码来查找 Z 值,但不起作用:
for row in source_df.collect():
z_val = len(row.geometry["coordinates"]) =3 for x in row.geometry["coordinates"]]
示例数据:
{
"type": "MultiPolygon",
"coordinates": [
[-120.92484404138442,35.54577502278743,0.0],
[-120.92484170835023,35.545764670080004],
[-120.92470946198651,35.54517811398435],
[-120.92373579577058,35.54476080459215],
[-120.92224560209857,35.544644824151],
[-120.91471743922112,35.54405891151482],
[-120.9137131887035,35.541405607829184],
[-120.91370267246779,35.54138005556737],
[-120.91368022915093,35.54133577314701],
[-120.91365314934913,35.54129325687539],
[-120.91364620938849,35.541283659095036],
[-120.91019544280519,35.53661949063082],
[-120.91016692865233,35.536584105321104],
[-120.91013516362523,35.53655061941634],
[-120.9101046793985,35.53652289241281],
[-120.90545581970368,35.53257237955164],
[-120.90540343303125,35.53253236763702]
]
}
udf
,它可用于解析JSON string
,允许轻松检查我们是否有('Polygon'或'MultiPolygon'),并迭代坐标以识别任何包含三个值(X、Y、Z)。
最后,您可以使用布尔列(z_column),并且过滤df
以仅保留该列为True的行。
# create df
df = spark.createDataFrame(data)
# define a UDF to check for the z cords
def contains_z_coordinate(geometry):
try:
geometry_object = json.loads(geometry)
# Check if it's a Polygon or MultiPolygon
if geometry_object["type"] in ["Polygon", "MultiPolygon"]:
# Extract the first coordinate set
coords = geometry_object["coordinates"]
if geometry_object["type"] == "Polygon":
coords = [coords] # Convert Polygon to MultiPolygon
# Check if any point has a z-coordinate
for polygon_coords in coords:
for ring_coordinates in polygon_coords:
for point in ring_coordinates:
if len(point) == 3:
return True
except (KeyError, TypeError, json.JSONDecodeError):
return False
return False
# Register the UDF
z_udf = udf(contains_z_coordinate, BooleanType())
# Add a column with UDF and filter
result_df_using_filter = df.withColumn("has_z", z_udf(col("geometry"))).filter(
col("z_column")
)
# results
result_df_using_filter.show(truncate=False)
结果
+--------------------------------------------------------------------------------------------------------------------------------+--------+
|geometry |z_column|
+----------------------------------------------------------------------------------------------------------------------------------+------+
|{"type":"MultiPolygon","coordinates":[[[[-120.92484404138442,35.54577502278743,0.0],[-120.92484170835023,35.545764670080004]]]]}|true |
+----------------------------------------------------------------------------------------------------------------------------------+-----+