使用时间/空间边界执行联接的 SQL 查询的效率

Question

我正在使用 PostgreSQL（v. 12.14）和 PostGIS 来对穿过“区域”的 GPS 轨迹进行建模。我创建了一个物化视图，它维护每个轨道经过的一系列区域，并且更新性能非常糟糕。我试图了解（相当幼稚的）查询是如何工作的，以及如何改进它。

以下是架构的相关部分：

create table track (
    start timestamp,
    end timestamp,
    user text
)

create table gps_point (
    create_time timestamp,
    point geometry(point, 4326),
    user text
)

create table area (
    name text,
    polygon geometry(polygon, 4326)
)

注意这些表之间没有外键——我没有创建这个模式，并且至少添加了一个从

gps_point

到

track

的外键。

视图的核心内容生成为：

select track.start, track.end, array_agg(distinct area.name)
  from track
    join gps_point on (gps_point.create_time between track.start and track.end
                       and gps_point.user = track.user)
    join area on st_covers(area.polygon, gps_point.point)
  group by track.start, track.end

非常有效地，我使用时间

between

来实现一半的连接，并使用空间

st_covers

来完成另一半。

这太慢了（除了

area.name

之外，我基本上在上面显示的所有列上都有索引）。原则上，我理解查询堆积的行数远远超过了它所需的行数（因此是

distinct area.name

），并且可能有一种“短路”方法：

gps_point

中有大量数据表，但我所需要的只是在某个区域内执行一次 ping 操作即可知道它应该包含在内。这感觉就像一个“存在”，但我不知道如何在那里得到一个“存在”。

这是

explain analyze

的输出，显然外部嵌套循环是整个事情崩溃的地方，但我不知道它代表什么：

 GroupAggregate  (cost=28495760.61..28768028.29 rows=3768 width=108) (actual time=51901.377..53247.436 rows=3055 loops=1)
   Group Key: track.user, track.start, track.end
   ->  Sort  (cost=28495760.61..28550204.73 rows=21777646 width=80) (actual time=51900.803..52624.699 rows=689231 loops=1)
         Sort Key: track.user, track.start, track.end
         Sort Method: external merge  Disk: 63488kB
         ->  Nested Loop  (cost=0.70..22938476.09 rows=21777646 width=80) (actual time=17.638..48055.263 rows=689231 loops=1)
               ->  Nested Loop  (cost=0.41..8420387.25 rows=46063701 width=72) (actual time=7.599..36250.753 rows=843071 loops=1)
                     ->  Seq Scan on area  (cost=0.00..6.75 rows=75 width=135) (actual time=0.028..0.892 rows=109 loops=1)
                     ->  Index Scan using point_idx on gps_point g  (cost=0.41..112267.42 rows=432 width=100) (actual time=3.387..320.934 rows=7735 loops=109)
                           Index Cond: (point @ area.polygon)
                           Filter: st_covers(area.polygon, point)
                           Rows Removed by Filter: 2048
               ->  Index Scan using track_user_start_idx on golfround gr  (cost=0.28..0.31 rows=1 width=76) (actual time=0.010..0.011 rows=1 loops=843071)
                     Index Cond: (((user)::text = (g.user)::text) AND (start <= g.create_time))
                     Filter: (g.create_time <= end)
                     Rows Removed by Filter: 2
 Planning Time: 12.674 ms
 Execution Time: 53259.611 ms

最后，一些有关表格的统计数据：

Rows in track table: 4427
Average GPS points per track: 1000

我该如何改进？

Answer 1

为了使用 EXISTS 子查询，您需要在轨道和区域之间进行交叉连接。这听起来是一个非常可怕的想法，但在这种情况下，我想说它实际上值得一试。

select track.start, track.end, area.name
  from track cross join area
  where exists (select 1 from gps_point 
      where gps_point.create_time between track.start and track.end
      and gps_point.user = track.user
      and  st_covers(area.polygon, gps_point.point)
  )

这可以通过多列 GiST 索引来支持，例如：

create index on gps_point using gist (point, "user", create_time)

您可能需要 btree_gist 扩展来将标量类型包含到索引中。索引中列的哪种顺序是最佳的也并不明显，您可以尝试不同的顺序，看看哪一个效果最好。

使用时间/空间边界执行联接的 SQL 查询的效率

问题描述投票：0回答：1

1个回答

最新问题

使用时间/空间边界执行联接的 SQL 查询的效率

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1