我正在尝试优化以下学校作业查询:
SELECT
DATE(b.book_date),
SUM(b.total_amount) revenue,
COUNT(DISTINCT(t.passenger_id)) count_passengers
FROM bookings b
JOIN tickets t ON t.book_ref = b.book_ref
GROUP BY
DATE(b.book_date)
ORDER BY
COUNT(DISTINCT(t.passenger_id)) DESC,
SUM(b.total_amount) DESC;
在不创建任何索引的情况下,调度程序会为我提供以下解释计划:
EXPLAIN ANALYZE
|QUERY PLAN |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|Sort (cost=20001452121.83..20001453243.21 rows=448552 width=44) (actual time=10862.729..10862.752 rows=392 loops=1) |
| Sort Key: (count(DISTINCT t.passenger_id)) DESC, (sum(b.total_amount)) DESC |
| Sort Method: quicksort Memory: 55kB |
| -> GroupAggregate (cost=20001359986.85..20001396213.70 rows=448552 width=44) (actual time=8131.457..10862.442 rows=392 loops=1) |
| Group Key: (date(b.book_date)) |
| -> Sort (cost=20001359986.85..20001367361.50 rows=2949857 width=22) (actual time=8131.394..8329.844 rows=2949857 loops=1) |
| Sort Key: (date(b.book_date)) |
| Sort Method: external merge Disk: 95136kB |
| -> Merge Join (cost=20000859819.02..20000921997.07 rows=2949857 width=22) (actual time=6064.572..7486.794 rows=2949857 loops=1) |
| Merge Cond: (b.book_ref = t.book_ref) |
| -> Sort (cost=10000342915.67..10000348193.45 rows=2111110 width=21) (actual time=854.300..1138.458 rows=2111110 loops=1) |
| Sort Key: b.book_ref |
| Sort Method: external merge Disk: 66024kB |
| -> Seq Scan on bookings b (cost=10000000000.00..10000034558.10 rows=2111110 width=21) (actual time=90.339..215.696 rows=2111110 loops=1)|
| -> Sort (cost=10000516903.35..10000524278.00 rows=2949857 width=19) (actual time=5210.197..5407.427 rows=2949857 loops=1) |
| Sort Key: t.book_ref |
| Sort Method: external sort Disk: 95320kB |
| -> Seq Scan on tickets t (cost=10000000000.00..10000078913.57 rows=2949857 width=19) (actual time=0.134..241.174 rows=2949857 loops=1) |
|Planning Time: 0.121 ms |
|JIT: |
| Functions: 16 |
| Options: Inlining true, Optimization true, Expressions true, Deforming true |
| Timing: Generation 0.834 ms, Inlining 7.090 ms, Optimization 51.113 ms, Emission 32.119 ms, Total 91.156 ms |
|Execution Time: 10890.239 ms |
为了优化连接操作,我创建了以下索引:
CREATE INDEX idx_bookings_bd_ta_bref ON bookings USING btree(book_date, total_amount, book_ref);
CREATE INDEX idx_tickets_bref ON tickets USING hash(book_ref);
主要思想是,调度程序可以通过迭代 Bookings 上的索引来执行 Join 操作,并通过 has 索引从每个 bookings 行的 Ticket 中获取必要的行,使得 Join 后的结果集已经按 book_date 排序完成,因此消除了对结果集的排序操作。禁用 seqscan 和一些索引后,我终于可以让调度程序执行我想要的操作,这会导致以下
EXPLAIN ANALYZE
输出
|QUERY PLAN |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|Sort (cost=966377.54..967498.92 rows=448552 width=44) (actual time=6961.514..6961.535 rows=392 loops=1) |
| Sort Key: (count(DISTINCT t.passenger_id)) DESC, (sum(b.total_amount)) DESC |
| Sort Method: quicksort Memory: 55kB |
| -> GroupAggregate (cost=874242.56..910469.41 rows=448552 width=44) (actual time=4283.854..6961.216 rows=392 loops=1) |
| Group Key: (date(b.book_date)) |
| -> Sort (cost=874242.56..881617.20 rows=2949857 width=22) (actual time=4283.792..4475.416 rows=2949857 loops=1) |
| Sort Key: (date(b.book_date)) |
| Sort Method: external merge Disk: 95136kB |
| -> Nested Loop (cost=0.43..436252.77 rows=2949857 width=22) (actual time=64.342..3727.674 rows=2949857 loops=1) |
| -> Index Only Scan using idx_bookings_bd_ta_bref on bookings b (cost=0.43..73467.08 rows=2111110 width=21) (actual time=0.049..172.735 rows=2111110 loops=1)|
| Heap Fetches: 0 |
| -> Index Scan using idx_tickets_bref on tickets t (cost=0.00..0.15 rows=2 width=19) (actual time=0.001..0.001 rows=1 loops=2111110) |
| Index Cond: (book_ref = b.book_ref) |
| Rows Removed by Index Recheck: 0 |
|Planning Time: 0.139 ms |
|JIT: |
| Functions: 10 |
| Options: Inlining true, Optimization true, Expressions true, Deforming true |
| Timing: Generation 0.448 ms, Inlining 6.415 ms, Optimization 34.103 ms, Emission 23.769 ms, Total 64.734 ms |
|Execution Time: 6971.385 ms |
令我困惑的是,尽管结果集已经按 book_date 排序,但调度程序坚持无论如何都要执行外部磁盘排序操作,这大大减慢了速度。为什么 PostgreSQL 对应该已经排序的东西进行排序?我怎样才能阻止它这样做?请注意,我不需要问题的答案,我只想知道为什么调度程序正在做一些它实际上不应该做的事情。
如果相关的话,我正在使用 PostgreSQL 版本 14,导致问题的 book_date 列是 timestampz 列。
排序是必需的,因为查询按
DATE(b.book_date)
进行分组,但该表达式不是 bookings
上索引的主键。查询规划器确定索引扫描比表扫描更有效,但它无法使用键 b.book_date
作为按 DATE(b.book_date)
排序的替代项。将 book_date
的类型从 TIMESTAMPTZ
更改为 DATE
,然后按 b.book_date
分组可能会消除额外的排序。通过将 book_date
替换为 DATE(book_date)
作为主键,将索引更改为功能索引也有可能达到预期的结果。