我想计算 2 个大表中 some 值的出现次数并返回一个结果集。理论上,对每个表中的值进行计数然后连接结果与根据值连接表然后对其进行计数之间没有区别。实际上,连接表首先会使查询不太适合 DBMS 的优化。特别是,Postgres 16.3 规划器无法将
WHERE col IN (1,2,3,...)
类型的条件推送到 OUTER JOIN
的分支内以利用适当的索引。这意味着我必须手动在每个连接表(子查询或视图)中应用过滤器。
有没有办法避免手动将过滤器应用于每个连接的表?如果不是,那么最佳实践或惯用方法是什么?
DROP TABLE IF EXISTS t1 CASCADE;
DROP TABLE IF EXISTS t2 CASCADE;
CREATE TABLE t1 (id INT);
CREATE TABLE t2 (id INT);
-- t1 contains some negative id, while t2 does not:
INSERT INTO t1
SELECT (SELECT (- random()*20)::int + i % 10000 LIMIT 1)
FROM generate_series(1,1000000) i;
INSERT INTO t2
SELECT (SELECT (random()*20)::int + i % 10000 LIMIT 1)
FROM generate_series(1,1000000) i;
CREATE INDEX "t1_idx" ON t1 USING btree ("id");
CREATE INDEX "t2_idx" ON t2 USING btree ("id");
-- views for counts:
CREATE VIEW c1 AS (
SELECT id, count(1) as cnt1
FROM t1
GROUP BY id
);
CREATE VIEW c2 AS (
SELECT id, count(1) as cnt2
FROM t2
GROUP BY id
);
现在我想要单个结果集中两个表中的“选定值”的计数。以下查询在真实数据库中非常慢。我们来看看为什么:
explain analyze
SELECT *
FROM c1
NATURAL FULL OUTER JOIN c2
WHERE id IN (-5, 11);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Hash Full Join (cost=27979.91..28107.22 rows=5011 width=20) (actual time=630.949..631.763 rows=2 loops=1)
Hash Cond: (t2.id = t1.id)
Filter: (COALESCE(t1.id, t2.id) = ANY ('{-5,11}'::integer[]))
Rows Removed by Filter: 10038
-> Finalize HashAggregate (cost=13881.60..13981.90 rows=10030 width=12) (actual time=306.364..307.877 rows=10020 loops=1)
Group Key: t2.id
Batches: 1 Memory Usage: 1169kB
-> Gather (cost=11675.00..13781.30 rows=20060 width=12) (actual time=287.054..294.920 rows=30056 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=10675.00..10775.30 rows=10030 width=12) (actual time=276.957..280.021 rows=10019 loops=3)
Group Key: t2.id
Batches: 1 Memory Usage: 1169kB
Worker 0: Batches: 1 Memory Usage: 1169kB
Worker 1: Batches: 1 Memory Usage: 1169kB
-> Parallel Seq Scan on t2 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.008..73.377 rows=333333 loops=3)
-> Hash (cost=13973.39..13973.39 rows=9993 width=12) (actual time=321.427..321.465 rows=10020 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 598kB
-> Finalize HashAggregate (cost=13873.46..13973.39 rows=9993 width=12) (actual time=318.759..320.219 rows=10020 loops=1)
Group Key: t1.id
Batches: 1 Memory Usage: 1169kB
-> Gather (cost=11675.00..13773.53 rows=19986 width=12) (actual time=300.414..306.962 rows=30057 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=10675.00..10774.93 rows=9993 width=12) (actual time=291.697..297.355 rows=10019 loops=3)
Group Key: t1.id
Batches: 1 Memory Usage: 1169kB
Worker 0: Batches: 1 Memory Usage: 1169kB
Worker 1: Batches: 1 Memory Usage: 1169kB
-> Parallel Seq Scan on t1 (cost=0.00..8591.67 rows=416667 width=4) (actual time=0.013..73.148 rows=333333 loops=3)
Planning Time: 0.298 ms
Execution Time: 632.066 ms
(32 rows)
我们看到Filter: (COALESCE(t1.id, t2.id) = ANY ('{-5,11}'::integer[]))
在对两个表进行顺序扫描后应用于连接结果,这是非常次优的。
如果我将
WHERE ...
子句克隆到 JOIN
的每个参数中,我会获得好几个数量级的性能。规划器将过滤条件推送到每个连接表中,并且能够用仅索引扫描替换顺序扫描:
explain analyze
SELECT *
FROM (
SELECT * FROM c1
WHERE id IN (-5, 11)
)
NATURAL FULL OUTER JOIN (
SELECT * FROM c2
WHERE id IN (-5, 11)
);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Merge Full Join (cost=0.85..33.58 rows=198 width=20) (actual time=0.051..0.068 rows=2 loops=1)
Merge Cond: (t1.id = t2.id)
-> GroupAggregate (cost=0.42..15.33 rows=198 width=12) (actual time=0.029..0.045 rows=2 loops=1)
Group Key: t1.id
-> Index Only Scan using t1_idx on t1 (cost=0.42..12.35 rows=200 width=4) (actual time=0.012..0.029 rows=178 loops=1)
Index Cond: (id = ANY ('{-5,11}'::integer[]))
Heap Fetches: 0
-> GroupAggregate (cost=0.42..15.29 rows=197 width=12) (actual time=0.020..0.020 rows=1 loops=1)
Group Key: t2.id
-> Index Only Scan using t2_idx on t2 (cost=0.42..12.32 rows=199 width=4) (actual time=0.009..0.015 rows=64 loops=1)
Index Cond: (id = ANY ('{-5,11}'::integer[]))
Heap Fetches: 0
Planning Time: 0.165 ms
Execution Time: 0.110 ms
(14 rows)
在慢查询中,正是条件WHERE id IN (-5, 11)
阻碍了规划器产生更好的执行计划。如果我将其替换为 WHERE id IN (11)
或
WHERE id = 11
,那么查询之间没有区别。
在这个玩具示例中,我们必须添加几行代码。在真实的数据库中,我有大量的视图和更复杂的查询,手动优化带来的不便成倍增加。我必须公开几个中间视图并在客户端构造一个相当复杂的查询,而不是在数据库内创建单个聚合视图并从客户端发送简单的查询。我可以做得更好吗?
您通过注意到必须将条件更改为
coalesce(t1.id, t2.id) IN (-5,11)
如果您在外连接之后应用它以获得相同的结果。优化器尝试在快速的同时尽可能智能,但是证明上述条件可以修改并下推到连接分支所需的那种逻辑在这里显然是无法实现的。