我正在努力解决 Postgres 慢速查询问题。
考虑以下固定装置:
DROP TABLE IF EXISTS expectation;
DROP TABLE IF EXISTS actual;
CREATE TABLE expectation (
set_id int NOT NULL,
value int NOT NULL
);
INSERT INTO expectation (set_id, value)
SELECT floor(random() * 1000)::int AS set_id, floor(random() * 1000)::int AS value FROM generate_series(1, 2000);
CREATE TABLE actual (
user_id int NOT NULL,
value int NOT NULL
);
INSERT INTO actual (user_id, value)
SELECT floor(random() * 200000)::int AS user_id, floor(random() * 1000)::int AS value FROM generate_series(1, 1000000);
我们有一个期望表,它代表一系列值和相应的
set_id
。一个 set_id
可以有多个 value
。
# SELECT * FROM "expectation" ORDER BY "set_id" LIMIT 10;
set_id | value
--------+-------
0 | 641
1 | 560
2 | 872
3 | 56
3 | 608
4 | 652
5 | 439
5 | 145
6 | 510
6 | 515
我们有一个为用户分配值的实际数据表。一个
user_id
也可以有多个 value
。
# SELECT * FROM "actual" ORDER BY "user_id" LIMIT 10;
user_id | value
---------+-------
0 | 128
0 | 177
0 | 591
0 | 219
0 | 785
0 | 837
0 | 782
1 | 502
1 | 521
1 | 210
现在我们需要让所有用户拥有他们拥有所有值的所有
set_id
。换句话说,用户必须拥有一组(可能更多)的所有值才能匹配它。
我的解决方案是:
# WITH
expected AS (SELECT set_id, array_agg(value) as values FROM expectation GROUP BY set_id),
gotten AS (SELECT user_id, array_agg(value) as values FROM actual GROUP BY user_id)
SELECT user_id, array_agg(set_id) FROM gotten
INNER JOIN expected ON expected.values <@ gotten.values
GROUP BY user_id LIMIT 10;
user_id | array_agg
---------+-------------------------
0 | {525}
1 | {175,840}
2 | {336}
3 | {98,260}
7 | {416}
8 | {2,251,261,352,682,808}
9 | {971}
10 | {163,485}
11 | {793}
12 | {157,332,539,582,617}
(10 rows)
Time: 18960.143 ms (00:18.960)
它返回预期结果,但花费的时间太长:对于给定的装置大约需要 18 秒。
Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
很慢,但我看不出更好的方法来进行检查配合。GroupAggregate (cost=127896.00..2339483.85 rows=200 width=36) (actual time=502.126..23381.752 rows=139712 loops=1)
Group Key: actual.user_id
-> Nested Loop (cost=127896.00..2335820.38 rows=732194 width=8) (actual time=501.614..23332.035 rows=277930 loops=1)
Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
Rows Removed by Join Filter: 171755568
-> GroupAggregate (cost=127757.34..137371.07 rows=169098 width=36) (actual time=500.499..762.447 rows=198653 loops=1)
Group Key: actual.user_id
-> Sort (cost=127757.34..130257.34 rows=1000000 width=8) (actual time=329.909..476.859 rows=1000000 loops=1)
Sort Key: actual.user_id
Sort Method: external merge Disk: 17696kB
-> Seq Scan on actual (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.014..41.334 rows=1000000 loops=1)
-> Materialize (cost=138.66..177.47 rows=866 width=36) (actual time=0.000..0.019 rows=866 loops=198653)
-> GroupAggregate (cost=138.66..164.48 rows=866 width=36) (actual time=0.551..1.164 rows=866 loops=1)
Group Key: expectation.set_id
-> Sort (cost=138.66..143.66 rows=2000 width=8) (actual time=0.538..0.652 rows=2000 loops=1)
Sort Key: expectation.set_id
Sort Method: quicksort Memory: 142kB
-> Seq Scan on expectation (cost=0.00..29.00 rows=2000 width=8) (actual time=0.020..0.146 rows=2000 loops=1)
Planning Time: 0.243 ms
JIT:
Functions: 17
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.831 ms, Inlining 43.440 ms, Optimization 61.965 ms, Emission 64.892 ms, Total 172.129 ms
Execution Time: 23406.950 ms
创建索引:
CREATE INDEX ON actual(user_id, value);
这给了我这个查询计划:
Limit (cost=39.42..114665.29 rows=10 width=36) (actual time=3.037..7.219 rows=10 loops=1)
" Output: actual.user_id, (array_agg(expectation.set_id))"
Buffers: shared hit=10 read=3
-> GroupAggregate (cost=39.42..2292556.70 rows=200 width=36) (actual time=3.035..7.214 rows=10 loops=1)
" Output: actual.user_id, array_agg(expectation.set_id)"
Group Key: actual.user_id
Buffers: shared hit=10 read=3
-> Nested Loop (cost=39.42..2288794.85 rows=751869 width=8) (actual time=1.243..7.193 rows=23 loops=1)
" Output: actual.user_id, expectation.set_id"
Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
Rows Removed by Join Filter: 14435
Buffers: shared hit=10 read=3
-> GroupAggregate (cost=0.42..33136.01 rows=172447 width=36) (actual time=0.105..0.204 rows=17 loops=1)
" Output: actual.user_id, array_agg(actual.value)"
Group Key: actual.user_id
Buffers: shared hit=1 read=3
-> Index Only Scan using actual_user_id_value_idx on public.actual (cost=0.42..25980.42 rows=1000000 width=8) (actual time=0.092..0.117 rows=95 loops=1)
" Output: actual.user_id, actual.value"
Heap Fetches: 0
Buffers: shared hit=1 read=3
-> Materialize (cost=39.00..54.26 rows=872 width=36) (actual time=0.057..0.168 rows=850 loops=17)
" Output: expectation.set_id, (array_agg(expectation.value))"
Buffers: shared hit=9
-> HashAggregate (cost=39.00..49.90 rows=872 width=36) (actual time=0.965..1.236 rows=872 loops=1)
" Output: expectation.set_id, array_agg(expectation.value)"
Group Key: expectation.set_id
Batches: 1 Memory Usage: 297kB
Buffers: shared hit=9
-> Seq Scan on public.expectation (cost=0.00..29.00 rows=2000 width=8) (actual time=0.006..0.204 rows=2000 loops=1)
" Output: expectation.set_id, expectation.value"
Buffers: shared hit=9
Settings: enable_partitionwise_join = 'on'
Planning:
Buffers: shared hit=18 read=1
Planning Time: 1.761 ms
Execution Time: 7.303 ms
在 PostgreSQL 版本 16 上运行,总共 9 毫秒。
请注意,由于聚合限制,它不会赢得时间来限制查询。
这不一定是真的。如果您构建排序数组(或者添加像 Frank 所示的索引),Postgres 会选择不同的查询计划,其中较小的
LIMIT
速度更快:
WITH expected AS (
SELECT set_id, array_agg(value) as values
FROM (
SELECT set_id, value
FROM expectation
ORDER BY 1, 2
) sub
GROUP BY 1
)
, gotten AS (
SELECT user_id, array_agg(value) as values
FROM (
SELECT user_id, value
FROM actual
ORDER BY 1, 2
) sub
GROUP BY 1
)
SELECT g.user_id, array_agg(set_id)
FROM gotten g
JOIN expected e ON g.values @> e.values
GROUP BY 1
LIMIT 10;
但这对于没有
LIMIT
的
全套几乎没有帮助。索引也没有多大帮助。无全套查询可以非常
快。但是,具有“递归 CTE”的查询可以使用索引,其速度“至少比现在快 10 倍”。从本质上讲,它是关系划分的动态案例。
这在清理样本数据并在 PRIMARY KEY
和 expectation (set_id, value)
上添加
actual (user_id, value)
约束后起作用:
EXPLAIN (ANALYZE, BUFFERS)
WITH RECURSIVE rcte AS (
SELECT a.user_id, e.set_id, value
FROM (
SELECT DISTINCT ON (1)
set_id, value
FROM expectation e
ORDER BY 1, 2
) e
JOIN actual a USING (value)
UNION ALL
SELECT r.user_id, r.set_id, e.value
FROM rcte r
CROSS JOIN LATERAL (
SELECT e.value
FROM expectation e
WHERE e.set_id = r.set_id
AND e.value > r.value
ORDER BY e.value
LIMIT 1
) e
JOIN actual a ON (a.user_id, a.value) = (r.user_id, e.value)
)
SELECT user_id, array_agg(set_id)
FROM rcte
GROUP BY 1;
小提琴
相关: