使组合查询更快

问题描述 投票:0回答:2

我正在努力解决 Postgres 慢速查询问题。

固定装置

考虑以下固定装置:

DROP TABLE IF EXISTS expectation;
DROP TABLE IF EXISTS actual;

CREATE TABLE expectation (
  set_id int NOT NULL,
  value int NOT NULL
);
INSERT INTO expectation (set_id, value) 
  SELECT floor(random() * 1000)::int AS set_id, floor(random() * 1000)::int AS value FROM generate_series(1, 2000);

CREATE TABLE actual (
  user_id int NOT NULL,
  value int NOT NULL
);
INSERT INTO actual (user_id, value) 
  SELECT floor(random() * 200000)::int AS user_id, floor(random() * 1000)::int AS value FROM generate_series(1, 1000000);

特点

我们有一个期望表,它代表一系列值和相应的

set_id
。一个
set_id
可以有多个
value

# SELECT * FROM "expectation" ORDER BY "set_id" LIMIT 10;
 set_id | value 
--------+-------
      0 |   641
      1 |   560
      2 |   872
      3 |    56
      3 |   608
      4 |   652
      5 |   439
      5 |   145
      6 |   510
      6 |   515

我们有一个为用户分配值的实际数据表。一个

user_id
也可以有多个
value

# SELECT * FROM "actual" ORDER BY "user_id" LIMIT 10;
 user_id | value 
---------+-------
       0 |   128
       0 |   177
       0 |   591
       0 |   219
       0 |   785
       0 |   837
       0 |   782
       1 |   502
       1 |   521
       1 |   210

问题

现在我们需要让所有用户拥有他们拥有所有值的所有

set_id
。换句话说,用户必须拥有一组(可能更多)的所有值才能匹配它。

我的解决方案是:

# WITH
  expected AS (SELECT set_id, array_agg(value) as values FROM expectation GROUP BY set_id),
  gotten AS (SELECT user_id, array_agg(value) as values FROM actual GROUP BY user_id)
SELECT user_id, array_agg(set_id) FROM gotten
INNER JOIN expected ON expected.values <@ gotten.values
GROUP BY user_id LIMIT 10;
 user_id |        array_agg        
---------+-------------------------
       0 | {525}
       1 | {175,840}
       2 | {336}
       3 | {98,260}
       7 | {416}
       8 | {2,251,261,352,682,808}
       9 | {971}
      10 | {163,485}
      11 | {793}
      12 | {157,332,539,582,617}
(10 rows)
Time: 18960.143 ms (00:18.960)

它返回预期结果,但花费的时间太长:对于给定的装置大约需要 18 秒。

已经探索过

  • 请注意,由于聚合限制,它不会赢得时间来限制查询。
  • 结果的物化索引视图可能会有所帮助,但我的应用程序中的数据经常发生变化,我不确定这是否适合我。
  • 查询计划对我来说看起来很公平,我看不出如何对任何内容建立索引。
    Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
    很慢,但我看不出更好的方法来进行检查配合。
GroupAggregate  (cost=127896.00..2339483.85 rows=200 width=36) (actual time=502.126..23381.752 rows=139712 loops=1)
  Group Key: actual.user_id
  ->  Nested Loop  (cost=127896.00..2335820.38 rows=732194 width=8) (actual time=501.614..23332.035 rows=277930 loops=1)
        Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
        Rows Removed by Join Filter: 171755568
        ->  GroupAggregate  (cost=127757.34..137371.07 rows=169098 width=36) (actual time=500.499..762.447 rows=198653 loops=1)
              Group Key: actual.user_id
              ->  Sort  (cost=127757.34..130257.34 rows=1000000 width=8) (actual time=329.909..476.859 rows=1000000 loops=1)
                    Sort Key: actual.user_id
                    Sort Method: external merge  Disk: 17696kB
                    ->  Seq Scan on actual  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.014..41.334 rows=1000000 loops=1)
        ->  Materialize  (cost=138.66..177.47 rows=866 width=36) (actual time=0.000..0.019 rows=866 loops=198653)
              ->  GroupAggregate  (cost=138.66..164.48 rows=866 width=36) (actual time=0.551..1.164 rows=866 loops=1)
                    Group Key: expectation.set_id
                    ->  Sort  (cost=138.66..143.66 rows=2000 width=8) (actual time=0.538..0.652 rows=2000 loops=1)
                          Sort Key: expectation.set_id
                          Sort Method: quicksort  Memory: 142kB
                          ->  Seq Scan on expectation  (cost=0.00..29.00 rows=2000 width=8) (actual time=0.020..0.146 rows=2000 loops=1)
Planning Time: 0.243 ms
JIT:
  Functions: 17
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.831 ms, Inlining 43.440 ms, Optimization 61.965 ms, Emission 64.892 ms, Total 172.129 ms
Execution Time: 23406.950 ms
sql postgresql relational-division
2个回答
0
投票

创建索引:

CREATE INDEX ON actual(user_id, value);

这给了我这个查询计划:

Limit  (cost=39.42..114665.29 rows=10 width=36) (actual time=3.037..7.219 rows=10 loops=1)
"  Output: actual.user_id, (array_agg(expectation.set_id))"
  Buffers: shared hit=10 read=3
  ->  GroupAggregate  (cost=39.42..2292556.70 rows=200 width=36) (actual time=3.035..7.214 rows=10 loops=1)
"        Output: actual.user_id, array_agg(expectation.set_id)"
        Group Key: actual.user_id
        Buffers: shared hit=10 read=3
        ->  Nested Loop  (cost=39.42..2288794.85 rows=751869 width=8) (actual time=1.243..7.193 rows=23 loops=1)
"              Output: actual.user_id, expectation.set_id"
              Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
              Rows Removed by Join Filter: 14435
              Buffers: shared hit=10 read=3
              ->  GroupAggregate  (cost=0.42..33136.01 rows=172447 width=36) (actual time=0.105..0.204 rows=17 loops=1)
"                    Output: actual.user_id, array_agg(actual.value)"
                    Group Key: actual.user_id
                    Buffers: shared hit=1 read=3
                    ->  Index Only Scan using actual_user_id_value_idx on public.actual  (cost=0.42..25980.42 rows=1000000 width=8) (actual time=0.092..0.117 rows=95 loops=1)
"                          Output: actual.user_id, actual.value"
                          Heap Fetches: 0
                          Buffers: shared hit=1 read=3
              ->  Materialize  (cost=39.00..54.26 rows=872 width=36) (actual time=0.057..0.168 rows=850 loops=17)
"                    Output: expectation.set_id, (array_agg(expectation.value))"
                    Buffers: shared hit=9
                    ->  HashAggregate  (cost=39.00..49.90 rows=872 width=36) (actual time=0.965..1.236 rows=872 loops=1)
"                          Output: expectation.set_id, array_agg(expectation.value)"
                          Group Key: expectation.set_id
                          Batches: 1  Memory Usage: 297kB
                          Buffers: shared hit=9
                          ->  Seq Scan on public.expectation  (cost=0.00..29.00 rows=2000 width=8) (actual time=0.006..0.204 rows=2000 loops=1)
"                                Output: expectation.set_id, expectation.value"
                                Buffers: shared hit=9
Settings: enable_partitionwise_join = 'on'
Planning:
  Buffers: shared hit=18 read=1
Planning Time: 1.761 ms
Execution Time: 7.303 ms

在 PostgreSQL 版本 16 上运行,总共 9 毫秒。


0
投票

请注意,由于聚合限制,它不会赢得时间来限制查询。

这不一定是真的。如果您构建排序数组(或者添加像 Frank 所示的索引),Postgres 会选择不同的查询计划,其中较小的

LIMIT
速度更快: WITH expected AS ( SELECT set_id, array_agg(value) as values FROM ( SELECT set_id, value FROM expectation ORDER BY 1, 2 ) sub GROUP BY 1 ) , gotten AS ( SELECT user_id, array_agg(value) as values FROM ( SELECT user_id, value FROM actual ORDER BY 1, 2 ) sub GROUP BY 1 ) SELECT g.user_id, array_agg(set_id) FROM gotten g JOIN expected e ON g.values @> e.values GROUP BY 1 LIMIT 10;

但这对于没有 
LIMIT

 全套几乎没有帮助。索引也没有多大帮助。
无全套查询可以

非常

快。但是,具有“递归 CTE”的查询可以使用索引,其速度“至少比现在快 10 倍”。从本质上讲,它是关系划分的动态案例。 这在清理样本数据并在 PRIMARY KEY 上添加

actual (user_id, value)

 约束后起作用:
EXPLAIN (ANALYZE, BUFFERS) WITH RECURSIVE rcte AS ( SELECT a.user_id, e.set_id, value FROM ( SELECT DISTINCT ON (1) set_id, value FROM expectation e ORDER BY 1, 2 ) e JOIN actual a USING (value) UNION ALL SELECT r.user_id, r.set_id, e.value FROM rcte r CROSS JOIN LATERAL ( SELECT e.value FROM expectation e WHERE e.set_id = r.set_id AND e.value > r.value ORDER BY e.value LIMIT 1 ) e JOIN actual a ON (a.user_id, a.value) = (r.user_id, e.value) ) SELECT user_id, array_agg(set_id) FROM rcte GROUP BY 1;
小提琴

相关:

在 WHERE 子句中多次使用同一列

    优化 GROUP BY 查询以检索每个用户的最新行
  • 从小表中删除重复行
© www.soinside.com 2019 - 2024. All rights reserved.