使组合查询更快

Question

我正在努力解决 Postgres 慢速查询问题。

固定装置

考虑以下固定装置：

DROP TABLE IF EXISTS expectation;
DROP TABLE IF EXISTS actual;

CREATE TABLE expectation (
  set_id int NOT NULL,
  value int NOT NULL
);
INSERT INTO expectation (set_id, value) 
  SELECT floor(random() * 1000)::int AS set_id, floor(random() * 1000)::int AS value FROM generate_series(1, 2000);

CREATE TABLE actual (
  user_id int NOT NULL,
  value int NOT NULL
);
INSERT INTO actual (user_id, value) 
  SELECT floor(random() * 200000)::int AS user_id, floor(random() * 1000)::int AS value FROM generate_series(1, 1000000);

特点

我们有一个期望表，它代表一系列值和相应的

set_id

。一个

set_id

可以有多个

value

。

# SELECT * FROM "expectation" ORDER BY "set_id" LIMIT 10;
 set_id | value 
--------+-------
      0 |   641
      1 |   560
      2 |   872
      3 |    56
      3 |   608
      4 |   652
      5 |   439
      5 |   145
      6 |   510
      6 |   515

我们有一个为用户分配值的实际数据表。一个

user_id

也可以有多个

value

。

# SELECT * FROM "actual" ORDER BY "user_id" LIMIT 10;
 user_id | value 
---------+-------
       0 |   128
       0 |   177
       0 |   591
       0 |   219
       0 |   785
       0 |   837
       0 |   782
       1 |   502
       1 |   521
       1 |   210

问题

现在我们需要让所有用户拥有他们拥有所有值的所有

set_id

。换句话说，用户必须拥有一组（可能更多）的所有值才能匹配它。

我的解决方案是：

# WITH
  expected AS (SELECT set_id, array_agg(value) as values FROM expectation GROUP BY set_id),
  gotten AS (SELECT user_id, array_agg(value) as values FROM actual GROUP BY user_id)
SELECT user_id, array_agg(set_id) FROM gotten
INNER JOIN expected ON expected.values <@ gotten.values
GROUP BY user_id LIMIT 10;
 user_id |        array_agg        
---------+-------------------------
       0 | {525}
       1 | {175,840}
       2 | {336}
       3 | {98,260}
       7 | {416}
       8 | {2,251,261,352,682,808}
       9 | {971}
      10 | {163,485}
      11 | {793}
      12 | {157,332,539,582,617}
(10 rows)
Time: 18960.143 ms (00:18.960)

它返回预期结果，但花费的时间太长：对于给定的装置大约需要 18 秒。

已经探索过

请注意，由于聚合限制，它不会赢得时间来限制查询。
结果的物化索引视图可能会有所帮助，但我的应用程序中的数据经常发生变化，我不确定这是否适合我。
查询计划对我来说看起来很公平，我看不出如何对任何内容建立索引。
```
Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
```
很慢，但我看不出更好的方法来进行检查配合。

GroupAggregate  (cost=127896.00..2339483.85 rows=200 width=36) (actual time=502.126..23381.752 rows=139712 loops=1)
  Group Key: actual.user_id
  ->  Nested Loop  (cost=127896.00..2335820.38 rows=732194 width=8) (actual time=501.614..23332.035 rows=277930 loops=1)
        Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
        Rows Removed by Join Filter: 171755568
        ->  GroupAggregate  (cost=127757.34..137371.07 rows=169098 width=36) (actual time=500.499..762.447 rows=198653 loops=1)
              Group Key: actual.user_id
              ->  Sort  (cost=127757.34..130257.34 rows=1000000 width=8) (actual time=329.909..476.859 rows=1000000 loops=1)
                    Sort Key: actual.user_id
                    Sort Method: external merge  Disk: 17696kB
                    ->  Seq Scan on actual  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.014..41.334 rows=1000000 loops=1)
        ->  Materialize  (cost=138.66..177.47 rows=866 width=36) (actual time=0.000..0.019 rows=866 loops=198653)
              ->  GroupAggregate  (cost=138.66..164.48 rows=866 width=36) (actual time=0.551..1.164 rows=866 loops=1)
                    Group Key: expectation.set_id
                    ->  Sort  (cost=138.66..143.66 rows=2000 width=8) (actual time=0.538..0.652 rows=2000 loops=1)
                          Sort Key: expectation.set_id
                          Sort Method: quicksort  Memory: 142kB
                          ->  Seq Scan on expectation  (cost=0.00..29.00 rows=2000 width=8) (actual time=0.020..0.146 rows=2000 loops=1)
Planning Time: 0.243 ms
JIT:
  Functions: 17
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.831 ms, Inlining 43.440 ms, Optimization 61.965 ms, Emission 64.892 ms, Total 172.129 ms
Execution Time: 23406.950 ms

Answer 1

创建索引：

CREATE INDEX ON actual(user_id, value);

这给了我这个查询计划：

Limit  (cost=39.42..114665.29 rows=10 width=36) (actual time=3.037..7.219 rows=10 loops=1)
"  Output: actual.user_id, (array_agg(expectation.set_id))"
  Buffers: shared hit=10 read=3
  ->  GroupAggregate  (cost=39.42..2292556.70 rows=200 width=36) (actual time=3.035..7.214 rows=10 loops=1)
"        Output: actual.user_id, array_agg(expectation.set_id)"
        Group Key: actual.user_id
        Buffers: shared hit=10 read=3
        ->  Nested Loop  (cost=39.42..2288794.85 rows=751869 width=8) (actual time=1.243..7.193 rows=23 loops=1)
"              Output: actual.user_id, expectation.set_id"
              Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
              Rows Removed by Join Filter: 14435
              Buffers: shared hit=10 read=3
              ->  GroupAggregate  (cost=0.42..33136.01 rows=172447 width=36) (actual time=0.105..0.204 rows=17 loops=1)
"                    Output: actual.user_id, array_agg(actual.value)"
                    Group Key: actual.user_id
                    Buffers: shared hit=1 read=3
                    ->  Index Only Scan using actual_user_id_value_idx on public.actual  (cost=0.42..25980.42 rows=1000000 width=8) (actual time=0.092..0.117 rows=95 loops=1)
"                          Output: actual.user_id, actual.value"
                          Heap Fetches: 0
                          Buffers: shared hit=1 read=3
              ->  Materialize  (cost=39.00..54.26 rows=872 width=36) (actual time=0.057..0.168 rows=850 loops=17)
"                    Output: expectation.set_id, (array_agg(expectation.value))"
                    Buffers: shared hit=9
                    ->  HashAggregate  (cost=39.00..49.90 rows=872 width=36) (actual time=0.965..1.236 rows=872 loops=1)
"                          Output: expectation.set_id, array_agg(expectation.value)"
                          Group Key: expectation.set_id
                          Batches: 1  Memory Usage: 297kB
                          Buffers: shared hit=9
                          ->  Seq Scan on public.expectation  (cost=0.00..29.00 rows=2000 width=8) (actual time=0.006..0.204 rows=2000 loops=1)
"                                Output: expectation.set_id, expectation.value"
                                Buffers: shared hit=9
Settings: enable_partitionwise_join = 'on'
Planning:
  Buffers: shared hit=18 read=1
Planning Time: 1.761 ms
Execution Time: 7.303 ms

在 PostgreSQL 版本 16 上运行，总共 9 毫秒。

Answer 2

请注意，由于聚合限制，它不会赢得时间来限制查询。

这不一定是真的。如果您构建排序数组（或者添加像 Frank 所示的索引），Postgres 会选择不同的查询计划，其中较小的

LIMIT

速度更快：

WITH expected AS (
   SELECT set_id, array_agg(value) as values
   FROM  (
      SELECT set_id, value
      FROM   expectation
      ORDER  BY 1, 2
      ) sub
   GROUP  BY 1
   )
, gotten AS (
   SELECT user_id, array_agg(value) as values
   FROM  (
      SELECT user_id, value
      FROM   actual
      ORDER  BY 1, 2
      ) sub
   GROUP  BY 1
   )
SELECT g.user_id, array_agg(set_id)
FROM   gotten   g
JOIN   expected e ON g.values @> e.values
GROUP  BY 1
LIMIT  10;

但这对于没有

LIMIT

的

 全套几乎没有帮助。索引也没有多大帮助。

无全套查询可以

非常

快。但是，具有“递归 CTE”的查询可以使用索引，其速度“至少比现在快 10 倍”。从本质上讲，它是关系划分的动态案例。这在清理样本数据并在 PRIMARY KEY 和 expectation (set_id, value) 上添加

actual (user_id, value)

 约束后起作用：

EXPLAIN (ANALYZE, BUFFERS)  
WITH RECURSIVE rcte AS (
   SELECT a.user_id, e.set_id, value
   FROM  (
      SELECT DISTINCT ON (1)
             set_id, value
      FROM   expectation e
      ORDER  BY 1, 2
    ) e
   JOIN actual a USING (value)

   UNION ALL
   SELECT r.user_id, r.set_id, e.value
   FROM   rcte r
   CROSS  JOIN LATERAL (
      SELECT e.value
    FROM   expectation e
      WHERE  e.set_id = r.set_id
    AND    e.value > r.value
      ORDER  BY e.value
      LIMIT  1
      ) e
   JOIN   actual a ON (a.user_id, a.value) = (r.user_id, e.value)
    )   
SELECT user_id, array_agg(set_id)
FROM   rcte
GROUP  BY 1;

小提琴

相关：

在 WHERE 子句中多次使用同一列

使组合查询更快

问题描述投票：0回答：2

固定装置

特点

问题

已经探索过

2个回答

最新问题

使组合查询更快

问题描述 投票：0回答：2

固定装置

特点

问题

已经探索过

2个回答

最新问题

问题描述投票：0回答：2