如何限制父值的数量,但返回每个值的所有子值?

问题描述 投票:0回答:1

transactions
桌子和
logs
桌子。
logs
通过
transactions
链接到
transaction_id
。我需要通过
logs
查询
address
,将其与
transactions
连接,将日志聚合到数组,限制事务(示例为
LIMIT 2
)和 获取该事务中的所有日志(但仅查询一个
address
字段)。

create table transactions
(id int,
 hash varchar);
 
create table logs
(transaction_id int,
 address varchar,
 value varchar
);

create index on logs(address);

insert into transactions values 
(1, 'h1'),
(2, 'h2'),
(3, 'h3'),
(4, 'h4'),
(5, 'h5')
;
 
insert into logs values 
(1, 'a1', 'h1.a1.1'),
(1, 'a1', 'h1.a1.2'),
(1, 'a3', 'h1.a3.1'),
(2, 'a1', 'h2.a1.1'),
(2, 'a2', 'h2.a2.1'),
(2, 'a2', 'h2.a2.2'),
(2, 'a3', 'h2.a3.1'),
(3, 'a2', 'h3.a2.1'),
(4, 'a1', 'h4.a1.1'),
(5, 'a2', 'h5.a2.1'),
(5, 'a3', 'h5.a3.1')
;

结果必须带有查询

WHERE log.address='a2' LIMIT 2
:

id  logs_array
2   [{"address":"a1","value":"h2.a1.1"},{"address":"a2","value":"h2.a2.1"},{"address":"a2","value":"h2.a2.2"},{"address":"a3","value":"h2.a3.1"}]
3   [{"address":"a2","value":"h3.a2.1"}]

问题:下面的 sql 查询工作正常,但是对于大量日志(1 个地址有 100k+ 日志),搜索可能需要很多分钟。解决方案将在

LIMIT
中设置
MATERIALIZED
,但在这种情况下,我可以获得不完全正确的日志列表的事务。如何修复?要么重写不带
MATERIALIZED
的查询,并在彼此内部使用多个
SELECT
,但我不知道如何,或者用
MATERIALIZED
修复。

所以问题是 Postgres 无法正确理解

MATERIALIZED
我需要有限数量的事务,它首先搜索所有日志,然后将它们附加到有限制的事务中(正如我猜测的那样)。
logs(address)
上的索引已设置。

WITH b AS MATERIALIZED (
        SELECT lg.transaction_id
        FROM logs lg
        WHERE lg.address='a2'
      
        -- this must be commented, otherwise not correct results, although fast execution
        -- LIMIT 2
    )
SELECT 
    id,
    (SELECT array_agg(JSON_BUILD_OBJECT('address',address,'value',value)) FROM logs WHERE transaction_id = t.id) logs_array
FROM transactions t 
WHERE t.id IN 
    (SELECT transaction_id FROM b)
LIMIT 2

真实示例,查询执行约 30 秒:

EXPLAIN WITH 
    b AS MATERIALIZED (
        SELECT lg.transaction_id
        FROM _logs lg
        WHERE lg.address in ('0xca530408c3e552b020a2300debc7bd18820fb42f', '0x68e78497a7b0db7718ccc833c164a18d8e626816')
    )
SELECT 
    (SELECT array_agg(JSON_BUILD_OBJECT('address',address)) FROM _logs WHERE transaction_id = t.id) logs_array
FROM _transactions t 
WHERE t.id IN 
    (SELECT transaction_id FROM b)
LIMIT 5000;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=87540.62..3180266.26 rows=5000 width=32)
   CTE b
     ->  Index Scan using _logs_address_idx on _logs lg  (cost=0.70..85820.98 rows=76403 width=8)
           Index Cond: ((address)::text = ANY ('{0xca530408c3e552b020a2300debc7bd18820fb42f,0x68e78497a7b0db7718ccc833c164a18d8e626816}'::text[]))
   ->  Nested Loop  (cost=1719.64..47260423.09 rows=76403 width=32)
         ->  HashAggregate  (cost=1719.07..1721.07 rows=200 width=8)
               Group Key: b.transaction_id
               ->  CTE Scan on b  (cost=0.00..1528.06 rows=76403 width=8)
         ->  Index Only Scan using _transactions_pkey on _transactions t  (cost=0.57..2.79 rows=1 width=8)
               Index Cond: (id = b.transaction_id)
         SubPlan 2
           ->  Aggregate  (cost=618.53..618.54 rows=1 width=32)
                 ->  Index Scan using _logs_transaction_id_idx on _logs  (cost=0.57..584.99 rows=6707 width=43)
                       Index Cond: (transaction_id = t.id)
 JIT:
   Functions: 17
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(17 rows)
sql postgresql greatest-n-per-group postgresql-performance
1个回答
1
投票

简单的解决方案

SELECT a.transaction_id AS id
     , ARRAY(SELECT json_build_object('address',address,'value',value)
             FROM logs l WHERE l.transaction_id = a.transaction_id) AS logs_array
FROM  (
   SELECT DISTINCT l.transaction_id
   FROM   logs l
   WHERE  l.address = 'a2'  -- your address here (or parameterize)
   LIMIT  2  -- your LIMIT here
   ) a

无需

MATERIALIZED
CTE。
使用更便宜的 ARRAY 构造函数来完成简单的任务。参见:

这将返回最多

LIMIT 2
笔交易以及 all 日志,其中给定的地址至少出现一次。

但是这对于每个地址的很少行来说是最快的。你的性能问题来了:

但是日志量非常大(1个地址超过100k日志)搜索可能需要很多分钟

每个地址许多行的快速解决方案

确保在 logs (address, transaction_id) 上有

索引

如果您的示例是真实的,并且只有一个小的附加列可以满足您的查询,请将其作为
INCLUDE
列添加到索引中以获取所有 仅索引扫描 - 如果您的表足够真空。

CREATE INDEX logs_address_transaction_id ON logs (address, transaction_id) INCLUDE (value);

(无论如何,rCTE 将使用仅索引扫描,因此

INCLUDE
子句只是对外部
SELECT
的一个小改进,在大部分艰苦工作已经完成之后。)

然后使用递归 CTE (rCTE) 模拟索引跳跃扫描。参见:

  • 优化 GROUP BY 查询以检索每个用户的最新行
  • 在 PostgreSQL 的表上 SELECT DISTINCT 比预期慢
WITH RECURSIVE adr_trans AS ( ( SELECT l.transaction_id FROM logs l WHERE l.address = 'a2' -- your address here (or parameterize) ORDER BY l.transaction_id LIMIT 1 ) UNION ALL SELECT (SELECT l.transaction_id FROM logs l WHERE l.address = 'a2' -- your address here (or parameterize) AND l.transaction_id > a.transaction_id ORDER BY l.transaction_id LIMIT 1 ) FROM adr_trans a WHERE a.transaction_id IS NOT NULL ) SELECT a.transaction_id AS id -- , ARRAY(SELECT json_build_object('address',address,'value',value) FROM logs l WHERE l.transaction_id = a.transaction_id) AS logs_array , (SELECT json_agg(l.*) FROM (SELECT address, value FROM logs l WHERE l.transaction_id = a.transaction_id) l) AS logs_json_array FROM adr_trans a WHERE a.transaction_id IS NOT NULL -- eliminate possible dangling null row LIMIT 2; -- your LIMIT here

小提琴
也适用于 Postgres 12。

现在我们谈论的是

毫秒,而不是秒或分钟。

此外,一旦外部

LIMIT

 得到满足,rCTE 就会停止执行。不会比这更快了。

我将构建一个 JSON 对象数组(类型

json

)而不是 JSON 值数组(类型 
json[]
)。只是一个建议。

© www.soinside.com 2019 - 2024. All rights reserved.