有
transactions
桌子和logs
桌子。 logs
通过 transactions
链接到 transaction_id
。我需要通过 logs
查询 address
,将其与 transactions
连接,将日志聚合到数组,限制事务(示例为 LIMIT 2
)和 获取该事务中的所有日志(但仅查询一个address
字段)。
create table transactions
(id int,
hash varchar);
create table logs
(transaction_id int,
address varchar,
value varchar
);
create index on logs(address);
insert into transactions values
(1, 'h1'),
(2, 'h2'),
(3, 'h3'),
(4, 'h4'),
(5, 'h5')
;
insert into logs values
(1, 'a1', 'h1.a1.1'),
(1, 'a1', 'h1.a1.2'),
(1, 'a3', 'h1.a3.1'),
(2, 'a1', 'h2.a1.1'),
(2, 'a2', 'h2.a2.1'),
(2, 'a2', 'h2.a2.2'),
(2, 'a3', 'h2.a3.1'),
(3, 'a2', 'h3.a2.1'),
(4, 'a1', 'h4.a1.1'),
(5, 'a2', 'h5.a2.1'),
(5, 'a3', 'h5.a3.1')
;
结果必须带有查询
WHERE log.address='a2' LIMIT 2
:
id logs_array
2 [{"address":"a1","value":"h2.a1.1"},{"address":"a2","value":"h2.a2.1"},{"address":"a2","value":"h2.a2.2"},{"address":"a3","value":"h2.a3.1"}]
3 [{"address":"a2","value":"h3.a2.1"}]
问题:下面的 sql 查询工作正常,但是对于大量日志(1 个地址有 100k+ 日志),搜索可能需要很多分钟。解决方案将在
LIMIT
中设置 MATERIALIZED
,但在这种情况下,我可以获得不完全正确的日志列表的事务。如何修复?要么重写不带 MATERIALIZED
的查询,并在彼此内部使用多个 SELECT
,但我不知道如何,或者用 MATERIALIZED
修复。
所以问题是 Postgres 无法正确理解
MATERIALIZED
我需要有限数量的事务,它首先搜索所有日志,然后将它们附加到有限制的事务中(正如我猜测的那样)。 logs(address)
上的索引已设置。
WITH b AS MATERIALIZED (
SELECT lg.transaction_id
FROM logs lg
WHERE lg.address='a2'
-- this must be commented, otherwise not correct results, although fast execution
-- LIMIT 2
)
SELECT
id,
(SELECT array_agg(JSON_BUILD_OBJECT('address',address,'value',value)) FROM logs WHERE transaction_id = t.id) logs_array
FROM transactions t
WHERE t.id IN
(SELECT transaction_id FROM b)
LIMIT 2
真实示例,查询执行约 30 秒:
EXPLAIN WITH
b AS MATERIALIZED (
SELECT lg.transaction_id
FROM _logs lg
WHERE lg.address in ('0xca530408c3e552b020a2300debc7bd18820fb42f', '0x68e78497a7b0db7718ccc833c164a18d8e626816')
)
SELECT
(SELECT array_agg(JSON_BUILD_OBJECT('address',address)) FROM _logs WHERE transaction_id = t.id) logs_array
FROM _transactions t
WHERE t.id IN
(SELECT transaction_id FROM b)
LIMIT 5000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=87540.62..3180266.26 rows=5000 width=32)
CTE b
-> Index Scan using _logs_address_idx on _logs lg (cost=0.70..85820.98 rows=76403 width=8)
Index Cond: ((address)::text = ANY ('{0xca530408c3e552b020a2300debc7bd18820fb42f,0x68e78497a7b0db7718ccc833c164a18d8e626816}'::text[]))
-> Nested Loop (cost=1719.64..47260423.09 rows=76403 width=32)
-> HashAggregate (cost=1719.07..1721.07 rows=200 width=8)
Group Key: b.transaction_id
-> CTE Scan on b (cost=0.00..1528.06 rows=76403 width=8)
-> Index Only Scan using _transactions_pkey on _transactions t (cost=0.57..2.79 rows=1 width=8)
Index Cond: (id = b.transaction_id)
SubPlan 2
-> Aggregate (cost=618.53..618.54 rows=1 width=32)
-> Index Scan using _logs_transaction_id_idx on _logs (cost=0.57..584.99 rows=6707 width=43)
Index Cond: (transaction_id = t.id)
JIT:
Functions: 17
Options: Inlining true, Optimization true, Expressions true, Deforming true
(17 rows)
SELECT a.transaction_id AS id
, ARRAY(SELECT json_build_object('address',address,'value',value)
FROM logs l WHERE l.transaction_id = a.transaction_id) AS logs_array
FROM (
SELECT DISTINCT l.transaction_id
FROM logs l
WHERE l.address = 'a2' -- your address here (or parameterize)
LIMIT 2 -- your LIMIT here
) a
无需
MATERIALIZED
CTE。这将返回最多
LIMIT 2
笔交易以及 all 日志,其中给定的地址至少出现一次。
但是这对于每个地址的很少行来说是最快的。你的性能问题来了:
但是日志量非常大(1个地址超过100k日志)搜索可能需要很多分钟
确保在 logs (address, transaction_id)
上有
索引!
INCLUDE
列添加到索引中以获取所有 仅索引扫描 - 如果您的表足够真空。
CREATE INDEX logs_address_transaction_id ON logs (address, transaction_id) INCLUDE (value);
(无论如何,rCTE 将使用仅索引扫描,因此
INCLUDE
子句只是对外部 SELECT
的一个小改进,在大部分艰苦工作已经完成之后。)
然后使用递归 CTE (rCTE) 模拟索引跳跃扫描。参见:
WITH RECURSIVE adr_trans AS (
(
SELECT l.transaction_id
FROM logs l
WHERE l.address = 'a2' -- your address here (or parameterize)
ORDER BY l.transaction_id
LIMIT 1
)
UNION ALL
SELECT (SELECT l.transaction_id
FROM logs l
WHERE l.address = 'a2' -- your address here (or parameterize)
AND l.transaction_id > a.transaction_id
ORDER BY l.transaction_id
LIMIT 1
)
FROM adr_trans a
WHERE a.transaction_id IS NOT NULL
)
SELECT a.transaction_id AS id
-- , ARRAY(SELECT json_build_object('address',address,'value',value) FROM logs l WHERE l.transaction_id = a.transaction_id) AS logs_array
, (SELECT json_agg(l.*) FROM (SELECT address, value FROM logs l WHERE l.transaction_id = a.transaction_id) l) AS logs_json_array
FROM adr_trans a
WHERE a.transaction_id IS NOT NULL -- eliminate possible dangling null row
LIMIT 2; -- your LIMIT here
现在我们谈论的是毫秒,而不是秒或分钟。
此外,一旦外部LIMIT
得到满足,rCTE 就会停止执行。不会比这更快了。我将构建一个 JSON 对象数组(类型
json
)而不是 JSON 值数组(类型
json[]
)。只是一个建议。