Postgresql 在类似的查询中表现出奇怪的性能差异

Question

我有两个由此模式定义的 postgres 表：

create table tableA (id int not null primary key);
create table tableB (id int not null primary key, tableA_id int null, 
foreign key(tableA_id) references tableA(id));

我在 tableB.tableA_id 上也有一个索引。

每个表大约有 5000 万行。

我必须从表 A 中获取不在表 B 中的所有行。我写了这个查询（注意，真正的查询包括更多列和联接，但我将其简化为更清晰）

select a.* from tableA a left join tableB b on a.id = b.tableA_id where b.id is null

此查询正确，花了 11 个小时才完成。经过多次尝试，我这样改变了查询：

select a.* from tableA a left join tableB b on a.id = b.tableA_id where b.tableA_id is null

我只改变了where条件，而不是b.id is null我写了b.tableA_id is null。此查询只需要 15 分钟即可运行。

为什么我会得到这种差异？这些查询之间的真正区别是什么？

Answer 1

根据您提供的详细信息，它在某种程度上可重现。

第一个可能会产生普通的旧

left join

，然后必须对其进行过滤以丢弃具有

b.id is null

 的行。您正在引用右表的两个不同列，连接一列并过滤另一列 - 除非索引

include

将其作为惰性负载，或者您对这两个列都建立了索引，否则您将无法快速获得 index-仅扫描，因为其中一列不可用，因此数据库必须从索引跳转到堆（表）来单独获取。

 Gather  (cost=8984.12..20659.61 rows=1 width=4) (actual time=695.053..1301.412 rows=224552 loops=1)
   ->  Parallel Hash Left Join  (cost=7984.12..19659.51 rows=1 width=4) (actual time=627.898..932.214 rows=112276 loops=2)
         Output: a.id
         Hash Cond: (a.id = b.tablea_id)
         Filter: (b.id IS NULL)
         ->  Parallel Seq Scan on public.tablea a  (cost=0.00..5532.50 rows=331950 width=4) (actual time=0.008..90.720 rows=250000 loops=2)
               Output: a.id
         ->  Parallel Hash  (cost=4122.94..4122.94 rows=235294 width=8) (actual time=252.246..252.246 rows=200000 loops=2)
               Output: b.tablea_id, b.id
               ->  Parallel Seq Scan on public.tableb b  (cost=0.00..4122.94 rows=235294 width=8) (actual time=0.015..95.427 rows=200000 loops=2)
                     Output: b.tablea_id, b.id
 Execution Time: 1321.167 ms

第二个结果是“反连接”，它只需要完成一半的工作。您还只引用了右表中的一列，并且它是您已建立索引的列，这可能会让您从顺序扫描升级到仅索引。

Merge Anti Join (cost=1.59..28842.63 rows=273998 width=4) (actual time=0.099..413.614 rows=224552 loops=1) Output: a.id Merge Cond: (a.id = b.tablea_id) -> Index Only Scan using tablea_pkey on public.tablea a (cost=0.42..12996.42 rows=500000 width=4) (actual time=0.082..118.190 rows=500000 loops=1) Output: a.id Heap Fetches: 0 -> Index Only Scan using tableb_tablea_id_idx on public.tableb b (cost=0.42..9596.42 rows=400000 width=4) (actual time=0.014..124.488 rows=400000 loops=1) Output: b.tablea_id Heap Fetches: 0 Execution Time: 441.053 ms |

理想情况下，PostgreSQL 应该注意到

b.id

不是可为空的列，因此它保存

null

的唯一情况是它是
left join
中的不匹配行，这意味着您实际上正在请求相同的行反加入。不幸的是，从版本 16 开始，规划器并没有深入解读你的陈述。

Postgresql 在类似的查询中表现出奇怪的性能差异

问题描述投票：0回答：1

1个回答

最新问题

Postgresql 在类似的查询中表现出奇怪的性能差异

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1