Redshift EXCEPT比LEFT JOIN慢得多

问题描述 投票:1回答:1

我正在尝试将临时表(“新数据”)与另一个表(“现有数据”)进行比较,以识别添加/更改/删除的行,并最终进行upsert。这是一项昂贵的操作 - 大型数据集上的完全差异。我真的想使用EXCEPT命令来理解语法,但是我遇到了严重的性能问题,并且发现LEFT JOIN要好得多。

这两个表具有相似的行数和相同的模式(几乎 - “第二个”表有一个额外的created_date列)。

他们都分享distkey(date)sortkey(date, id1, id2);我甚至在EXCEPT语句中以“正确”的顺序指定列来帮助优化器。

在测试大小的数据子集上的每个查询计划如下。

explain
select date, id1, id2, id3, value, attr1, attr2, attr3 from new_data
except select date, id1, id2, id3, value, attr1, attr2, attr3 from existing_data;

XN SetOp Except  (cost=1000002817944.78..1000003266822.61 rows=1995013 width=1637)
  ->  XN Sort  (cost=1000002817944.78..1000002867820.09 rows=19950126 width=1637)
        Sort Key: date, id1, id2, id3, value, attr1, attr2, attr3
        ->  XN Append  (cost=0.00..399002.52 rows=19950126 width=1637)
              ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..199501.26 rows=9975063 width=1637)
                    ->  XN Seq Scan on new_data  (cost=0.00..99750.63 rows=9975063 width=1637)
              ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..199501.26 rows=9975063 width=1636)
                    ->  XN Seq Scan on existing_data  (cost=0.00..99750.63 rows=9975063 width=1636)

与我更加丑陋的LEFT JOIN相比

explain
select t1.* from new_data t1 
left outer join existing_data t2 on     
    t1.date = t2.date
    and t1.id1 = t2.id1
    and coalesce(t1.id2, -1) = coalesce(t2.id2, -1)
    and coalesce(t1.id3, -1) = coalesce(t2.id3, -1)
    and coalesce(t1.value, -1) = coalesce(t2.value, -1) 
    and coalesce(t1.attr1, '') = coalesce(t2.attr1, '')
    and coalesce(t1.attr2, '') = coalesce(t2.attr2, '')
    and coalesce(t1.attr3, '') = coalesce(t2.attr3, '')
where t2.id1 is null;

XN Merge Left Join DS_DIST_NONE  (cost=0.00..68706795.68 rows=9975063 width=1637)
  Merge Cond: (("outer".date = "inner".date) AND (("outer".id1)::bigint = "inner".id1))
  Join Filter: (((COALESCE("outer".id2, -1))::bigint = COALESCE("inner".id2, -1::bigint)) AND ((COALESCE("outer".id3, -1))::bigint = COALESCE("inner".id3, -1::bigint)) AND ((COALESCE("outer".value, -1::numeric))::double precision = COALESCE("inner".value, -1::double precision)) AND ((COALESCE("outer".attr1, ''::character varying))::text = (COALESCE("inner".attr1, ''::character varying))::text) AND ((COALESCE("outer".attr2, ''::character varying))::text = (COALESCE("inner".attr2, ''::character varying))::text) AND ((COALESCE("outer".attr3, ''::character varying))::text = (COALESCE("inner".attr3, ''::character varying))::text))
  Filter: ("inner".id1 IS NULL)
  ->  XN Seq Scan on new_data t1  (cost=0.00..99750.63 rows=9975063 width=1637)
  ->  XN Seq Scan on existing_data t2  (cost=0.00..99750.63 rows=9975063 width=1636)

查询成本是1000003266822.61 vs 68706795.68。我知道我不应该在查询之间进行比较,但是在执行时间已经证明了这一点。任何想法为什么EXCEPT声明比这里的LEFT JOIN慢得多?

performance left-join amazon-redshift database-performance sql-except
1个回答
2
投票

left join为每个(可能是有序的)键值生成一堆交叉连接的行,然后通过on过滤掉它不想要的那些;它也可以在(可能有序的)旧键值超过新键值时停止,因为不能再有任何匹配 - 这也涉及通过一些coalesce SARG智能进行一些推理。 except首先对一切进行排序。在这种情况下,排序的成本不仅仅是生成和丢弃行,而是遍历右侧表的每个键的行。当然,优化器可以在其outer join规划中包含except成语 - 但它显然没有。

相关:PostgreSQL: NOT IN versus EXCEPT performance difference

© www.soinside.com 2019 - 2024. All rights reserved.