我有2个表,下面各列
表1
col1 col2 col3 val
11 221 38 10
null 90 null 989
78 90 null 77
table2
col1 col2 col3
12 221 78
23 null 67
78 90 null
我想首先在col1上连接这两个表,如果值匹配,则停止,如果在col2上不连接,则停止,如果匹配停止,否则在col3上连接,如果任何列匹配的话则填充val,否则任何null匹配,然后在matchcol列中填充该列。因此,输出应如下所示:
col1 col2 col3 val matchingcol
11 221 38 10 col2
null 90 null null null
78 90 null 77 col1
我能够使用下面的查询来执行此操作,但是性能非常慢。请让我知道下面是否有更好的书写方式,以实现更快的性能
select *
from table1 t1 left join
table2 t2_1
on t2_1.col1 = t1.col1 left join
table2 t2_2
on t2_2.col2 = t1.col2 and t2_1.col1
left join table2 t2_3 on t2_3.col3 = t1.col3 and t2_2.col2 is null
ps:我之前问过同样的问题,但没有更好的答案
您描述的是:
select t1.col1, t1.col2, t1.col3,
(case when t2_1.col1 is not null or t2_2.col1 is not null or t2_3.col1 is not null then t1.val end) as val
(case when t2_1.col1 is not null then 'col1'
when t2_2.col2 is not null then 'col2'
when t2_3.col3 is not null then 'col3'
end) as matching
from table1 t1 left join
table2 t2_1
on t2_1.col1 = t1.col1 left join
table2 t2_2
on t2_2.col2 = t1.col2 and t2_1.col1 is null left join
table2 t2_3
on t2_3.col3 = t1.col3 and t2_2.col2 is null;
这可能是最好的方法。
如果将查询重写为一串带有后续UNION的INNER JOIN并在col1-colN分区内排名,您可能会获得更好的性能(以利用额外资源为代价)。类似于:
select x.col1, x.col2, x.col3, x.val, x.matchingcol
from (
select col1, col2, col3, val, matchingcol,
row_number() over (partition by col1, col2, col3 order by preference) bestmatch
from (
select t1.col1 col1, t1.col2 col2, t1.col3 col3, t1.val val,
'col1' matchingcol, 1 preference
from table1 t1 inner join table2 t2
on t1.col1 = t2.col1
union all
select t1.col1 col1, t1.col2 col2, t1.col3 col3, t1.val val,
'col2' matchingcol, 2 preference
from table1 t1 inner join table2 t2
on t1.col2 = t2.col2
union all
select t1.col1 col1, t1.col2 col2, t1.col3 col3, t1.val val,
'col3' matchingcol, 3 preference
from table1 t1 inner join table2 t2
on t1.col3 = t2.col3
union all
select t1.col1 col1, t1.col2 col2, t1.col3 col3, cast(null as int) val,
cast(null as string) matchingcol, 4 preference
from table1 t1
) q
) x
where x.bestmatch = 1
我认为它可能会更好,因为UNION的所有分支都并行执行,并且单个最终洗牌将胜过您在原始查询中产生的多个顺序洗牌。但是,当然还有其他因素可能会影响最终结果,例如资源可用性,数据量,形状,存储格式等。