我有一个
HIVE
表如下:
select id, id_2, val from test order by id;
234 974 0.5
234 457 0.7
234 236 0.5
234 859 0.6
123 859 0.7
123 236 0.6
123 974 0.5
123 457 0.5
我正在尝试根据
collect
值来 id
数据。我需要收集的数据每行都遵循相同的顺序。我的预期输出如下:(任何顺序都可以,只要所有行都相同):
234 [974,457,236,859] [0.5,0.7,0.5,0.6]
123 [974,457,236,859] [0.5,0.5,0.6,0.7]
我使用了
Brickhouse的
collect
UDF。
select tmp.id, collect(id_2), collect(tmp.val) from
(select id, id_2, val from test
order by id) tmp
group by tmp.id
;
234 [974,457,236,859] [0.5,0.7,0.5,0.6]
123 [859,236,974,457] [0.7,0.6,0.5,0.5]
如您所见,列的顺序没有保持。有什么方法可以在整个输出中保持顺序不变吗?任何提示将不胜感激。
使用此查询
select tmp.id, collect(id_2), collect(tmp.val) from
(select id, id_2, val from test
order by id desc, id_2 desc) tmp
group by tmp.id
;
输出如下,
234 [974,457,236,859] [0.5,0.7,0.5,0.6]
123 [974,457,236,859] [0.5,0.5,0.6,0.7]
基本修改了
order by id
到
order by id desc, id_2 desc
请注意 SQL(以及 Hive)中的 SUBQUERY 或 CTE(公用表表达式)不会保留数据顺序。
此查询(仅使用标准配置单元功能):
with
cte_data_test as (
select 234 as id, 974 as id_2, 0.5 as val union all
select 234 as id, 457 as id_2, 0.7 as val union all
select 234 as id, 236 as id_2, 0.5 as val union all
select 234 as id, 859 as id_2, 0.6 as val union all
select 123 as id, 859 as id_2, 0.7 as val union all
select 123 as id, 236 as id_2, 0.6 as val union all
select 123 as id, 974 as id_2, 0.5 as val union all
select 123 as id, 457 as id_2, 0.5 as val
order by rand() -- just to sumulate that CTE don't preserve order
)
select
id,
regexp_replace( -- remove temporary prefix
concat_ws( -- concat array with separator
',',
sort_array( -- sort on temporary prefix
collect_list(
concat(
'<<<',
lpad(id_2, 9, '0'), -- add an temporary alphanumerical sortable prefix
'>>>',
val
)
)
)
),
'<<<[0-9]{9}>>>',
''
) as ordered_collect
from
cte_data_test
group by
id
将产生:
id | 订购_收集 |
---|---|
123 | 0.6,0.5,0.7,0.5 |
234 | 0.5,0.7,0.6,0.5 |
备注:
split
函数来获取数组