假设我们有两个文件:价格和交易。 Prices有两列:price和publishedTime表示该价格,如下所示:
price publishedTime
5.05 2020-01-01 11:00:06.122356
9.87 2020-01-01 11:00:05.289655
6.37 2020-01-01 11:00:05.111234
8.22 2020-01-01 11:00:04.242103
... (millions of rows)
Transactions有两列:transactionID和transactionTime
transactionID transactionTime
1001 2020-01-01 11:00:07.005477
2001 2020-01-01 11:00:06.110982
3005 2020-01-01 11:00:05.175564
4002 2020-01-01 11:00:05.152234
... (millions of rows)
对于每个transactionID,我们要查找其发布时间早于或等于transactionTime的第一个价格,例如对于上述交易,输出应如下所示:
transactionID transactionTime Price
1001 2020-01-01 11:00:07.005477 5.05
2001 2020-01-01 11:00:06.110982 9.87
3005 2020-01-01 11:00:05.175564 6.37
4002 2020-01-01 11:00:05.152234 6.37
... (millions of rows)
我通过联合prices和transactions解决了这一问题,对timestamp列进行降序排序,然后使用尾部递归函数遍历整个数组,并在交易旁边找到第一个价格。我的问题是一个悬而未决的问题:该问题有哪些替代或“更好”的解决方案? SQL,Spark等?
这在Spark中非常棘手。我认为带有窗口功能的union all
方法会起作用:
with t as (
select t.transactionTime, t.transactionId, null
from transactions t
union all
select p.publishedTime, null, p.price
from prices p
)
select t.*
from (select transactionTime, transactionId, max(price) over (partition by grp) as price
from (select t.*, count(price) over (order by transactionTime) as grp
from t
) t
) t
where transactionId is not null;
这将合并两个表中的行,将价格分配给所有适当的交易,然后仅过滤回交易。