在SnowFlake中按窗口分区使用条件

问题描述 投票:0回答:2

一些相关内容:根据 Snowflake 中的日期和窗口函数过滤查询

我需要创建一个查询来计算

id
在 -+ 90 天窗口中出现的次数,与此类似,但作为窗口函数,这可能吗?

WITH fake_data(id, DATE) as (
    SELECT * FROM VALUES
    -- this id has visted once
    (1, '2022-04-14'::date),
    -- this id has visited 3 times
    (3, '2022-01-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-05-13'::date),
    -- this id is a huge vistor
    (5, '2022-01-01'::date),
    (5, '2022-02-01'::date),
    (5, '2022-05-01'::date),
    (5, '2022-06-01'::date),
    (5, '2022-08-01'::date)
)
select * from (
    select 
    count_if("change" between -90 and 90) over (partition by ID, t1.DATE) "c",
    *
    from fake_data as t1
    left outer join lateral (
        select t1.DATE - t.DATE "change", t.DATE "t_DATE" 
        from fake_data AS t
        where t1.id = t.id and t1.DATE - t.DATE between -90 and 9
     ) as t2
    order by ID, t1.DATE, "change"
)
where "change" = 0;

结果(

change
t_DATE
仅供参考):

c 身份证 日期 改变 t_DATE
1 1 2022-04-14 0 2022-04-14
2 3 2022-01-13 0 2022-01-13
3 3 2022-03-13 0 2022-03-13
2 3 2022-05-13 0 2022-04-13
2 5 2022-01-01 0 2022-01-01
3 5 2022-02-01 0 2022-02-01
3 5 2022-05-01 0 2022-05-01
3 5 2022-06-01 0 2022-06-01
2 5 2022-08-01 0 2022-08-01

这是我喜欢做的,但似乎当前行的日期不可用(或者我可以使用别名):

select
  count_if(DATE - d between -90 and 90) over (partition by id, DATE  as d) as "c",
  id,
  date
from fake_data;
sql snowflake-cloud-data-platform
2个回答
2
投票

嗯,

即使不是你想要的,你的SQL也可以这样写:

select 
    t1.DATE - t.DATE as change
    ,count_if(abs(t1.DATE - t.DATE) <= 90) over (partition by t1.ID, t1.DATE) as c
    ,t1.*
    ,t.date as t_date
from fake_data as t1
left join fake_data as t
    on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change

但是考虑到 join 与你的 count_if 相同,也可以写成:

select 
    t1.DATE - t.DATE as change
    ,count(*) over (partition by t1.ID, t1.DATE) as c
    ,t1.*
    ,t.date as t_date
from fake_data as t1
left join fake_data as t
    on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change

但是给定的窗口函数没有您希望的时间范围的“这一行”,您可以通过使用 Javascript UDTF 来解决这个问题,并为每一行构建一个内存集,并通过它进行计数,然后在决赛中发出它,然后加入它。

此时,您不妨分解数据并在原始 SQL 中进行等连接,与自连接相比,在 +90、-90 天的时间里,对于海量数据来说,这可能仍然相当快

因此对于海量数据,这应该表现得更好:

WITH fake_data(id, DATE) as (
    SELECT * FROM VALUES
    -- this id has visted once
    (1, '2022-04-14'::date),
    -- this id has visited 3 times
    (3, '2022-01-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-05-13'::date),
    -- this id is a huge vistor
    (5, '2022-01-01'::date),
    (5, '2022-02-01'::date),
    (5, '2022-05-01'::date),
    (5, '2022-06-01'::date),
    (5, '2022-08-01'::date)
), range as (
    select row_number() over (order by null)-91 as rn
    from table(generator(ROWCOUNT => 181))
), exploded as (
    select
        id, 
        dateadd('day', e.rn, d.date) as t_date
    from fake_data as d
    cross join range as e
)
select
    f.*
    ,count(t_date) as c
from fake_data as f
join exploded as e
    on f.id = e.id and f.date = t_date
group by f.id, f.date
order by f.id, f.DATE
;
身份证 日期 C
1 2022-04-14 1
3 2022-01-13 2
3 2022-03-13 3
3 2022-05-13 2
5 2022-01-01 2
5 2022-02-01 3
5 2022-05-01 3
5 2022-06-01 3
5 2022-08-01 2

0
投票

Snowflake 现在支持窗口函数中的 RANGE 子句。示例数据:

输入表:访客数据(
visitor_data
)

VISITOR_ID DATE_VISITED
1 2022-04-14
3 2022-01-13
3 2022-03-13
3 2022-05-13
5 2022-01-01
5 2022-02-01
5 2022-05-01
5 2022-06-01
5 2022-08-01

SQL 查询

select 
  *
  , count(*) over (partition by visitor_id
      order by date_visited
      range between interval '90 day' preceding and current row
    ) as count_90_days
from visitor_data;

查询输出:

VISITOR_ID DATE_VISITED COUNT_90_DAYS
3 2022-01-13 1
3 2022-03-13 2
3 2022-05-13 2
1 2022-04-14 1
5 2022-01-01 1
5 2022-02-01 2
5 2022-05-01 2
5 2022-06-01 2
5 2022-08-01 2
© www.soinside.com 2019 - 2024. All rights reserved.