一些相关内容:根据 Snowflake 中的日期和窗口函数过滤查询
我需要创建一个查询来计算
id
在 -+ 90 天窗口中出现的次数,与此类似,但作为窗口函数,这可能吗?
WITH fake_data(id, DATE) as (
SELECT * FROM VALUES
-- this id has visted once
(1, '2022-04-14'::date),
-- this id has visited 3 times
(3, '2022-01-13'::date),
(3, '2022-03-13'::date),
(3, '2022-05-13'::date),
-- this id is a huge vistor
(5, '2022-01-01'::date),
(5, '2022-02-01'::date),
(5, '2022-05-01'::date),
(5, '2022-06-01'::date),
(5, '2022-08-01'::date)
)
select * from (
select
count_if("change" between -90 and 90) over (partition by ID, t1.DATE) "c",
*
from fake_data as t1
left outer join lateral (
select t1.DATE - t.DATE "change", t.DATE "t_DATE"
from fake_data AS t
where t1.id = t.id and t1.DATE - t.DATE between -90 and 9
) as t2
order by ID, t1.DATE, "change"
)
where "change" = 0;
结果(
change
和t_DATE
仅供参考):
c | 身份证 | 日期 | 改变 | t_DATE |
---|---|---|---|---|
1 | 1 | 2022-04-14 | 0 | 2022-04-14 |
2 | 3 | 2022-01-13 | 0 | 2022-01-13 |
3 | 3 | 2022-03-13 | 0 | 2022-03-13 |
2 | 3 | 2022-05-13 | 0 | 2022-04-13 |
2 | 5 | 2022-01-01 | 0 | 2022-01-01 |
3 | 5 | 2022-02-01 | 0 | 2022-02-01 |
3 | 5 | 2022-05-01 | 0 | 2022-05-01 |
3 | 5 | 2022-06-01 | 0 | 2022-06-01 |
2 | 5 | 2022-08-01 | 0 | 2022-08-01 |
这是我喜欢做的,但似乎当前行的日期不可用(或者我可以使用别名):
select
count_if(DATE - d between -90 and 90) over (partition by id, DATE as d) as "c",
id,
date
from fake_data;
嗯,
即使不是你想要的,你的SQL也可以这样写:
select
t1.DATE - t.DATE as change
,count_if(abs(t1.DATE - t.DATE) <= 90) over (partition by t1.ID, t1.DATE) as c
,t1.*
,t.date as t_date
from fake_data as t1
left join fake_data as t
on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change
但是考虑到 join 与你的 count_if 相同,也可以写成:
select
t1.DATE - t.DATE as change
,count(*) over (partition by t1.ID, t1.DATE) as c
,t1.*
,t.date as t_date
from fake_data as t1
left join fake_data as t
on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change
但是给定的窗口函数没有您希望的时间范围的“这一行”,您可以通过使用 Javascript UDTF 来解决这个问题,并为每一行构建一个内存集,并通过它进行计数,然后在决赛中发出它,然后加入它。
此时,您不妨分解数据并在原始 SQL 中进行等连接,与自连接相比,在 +90、-90 天的时间里,对于海量数据来说,这可能仍然相当快
因此对于海量数据,这应该表现得更好:
WITH fake_data(id, DATE) as (
SELECT * FROM VALUES
-- this id has visted once
(1, '2022-04-14'::date),
-- this id has visited 3 times
(3, '2022-01-13'::date),
(3, '2022-03-13'::date),
(3, '2022-05-13'::date),
-- this id is a huge vistor
(5, '2022-01-01'::date),
(5, '2022-02-01'::date),
(5, '2022-05-01'::date),
(5, '2022-06-01'::date),
(5, '2022-08-01'::date)
), range as (
select row_number() over (order by null)-91 as rn
from table(generator(ROWCOUNT => 181))
), exploded as (
select
id,
dateadd('day', e.rn, d.date) as t_date
from fake_data as d
cross join range as e
)
select
f.*
,count(t_date) as c
from fake_data as f
join exploded as e
on f.id = e.id and f.date = t_date
group by f.id, f.date
order by f.id, f.DATE
;
身份证 | 日期 | C |
---|---|---|
1 | 2022-04-14 | 1 |
3 | 2022-01-13 | 2 |
3 | 2022-03-13 | 3 |
3 | 2022-05-13 | 2 |
5 | 2022-01-01 | 2 |
5 | 2022-02-01 | 3 |
5 | 2022-05-01 | 3 |
5 | 2022-06-01 | 3 |
5 | 2022-08-01 | 2 |
Snowflake 现在支持窗口函数中的 RANGE 子句。示例数据:
visitor_data
)VISITOR_ID | DATE_VISITED |
---|---|
1 | 2022-04-14 |
3 | 2022-01-13 |
3 | 2022-03-13 |
3 | 2022-05-13 |
5 | 2022-01-01 |
5 | 2022-02-01 |
5 | 2022-05-01 |
5 | 2022-06-01 |
5 | 2022-08-01 |
select
*
, count(*) over (partition by visitor_id
order by date_visited
range between interval '90 day' preceding and current row
) as count_90_days
from visitor_data;
VISITOR_ID | DATE_VISITED | COUNT_90_DAYS |
---|---|---|
3 | 2022-01-13 | 1 |
3 | 2022-03-13 | 2 |
3 | 2022-05-13 | 2 |
1 | 2022-04-14 | 1 |
5 | 2022-01-01 | 1 |
5 | 2022-02-01 | 2 |
5 | 2022-05-01 | 2 |
5 | 2022-06-01 | 2 |
5 | 2022-08-01 | 2 |