我通常按 ds 分区 - 具有许多功能的日期列表。 并非如此,所有列每天都会发生变化,因此大多数行只是前面行的重复。 我想从现有表(已分区)实现 SCD2
并得到 dt_start - 记录实际的期间开始,dt_end - 期间结束
如果记录当前是实际的则 dt_end = NULL
我想到了像窗口函数这样的东西
ds 作为 dt_start, __(ds) over(按 user_id 分区,country_id 按 ds 排序) as dt_end, ... 按表中的所有列进行分组
CREATE TABLE public.app(
ds date NULL,
user_id int4 NULL,
country_id int2 NULL,
n_sessions_1d int2 NULL,
n_sessions_3d int2 NULL,
n_sessions_1w int2 NULL,
n_sessions_2w int2 NULL,
n_sessions_1m int2 NULL,
total_time_spent_1d int4 NULL,
total_time_spent_3d int4 NULL,
total_time_spent_1w int4 NULL,
total_time_spent_2w int4 NULL,
total_time_spent_1m int4 NULL,
is_subscription_1d int2 NULL,
is_subscription_3d int2 NULL
)
PARTITION BY RANGE (ds);
CREATE INDEX idx ON ONLY public.app USING btree (user_id, country_id);
您可以使用相当简单的聚合:
db<>fiddle 的演示
select min(ds) as dt_start
, max(ds) as dt_end
, user_id,country_id,n_sessions_1d,n_sessions_3d,n_sessions_1w,n_sessions_2w,n_sessions_1m,total_time_spent_1d,total_time_spent_3d,total_time_spent_1w,total_time_spent_2w,total_time_spent_1m,is_subscription_1d,is_subscription_3d
from public.app
group by user_id,country_id,n_sessions_1d,n_sessions_3d,n_sessions_1w,n_sessions_2w,n_sessions_1m,total_time_spent_1d,total_time_spent_3d,total_time_spent_1w,total_time_spent_2w,total_time_spent_1m,is_subscription_1d,is_subscription_3d;
从技术上讲,你的窗口函数想法是可行的,但实现这一点只是要复杂得多:
first_value()
用作 dt_start
和 last_value()
用作 dt_end
ds
之外的所有值。这样,不同日期的所有相同行都共享该分区。between unbounded preceding and
unbounded following
以覆盖默认的 between unbounded preceding and
current row
,该默认值带有 order by
并且没有框架定义。distinct
,每组只保留一个。select distinct
first_value(ds)over w1 as dt_start
, last_value(ds)over w1 as dt_end
, user_id,country_id,n_sessions_1d,n_sessions_3d,n_sessions_1w,n_sessions_2w,n_sessions_1m,total_time_spent_1d,total_time_spent_3d,total_time_spent_1w,total_time_spent_2w,total_time_spent_1m,is_subscription_1d,is_subscription_3d
from public.app
window w1 as (partition by user_id, country_id, n_sessions_1d,n_sessions_3d,n_sessions_1w,n_sessions_2w,n_sessions_1m,total_time_spent_1d,total_time_spent_3d,total_time_spent_1w,total_time_spent_2w,total_time_spent_1m,is_subscription_1d,is_subscription_3d
order by ds
rows between unbounded preceding
and unbounded following);