我想从 SQL/duckdb 的滚动窗口中按帧(而不是分区)获取行号。
有了这个数据
customer_id,date
ca,2024-04-03
ca,2024-04-04
ca,2024-04-04
ca,2024-04-11
cb,2024-04-02
cb,2024-04-02
cb,2024-04-03
cb,2024-05-13
还有这个查询
SELECT
customer_id,
date,
row_number() OVER win AS row_by_partition
FROM 'example.csv'
WINDOW win AS (
PARTITION BY customer_id
ORDER BY date ASC
RANGE BETWEEN CURRENT ROW
AND INTERVAL 1 WEEK FOLLOWING)
我通过分区获取行号
┌─────────────┬────────────┬──────────────────┐
│ customer_id │ date │ row_by_partition │
│ varchar │ date │ int64 │
├─────────────┼────────────┼──────────────────┤
│ ca │ 2024-04-03 │ 1 │
│ ca │ 2024-04-04 │ 2 │
│ ca │ 2024-04-04 │ 3 │
│ ca │ 2024-04-11 │ 4 │
│ cb │ 2024-04-02 │ 1 │
│ cb │ 2024-04-02 │ 2 │
│ cb │ 2024-04-03 │ 3 │
│ cb │ 2024-05-13 │ 4 │
└─────────────┴────────────┴──────────────────┘
但是,我想按帧获取行号
┌─────────────┬────────────┬──────────────┐
│ customer_id │ date │ row_by_frame │
│ varchar │ date │ int64 │
├─────────────┼────────────┼──────────────┤
│ ca │ 2024-04-03 │ 1 │
│ ca │ 2024-04-04 │ 1 │
│ ca │ 2024-04-04 │ 2 │
│ ca │ 2024-04-11 │ 1 │
│ cb │ 2024-04-02 │ 1 │
│ cb │ 2024-04-02 │ 2 │
│ cb │ 2024-04-03 │ 1 │
│ cb │ 2024-05-13 │ 1 │
└─────────────┴────────────┴──────────────┘
您可能可以分两步计算 - 首先,获取帧内的所有数据,然后计算该帧内的行索引。我正在根据您的数据添加一个示例,您可能需要根据数据的唯一性进行调整。
import duckdb
duckdb.sql("""
with cte as (
select
customer_id,
date,
array_agg(date) over win as dates
from df
window win as (
partition by customer_id
order by date asc
range between current row and interval 1 week following
)
)
select
customer_id,
date,
row_number() over(partition by customer_id, dates) as row_by_frame
from cte
""")
┌─────────────┬────────────┬──────────────┐
│ customer_id │ date │ row_by_frame │
│ varchar │ date │ int64 │
├─────────────┼────────────┼──────────────┤
│ cb │ 2024-04-02 │ 1 │
│ cb │ 2024-04-02 │ 2 │
│ cb │ 2024-04-03 │ 1 │
│ ca │ 2024-04-04 │ 1 │
│ ca │ 2024-04-04 │ 2 │
│ cb │ 2024-05-13 │ 1 │
│ ca │ 2024-04-11 │ 1 │
│ ca │ 2024-04-03 │ 1 │
└─────────────┴────────────┴──────────────┘