我正在寻找清理特定日期恰好有“重复”行的事件数据。我想根据第二天的
status
值的上下文删除一天中具有多个状态的行。目前,我正在使用 BigQuery 和具有自连接的多个 CTE 步骤来迭代具有多个事件的日子,最终每天“正确”以获得单个 status
值。
我尝试过使用具有自连接、各种窗口函数等的递归 CTE,但运气不佳。 BigQuery 不允许在递归 CTE 中使用分析函数,包括 GROUP BY :(
请参阅下面的 2 次迭代示例:
# data has multiple instances of days with more than one status (* = duplicate)
| date | status |
|------------|----------|
| 2024-11-01 | active |*
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active |*
| 2024-11-03 | inactive |*
| 2024-11-04 | active |*
| 2024-11-04 | inactive |*
| 2024-11-05 | active |
# first iteration with removed rows (**)
| date | status |
|------------|----------|
| 2024-11-01 | active |** (2024-11-02 is inactive, so remove this row)
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-03 | inactive |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-04 | active |*
| 2024-11-04 | inactive |** (2024-11-05 is active, so remove this row)
| 2024-11-05 | active |
# second iteration with removed rows (***)
| date | status |
|------------|----------|
| 2024-11-01 | active |**
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active |*
| 2024-11-03 | inactive |*** (2024-11-04 has been deduped to active, so remove this row)
| 2024-11-04 | active |*
| 2024-11-04 | inactive |**
| 2024-11-05 | active |
# final desired set of deduplicated rows
| date | status |
|------------|----------|
| 2024-11-01 | inactive |
| 2024-11-02 | inactive |
| 2024-11-03 | active |
| 2024-11-04 | active |
| 2024-11-05 | active |
我可以想象,考虑到数据的大小,必须迭代 N 次。 SQL中有没有递归的方法来解决这个问题?谢谢!
NULL
。FIRST_VALUE
查找具有 NULL
状态的日期的下一个已知状态。WITH a AS (
SELECT date, IF(COUNT(DISTINCT status) = 1, MIN(status), NULL) AS status
FROM sample_data
GROUP BY date
),
b AS (
SELECT
date,
COALESCE(
status,
FIRST_VALUE(status IGNORE NULLS) OVER (
ORDER BY date
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
)
) AS final_status
FROM a
)
SELECT date, final_status AS status
FROM b
ORDER BY date;
输出:
日期 | 状态 |
---|---|
2024-11-01 | 不活跃 |
2024-11-02 | 不活动 |
2024-11-03 | 活跃 |
2024-11-04 | 活跃 |
2024-11-05 | 活跃 |