递归 CTE 删除重复项

问题描述 投票:0回答:1

我正在寻找清理特定日期恰好有“重复”行的事件数据。我想根据第二天的

status
值的上下文删除一天中具有多个状态的行。目前,我正在使用 BigQuery 和具有自连接的多个 CTE 步骤来迭代具有多个事件的日子,最终每天“正确”以获得单个
status
值。

我尝试过使用具有自连接、各种窗口函数等的递归 CTE,但运气不佳。 BigQuery 不允许在递归 CTE 中使用分析函数,包括 GROUP BY :(

请参阅下面的 2 次迭代示例:

# data has multiple instances of days with more than one status (* = duplicate)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |*
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |*
| 2024-11-05 | active   |

# first iteration with removed rows (**)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |** (2024-11-02 is inactive, so remove this row)
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-03 | inactive |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |** (2024-11-05 is active, so remove this row)
| 2024-11-05 | active   |

# second iteration with removed rows (***)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |**
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*** (2024-11-04 has been deduped to active, so remove this row)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |**
| 2024-11-05 | active   |

# final desired set of deduplicated rows
| date       | status   |
|------------|----------|
| 2024-11-01 | inactive |
| 2024-11-02 | inactive |
| 2024-11-03 | active   |
| 2024-11-04 | active   |
| 2024-11-05 | active   |

我可以想象,考虑到数据的大小,必须迭代 N 次。 SQL中有没有递归的方法来解决这个问题?谢谢!

sql recursion google-bigquery recursive-query recursive-cte
1个回答
0
投票
  • CTE“a”将具有多种状态的日期的状态设置为
    NULL
  • CTE“b”使用
    FIRST_VALUE
    查找具有
    NULL
    状态的日期的下一个已知状态。
WITH a AS (
  SELECT date, IF(COUNT(DISTINCT status) = 1, MIN(status), NULL) AS status
  FROM sample_data
  GROUP BY date
),
b AS (
  SELECT
    date,
    COALESCE(
      status,
      FIRST_VALUE(status IGNORE NULLS) OVER (
        ORDER BY date
        ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
      )
    ) AS final_status
  FROM a
)
SELECT date, final_status AS status
FROM b
ORDER BY date;

输出:

日期 状态
2024-11-01 不活跃
2024-11-02 不活动
2024-11-03 活跃
2024-11-04 活跃
2024-11-05 活跃
© www.soinside.com 2019 - 2024. All rights reserved.