如何在每两分钟内找到第一行？

Question

我正在使用 Snowflake，我有一个数据集，其中包含一天内打入的每个电话。其中一些由于 2 分钟内多次尝试而被我们视为无效。

这是一个例子：

用户	通话日期
1	2024年1月19日16:24:11
1	2024年1月19日16:27:29
1	2024年1月19日16:27:34
1	2024年1月19日16:27:38
1	2024年1月19日16:27:43
1	2024年1月19日16:29:29
2	2024年1月19日11:08:49
2	2024年1月19日11:09:32
2	2024年1月19日11:12:24
2	2024年1月19日11:14:49

期望的结果如下所示：

用户	通话日期
1	2024年1月19日16:24:11
1	2024年1月19日16:27:29
1	2024年1月19日16:29:29
2	2024年1月19日11:08:49
2	2024年1月19日11:12:24
2	2024年1月19日11:14:49

每个用户的首次通话日期将被统计，首次通话日期后2分钟内的任何通话均视为无效通话。

目前我正在使用Python来迭代这个过程，但想知道是否有一种方法可以在雪花中使用来节省时间。

Answer 1

实际上，您可以使用 SQL 查询直接在 Snowflake 中执行此操作，无需在 Python 中进行迭代。这个想法是：

分别处理每个用户的呼叫：我们将按用户对呼叫进行分组，以便我们只比较他们自己的呼叫。
按时间对呼叫进行排序：这可以让我们找出哪个呼叫恰好在每个呼叫之前。
计算时差：对于每次通话，检查自上次通话以来已经过去了多少时间。如果是第一次通话或超过2分钟，我们将视为有效。
过滤掉无效呼叫：忽略上一次有效呼叫后2分钟内的任何内容。

这是 SQL 查询：

WITH RankedCalls AS (
    SELECT
        user,
        "Call Date",
        LAG("Call Date") OVER (PARTITION BY user ORDER BY "Call Date") AS prev_call,
        CASE
            WHEN LAG("Call Date") OVER (PARTITION BY user ORDER BY "Call Date") IS NULL THEN 'valid'
            WHEN DATEDIFF('second', LAG("Call Date") OVER (PARTITION BY user ORDER BY "Call Date"), "Call Date") >= 120 THEN 'valid'
            ELSE 'invalid'
        END AS call_status
    FROM your_table_name
),
FilteredCalls AS (
    SELECT user, "Call Date"
    FROM RankedCalls
    WHERE call_status = 'valid'
)
SELECT * 
FROM FilteredCalls
ORDER BY user, "Call Date";

根据 2 分钟规则，这只会为您提供有效的呼叫。

Answer 2

具有递归CTE

另一种可能更便携的可能性是使用递归 CTE。
这里是 PostgreSQL 风格的递归（将在我的 SQLFiddle 中的解决方案 3. 中实现完全工作的示例）：

with recursive
    -- Index our entries:
    i as (select row_number() over (partition by u order by d) id, * from t),
    -- Know whose election as a "first" will make each entry redundant for good.
    l as
    (
        -- "pass" increments to keep hesitating entries queued for the next iteration
        -- "kind":
        --   1: first of a serie
        --   0: don't know yet if first of serie or not
        --  -1: confirmed duplicate (follows a confirmed first)
        --  -2: first, but finished (has no more followers to evaluate)
        select *, 0 pass, 0 kind from i
            union
        select id, u, d, pass + 1,
            case
                when kind = 0 then
                    case
                        -- Is the preceding entry more than 2 mn before? We're a first!
                        when coalesce(lag(d) over byu < d - interval '2 minutes', true) then 1
                        -- Else (if preceding is less than 2 mn away), if said preceding entry is itself a first, we're a duplicate.
                        when lag(kind) over byu = 1 then -1
                        -- Still not sure.
                        else 0
                    end
                -- If we are a first but have no more followers waiting for us, get away.
                when kind = 1 and coalesce(lead(kind) over byu <> 0, true) then -2
                else kind
            end
        from l
        where kind >= 0 -- Only work with confirmed firsts, and still hesitating ones.
        and pass < 99 -- In case I missed something...
        window byu as (partition by u order by d)
    )
--select * from l;
select u, d from i where (u, id) in (select u, id from l where kind = -2)
order by u, d;

请注意，它

有点复杂且未优化得可怕，因为递归性

not exists (select 1 from accu where alreadyConfirmedAsAFirst)
```
）
```

Answer 3

使用自定义聚合

通过定义

自定义聚合函数，可以在 SQL 中实现预期目标，（非常直观地）记得上次为参赛作品“授予”“系列第一”徽章的时间，
因此（在以下条目上）可以确定它成为新的“第一”或在“第一”之后仍处于 2 分钟 DMZ 之下。

您可以看到它

通过添加的测试用例（小提琴中的解决方案2）来验证边缘情况。

create type firstAndWhen as (first bool, lastFirst timestamp);
create function firstSinceNSecondsAgg(accu firstAndWhen, new timestamp, nSeconds int) returns firstAndWhen language sql as
$$
    select
        coalesce(new >= accu.lastFirst + cast(nSeconds||' seconds' as interval), true),
        case when coalesce(new >= accu.lastFirst + cast(nSeconds||' seconds' as interval), true)
            then new
            else accu.lastFirst
        end
$$;
create function firstSinceNSecondsPop(accu firstAndWhen) returns bool language sql as 'select accu.first';
create aggregate firstSinceNSeconds(timestamp, int)
(
    stype = firstAndWhen,
    initcond = '(false,)',
    sfunc = firstSinceNSecondsAgg,
    finalfunc = firstSinceNSecondsPop
);

with l as (select *, firstSinceNSeconds(d, 120) over (partition by u order by d) ok from t)
select u, d from l where ok order by u, d;

优点和缺点：

+++ 2 分钟期限过后，它始终保留第一个条目（按预期）
+++ 而且它也不会保留太多： 如果是
A1 - A2 - A3 - B1 - B2 - B3 （其中 A 和 B 是“有效”时间戳，并且 1 - 2 - 3 次出现，我们只想保留 1），其中 B1 >= A1 + 2 mn 但 B1、B2 和 B3 它正确地只保留 A1、B1、< A3 + 2 mn,
不是 A1、B1、B2、B3，
或A1、B3
正如我们的第一次尝试一样。
--- 目前只有 PostgreSQL。我找不到通用的 SQL 方式（而且我不太了解其他方言，所以如果有人想尝试......）
---我没看优化 （我希望 PostgreSQL “自然地”只在整个集合上迭代一次，而不是从每个条目的开头重新启动）

如何在每两分钟内找到第一行？

问题描述投票：0回答：3

3个回答

具有递归CTE

最新问题

如何在每两分钟内找到第一行？

问题描述 投票：0回答：3

3个回答

具有递归CTE

最新问题

问题描述投票：0回答：3