我有几百万个传感器连续执行健康检查并将每个5 minutes
的数据发送到服务器。我的任务是存储这些数据点,并每小时生成一个关于未能报告的报告。
问题:
示例数据:
"point1" "12-2-19T00:00"
"point2" "12-2-19T00:00"
"point1" "12-2-19T00:05" #missing point2
"point1" "12-2-19T00:10"
"point2" "12-2-19T00:10"
我需要找到point2
以下是BigQuery Standard SQL
#standardSQL
WITH temp AS (
SELECT point, PARSE_TIMESTAMP('%d-%m-%yT%H:%M', dt) dt
FROM `project.dataset.table`
), points AS (
SELECT DISTINCT point FROM temp
), times AS (
SELECT dt
FROM (SELECT MIN(dt) min_dt, MAX(dt) max_dt FROM temp),
UNNEST(GENERATE_TIMESTAMP_ARRAY(min_dt, max_dt, INTERVAL 5 MINUTE)) dt
)
SELECT
point,
FORMAT_DATETIME('%d-%m-%yT%H:%M', DATETIME(dt)) dt,
IF(t.point IS NULL, 'missing', 'ok') status
FROM times CROSS JOIN points
LEFT JOIN temp t USING(dt, point)
您可以使用问题中的示例数据进行测试,使用上面的示例,如下例所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'point1' point, '12-2-19T00:00' dt UNION ALL
SELECT 'point2', '12-2-19T00:00' UNION ALL
SELECT 'point1', '12-2-19T00:05' UNION ALL -- #missing point2
SELECT 'point1', '12-2-19T00:10' UNION ALL
SELECT 'point2', '12-2-19T00:10'
), temp AS (
SELECT point, PARSE_TIMESTAMP('%d-%m-%yT%H:%M', dt) dt
FROM `project.dataset.table`
), points AS (
SELECT DISTINCT point FROM temp
), times AS (
SELECT dt
FROM (SELECT MIN(dt) min_dt, MAX(dt) max_dt FROM temp),
UNNEST(GENERATE_TIMESTAMP_ARRAY(min_dt, max_dt, INTERVAL 5 MINUTE)) dt
)
SELECT
point,
FORMAT_DATETIME('%d-%m-%yT%H:%M', DATETIME(dt)) dt,
IF(t.point IS NULL, 'missing', 'ok') status
FROM times CROSS JOIN points
LEFT JOIN temp t USING(dt, point)
-- ORDER BY dt, point
结果
Row point dt status
1 point1 12-02-19T00:00 ok
2 point2 12-02-19T00:00 ok
3 point1 12-02-19T00:05 ok
4 point2 12-02-19T00:05 missing
5 point1 12-02-19T00:10 ok
6 point2 12-02-19T00:10 ok