我正在寻找数据集中最好的 X% 数据,其中“最佳”被定义为具有最小的值之和。我可以通过运行一系列测试来获得我想要的结果来做到这一点:
SELECT
-- Analyze results manually, looking for a testXX value which is close to X%.
-- If needed, edit the query for higher precision and try again.
1.0 * count_if(f1+f2 < 0.1)/count(1) AS test01,
1.0 * count_if(f1+f2 < 0.2)/count(1) AS test02,
...
FROM table1
我尝试加入 SEQUENCE 来减少复制和粘贴,但除了使查询更加占用内存之外,我无法让它工作。这是我尝试过的:
SELECT 1.0 * count_if(f1+f2 < threshold)/count(1) AS test
FROM table1
JOIN (SELECT t.v/100.0 AS threshold FROM UNNEST(SEQUENCE(20, 80, 1)) t(v))
ON true
我真正想要的是一个查询,它会自动找到一个等于 X +-某个 epsilon 的阈值,或者更好的是,一个尽可能接近 X 的阈值。
简化样本数据
f1 f2
0.04 0.05
0.02 0.07
0.02 0.69
0.1 0.1
0.1 0.3
0.1 0.4
0.1 0.5
0.1 0.6
0.1 0.7
0.1 0.8
如果我的目标 X 是 0.3,我希望阈值在 0.09 左右,因为 f1+f2 的 30% 是 <=0.09. The real data set has tens of millions of rows with far more random values. If I want a 30% slice, it's okay if it's actually 30.2% or 29.8%.
CREATE TABLE sample (
f1 DECIMAL(4,3),
f2 DECIMAL(4,3)
)
INSERT INTO
sample
VALUES
(0.04, 0.05),
(0.02, 0.07),
(0.02, 0.069), -- I changed this value
(0.1 , 0.1),
(0.1 , 0.3),
(0.1 , 0.4),
(0.1 , 0.5),
(0.1 , 0.6),
(0.1 , 0.7),
(0.1 , 0.8)
WITH
ranked AS
(
SELECT
*,
f1+f2 AS x,
ROW_NUMBER()
OVER (ORDER BY f1+f2, f1, f2)
*
1.0
/
COUNT(*) OVER ()
AS percentile
FROM
sample
)
SELECT
MAX(CASE WHEN percentile <= 0.3 THEN x END),
MIN(CASE WHEN percentile > 0.3 THEN x END)
FROM
ranked
最大 | 分钟 |
---|---|
0.090 | 0.200 |
30% 的截止值可以是从 0.090 到(但不包括)0.200 的任何值