根据数据库中的列值生成直方图

Question

假设我有一个像这样的数据库列“等级”：

|grade|
|    1|
|    2|
|    1|
|    3|
|    4|
|    5|

SQL 中是否有一种非平凡的方法来生成这样的直方图？

|2,1,1,1,1,0|

其中 2 表示等级 1 出现两次，1s 表示等级 {2..5} 出现一次，0 表示等级 6 根本不出现。

我不介意直方图每次计数一行。

如果这很重要，数据库是 SQL Server，由 Perl CGI 通过 unixODBC/FreeTDS 访问。

编辑：感谢您的快速回复！如果不存在的值（如上例中的 6 级）不出现也没关系，只要我能确定哪个直方图值属于哪个等级即可。

Answer 1

SELECT COUNT(grade) FROM table GROUP BY grade ORDER BY grade

尚未验证，但它应该可以工作。但是，它不会显示 6 年级的计数，因为它根本不存在于表中...

Answer 2

如果有很多数据点，您还可以将范围分组在一起，如下所示：

SELECT FLOOR(grade/5.00)*5 As Grade, 
       COUNT(*) AS [Grade Count]
FROM TableName
GROUP BY FLOOR(Grade/5.00)*5
ORDER BY 1

此外，如果您想标记全系列，您可以通过 CTE 提前获得下限和上限。

With GradeRanges As (
  SELECT FLOOR(Score/5.00)*5     As GradeFloor, 
         FLOOR(Score/5.00)*5 + 4 As GradeCeiling
  FROM TableName
)
SELECT GradeFloor,
       CONCAT(GradeFloor, ' to ', GradeCeiling) AS GradeRange,
       COUNT(*) AS [Grade Count]
FROM GradeRanges
GROUP BY GradeFloor, CONCAT(GradeFloor, ' to ', GradeCeiling)
ORDER BY GradeFloor

注意：在某些 SQL 引擎中，您可以

GROUP BY

顺序列索引，但对于 MS SQL，如果您希望在

SELECT

语句中使用它，则还需要对其进行分组，因此复制范围也进入组表达式。

选项 2：您可以使用 case 语句有选择地将值计数到任意容器中，然后对它们进行逆透视以获取包含值的逐行计数

Answer 3

使用临时表来获取缺失值：

CREATE TABLE #tmp(num int)
DECLARE @num int
SET @num = 0
WHILE @num < 10
BEGIN
  INSERT #tmp @num
  SET @num = @num + 1
END


SELECT t.num as [Grade], count(g.Grade) FROM gradeTable g
RIGHT JOIN #tmp t on g.Grade = t.num
GROUP by t.num
ORDER BY 1

Answer 4

根据Shlomo Priymak的文章如何在MySQL中快速创建直方图，可以使用以下查询：

SELECT grade, 
       COUNT(*) AS 'Count',
       RPAD('', COUNT(*), '*') AS 'Bar' 
FROM grades 
GROUP BY grade

这将产生下表：

grade   Count   Bar
1       2       **
2       1       *
3       1       *
4       1       *
5       1       *

Answer 5

Gamecat 对 DISTINCT 的使用对我来说似乎有点奇怪，当我回到办公室时必须尝试一下......

我的做法是相似的...

SELECT
    [table].grade        AS [grade],
    COUNT(*)             AS [occurances]
FROM
    [table]
GROUP BY
    [table].grade
ORDER BY
    [table].grade

为了克服出现 0 次的数据缺乏问题，您可以 LEFT JOIN 到包含所有有效成绩的表。 COUNT(*) 会计算 NULL，但 COUNT(grade) 不会计算 NULL。

DECLARE @grades TABLE (
   val INT
   )  

INSERT INTO @grades VALUES (1)  
INSERT INTO @grades VALUES (2)  
INSERT INTO @grades VALUES (3)  
INSERT INTO @grades VALUES (4)  
INSERT INTO @grades VALUES (5)  
INSERT INTO @grades VALUES (6)  

SELECT
    [grades].val         AS [grade],
    COUNT([table].grade) AS [occurances]
FROM
    @grades   AS [grades]
LEFT JOIN
    [table]
        ON [table].grade = [grades].val
GROUP BY
    [grades].val
ORDER BY
    [grades].val

Answer 6

select Grade, count(Grade)
from MyTable
group by Grade

Answer 7

按范围分组并考虑空范围

这是对：https://stackoverflow.com/a/41275222/895245的扩展，它还为其中包含0个条目的空范围创建容器：

select x, sum(cnt) from (
  select floor(x/5)*5 as x,
         count(*) as cnt
    from t
    group by 1
  union
  select *, 0 as cnt from generate_series(0, 15, 5)
)
group by x

测试：

create table t(x integer)
insert into t values (
  0,
  2,
  2,
  3,

  5,
  6,
  6,
  8,
  9,

  17,
)

输出：

0|4
5|5
10|0
15|1

技巧是使用

generate_series

创建一系列零，然后

sum

将其与填充的范围一起添加。它不会改变填充范围的计数，但会生成 0 条目。

多个范围查询速度更快

floor()

计算

虽然使用

floor()

很方便且自包含，但如果对列进行索引，在大型数据库上它可能会慢得多。原因可能是因为在这种情况下，SQLite 失去了仅使用索引来计算每个范围中有多少个值的能力，而这是使用索引的快速操作。

例如，我创建一个有10m行的测试数据库：

f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" 'create table t(x integer)'
time sqlite3 "$f" 'insert into t select value as x from generate_series(0,9999999)'
time sqlite3 "$f" 'create index tx on t(x)'

然后，对大小为 1m 的 bin 进行多个查询：

bin=1000000
time (
i=0
while [ $i -lt 10 ]; do
  start="$((i * $bin))"
  printf "${start}|"
  sqlite3 10m.sqlite "select count(*) from t where x >= $start and x < $(((i + 1) * $bin))"
  i=$((i + 1))
done
)

在 0.28 秒内完成，输出：

0|1000000
1000000|1000000
2000000|1000000
3000000|1000000
4000000|1000000
5000000|1000000
6000000|1000000
7000000|1000000
8000000|1000000
9000000|1000000

做

floor

但是：

time sqlite3 10m.sqlite <<EOF
select floor(x/1000000)*1000000 as x,
       count(*) as cnt
from t
group by 1
order by 1
EOF

需要1.7秒！

计算列过度杀伤

我不确定使用它是否比仅执行多个查询有任何具体优势。但它确实允许您在单个查询上有效地获得结果。

我们将创建一个带有下限的计算列，对其进行索引，然后将其用于

GROUP BY

：

f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" <<EOF
create table t(
  x integer,
  x_floor integer generated always as (floor(x/1000000)*1000000) STORED
)
EOF
time sqlite3 "$f" 'insert into t select value as x from generate_series(0,9999999)'
time sqlite3 "$f" 'create index tx_floor on t(x_floor)'
time sqlite3 10m.sqlite <<EOF
select x_floor, count(*) as cnt
from t
group by x_floor
order by x_floor
EOF

运行时间：0.48秒。嗯，所以它比仅仅执行多个查询要慢。诡异的。一种理论是，这是因为我们使用新存储列将数据库增大了 2 倍，这会产生 IO 成本。

在 Ubuntu 24.10、SQLite 3.46.1、Lenovo ThinkPad P14s 上测试。

Answer 8

我正在 Ilya Volodin 上面所做的基础上进行构建，这应该允许您选择要在结果中组合在一起的成绩范围：

DECLARE @cnt INT = 0;

WHILE @cnt < 100 -- Set max value
BEGIN
SELECT @cnt,COUNT(fe) FROM dbo.GEODATA_CB where fe >= @cnt-0.999 and fe <= @cnt+0.999 -- set tolerance
SET @cnt = @cnt + 1; -- set step
END;

Answer 9

SELECT FLOOR(grade/5.00)*5 As Grade_Lower, 
FLOOR(grade/5.00)*5+5 As Grade_Upper
       COUNT(*) AS [Grade Count]
FROM TableName
GROUP BY FLOOR(Grade/5.00)*5, FLOOR(grade/5.00)*5+5
ORDER BY 1

视频教程如果你喜欢的话

https://www.youtube.com/watch?v=ioc-NU4meu8

根据数据库中的列值生成直方图

问题描述投票：0回答：9

9个回答

最新问题

根据数据库中的列值生成直方图

问题描述 投票：0回答：9

9个回答

最新问题

问题描述投票：0回答：9