我正在使用 ml100k 数据集编写一个查询,该查询可以为我获取每个年龄段评价最高的电影。
这是我的表的定义方式。
用户
id | age | gender | occupation | zipcode
userid | movieid | rating | ts
到目前为止我做了什么
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
这给了我每个年龄段每部电影的计数。
10 1 1
11 1 1
13 1 3
14 1 1
15 1 2
16 1 4
17 1 4
18 1 6
19 1 15
20 1 22
21 1 14
SELECT age, MAX(mcount) AS mc
FROM (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t1
GROUP BY age
7 1
10 1
11 1
13 5
14 3
15 5
16 5
17 11
18 16
19 21
20 25
21 23
这给了我年龄和最大计数。然而,我也想要相应的电影ID,这就是我一直被卡住的地方。我的思考过程是将这些结果与第一个表连接起来,但它不起作用。我可以尝试其他选择吗? 这是我使用的查询。
SELECT users.age, ratings.movieid, count(*) as mc2
FROM ratings JOIN users ON id = userid
INNER JOIN
(
SELECT age, MAX(mcount) AS mc
FROM (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t1
GROUP BY age
)t2
ON t2.age = users.age
WHERE mc2=t2.mc
GROUP BY users.age, ratings.movieid;
你可以这样做:
SELECT t.age, t.movieid, t.mcount
FROM (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t
LEFT JOIN (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t2
ON t.age = t2.age AND
t.movieid <> t2.movieid AND
t.mcount < t2.mcount
WHERE t2.age IS NULL
说明:
ratings
和 users
on
id
的
users
age
和 movieid
LEFT JOIN
两组
age
movieid
(因此我们将进行合理的计数比较)mcount
小于第二个WHERE
子句中,我们排除了当我们发现第二组中的匹配数高于第一组时的情况现在是 8.2.0 – vnk
WITH cte AS (
SELECT age,
movieid,
COUNT(*) AS mcount,
RANK() OVER (PARTITION BY age, ORDER BY COUNT(*) DESC) rnk
FROM ratings
JOIN users ON id = userid
GROUP BY 1, 2
)
SELECT age,
GROUP_CONCAT(movieid) movie_ids,
mcount
FROM cte
WHERE rnk = 1
GROUP BY 1, 3