另一个参加#store-n-per-group派对!
我以前的代码:
select count(*)
from revisions join files on rev_file = file_id
where rev_parent_id like 0
and rev_timestamp between '20011231230000' and '20191231225959'
and file_namespace like 0
and file_is_redirect like 0
问题是,对于某些文件,有多个条目的rev_parent_id = 0。我想只计算那些最早的rev_timestamp,但我尝试使用SQL select only rows with max value on a column和Select Earliest Date and Time from List of Distinct User Sessions的答案给了我cca 9 000和11 000 000.正确的数字应该是cca 422 000.也许我没能正确加入三个表,这里是我的一次尝试(那个有9 000个结果):
select count(r1.rev_file)
from revisions r1
left outer join revisions r2 on (r1.rev_file = r2.rev_file
and r1.rev_timestamp < r2.rev_timestamp)
join files on r1.rev_file = file_id
where r2.rev_file is NULL
and r1.rev_parent_id like 0
and r1.rev_timestamp between '20011231230000' and '20191231225959'
and file_namespace like 0
and file_is_redirect like 0
表格结构:
files
file_id, file_namespace, file_is_redirect
1234, 0, 0
1235, 3, 1
1236, 3, 0
revisions
rev_file, rev_id, rev_parent_id, rev_timestamp
1234, 19, 16, 20170302061522
1234, 16, 0, 20170302061428
1234, 14, 12, 20170302061422
1234, 12, 0, 20170302061237
1235, 21, 18, 20170302061815
1235, 18, 13, 20170302061501
1235, 13, 8, 20170302061355
1235, 8, 3, 20170302061213
1235, 3, 0, 20170302061002
1236, 6, 0, 20170302061014
file_id = rev_file =文件的id。 file_namespace =文件的mimetype,0是纯文本。 rev_id =修订版的ID。 rev_parent_id =父修订的id。 rev_timestamp =修订的时间戳
唯一有效的文件是1234,它已被删除并重新创建,因此它有两个rev_parent_id = 0条目。我想仅在较旧的rev_parent_id = 0版本介于所选时间之间时才计算文件。
您应该为rev_file的min rev_timestamp加入子查询
select count(*)
from revisions
join files on rev_file = file_id
join (
select rev_file, min(rev_timestamp) min_time
from revisions
where rev_parent_id = 0
group by rev_file
) t on t.min_time = revisions.rev_timestamp
and t.rev_file = revisions.rev_file
where rev_parent_id like 0
and rev_timestamp between '20011231230000' and '20191231225959'
and file_namespace like 0
and file_is_redirect like 0
首先,让我们使用子查询来查找revisions
中每个rev_file
的最早时间戳,符合您的标准。
SELECT MIN(rev_timestamp) rev_timestamp, rev_file
FROM revisons
WHERE rev_parent_id like 0
AND rev_timestamp between '20011231230000' and '20191231225959'
GROUP BY rev_file
这为您提供了一个虚拟表,其中包含符合条件的每个文件的最早时间戳。
接下来,将该表连接到这样的其他表
SELECT COUNT(*) count
FROM revisions r1
JOIN (
SELECT MIN(rev_timestamp) rev_timestamp, rev_file
FROM revisons
WHERE rev_parent_id like 0
AND rev_timestamp between '20011231230000' and '20191231225959'
GROUP BY rev_file
) rmin ON r1.rev_timstamp = rmin.rev_timestamp
AND r1.rev_file = rmin.rev_file
JOIN files f ON r1.rev_file = file_id
and f.file_namespace like 0
and f.file_is_redirect like 0
专业提示:格式化查询以使其可读是值得的。
专业提示:尽可能使用COUNT(*)
而不是COUNT(col)
。它更快。并且,除非您提到的col
可能包含NULL值,否则它会产生相同的结果。对于问题中的查询,情况并非如此。
专业提示:始终在JOIN操作中限定您的列(f.file_is_redirect
而不是file_is_redirect
)。同样,查询的可读性是动机。如果你有幸有一天能够让别人维护你的代码,那么这个人会很高兴看到这个。这是“专业和爱好者”编程的重要组成部分。
专业提示:numeric_col LIKE 0
杀死表演。它用于匹配文本(column LIKE '%verflo'
匹配Stack Overflow
)。当您在数字列上使用LIKE
时,它会将每列的数据类型强制转换为字符串,然后在其上运行LIKE
运算符,从而无法使用您在数字列上使用的任何索引。
谢谢各位@scaisedge和@ o-jones,最后我使用了你的两个答案的核心并删除了多余的代码,这最终对我有用:
select count(*)
from (select rev_file, min(rev_timestamp) rev_timestamp from revision where rev_parent_id like 0 group by rev_file) revision
join file on rev_file = file_id
where rev_timestamp between '20011231230000' and '20191231225959'
and file_namespace like 0
and not file_is_redirect;
也许我还可以通过将file_namespace和file_is_redirect条件移动到连接中的另一个子查询来节省一些运行时间,但也许不是,我不确定。
scaisedge答案更简洁,更易读,因此我立即理解并更喜欢它。 scaisedge在代码中遇到了一些错误(由我修复)。 o-jones的答案更加杂乱,不必要的东西,但它更详细,以防任何读者需要解释,并感谢提示提示我学习了一些时间问题与我的代码。