我有一个巨大的查询,其中有很多
JOIN
。它正在产生重复项。
我正在使用下面的这种技术,我在SO上找到了它来识别重复项来自哪个表:
SELECT
TableA = '----------', TableA.*,
TableB = '----------', TableB.*
FROM ...
以下是数据示例:
TABLE_A USER_ID TABLE_B LOCATION USER_CODE LOCATION_CODE TABLE_C SCI_YEAR_CODE
USER 1092993811 COL_PATHS_SCIENCE_ED University Of N. Maryland NULL ND BIO_PATHS_SCIENCE_RESEARCH 2016_AAB
USER 1092993811 COL_PATHS_SCIENCE_ED University Of N. Maryland NULL ND BIO_PATHS_SCIENCE_RESEARCH 2017_RRT
USER 1092993811 COL_PATHS_SCIENCE_ED University Of N. Maryland NULL ND BIO_PATHS_SCIENCE_RESEARCH 2016_AAB
USER 1092993811 COL_PATHS_SCIENCE_ED University Of N. Maryland NULL ND BIO_PATHS_SCIENCE_RESEARCH 2017_RRT
USER 1092993811 COL_PATHS_SCIENCE_ED California of College NULL MH BIO_PATHS_SCIENCE_RESEARCH 2016_AAB
USER 1092993811 COL_PATHS_SCIENCE_ED California of College NULL MH BIO_PATHS_SCIENCE_RESEARCH 2017_RRT
USER 1092993811 COL_PATHS_SCIENCE_ED California of College NULL MH BIO_PATHS_SCIENCE_RESEARCH 2016_AAB
USER 1092993811 COL_PATHS_SCIENCE_ED California of College NULL MH BIO_PATHS_SCIENCE_RESEARCH 2017_RRT
USER 1092993811 COL_PATHS_SCIENCE_ED New York City Tech NULL BS BIO_PATHS_SCIENCE_RESEARCH 2016_AAB
USER 1092993811 COL_PATHS_SCIENCE_ED New York City Tech NULL BS BIO_PATHS_SCIENCE_RESEARCH 2017_RRT
USER 1092993811 COL_PATHS_SCIENCE_ED New York City Tech NULL BS BIO_PATHS_SCIENCE_RESEARCH 2016_AAB
USER 1092993811 COL_PATHS_SCIENCE_ED New York City Tech NULL BS BIO_PATHS_SCIENCE_RESEARCH 2017_RRT
USER 1092993811 COL_PATHS_SCIENCE_ED New York City Tech NULL BS BIO_PATHS_SCIENCE_RESEARCH 2016_AAB
USER 1092993811 COL_PATHS_SCIENCE_ED New York City Tech NULL BS BIO_PATHS_SCIENCE_RESEARCH 2017_RRT
USER 1092993811 COL_PATHS_SCIENCE_ED New York City Tech NULL BS BIO_PATHS_SCIENCE_RESEARCH 2016_AAB
USER 1092993811 COL_PATHS_SCIENCE_ED New York City Tech NULL BS BIO_PATHS_SCIENCE_RESEARCH 2017_RRT
您可以看到导致重复次数最多的表格列来自
TABLE_C
、BIO_PATHS_SCIENCE_RESEARCH
。
对于
SCI_YEAR_CODE
,我只需要获取最近的日期,并且只需要以SCI_YEAR_CODE
结尾的
RRT
有没有办法“清除”这些重复项?
您可以使用 ROW_NUMBER() 为每个 USER_ID、LOCATION_CODE 和 TABLE_C 分区中的每一行分配序号,然后过滤结果以仅包含 RowNum = 1 的行:
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY USER_ID, LOCATION_CODE, TABLE_C ORDER BY SCI_YEAR_CODE DESC) AS RowNum,
TABLE_A.*,
TABLE_B.*,
TABLE_C.*
FROM
TABLE_A
JOIN
TABLE_B ON TABLE_A.USER_ID = TABLE_B.USER_ID
JOIN
TABLE_C ON TABLE_B.LOCATION_CODE = TABLE_C.LOCATION_CODE
) AS sub
WHERE
sub.RowNum = 1
AND SCI_YEAR_CODE LIKE '%RRT';