我正在尝试删除 Redshift 表中的一些重复数据。
以下是我的查询:-
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;
此查询给我一个错误。
亚马逊无效操作:“删除”处或附近出现语法错误;
不确定问题是什么,因为 with 子句的语法似乎是正确的。 有人遇到过这种情况吗?
Redshift 就是这样(任何列都没有强制唯一性),Ziggy 的第三个选项可能是最好的。一旦我们决定采用临时表路线,将所有东西全部换掉会更有效。 Redshift 中的删除和插入成本很高。
begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;
如果空间不是问题,您可以将旧表保留一段时间,并使用此处描述的其他方法来验证原始记录中重复项的行数是否与新表中的行数相匹配。
如果您要对这样的表进行持续加载,您将需要在此过程中暂停该过程。
如果重复项的数量只占大表的一小部分,您可能需要尝试将重复项的不同记录复制到临时表,然后从原始表中删除与临时表连接的所有记录。然后将临时表append回原始表。确保您在之后清理原始桌子(无论如何,您应该按计划对大型桌子进行操作)。
如果您正在处理大量数据,那么重新创建整个表并不总是可行或明智的。查找、删除这些行可能会更容易:
-- First identify all the rows that are duplicate
CREATE TEMP TABLE duplicate_saleids AS
SELECT saleid
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
GROUP BY saleid
HAVING COUNT(*) > 1;
-- Extract one copy of all the duplicate rows
CREATE TEMP TABLE new_sales(LIKE sales);
INSERT INTO new_sales
SELECT DISTINCT *
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Remove all rows that were duplicated (all copies).
DELETE FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Insert back in the single copies
INSERT INTO sales
SELECT *
FROM new_sales;
-- Cleanup
DROP TABLE duplicate_saleids;
DROP TABLE new_sales;
COMMIT;
完整文章:https://elliot.land/post/removing-duplicate-data-in-redshift
那应该有效。您可以选择的替代方案:
With
duplicates As (
Select *, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);
或
delete from table_name
where id in (
select id
from (
Select id, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name) x
Where Duplicate > 1);
如果没有主键,可以执行以下操作:
BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
SELECT DISTINCT ON (record_indicator) *
FROM table_name
ORDER BY record_indicator --, other_optional_priority_field DESC
;
DELETE FROM table_name
WHERE record_indicator IN (
SELECT record_indicator FROM mydups);
INSERT INTO table_name SELECT * FROM mydups;
COMMIT;
original_table
的权限和表定义。CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
original_table
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
original_table
TRUNCATE original_table;
unique_table
插入到 original_table
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
BEGIN transaction;
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
DELETE FROM original_table;
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
END transaction;
简单回答这个问题:
row_number=1
。delete
主表中我们有重复项的所有行。查询:
临时表
select id,date into #temp_a
from
(select *
from (select a.*,
row_number() over(partition by id order by etl_createdon desc) as rn
from table a
where a.id between 59 and 75 and a.date = '2018-05-24')
where rn =1)a
删除主表中的所有行。
delete from table a
where a.id between 59 and 75 and a.date = '2018-05-24'
将临时表中的所有值插入主表
insert into table a select * from #temp_a
。以下命令删除“tablename”中所有重复的记录,但不会对表进行重复数据删除:
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename
) t
WHERE t.rnum > 1);
您的查询不起作用,因为 Redshift 不允许在
DELETE
子句后使用 WITH
。仅允许 SELECT
和 UPDATE
以及其他一些内容(请参阅 WITH 子句)
解决方案(就我而言):
我的表上确实有一个 id 列
events
,其中包含重复的行并唯一标识该记录。此栏id
与您的record_indicator
相同。
不幸的是,我无法创建临时表,因为我使用
SELECT DISTINCT
遇到以下错误:
ERROR: Intermediate result row exceeds database block size
但这就像一个魅力:
CREATE TABLE temp as (
SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber
FROM events
);
产生
temp
表:
id | rownumber | ...
----------------
1 | 1 | ...
1 | 2 | ...
2 | 1 | ...
2 | 2 | ...
现在可以通过删除
rownumber
大于 1 的行来删除重复项:
DELETE FROM temp WHERE rownumber > 1
之后重命名表格并完成。
with duplicates as
(
select a.*, row_number (over (partition by first_name, last_name, email order by first_name, last_name, email) as rn from contacts a
)
delete from contacts
where contact_id in (
select contact_id from duplicates where rn >1
)
更改和删除现有表可能会导致依赖性问题。
我建议反过来做,
只需将 _ _ _ 替换为您的桌子即可:
CREATE TABLE ____dedupped AS SELECT DISTINCT * FROM ___;
DELETE FROM ___;
INSERT INTO ___ SELECT * FROM ____dedupped;
DROP TABLE ____dedupped;
SELECT * FROM ___ ORDER BY id LIMIT 50;
commit;