如何在Clickhouse中对大表进行重复数据删除?

问题描述 投票:0回答:1

我有一个巨大的单列表,其中引擎=日志:

SELECT * FROM addresses_tmp LIMIT 5

   ┌─address──────────────────────────────────┐
1. │ 18a0a8bdcbd1fec1785224cfc486ccf02dc3ef5d │
2. │ 3ca0a8d9744b229f81fae2f59892b546c20a744e │
3. │ 4456058ebd1ae161348b5aae51d86aef423513a6 │
4. │ a3230a93a31f924a2713af72733d522873434025 │
5. │ 4960323c0fbd63ae068ea313c67bb2a3bc133baf │
   └──────────────────────────────────────────┘

我尝试将其插入到 ReplacingMergeTree 表中:

create table addresses engine=ReplacingMergeTree() primary key address as select row_number() over() as id, * from (select * from addresses_tmp)

但由于内存错误而失败:

代码:241。DB::异常:从本地主机接收:9000。数据库::异常: 超出内存限制(总计):将使用 27.89 GiB(尝试 分配 5248943 字节的块,最大:27.86 GiB。 OvercommitTracker 决策:选择查询停止 过量使用跟踪器。:

我还能如何执行到 MergeTree 或 ReplacingMergeTree 的转换并对表进行重复数据删除?

out-of-memory bigdata clickhouse
1个回答
0
投票
CREATE TABLE addresses (address String)
ENGINE = ReplacingMergeTree ORDER BY address;

INSERT INTO addresses SELECT * FROM addresses_tmp;

OPTIMIZE TABLE addresses FINAL DEDUPLICATE;
© www.soinside.com 2019 - 2024. All rights reserved.