我正在尝试查找字符串中的任何字符在 clickhouse 云中是否重复五次或更多次。例子:
'12344444156'
'abcrrrrrggds'
我知道一般情况下有效的正则表达式:
.*(.)\1{4,}.*
但是 clickhouse 使用的是 RE2 引擎,不支持反向引用。我还能怎么做?
我尝试过:
WITH '12344444156' as str
SELECT str, extract(str, '.*(.)\\1{4,}.*');
预期输出:
12344444156
得到:
SQL Error [427] [07000]: Code: 427. DB::Exception: OptimizedRegularExpression: cannot compile re2: .*(.)\1{4,}.*, error: invalid escape sequence: \1. Look at https://github.com/google/re2/wiki/Syntax for reference. Please note that if you specify regex as an SQL string literal, the slashes have to be additionally escaped. For example, to match an opening brace, write '\(' -- the first slash is for SQL and the second one is for regex: While processing '12344444156' AS str, extract(str, '.*(.)\\1{4,}.*'). (CANNOT_COMPILE_REGEXP) (version 24.6.1.4410 (official build))
, server ClickHouseNode [uri=https://w2z74jyoma.ap-southeast-2.aws.clickhouse.cloud:8443/default, options={use_server_time_zone=false,use_time_zone=false}]@248459710
由于ClickHouse的RE2引擎不支持反向引用,所以使用extractAll来查找重复字符
与‘12344444156’AS str SELECT arrayExists(x -> length(x) >= 5, extractAll(str, ‘(.)\1*’) ) AS 有_重复_字符;
这会检查任何字符是否重复五次或更多次