我有一个包含约 1000 列的 BigQuery 表,每个
uuid
应该只有一行。一些重复的数据悄然出现,导致 uuid
被插入两次(不再插入),其中 message_timestamp
也是重复的。其他 ~998 列将具有不同的值。重新创建表是一项成本高昂的工作,因此我希望能够删除其中一个重复行,任何一个,无论是哪一个都没关系。
解决方案类似于此表中有两行具有相同的
uuid
和 message_timestamp
,删除其中任意一行。看起来很简单,但我很难过。
我们可以创建json_string(jdata),其中包含所有表列, 并使用这个值来区分字符串。
参见示例。有 3 行 - 列不同
email
。
WITH mytable as(
select 1 id,'code1' code,'description1' description,'[email protected]' email, 'address:Client1' contacts,'cf35d47-0370-467f-b946-f2a1286d8fad' uid
union all
select 1 id,'code1' code,'description1' description,'[email protected]' email, 'address:Client1' contacts,'cf35d47-0370-467f-b946-f2a1286d8fad' uid
union all
select 1 id,'code1' code,'description1' description,'[email protected]' email, 'address:Client1' contacts,'cf35d47-0370-467f-b946-f2a1286d8fad' uid
)
, data AS (
SELECT uid,TO_JSON_STRING(a) AS jdata
,row_number()over(partition by uid order by TO_JSON_STRING(a))rn
FROM mytable a
)
SELECT targ.email,src.*
from mytable targ
left join data src on src.jdata=TO_JSON_STRING(targ)
输出
电子邮件 | 流体 | jdata | rn |
---|---|---|---|
[电子邮件受保护] | cf35d47-0370-467f-b946-f2a1286d8fad“ | ""id"":1,""code"":""code1"",""description"":""description1"",""email"":""[电子邮件受保护]"", ""联系人"":""地址:Client1"",""uid"":""cf35d47-0370-467f-b946-f2a1286d8fad""}" | 1 |
[电子邮件受保护] | cf35d47-0370-467f-b946-f2a1286d8fad“ | ""id"":1,""code"":""code1"",""description"":""description1"",""email"":""[电子邮件受保护]"", ""联系人"":""地址:Client1"",""uid"":""cf35d47-0370-467f-b946-f2a1286d8fad""}" | 3 |
[电子邮件受保护] | cf35d47-0370-467f-b946-f2a1286d8fad“ | {""id"":1,""code"":""code1"",""description"":""description1"",""email"":""[电子邮件受保护]"" ,""联系人"":""地址:Client1"",""uid"":""cf35d47-0370-467f-b946-f2a1286d8fad""}" | 2 |