根据约 1000 列的表中重复的 uuid 值删除行

问题描述 投票:0回答:1

我有一个包含约 1000 列的 BigQuery 表,每个

uuid
应该只有一行。一些重复的数据悄然出现,导致
uuid
被插入两次(不再插入),其中
message_timestamp
也是重复的。其他 ~998 列将具有不同的值。重新创建表是一项成本高昂的工作,因此我希望能够删除其中一个重复行,任何一个,无论是哪一个都没关系。

解决方案类似于此表中有两行具有相同的

uuid
message_timestamp
,删除其中任意一行。看起来很简单,但我很难过。

sql google-bigquery
1个回答
0
投票

我们可以创建json_string(jdata),其中包含所有表列, 并使用这个值来区分字符串。

参见示例。有 3 行 - 列不同

email

WITH mytable as(
  select 1 id,'code1' code,'description1' description,'[email protected]' email, 'address:Client1' contacts,'cf35d47-0370-467f-b946-f2a1286d8fad'  uid
  union all
  select 1 id,'code1' code,'description1' description,'[email protected]' email, 'address:Client1' contacts,'cf35d47-0370-467f-b946-f2a1286d8fad'  uid
  union all
  select 1 id,'code1' code,'description1' description,'[email protected]' email, 'address:Client1' contacts,'cf35d47-0370-467f-b946-f2a1286d8fad'  uid
  )
, data AS (
  SELECT uid,TO_JSON_STRING(a) AS jdata 
    ,row_number()over(partition by uid order by TO_JSON_STRING(a))rn
  FROM mytable a
)
SELECT targ.email,src.* 
from mytable targ
left join data src  on src.jdata=TO_JSON_STRING(targ)

输出

电子邮件 流体 jdata rn
[电子邮件受保护] cf35d47-0370-467f-b946-f2a1286d8fad“ ""id"":1,""code"":""code1"",""description"":""description1"",""email"":""[电子邮件受保护]"", ""联系人"":""地址:Client1"",""uid"":""cf35d47-0370-467f-b946-f2a1286d8fad""}" 1
[电子邮件受保护] cf35d47-0370-467f-b946-f2a1286d8fad“ ""id"":1,""code"":""code1"",""description"":""description1"",""email"":""[电子邮件受保护]"", ""联系人"":""地址:Client1"",""uid"":""cf35d47-0370-467f-b946-f2a1286d8fad""}" 3
[电子邮件受保护] cf35d47-0370-467f-b946-f2a1286d8fad“ {""id"":1,""code"":""code1"",""description"":""description1"",""email"":""[电子邮件受保护]"" ,""联系人"":""地址:Client1"",""uid"":""cf35d47-0370-467f-b946-f2a1286d8fad""}" 2
© www.soinside.com 2019 - 2024. All rights reserved.