需要解析Snowflake中嵌套的XML。示例:
WITH xml_table
AS (
SELECT 1 AS ID
, PARSE_XML(
'<root>
<docs>
<doc>
<Id>1</Id>
<Name>
<Kvp>
<Key>k1</Key>
<Value>v1</Value>
</Kvp>
<Kvp>
<Key>k2</Key>
<Value>v2</Value>
</Kvp>
<Kvp>
<Key>k3</Key>
<Value>v3</Value>
</Kvp>
</Name>
</doc>
<doc>
<Id>2</Id>
<Name>
<Kvp>
<Key>k4</Key>
<Value>v4</Value>
</Kvp>
<Kvp>
<Key>k5</Key>
<Value>v5</Value>
</Kvp>
</Name>
</doc>
<doc>
<Id>3</Id>
<Name>
<Kvp>
<Key>k6</Key>
<Value>v6</Value>
</Kvp>
</Name>
</doc>
</docs>
</root>'
) AS XML_COL
)
SELECT docs.ID
, docs.docId
--, docs.docName
, GET(XMLGET(kvps.VALUE, 'Key'), '$')::STRING AS docNameKey
, GET(XMLGET(kvps.VALUE, 'Value'), '$')::STRING AS docNameValue
--, kvps.*
FROM --docs
(
SELECT xml_table.ID
, xml_table.XML_COL
, GET(xml_table.XML_COL, '@')::STRING AS ROOT_NODE_NAME
, XMLGET(docs.VALUE, 'Id') : "$"::STRING AS docId
, XMLGET(docs.VALUE, 'Name') AS docName
FROM xml_table
, LATERAL FLATTEN(GET(XMLGET(xml_table.XML_COL, 'docs'), '$')) AS docs
WHERE 1 = 1
) docs
, LATERAL FLATTEN(GET(docs.docName, '$')) AS kvps
WHERE 1 = 1
所以我需要将其解析为表格格式:
doc.Id AS docId
doc.Name.Kvp.Key AS docNameKey
doc.Name.Kvp.Value AS docNameValue
当 KVP 下有多个键值对(如示例中的 docId IN (1,2))时,给定的示例似乎可以按预期工作,但是当有一个键值对时,则结果不符合预期。为什么?
查询返回:
ID DOCID DOCNAMEKEY DOCNAMEVALUE
1 1 k1 v1
1 1 k2 v2
1 1 k3 v3
1 2 k4 v4
1 2 k5 v5
1 3 null null
1 3 null null
1 3 null null
1 3 null null
如您所见,docId=3 已被“吹”到 4 条记录,并且没有找到键值对。但它有一个键值对 k6-v6。
所以预期的输出(我想要得到的)是:
ID DOCID DOCNAMEKEY DOCNAMEVALUE
1 1 k1 v1
1 1 k2 v2
1 1 k3 v3
1 2 k4 v4
1 2 k5 v5
1 3 k6 v6
为什么xml解析依赖于子元素的个数?
我目前解决的方法:
WITH xml_table
AS (
SELECT 1 AS ID
, PARSE_XML(
'<root>
<docs>
<doc>
<Id>1</Id>
<Name>
<Kvp>
<Key>k1</Key>
<Value>v1</Value>
</Kvp>
<Kvp>
<Key>k2</Key>
<Value>v2</Value>
</Kvp>
<Kvp>
<Key>k3</Key>
<Value>v3</Value>
</Kvp>
</Name>
</doc>
<doc>
<Id>2</Id>
<Name>
<Kvp>
<Key>k4</Key>
<Value>v4</Value>
</Kvp>
<Kvp>
<Key>k5</Key>
<Value>v5</Value>
</Kvp>
</Name>
</doc>
<doc>
<Id>3</Id>
<Name>
<Kvp>
<Key>k6</Key>
<Value>v6</Value>
</Kvp>
</Name>
</doc>
</docs>
</root>'
) AS XML_COL
)
SELECT docs.ID
, docs.docId
, CASE
WHEN kvps.KEY = '$' --For some reason Snowflake works differently if there is 1 sub element than when multiple
THEN CAST(GET(XMLGET(kvps.THIS, 'Key'), '$') AS STRING)
ELSE CAST(GET(XMLGET(kvps.VALUE, 'Key'), '$') AS STRING)
END AS docNameKey
, CASE
WHEN kvps.KEY = '$' --For some reason Snowflake works differently if there is 1 sub element than when multiple
THEN CAST(GET(XMLGET(kvps.THIS, 'Value'), '$') AS STRING)
ELSE CAST(GET(XMLGET(kvps.VALUE, 'Value'), '$') AS STRING)
END AS docNameValue
FROM --docs
(
SELECT xml_table.ID
, xml_table.XML_COL
, CAST(GET(xml_table.XML_COL, '@') AS STRING) AS ROOT_NODE_NAME
, CAST(GET(XMLGET(docs.VALUE, 'Id'), '$') AS STRING) AS docId
, XMLGET(docs.VALUE, 'Name') AS docName
FROM xml_table
, LATERAL FLATTEN(GET(XMLGET(xml_table.XML_COL, 'docs'), '$')) AS docs
WHERE 1 = 1
) docs
, LATERAL FLATTEN(GET(docs.docName, '$')) AS kvps
WHERE 1 = 1
AND (
kvps.KEY IS NULL
OR kvps.KEY = '$'
) --For some reason Snowflake works differently if there is 1 sub element than when multiple
因此添加了 WHERE 子句条件,以排除 Snowflake 在只有一组值对时添加的“额外”行。然后在 SELECT 子句中使用额外的 CASE 语句从 THIS 或 VALUE 属性中提取值,具体取决于是否存在一个或多个值对。