在 Snowflake 中解析带有键值对的嵌套 XML

问题描述 投票:0回答:1

需要解析Snowflake中嵌套的XML。示例:


    WITH xml_table
    AS (
        SELECT 1 AS ID
            , PARSE_XML(
                '<root>
                    <docs>
                        <doc>
                            <Id>1</Id>
                            <Name>
                                <Kvp>
                                    <Key>k1</Key>
                                    <Value>v1</Value>
                                </Kvp>
                                <Kvp>
                                    <Key>k2</Key>
                                    <Value>v2</Value>
                                </Kvp>
                                <Kvp>
                                    <Key>k3</Key>
                                    <Value>v3</Value>
                                </Kvp>
                            </Name>
                        </doc>
                        <doc>
                            <Id>2</Id>
                            <Name>
                                <Kvp>
                                    <Key>k4</Key>
                                    <Value>v4</Value>
                                </Kvp>
                                <Kvp>
                                    <Key>k5</Key>
                                    <Value>v5</Value>
                                </Kvp>
                            </Name>
                        </doc>
                        <doc>
                            <Id>3</Id>
                            <Name>
                                <Kvp>
                                    <Key>k6</Key>
                                    <Value>v6</Value>
                                </Kvp>
                            </Name>
                        </doc>
                    </docs>
                </root>'
            ) AS XML_COL
        )
    SELECT docs.ID
        , docs.docId
        --, docs.docName
        , GET(XMLGET(kvps.VALUE, 'Key'), '$')::STRING AS docNameKey
        , GET(XMLGET(kvps.VALUE, 'Value'), '$')::STRING AS docNameValue
        --, kvps.*
    FROM --docs
        (
        SELECT xml_table.ID
            , xml_table.XML_COL
            , GET(xml_table.XML_COL, '@')::STRING AS ROOT_NODE_NAME
            , XMLGET(docs.VALUE, 'Id') : "$"::STRING AS docId
            , XMLGET(docs.VALUE, 'Name') AS docName
        FROM xml_table
            , LATERAL FLATTEN(GET(XMLGET(xml_table.XML_COL, 'docs'), '$')) AS docs
        WHERE 1 = 1
        ) docs
        , LATERAL FLATTEN(GET(docs.docName, '$')) AS kvps
    WHERE 1 = 1

所以我需要将其解析为表格格式:

    doc.Id AS docId
    doc.Name.Kvp.Key AS docNameKey
    doc.Name.Kvp.Value AS docNameValue

当 KVP 下有多个键值对(如示例中的 docId IN (1,2))时,给定的示例似乎可以按预期工作,但是当有一个键值对时,则结果不符合预期。为什么?

查询返回:

ID  DOCID   DOCNAMEKEY  DOCNAMEVALUE
1   1   k1  v1
1   1   k2  v2
1   1   k3  v3
1   2   k4  v4
1   2   k5  v5
1   3   null    null
1   3   null    null
1   3   null    null
1   3   null    null

如您所见,docId=3 已被“吹”到 4 条记录,并且没有找到键值对。但它有一个键值对 k6-v6。

所以预期的输出(我想要得到的)是:

ID  DOCID   DOCNAMEKEY  DOCNAMEVALUE
1   1   k1  v1
1   1   k2  v2
1   1   k3  v3
1   2   k4  v4
1   2   k5  v5
1   3   k6  v6

为什么xml解析依赖于子元素的个数?

xml-parsing snowflake-cloud-data-platform
1个回答
0
投票

我目前解决的方法:

WITH xml_table
AS (
    SELECT 1 AS ID
        , PARSE_XML(
            '<root>
                    <docs>
                        <doc>
                            <Id>1</Id>
                            <Name>
                                <Kvp>
                                    <Key>k1</Key>
                                    <Value>v1</Value>
                                </Kvp>
                                <Kvp>
                                    <Key>k2</Key>
                                    <Value>v2</Value>
                                </Kvp>
                                <Kvp>
                                    <Key>k3</Key>
                                    <Value>v3</Value>
                                </Kvp>
                            </Name>
                        </doc>
                        <doc>
                            <Id>2</Id>
                            <Name>
                                <Kvp>
                                    <Key>k4</Key>
                                    <Value>v4</Value>
                                </Kvp>
                                <Kvp>
                                    <Key>k5</Key>
                                    <Value>v5</Value>
                                </Kvp>
                            </Name>
                        </doc>
                        <doc>
                            <Id>3</Id>
                            <Name>
                                <Kvp>
                                    <Key>k6</Key>
                                    <Value>v6</Value>
                                </Kvp>
                            </Name>
                        </doc>
                    </docs>
                </root>'
        ) AS XML_COL
    )
SELECT docs.ID
    , docs.docId
    , CASE 
        WHEN kvps.KEY = '$' --For some reason Snowflake works differently if there is 1 sub element than when multiple
            THEN CAST(GET(XMLGET(kvps.THIS, 'Key'), '$') AS STRING)
        ELSE CAST(GET(XMLGET(kvps.VALUE, 'Key'), '$') AS STRING)
        END AS docNameKey
    , CASE 
        WHEN kvps.KEY = '$' --For some reason Snowflake works differently if there is 1 sub element than when multiple
            THEN CAST(GET(XMLGET(kvps.THIS, 'Value'), '$') AS STRING)
        ELSE CAST(GET(XMLGET(kvps.VALUE, 'Value'), '$') AS STRING)
        END AS docNameValue
FROM --docs
    (
    SELECT xml_table.ID
        , xml_table.XML_COL
        , CAST(GET(xml_table.XML_COL, '@') AS STRING) AS ROOT_NODE_NAME
        , CAST(GET(XMLGET(docs.VALUE, 'Id'), '$') AS STRING) AS docId
        , XMLGET(docs.VALUE, 'Name') AS docName
    FROM xml_table
        , LATERAL FLATTEN(GET(XMLGET(xml_table.XML_COL, 'docs'), '$')) AS docs
    WHERE 1 = 1
    ) docs
    , LATERAL FLATTEN(GET(docs.docName, '$')) AS kvps
WHERE 1 = 1
    AND (
        kvps.KEY IS NULL
        OR kvps.KEY = '$'
        ) --For some reason Snowflake works differently if there is 1 sub element than when multiple

因此添加了 WHERE 子句条件,以排除 Snowflake 在只有一组值对时添加的“额外”行。然后在 SELECT 子句中使用额外的 CASE 语句从 THIS 或 VALUE 属性中提取值,具体取决于是否存在一个或多个值对。

© www.soinside.com 2019 - 2024. All rights reserved.