使用雪花表上的哈希比较执行数据验证

问题描述 投票:0回答:1

场景如下:

我在 Snowflake 中有两个模式的数据,旨在验证两个表是否包含相同的数据值。我在 Snowflake 中设置了动态 SQL 方法来完成此任务。

我的代码首先使用 CTE 从两个表中检索列,然后应用 MD5 哈希函数生成哈希值进行比较。最后,我构建了一个动态 SQL 查询来显示比较结果。

但是,动态 SQL 只显示 SELECT 语句本身,而不是返回实际输出,因此我不确定我的方法是否完全正确。

Here is my code snippet:

-- Step 1: Extract the column list for both tables (schema_1 & schema_2)
WITH column_list_schema_1 AS (
    SELECT COLUMN_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = 'schema_1'  
      AND TABLE_NAME = 'table_1'
),
column_list_schema_2 AS (
    SELECT COLUMN_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = 'schema_2'
      AND TABLE_NAME = 'table_2'
),
-- Step 2: Generate dynamic SQL for comparing values in each row and generating hash values for comparison
dynamic_sql AS (
    SELECT 
        LISTAGG(
            'CASE WHEN h."' || COLUMN_NAME || '" != s."' || COLUMN_NAME || '" THEN ''' || COLUMN_NAME || ''' END AS "' || COLUMN_NAME || '"', 
            ', '
        ) WITHIN GROUP (ORDER BY COLUMN_NAME) AS column_comparisons,
        -- Generate MD5 hashes for comparison (on a row level)
        LISTAGG(
            'MD5(CAST(h."' || COLUMN_NAME || '" AS STRING))', ', '
        ) WITHIN GROUP (ORDER BY COLUMN_NAME) AS schema_1_hash,
        LISTAGG(
            'MD5(CAST(s."' || COLUMN_NAME || '" AS STRING))', ', '
        ) WITHIN GROUP (ORDER BY COLUMN_NAME) AS schema_2_hash
    FROM column_list_heroku
    WHERE COLUMN_NAME IN (SELECT COLUMN_NAME FROM column_list_schema_2)
)
-- Step 3: Generate the final comparison query string
SELECT 
    'SELECT h.row_id AS h_row_id, 
            s.row_id AS s_row_id, 
            h.parent_id AS parent_id, 
            MD5(CONCAT(' || ds.h_hash || ')) AS h_hash, 
            MD5(CONCAT(' || ds.s_hash || ')) AS s_hash, 
            CASE WHEN MD5(CONCAT(' || ds.h_hash || ')) = MD5(CONCAT(' || ds.s_hash || ')) 
                 THEN ''Match'' ELSE ''Mismatch'' END AS comparison_result, 
            ' || ds.column_comparisons || '
     FROM 
         (SELECT *, ROW_NUMBER() OVER () AS row_id, parent_id FROM schema_1.your_table_name) h
     FULL OUTER JOIN 
         (SELECT *, ROW_NUMBER() OVER () AS row_id, parent_id FROM schema_2.your_table_name) s
     ON h.row_id = s.row_id
     WHERE MD5(CONCAT(' || ds.h_hash || ')) != MD5(CONCAT(' || ds.s_hash || '));' 
FROM dynamic_sql ds;

sql stored-procedures snowflake-cloud-data-platform md5 heroku-postgres
1个回答
0
投票

让数据库为您完成这项工作不是更简单吗:

select 
    (
        select count(*) from (
            select * from so.so.tablename_a /* A */
            minus 
            select * from so.so.tablename_b /* B */
        )
    ) as a_minus_b,
    (
        select count(*) from (
            select * from so.so.tablename_b /* B */
            minus 
            select * from so.so.tablename_a /* A */
        )
    ) as b_minus_a;

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.