如何检查两个数据集之间的相似性并在 Snowflake 中返回分数(这可能吗?)

问题描述 投票:0回答:1

我有两个数据集,其中包含我公司客户的全名。两组都相当大(40-70k 行)。我想检查一下这两个群体之间是否存在相似之处。例如:如果一组记录的值为“John Smith”,另一组记录的值为“John W. Smith”,我可能会将其视为匹配项。

A 栏 B 栏 相似度分数
约翰·史密斯 约翰·W·史密斯 0.97
詹姆斯·邦德 安德鲁·邦德 0.5

数据集有不同的大小,它们在 Snowflake 上。如果缺少任何信息,请告诉我。

另外,你认为这个问题可以用 SQL 解决吗?还是Python更适合?

我尝试使用 Jarowinkler 相似度,但它仅适用于“特定”字符串,不适用于数据集。

python sql snowflake-cloud-data-platform bigdata
1个回答
0
投票

在比较相似性之前,您应该准备和清理数据。以下是使用 Snowflake SQL 的一种方法。

WITH table_a_prepped AS (
  SELECT
      Employee_Name as name,
      -- Remove suffixes/titles, characters not alphabet or space, normalize case and trim
      TRIM(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(UPPER(TRIM(name)), '[^A-Z ]', ''), '^(MR|MRS|MISS|MS|DR|PROF)',''),'(JR\.|SR\.|II|III|IV)$','')) AS first_clean,
      -- Remove middle names or initials
      CASE
        WHEN ARRAY_SIZE(SPLIT(first_clean, ' ')) > 2 THEN SPLIT_PART(first_clean, ' ', 1) || ' ' || SPLIT_PART(first_clean, ' ', -1)
        ELSE first_clean
      END AS cleaned_name,
      -- Create blocking key using the first letters of the first and last names from the cleansed name
       SUBSTRING(SPLIT_PART(cleaned_name,' ', 1), 1, 1) || SUBSTRING(SPLIT_PART(cleaned_name,' ', -1), 1, 1) AS blocking_key
    FROM
      table_a
  ),
  table_b_prepped AS (
    SELECT
      name,
      -- Remove suffixes/titles, characters not alphabet or space, normalize case and trim
      TRIM(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(UPPER(TRIM(name)), '[^A-Z ]', ''), '^(MR|MRS|MISS|MS|DR|PROF)',''),'(JR\.|SR\.|II|III|IV)$','')) AS first_clean,
      -- Remove middle names or initials
      CASE
        WHEN ARRAY_SIZE(SPLIT(first_clean, ' ')) > 2 THEN SPLIT_PART(first_clean, ' ', 1) || ' ' || SPLIT_PART(first_clean, ' ', -1)
        ELSE first_clean
      END AS cleaned_name,
      -- Create blocking key using the first letters of the first and last names from the cleansed name
       SUBSTRING(SPLIT_PART(cleaned_name,' ', 1), 1, 1) || SUBSTRING(SPLIT_PART(cleaned_name,' ', -1), 1, 1) AS blocking_key
    FROM
      table_b
  )
  SELECT
    a.name AS "Column A",
    b.name AS "Column B",
    -- Levenshtein Distance converted to a similarity score between 0 and 1
    1 - (EDITDISTANCE(a.cleaned_name, b.cleaned_name)::FLOAT / GREATEST(LENGTH(a.cleaned_name), LENGTH(b.cleaned_name))) AS "Similarity Score"
  FROM
    table_a_prepped AS a
  INNER JOIN
    table_b_prepped b ON a.blocking_key = b.blocking_key
  
      
© www.soinside.com 2019 - 2024. All rights reserved.