如何检查两个数据集之间的相似性并在 Snowflake 中返回分数（这可能吗？）

Question

我有两个数据集，其中包含我公司客户的全名。两组都相当大（40-70k 行）。我想检查一下这两个群体之间是否存在相似之处。例如：如果一组记录的值为“John Smith”，另一组记录的值为“John W. Smith”，我可能会将其视为匹配项。

A 栏	B 栏	相似度分数
约翰·史密斯	约翰·W·史密斯	0.97
詹姆斯·邦德	安德鲁·邦德	0.5

数据集有不同的大小，它们在 Snowflake 上。如果缺少任何信息，请告诉我。

另外，你认为这个问题可以用 SQL 解决吗？还是Python更适合？

我尝试使用 Jarowinkler 相似度，但它仅适用于“特定”字符串，不适用于数据集。

Answer 1

在比较相似性之前，您应该准备和清理数据。以下是使用 Snowflake SQL 的一种方法。

WITH table_a_prepped AS (
  SELECT
      Employee_Name as name,
      -- Remove suffixes/titles, characters not alphabet or space, normalize case and trim
      TRIM(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(UPPER(TRIM(name)), '[^A-Z ]', ''), '^(MR|MRS|MISS|MS|DR|PROF)',''),'(JR\.|SR\.|II|III|IV)$','')) AS first_clean,
      -- Remove middle names or initials
      CASE
        WHEN ARRAY_SIZE(SPLIT(first_clean, ' ')) > 2 THEN SPLIT_PART(first_clean, ' ', 1) || ' ' || SPLIT_PART(first_clean, ' ', -1)
        ELSE first_clean
      END AS cleaned_name,
      -- Create blocking key using the first letters of the first and last names from the cleansed name
       SUBSTRING(SPLIT_PART(cleaned_name,' ', 1), 1, 1) || SUBSTRING(SPLIT_PART(cleaned_name,' ', -1), 1, 1) AS blocking_key
    FROM
      table_a
  ),
  table_b_prepped AS (
    SELECT
      name,
      -- Remove suffixes/titles, characters not alphabet or space, normalize case and trim
      TRIM(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(UPPER(TRIM(name)), '[^A-Z ]', ''), '^(MR|MRS|MISS|MS|DR|PROF)',''),'(JR\.|SR\.|II|III|IV)$','')) AS first_clean,
      -- Remove middle names or initials
      CASE
        WHEN ARRAY_SIZE(SPLIT(first_clean, ' ')) > 2 THEN SPLIT_PART(first_clean, ' ', 1) || ' ' || SPLIT_PART(first_clean, ' ', -1)
        ELSE first_clean
      END AS cleaned_name,
      -- Create blocking key using the first letters of the first and last names from the cleansed name
       SUBSTRING(SPLIT_PART(cleaned_name,' ', 1), 1, 1) || SUBSTRING(SPLIT_PART(cleaned_name,' ', -1), 1, 1) AS blocking_key
    FROM
      table_b
  )
  SELECT
    a.name AS "Column A",
    b.name AS "Column B",
    -- Levenshtein Distance converted to a similarity score between 0 and 1
    1 - (EDITDISTANCE(a.cleaned_name, b.cleaned_name)::FLOAT / GREATEST(LENGTH(a.cleaned_name), LENGTH(b.cleaned_name))) AS "Similarity Score"
  FROM
    table_a_prepped AS a
  INNER JOIN
    table_b_prepped b ON a.blocking_key = b.blocking_key

如何检查两个数据集之间的相似性并在 Snowflake 中返回分数（这可能吗？）

问题描述投票：0回答：1

1个回答

最新问题

如何检查两个数据集之间的相似性并在 Snowflake 中返回分数（这可能吗？）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1