我有两个数据集,其中包含我公司客户的全名。两组都相当大(40-70k 行)。我想检查一下这两个群体之间是否存在相似之处。例如:如果一组记录的值为“John Smith”,另一组记录的值为“John W. Smith”,我可能会将其视为匹配项。
A 栏 | B 栏 | 相似度分数 |
---|---|---|
约翰·史密斯 | 约翰·W·史密斯 | 0.97 |
詹姆斯·邦德 | 安德鲁·邦德 | 0.5 |
数据集有不同的大小,它们在 Snowflake 上。如果缺少任何信息,请告诉我。
另外,你认为这个问题可以用 SQL 解决吗?还是Python更适合?
我尝试使用 Jarowinkler 相似度,但它仅适用于“特定”字符串,不适用于数据集。
在比较相似性之前,您应该准备和清理数据。以下是使用 Snowflake SQL 的一种方法。
WITH table_a_prepped AS (
SELECT
Employee_Name as name,
-- Remove suffixes/titles, characters not alphabet or space, normalize case and trim
TRIM(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(UPPER(TRIM(name)), '[^A-Z ]', ''), '^(MR|MRS|MISS|MS|DR|PROF)',''),'(JR\.|SR\.|II|III|IV)$','')) AS first_clean,
-- Remove middle names or initials
CASE
WHEN ARRAY_SIZE(SPLIT(first_clean, ' ')) > 2 THEN SPLIT_PART(first_clean, ' ', 1) || ' ' || SPLIT_PART(first_clean, ' ', -1)
ELSE first_clean
END AS cleaned_name,
-- Create blocking key using the first letters of the first and last names from the cleansed name
SUBSTRING(SPLIT_PART(cleaned_name,' ', 1), 1, 1) || SUBSTRING(SPLIT_PART(cleaned_name,' ', -1), 1, 1) AS blocking_key
FROM
table_a
),
table_b_prepped AS (
SELECT
name,
-- Remove suffixes/titles, characters not alphabet or space, normalize case and trim
TRIM(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(UPPER(TRIM(name)), '[^A-Z ]', ''), '^(MR|MRS|MISS|MS|DR|PROF)',''),'(JR\.|SR\.|II|III|IV)$','')) AS first_clean,
-- Remove middle names or initials
CASE
WHEN ARRAY_SIZE(SPLIT(first_clean, ' ')) > 2 THEN SPLIT_PART(first_clean, ' ', 1) || ' ' || SPLIT_PART(first_clean, ' ', -1)
ELSE first_clean
END AS cleaned_name,
-- Create blocking key using the first letters of the first and last names from the cleansed name
SUBSTRING(SPLIT_PART(cleaned_name,' ', 1), 1, 1) || SUBSTRING(SPLIT_PART(cleaned_name,' ', -1), 1, 1) AS blocking_key
FROM
table_b
)
SELECT
a.name AS "Column A",
b.name AS "Column B",
-- Levenshtein Distance converted to a similarity score between 0 and 1
1 - (EDITDISTANCE(a.cleaned_name, b.cleaned_name)::FLOAT / GREATEST(LENGTH(a.cleaned_name), LENGTH(b.cleaned_name))) AS "Similarity Score"
FROM
table_a_prepped AS a
INNER JOIN
table_b_prepped b ON a.blocking_key = b.blocking_key