我正在使用
Oracle11g
,我会比较两个表以查找它们之间匹配的记录。
示例:
Table 1 Table 2
George Micheal
Michael Paul
“Micheal”和“Michael”的记录是他们之间的匹配,所以他们是很好的记录。
要查看两条记录是否匹配,我使用
Oracle
函数 utl_match.edit_distance_similarity
。
我尝试使用下面的代码,但是我遇到了性能问题(太慢了):
SELECT *
FROM table1
JOIN table2
ON utl_match.edit_distance_similarity(table1.name, table2.name) > 75;
有更好的解决办法吗?
谢谢你
这是一个难题。一般来说,这会导致嵌套循环连接和缓慢。可以使用
SOUNDEX()
来获得“接近”匹配,然后使用字符距离函数进行最终过滤。这可能无法解决您的问题,但也有可能。
虽然我不太喜欢该功能,但您可能会发现
soundex()
适合您的目的(请参阅此处)。
想法是为此值添加索引:
create index idx_table1_soundexname on table1(soundex(name));
create index idx_table2_soundexname on table2(soundex(name));
然后你可以这样查询:
SELECT *
FROM table1 t1 JOIN
table2 t2
ON soundex(t1.name) = soundex(t2.name)
WHERE utl_match.edit_distance_similarity(t1.name, t2.name) > 75;
这个想法是,Oracle 将使用索引来获取“接近”的名称,然后使用编辑距离来获得更好的匹配。这可能无法解决您的问题。这只是一个可能可行的想法。
如果表 table1 和 table2 中的名称值有很多冗余,这可能是一个解决方案
-- Test data set
select count(*) from table1;
--> 10.000
select count(*) from table2;
--> 10.000
select count(distinct(name)) from table1;
--> ~ 2500
select count(distinct(name)) from table2;
--> ~ 2500
/* a) Join with function compare */
select table1.name, table2.name
from table1, table2
where utl_match.edit_distance_similarity(table1.name, table2.name) > 35
/*
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5000000 | 270000000 | 37364 | 00:09:21 |
| 1 | NESTED LOOPS | | 5000000 | 270000000 | 37364 | 00:09:21 |
| 2 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| * 3 | TABLE ACCESS FULL | TABLE2 | 500 | 13500 | 4 | 00:00:01 |
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 3 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("TABLE1"."NAME","TABLE2"."NAME")>35)
Note
-----
- dynamic sampling used for this statement
*/
/* b) Join with function, only distinct values */
-- A Set of all existing names (in table1 and table2)
with names as
(select name from table1 union select name from table2),
-- Compare only once because utl_match.edit_distance_similarity(name1, name2) = utl_match.edit_distance_similarity(name2, name1)
table_cmp(name1, name2) as
(select n1.name, n2.name
from names n1
join names n2
on n1.name <= n2.name
and utl_match.edit_distance_similarity(n1.name, n2.name) > 35)
select t1.*, t2.*
from table_cmp c
join table1 t1
on t1.name = c.name1
join table2 t2
on t2.name = c.name2
union all
select t1.*, t2.*
from table_cmp c
join table1 t1
on t1.name = c.name2
join table2 t2
on t2.name = c.name1;
/*
--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
--------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 30469950 | 3290754600 | 2495 | 00:00:38 |
| 1 | TEMP TABLE TRANSFORMATION | | | | | |
| 2 | LOAD AS SELECT | SYS_TEMP_0FD9D663E_B39FC2B6 | | | | |
| 3 | SORT UNIQUE | | 20000 | 540000 | 12 | 00:00:01 |
| 4 | UNION-ALL | | | | | |
| 5 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| 6 | TABLE ACCESS FULL | TABLE2 | 10000 | 270000 | 5 | 00:00:01 |
| 7 | LOAD AS SELECT | SYS_TEMP_0FD9D663F_B39FC2B6 | | | | |
| 8 | MERGE JOIN | | 1000000 | 54000000 | 62 | 00:00:01 |
| 9 | SORT JOIN | | 20000 | 540000 | 3 | 00:00:01 |
| 10 | VIEW | | 20000 | 540000 | 2 | 00:00:01 |
| 11 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663E_B39FC2B6 | 20000 | 540000 | 2 | 00:00:01 |
| * 12 | FILTER | | | | | |
| * 13 | SORT JOIN | | 20000 | 540000 | 3 | 00:00:01 |
| 14 | VIEW | | 20000 | 540000 | 2 | 00:00:01 |
| 15 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663E_B39FC2B6 | 20000 | 540000 | 2 | 00:00:01 |
| 16 | UNION-ALL | | | | | |
| * 17 | HASH JOIN | | 15234975 | 1645377300 | 1248 | 00:00:19 |
| 18 | TABLE ACCESS FULL | TABLE2 | 10000 | 270000 | 5 | 00:00:01 |
| * 19 | HASH JOIN | | 3903201 | 316159281 | 1200 | 00:00:18 |
| 20 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| 21 | VIEW | | 1000000 | 54000000 | 1183 | 00:00:18 |
| 22 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663F_B39FC2B6 | 1000000 | 54000000 | 1183 | 00:00:18 |
| * 23 | HASH JOIN | | 15234975 | 1645377300 | 1248 | 00:00:19 |
| 24 | TABLE ACCESS FULL | TABLE2 | 10000 | 270000 | 5 | 00:00:01 |
| * 25 | HASH JOIN | | 3903201 | 316159281 | 1200 | 00:00:18 |
| 26 | TABLE ACCESS FULL | TABLE1 | 10000 | 270000 | 5 | 00:00:01 |
| 27 | VIEW | | 1000000 | 54000000 | 1183 | 00:00:18 |
| 28 | TABLE ACCESS FULL | SYS_TEMP_0FD9D663F_B39FC2B6 | 1000000 | 54000000 | 1183 | 00:00:18 |
--------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 12 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("N1"."NAME","N2"."NAME")>35)
* 13 - access("N1"."NAME"<="N2"."NAME")
* 13 - filter("N1"."NAME"<="N2"."NAME")
* 17 - access("T2"."NAME"="C"."NAME2")
* 19 - access("T1"."NAME"="C"."NAME1")
* 23 - access("T2"."NAME"="C"."NAME1")
* 25 - access("T1"."NAME"="C"."NAME2")
Note
-----
- dynamic sampling used for this statement
*/
我编写了一个自定义函数来比较名称而不是编辑距离。按照建议,我在两个表的名称列上创建了 soundedx 索引,并使用 soundex(custname) 进行连接。我的问题是 soundex(custname) 上的连接会影响自定义函数还是仅用于索引。有没有其他方法可以在不使用 soundex 的情况下创建索引