快速比较两个表之间的数据

问题描述 投票:0回答:3

我正在使用

Oracle11g
,我会比较两个表以查找它们之间匹配的记录。

示例:

Table 1        Table 2

George         Micheal
Michael        Paul

“Micheal”和“Michael”的记录是他们之间的匹配,所以他们是很好的记录。

要查看两条记录是否匹配,我使用

Oracle
函数
utl_match.edit_distance_similarity

我尝试使用下面的代码,但是我遇到了性能问题(太慢了):

SELECT * 
FROM table1
JOIN table2
ON utl_match.edit_distance_similarity(table1.name, table2.name) > 75;

有更好的解决办法吗?

谢谢你

sql oracle oracle11g utl-match
3个回答
1
投票

这是一个难题。一般来说,这会导致嵌套循环连接和缓慢。可以使用

SOUNDEX()
来获得“接近”匹配,然后使用字符距离函数进行最终过滤。这可能无法解决您的问题,但也有可能。

虽然我不太喜欢该功能,但您可能会发现

soundex()
适合您的目的(请参阅此处)。

想法是为此值添加索引:

create index idx_table1_soundexname on table1(soundex(name));
create index idx_table2_soundexname on table2(soundex(name));

然后你可以这样查询:

SELECT * 
FROM table1 t1 JOIN
     table2 t2
     ON soundex(t1.name) = soundex(t2.name)
WHERE utl_match.edit_distance_similarity(t1.name, t2.name) > 75;

这个想法是,Oracle 将使用索引来获取“接近”的名称,然后使用编辑距离来获得更好的匹配。这可能无法解决您的问题。这只是一个可能可行的想法。


1
投票

如果表 table1 和 table2 中的名称值有很多冗余,这可能是一个解决方案

-- Test data set

select count(*) from table1;
--> 10.000

select count(*) from table2;
--> 10.000

select count(distinct(name)) from table1;
--> ~ 2500

select count(distinct(name)) from table2;
--> ~ 2500

/* a) Join with function compare */

select table1.name, table2.name
  from table1, table2
 where utl_match.edit_distance_similarity(table1.name, table2.name) > 35

/*

--------------------------------------------------------------------------------
| Id  | Operation            | Name   | Rows    | Bytes     | Cost  | Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |        | 5000000 | 270000000 | 37364 | 00:09:21 |
|   1 |   NESTED LOOPS       |        | 5000000 | 270000000 | 37364 | 00:09:21 |
|   2 |    TABLE ACCESS FULL | TABLE1 |   10000 |    270000 |     5 | 00:00:01 |
| * 3 |    TABLE ACCESS FULL | TABLE2 |     500 |     13500 |     4 | 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 3 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("TABLE1"."NAME","TABLE2"."NAME")>35)


Note
-----
- dynamic sampling used for this statement

*/

/* b) Join with function, only distinct values */

-- A Set of all existing names (in table1 and table2)
 with names as
 (select name from table1 union select name from table2),

-- Compare only once because utl_match.edit_distance_similarity(name1, name2) = utl_match.edit_distance_similarity(name2, name1)
 table_cmp(name1, name2) as
 (select n1.name, n2.name
          from names n1
          join names n2
            on n1.name <= n2.name
           and utl_match.edit_distance_similarity(n1.name, n2.name) > 35)

  select t1.*, t2.*
          from table_cmp c
          join table1 t1
            on t1.name = c.name1
          join table2 t2
            on t2.name = c.name2
        union all
        select t1.*, t2.*
          from table_cmp c
          join table1 t1
            on t1.name = c.name2
          join table2 t2
            on t2.name = c.name1;


/*

--------------------------------------------------------------------------------------------------------------
| Id   | Operation                   | Name                        | Rows     | Bytes      | Cost | Time     |
--------------------------------------------------------------------------------------------------------------
|    0 | SELECT STATEMENT            |                             | 30469950 | 3290754600 | 2495 | 00:00:38 |
|    1 |   TEMP TABLE TRANSFORMATION |                             |          |            |      |          |
|    2 |    LOAD AS SELECT           | SYS_TEMP_0FD9D663E_B39FC2B6 |          |            |      |          |
|    3 |     SORT UNIQUE             |                             |    20000 |     540000 |   12 | 00:00:01 |
|    4 |      UNION-ALL              |                             |          |            |      |          |
|    5 |       TABLE ACCESS FULL     | TABLE1                      |    10000 |     270000 |    5 | 00:00:01 |
|    6 |       TABLE ACCESS FULL     | TABLE2                      |    10000 |     270000 |    5 | 00:00:01 |
|    7 |    LOAD AS SELECT           | SYS_TEMP_0FD9D663F_B39FC2B6 |          |            |      |          |
|    8 |     MERGE JOIN              |                             |  1000000 |   54000000 |   62 | 00:00:01 |
|    9 |      SORT JOIN              |                             |    20000 |     540000 |    3 | 00:00:01 |
|   10 |       VIEW                  |                             |    20000 |     540000 |    2 | 00:00:01 |
|   11 |        TABLE ACCESS FULL    | SYS_TEMP_0FD9D663E_B39FC2B6 |    20000 |     540000 |    2 | 00:00:01 |
| * 12 |      FILTER                 |                             |          |            |      |          |
| * 13 |       SORT JOIN             |                             |    20000 |     540000 |    3 | 00:00:01 |
|   14 |        VIEW                 |                             |    20000 |     540000 |    2 | 00:00:01 |
|   15 |         TABLE ACCESS FULL   | SYS_TEMP_0FD9D663E_B39FC2B6 |    20000 |     540000 |    2 | 00:00:01 |
|   16 |    UNION-ALL                |                             |          |            |      |          |
| * 17 |     HASH JOIN               |                             | 15234975 | 1645377300 | 1248 | 00:00:19 |
|   18 |      TABLE ACCESS FULL      | TABLE2                      |    10000 |     270000 |    5 | 00:00:01 |
| * 19 |      HASH JOIN              |                             |  3903201 |  316159281 | 1200 | 00:00:18 |
|   20 |       TABLE ACCESS FULL     | TABLE1                      |    10000 |     270000 |    5 | 00:00:01 |
|   21 |       VIEW                  |                             |  1000000 |   54000000 | 1183 | 00:00:18 |
|   22 |        TABLE ACCESS FULL    | SYS_TEMP_0FD9D663F_B39FC2B6 |  1000000 |   54000000 | 1183 | 00:00:18 |
| * 23 |     HASH JOIN               |                             | 15234975 | 1645377300 | 1248 | 00:00:19 |
|   24 |      TABLE ACCESS FULL      | TABLE2                      |    10000 |     270000 |    5 | 00:00:01 |
| * 25 |      HASH JOIN              |                             |  3903201 |  316159281 | 1200 | 00:00:18 |
|   26 |       TABLE ACCESS FULL     | TABLE1                      |    10000 |     270000 |    5 | 00:00:01 |
|   27 |       VIEW                  |                             |  1000000 |   54000000 | 1183 | 00:00:18 |
|   28 |        TABLE ACCESS FULL    | SYS_TEMP_0FD9D663F_B39FC2B6 |  1000000 |   54000000 | 1183 | 00:00:18 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 12 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("N1"."NAME","N2"."NAME")>35)
* 13 - access("N1"."NAME"<="N2"."NAME")
* 13 - filter("N1"."NAME"<="N2"."NAME")
* 17 - access("T2"."NAME"="C"."NAME2")
* 19 - access("T1"."NAME"="C"."NAME1")
* 23 - access("T2"."NAME"="C"."NAME1")
* 25 - access("T1"."NAME"="C"."NAME2")


Note
-----
- dynamic sampling used for this statement

*/

0
投票

我编写了一个自定义函数来比较名称而不是编辑距离。按照建议,我在两个表的名称列上创建了 soundedx 索引,并使用 soundex(custname) 进行连接。我的问题是 soundex(custname) 上的连接会影响自定义函数还是仅用于索引。有没有其他方法可以在不使用 soundex 的情况下创建索引

© www.soinside.com 2019 - 2024. All rights reserved.