SQL Server 中的模糊短语相似度

问题描述 投票:0回答:3

使用 SQL Server,我应该如何在大型表中的所有行中执行模糊排名搜索,以查找与一列上的长短语的相似性?

换句话说,如果我的数据如下所示:

身份证 数据
1 敏捷的棕色狐狸跳过了懒狗
2 敏捷的棕色猫跳过了懒惰的青蛙
3 懒惰而敏捷的棕色青蛙跳过猫
4 lorem ipsum dolor 坐 amet

我搜索“快速的棕色牛跳过一只懒狗”,我想要的结果大致类似于:

身份证 分数
1 95
2 80
3 40
4 0

实际数据会有更多的行和更长的短语。

显然我不想要精确的字符串匹配,因此使用

LIKE
CONTAINS
显然行不通。

词序很重要,因此单独搜索每个单词也是行不通的。

全文索引和类似声音索引似乎只对子字符串相似性有用,所以我还没有找到将其应用于短语相似性的方法。例如,您如何以一种方式对缺少或添加单词的相似短语给出不错的分数进行查询?

我已经使用编辑距离(Lavenshtein、Jaro-Winkler 等)进行了测试,但对于一大组长字符串来说它太慢了。一个查询需要几分钟的时间。听起来它应该只用于较小的数据,所以我认为这里需要一种不同的方法。

我已经看到提到了 TFIDF 和余弦相似度,但我不确定这是否适合在这里使用,或者它如何在 SQL Server 上实现。

此外,由于我们在 Linux 上使用 SQL Server,因此 CLR 支持受到限制。看起来只要不需要不安全或外部权限就可以。

sql sql-server t-sql search fuzzy-search
3个回答
2
投票

使用模糊匹配逻辑快速找到最佳匹配字符串的相对快速方法可以基于对字符串中匹配的 3-gram 进行计数。

它可以利用预先构建的 SQL 函数和索引表来加快搜索速度。特别是,它不必检查从搜索字符串到数据集中每个字符串的距离。

首先,为了方便起见,创建一个表函数,将字符串分解为 3 个字母的标记。

drop function dbo.get_triplets;
go
CREATE FUNCTION dbo.get_triplets
(   
    @data varchar(1000)
)
RETURNS TABLE AS RETURN 
(
    WITH Nums AS
    (
      SELECT n = ROW_NUMBER() OVER (ORDER BY [object_id])  FROM sys.all_objects 
    )
    select triplet,count(*) c, len(@data)-2 triplet_count
    from (
        select SUBSTRING(@data,n,3) triplet
        from (select top (len(@data)-2) n from nums) n
    ) triplets
    group by triplet
)
GO

创建字符串数据集

drop table if exists #data;
select * into #data
from (
values
(1, 'the quick brown fox jumps over the lazy dog'),
(2, 'the quick brown cat jumps over the lazy frog'),
(3, 'the lazy quick brown frog jumps over the cat'),
(4, 'lorem ipsum dolor sit amet')
) a(id,data);

创建 3 字母标记的索引表

drop table if exists #triplets;
select id,triplet,c,triplet_count data_triplet_count 
into #triplets
from #data d
cross apply dbo.get_triplets(d.data);

CREATE unique CLUSTERED INDEX IX_triplet_index ON #triplets(triplet,id); 

然后我希望使用类似于

的查询对给定字符串的匹配进行有效的模糊搜索
declare @string_to_search varchar(1000) = 'the quick brown ox jumps over a lazy dog';

select  matched.*,d.data,
cast(
cast(matched_triplets as float)
/
cast(case when data_triplet_count>string_triplet_count 
          then data_triplet_count 
          else string_triplet_count
          end as float) 
as decimal(4,3)) score 
from (
    select id,sum(case when a.c<b.c then a.c else b.c end) matched_triplets,
    max(a.data_triplet_count) data_triplet_count,
    max(b.triplet_count) string_triplet_count
    from #triplets a
    join dbo.get_triplets(@string_to_search) b
    on a.triplet = b.triplet
    group by id
) matched
join #data d
on d.id = matched.id;

2
投票

执行模糊字符串比较的方法和算法有很多。

在我的系统中,我使用 CLR 函数来计算 Jaro–Winkler 距离。当用户尝试创建新公司时我会使用它。在创建新条目之前,我计算新公司名称与数据库中所有现有公司名称之间的 Jaro-Winkler 距离,以查看它是否已存在并允许一些错误输入和略有不同的拼写。

我向用户展示了一些最热门的匹配项,希望他们不会创建重复项。

这就是您的示例的工作原理:

DECLARE @T TABLE (id int, val VARCHAR(256));
INSERT INTO @T VALUES
(1,  'the quick brown fox jumps over the lazy dog'),
(2,  'the quick brown cat jumps over the lazy frog'),
(3,  'the lazy quick brown frog jumps over the cat'),
(4,  'lorem ipsum dolor sit amet'),
(12, 'the quick brown ox jumps over the lazy dog'),
(13, 'the quick brown fox jumps over a lazy dog'),
(14, 'the quick brown ox jumps over a lazy dog')
;

SELECT
    T.*
    ,dbo.StringSimilarityJaroWinkler(T.val, 
        'the quick brown ox jumps over a lazy dog') AS dist
FROM @T AS T
ORDER BY dist desc
;
+----+----------------------------------------------+-------------------+
| id |                     val                      |       dist        |
+----+----------------------------------------------+-------------------+
| 14 | the quick brown ox jumps over a lazy dog     | 1                 |
| 13 | the quick brown fox jumps over a lazy dog    | 0.995121951219512 |
| 12 | the quick brown ox jumps over the lazy dog   | 0.975586080586081 |
|  1 | the quick brown fox jumps over the lazy dog  | 0.971267143709004 |
|  2 | the quick brown cat jumps over the lazy frog | 0.931560196560197 |
|  3 | the lazy quick brown frog jumps over the cat | 0.836212121212121 |
|  4 | lorem ipsum dolor sit amet                   | 0.584472934472934 |
+----+----------------------------------------------+-------------------+

这是 CLR 函数的 C# 代码:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class UserDefinedFunctions
{
    /*
    The Winkler modification will not be applied unless the percent match
    was at or above the WeightThreshold percent without the modification.
    Winkler's paper used a default value of 0.7
    */
    private static readonly double m_dWeightThreshold = 0.7;

    /*
    Size of the prefix to be concidered by the Winkler modification.
    Winkler's paper used a default value of 4
    */
    private static readonly int m_iNumChars = 4;

    [Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
    public static SqlDouble StringSimilarityJaroWinkler(SqlString string1, SqlString string2)
    {
        if (string1.IsNull || string2.IsNull)
        {
            return 0.0;
        }

        return GetStringSimilarityJaroWinkler(string1.Value, string2.Value);
    }

    private static double GetStringSimilarityJaroWinkler(string string1, string string2)
    {
        int iLen1 = string1.Length;
        int iLen2 = string2.Length;
        if (iLen1 == 0)
        {
            return iLen2 == 0 ? 1.0 : 0.0;
        }

        int iSearchRange = Math.Max(0, Math.Max(iLen1, iLen2) / 2 - 1);

        bool[] Matched1 = new bool[iLen1];
        for (int i = 0; i < Matched1.Length; ++i)
        {
            Matched1[i] = false;
        }
        bool[] Matched2 = new bool[iLen2];
        for (int i = 0; i < Matched2.Length; ++i)
        {
            Matched2[i] = false;
        }

        int iNumCommon = 0;
        for (int i = 0; i < iLen1; ++i)
        {
            int iStart = Math.Max(0, i - iSearchRange);
            int iEnd = Math.Min(i + iSearchRange + 1, iLen2);
            for (int j = iStart; j < iEnd; ++j)
            {
                if (Matched2[j]) continue;
                if (string1[i] != string2[j]) continue;

                Matched1[i] = true;
                Matched2[j] = true;
                ++iNumCommon;
                break;
            }
        }
        if (iNumCommon == 0) return 0.0;

        int iNumHalfTransposed = 0;
        int k = 0;
        for (int i = 0; i < iLen1; ++i)
        {
            if (!Matched1[i]) continue;
            while (!Matched2[k])
            {
                ++k;
            }
            if (string1[i] != string2[k])
            {
                ++iNumHalfTransposed;
            }
            ++k;
            // even though length of Matched1 and Matched2 can be different,
            // number of elements with true flag is the same in both arrays
            // so, k will never go outside the array boundary
        }
        int iNumTransposed = iNumHalfTransposed / 2;

        double dWeight =
            (
                (double)iNumCommon / (double)iLen1 +
                (double)iNumCommon / (double)iLen2 +
                (double)(iNumCommon - iNumTransposed) / (double)iNumCommon
            ) / 3.0;

        if (dWeight > m_dWeightThreshold)
        {
            int iComparisonLength = Math.Min(m_iNumChars, Math.Min(iLen1, iLen2));
            int iCommonChars = 0;
            while (iCommonChars < iComparisonLength && string1[iCommonChars] == string2[iCommonChars])
            {
                ++iCommonChars;
            }
            dWeight = dWeight + 0.1 * iCommonChars * (1.0 - dWeight);
        }
        return dWeight;
    }

};

1
投票

使用函数 F_INFERENCE_BASIQUE

演示:

CREATE TABLE T_TEST (id int, val VARCHAR(256));
INSERT INTO T_TEST VALUES
(1,     'the quick brown fox jumps over the lazy dog'),
(2,     'the quick brown cat jumps over the lazy frog'),
(3,     'the lazy quick brown frog jumps over the cat'),
(4,     'lorem ipsum dolor sit amet');

SELECT id, 100.0 * dbo.F_INFERENCE_BASIQUE(val, 'the quick brown fox jumps over the lazy dog') 
       / LEN('the quick brown fox jumps over the lazy dog') AS PERCENT_MATCH
FROM   T_TEST

id          PERCENT_MATCH
----------- ---------------------------------------
1           100.000000000000
2           46.511627906976
3           81.395348837209
4           6.976744186046

您可以根据自己的方便调整代码,例如消除两种方式与仅一种方式的比较......在这种情况下,匹配索引是:

id          PERCENT_MATCH
----------- ---------------------------------------
1           100.000000000000
2           46.511627906976
3           25.581395348837
4           4.651162790697

这几乎接近你的需求了!

© www.soinside.com 2019 - 2024. All rights reserved.