如何计算给定2个字符串的距离相似性度量？

Question

我需要计算2个字符串之间的相似度。那究竟是什么意思呢？让我用一个例子来解释一下：

真实的词：hospital
误区：haspita

现在我的目标是确定修改错误单词以获得真实单词所需的字符数。在这个例子中，我需要修改2个字母。那么百分比是多少？我总是把真正的词长度。因此它变为2/8 = 25％所以这两个给定的字符串DSM是75％。

如何以性能为关键考虑因素来实现这一目标？

Answer 1

您正在寻找的是编辑距离或Levenshtein distance。维基百科的文章解释了它是如何计算的，底部有一段很好的伪代码，可以帮助你很容易地在C＃中编写这个算法。

这是以下链接的第一个站点的实现：

private static int  CalcLevenshteinDistance(string a, string b)
    {
    if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
        return 0;
    }
    if (String.IsNullOrEmpty(a)) {
        return b.Length;
    }
    if (String.IsNullOrEmpty(b)) {
        return a.Length;
    }
    int  lengthA   = a.Length;
    int  lengthB   = b.Length;
    var  distances = new int[lengthA + 1, lengthB + 1];
    for (int i = 0;  i <= lengthA;  distances[i, 0] = i++);
    for (int j = 0;  j <= lengthB;  distances[0, j] = j++);

    for (int i = 1;  i <= lengthA;  i++)
        for (int j = 1;  j <= lengthB;  j++)
            {
            int  cost = b[j - 1] == a[i - 1] ? 0 : 1;
            distances[i, j] = Math.Min
                (
                Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
                distances[i - 1, j - 1] + cost
                );
            }
    return distances[lengthA, lengthB];
    }

Answer 2

几周前我刚刚解决了这个问题。既然有人现在问，我会分享代码。在我的详尽测试中，即使没有提供最大距离，我的代码也比维基百科上的C＃示例快10倍。当提供最大距离时，此性能增益会增加到30x - 100x +。请注意性能的几个关键点：

如果需要反复比较相同的单词，首先将单词转换为整数数组。 Damerau-Levenshtein算法包括许多>，<，==比较，而ints比chars要快得多。
它包括一个短路机制，如果距离超过提供的最大值，则退出
在我在其他地方看到的所有实现中，使用一组三个阵列而不是一个大型矩阵
确保您的数组切换到较短的字宽。

代码（如果在参数声明中用int[]替换String，它的工作方式完全相同：

/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {

    int length1 = source.Length;
    int length2 = target.Length;

    // Return trivial case - difference in string lengths exceeds threshhold
    if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }

    // Ensure arrays [i] / length1 use shorter length 
    if (length1 > length2) {
        Swap(ref target, ref source);
        Swap(ref length1, ref length2);
    }

    int maxi = length1;
    int maxj = length2;

    int[] dCurrent = new int[maxi + 1];
    int[] dMinus1 = new int[maxi + 1];
    int[] dMinus2 = new int[maxi + 1];
    int[] dSwap;

    for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }

    int jm1 = 0, im1 = 0, im2 = -1;

    for (int j = 1; j <= maxj; j++) {

        // Rotate
        dSwap = dMinus2;
        dMinus2 = dMinus1;
        dMinus1 = dCurrent;
        dCurrent = dSwap;

        // Initialize
        int minDistance = int.MaxValue;
        dCurrent[0] = j;
        im1 = 0;
        im2 = -1;

        for (int i = 1; i <= maxi; i++) {

            int cost = source[im1] == target[jm1] ? 0 : 1;

            int del = dCurrent[im1] + 1;
            int ins = dMinus1[i] + 1;
            int sub = dMinus1[im1] + cost;

            //Fastest execution for min value of 3 integers
            int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);

            if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
                min = Math.Min(min, dMinus2[im2] + cost);

            dCurrent[i] = min;
            if (min < minDistance) { minDistance = min; }
            im1++;
            im2++;
        }
        jm1++;
        if (minDistance > threshold) { return int.MaxValue; }
    }

    int result = dCurrent[maxi];
    return (result > threshold) ? int.MaxValue : result;
}

Swap在哪里：

static void Swap<T>(ref T arg1,ref T arg2) {
    T temp = arg1;
    arg1 = arg2;
    arg2 = temp;
}

Answer 3

可以使用大量的字符串相似性距离算法。这里列出的一些（但没有详尽列出）：

Levenstein
Needleman Wunsch
史密斯沃特曼
Smith Waterman Gotoh
Jarro，Coxspoi
Jaro Winkler
Jaccard Similarity
骰子相似度
Euclidean Distance
Monge Elkan

包含所有这些实现的库称为Cosine Similarity，它同时具有java和c＃实现。

Answer 4

我发现SimMetrics和Levenshtein非常适合字符串之间的小差异，例如：

拼写错误;要么
ö而不是人名中的o。

然而，当比较像文章标题的东西时，文本的重要块将是相同的但是边缘有“噪音”，Jaro Winkler非常棒：

比较这两个标题（虽然相同，但措辞不同，来自不同来源）：

来自大肠杆菌的内切核酸酶，其在紫外线照射的DNA中引入单个多核苷酸链断裂

内切核酸酶III：来自大肠杆菌的核酸内切酶，在紫外线照射的DNA中引入单个多核苷酸链断裂

Smith-Waterman-Gotoh的字符串显示：

Levenshtein：81
Smith-Waterman Gotoh 94
Jaro Winkler 78

Jaro Winkler和Levenshtein在检测相似性方面不如Smith Waterman Gotoh能干。如果我们比较两篇不同文章的标题，但有一些匹配的文字：

高等植物的脂肪代谢。酰基硫酯酶在酰基辅酶A和酰基 - 酰基载体蛋白代谢中的作用

高等植物的脂肪代谢。复合脂质混合物中酰基 - 酰基载体蛋白和酰基辅酶A的测定

Jaro Winkler给出了误报，但Smith Waterman Gotoh没有：

Levenshtein：54
Smith-Waterman Gotoh 49
Jaro Winkler 89

正如This site that provides algorithm comparison指出的那样，Anastasiosyal拥有这些算法的java代码。我使用SimMetrics取得了成功。

Answer 5

这是我对Damerau Levenshtein Distance的实现，它不仅返回相似系数，还返回更正后的单词中的错误位置（此功能可用于文本编辑器）。我的实现也支持不同的错误权重（替换，删除，插入，转置）。

SmithWatermanGotoh java code from SimMetrics

这段代码是我项目的一部分：public static List<Mistake> OptimalStringAlignmentDistance( string word, string correctedWord, bool transposition = true, int substitutionCost = 1, int insertionCost = 1, int deletionCost = 1, int transpositionCost = 1) { int w_length = word.Length; int cw_length = correctedWord.Length; var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1]; var result = new List<Mistake>(Math.Max(w_length, cw_length)); if (w_length == 0) { for (int i = 0; i < cw_length; i++) result.Add(new Mistake(i, CharMistakeType.Insertion)); return result; } for (int i = 0; i <= w_length; i++) d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None); for (int j = 0; j <= cw_length; j++) d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None); for (int i = 1; i <= w_length; i++) { for (int j = 1; j <= cw_length; j++) { bool equal = correctedWord[j - 1] == word[i - 1]; int delCost = d[i - 1, j].Key + deletionCost; int insCost = d[i, j - 1].Key + insertionCost; int subCost = d[i - 1, j - 1].Key; if (!equal) subCost += substitutionCost; int transCost = int.MaxValue; if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1]) { transCost = d[i - 2, j - 2].Key; if (!equal) transCost += transpositionCost; } int min = delCost; CharMistakeType mistakeType = CharMistakeType.Deletion; if (insCost < min) { min = insCost; mistakeType = CharMistakeType.Insertion; } if (subCost < min) { min = subCost; mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution; } if (transCost < min) { min = transCost; mistakeType = CharMistakeType.Transposition; } d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType); } } int w_ind = w_length; int cw_ind = cw_length; while (w_ind >= 0 && cw_ind >= 0) { switch (d[w_ind, cw_ind].Value) { case CharMistakeType.None: w_ind--; cw_ind--; break; case CharMistakeType.Substitution: result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution)); w_ind--; cw_ind--; break; case CharMistakeType.Deletion: result.Add(new Mistake(cw_ind, CharMistakeType.Deletion)); w_ind--; break; case CharMistakeType.Insertion: result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion)); cw_ind--; break; case CharMistakeType.Transposition: result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition)); w_ind -= 2; cw_ind -= 2; break; } } if (d[w_length, cw_length].Key > result.Count) { int delMistakesCount = d[w_length, cw_length].Key - result.Count; for (int i = 0; i < delMistakesCount; i++) result.Add(new Mistake(0, CharMistakeType.Deletion)); } result.Reverse(); return result; } public struct Mistake { public int Position; public CharMistakeType Type; public Mistake(int position, CharMistakeType type) { Position = position; Type = type; } public override string ToString() { return Position + ", " + Type; } } public enum CharMistakeType { None, Substitution, Insertion, Deletion, Transposition }。

我写了一些Yandex-Linguistics.NET，在我看来，这种方法是有效的。

但欢迎提出意见和评论。

Answer 6

这是另一种方法：

这个评论太长了。

找到相似性的典型方法是Levenshtein距离，毫无疑问是一个可用代码的库。

不幸的是，这需要与每个字符串进行比较如果距离大于某个阈值，您可能可以编写专用版本的代码来使计算短路，您仍然需要进行所有比较。

另一个想法是使用一些三卦或n-gram的变体。这些是n个字符的序列（或n个字或n个基因组序列或n个）。保持三元组到字符串的映射，并选择具有最大重叠的那些。 n的典型选择是“3”，因此得名。

例如，英语会有这些三元组：

工程
NGL
该
百合
前

而且英格兰会有：

工程
NGL
GLA
蓝
和

好吧，7个中有2个（或10个中的4个）匹配。如果这适合您，您可以索引trigram / string表并获得更快的搜索。

您还可以将其与Levenshtein结合使用，以将比较集减少到具有一些共同的最小n-gram数的那些。

Answer 7

0
投票

这是一个VB.net实现：

tests

如何计算给定2个字符串的距离相似性度量？

问题描述投票：54回答：7

7个回答

最新问题

如何计算给定2个字符串的距离相似性度量？

问题描述 投票：54回答：7

7个回答

最新问题

问题描述投票：54回答：7