计算编辑距离的最有效方法

问题描述 投票:0回答:7

我刚刚实现了最佳匹配文件搜索算法来查找与字典中的字符串最接近的匹配。对我的代码进行分析后,我发现绝大多数时间都花在计算查询与可能结果之间的距离上。我目前正在实现使用二维数组计算编辑距离的算法,这使得实现成为 O(n^2) 操作。我希望有人能建议一种更快的方法来完成同样的事情。

这是我的实现:

public int calculate(String root, String query)
{
  int arr[][] = new int[root.length() + 2][query.length() + 2];

  for (int i = 2; i < root.length() + 2; i++)
  {
    arr[i][0] = (int) root.charAt(i - 2);
    arr[i][1] = (i - 1);
  }

  for (int i = 2; i < query.length() + 2; i++)
  {
    arr[0][i] = (int) query.charAt(i - 2);
    arr[1][i] = (i - 1);
  }

  for (int i = 2; i < root.length() + 2; i++)
  {
    for (int j = 2; j < query.length() + 2; j++)
    {
      int diff = 0;
      if (arr[0][j] != arr[i][0])
      {
        diff = 1;
      }
      arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
    }
  }
  return arr[root.length() + 1][query.length() + 1];
}

public int min(int n1, int n2, int n3)
{
  return (int) Math.min(n1, Math.min(n2, n3));
}
algorithm optimization levenshtein-distance
7个回答
29
投票

Levenshtein 距离上的 wikipedia 条目对优化计算提供了有用的建议——在您的情况下最适用的建议是,如果您可以在感兴趣的最大距离上设置一个界限

k
(超出该范围的任何距离也可能是无穷大!)您可以将计算量减少到
O(n times k)
而不是
O(n squared)
(基本上是在最小可能距离变为
> k
时立即放弃)。

由于您正在寻找最接近的匹配,因此您可以逐渐减少

k
到迄今为止找到的最佳匹配的距离 - 这不会影响最坏情况的行为(因为匹配 可能 是按递减顺序排列的)距离,这意味着你永远不会更快地逃生)但平均情况应该会有所改善。

我相信,如果您需要获得大幅更好的性能,您可能必须接受一些强有力的妥协来计算更近似的距离(因此获得“相当好的匹配”而不一定是最佳匹配)。


8
投票

根据此博客上的评论,加速 Levenshtein,您可以使用 VP 树并实现 O(nlogn)。 同一篇博客上的另一条评论指出了 VP-Trees 和 Levenshtein 的 python 实现。 请告诉我们这是否有效。


4
投票
我修改了在此

post上找到的编辑距离VBA函数以使用一维数组。 它的执行速度要快得多。

'Calculate the Levenshtein Distance between two strings (the number of insertions, 'deletions, and substitutions needed to transform the first string into the second) Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2 L1 = Len(s1): L2 = Len(s2) L1p1 = L1 + 1 L1p2 = L1 + 2 LD = (((L1 + 1) * (L2 + 1))) - 1 ReDim D(0 To LD) ss2 = L1 + 1 For i = 0 To L1 Step 1: D(i) = i: Next i 'setup array positions 0,1,2,3,4,... For j = 0 To LD Step ss2: D(j) = j / ss2: Next j 'setup array positions 0,1,2,3,4,... For j = 1 To L2 ssL = (L1 + 1) * j For i = (ssL + 1) To (ssL + L1) If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0 cI = D(i - 1) + 1 cD = D(i - L1p1) + 1 cS = D(i - L1p2) + cost If cI <= cD Then 'Insertion or Substitution If cI <= cS Then D(i) = cI Else D(i) = cS Else 'Deletion or Substitution If cD <= cS Then D(i) = cD Else D(i) = cS End If Next i Next j LevenshteinDistance2 = D(LD) End Function

我已经使用长度为 11,304 的字符串“s1”和长度为 5,665 的字符串“s2”测试了此函数(> 6400 万个字符比较)。 使用上述单维版本的函数,在我的机器上执行时间约为 24 秒。 我在上面的链接中引用的原始二维函数对于相同的字符串需要约 37 秒。 我进一步优化了单维函数,如下所示,对于相同的字符串,它需要 ~10 秒。

'Calculate the Levenshtein Distance between two strings (the number of insertions, 'deletions, and substitutions needed to transform the first string into the second) Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix Dim i As Long, j As Long, ss2 As Long 'loop counters, loop step Dim ssL As Long, cost As Long 'loop start, and cost of substitution for current letter Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2 Dim sss1() As String, sss2() As String 'Character arrays for string S1 & S2 L1 = Len(s1): L2 = Len(s2) L1p1 = L1 + 1 L1p2 = L1 + 2 LD = (((L1 + 1) * (L2 + 1))) - 1 ReDim D(0 To LD) ss2 = L1 + 1 For i = 0 To L1 Step 1: D(i) = i: Next i 'setup array positions 0,1,2,3,4,... For j = 0 To LD Step ss2: D(j) = j / ss2: Next j 'setup array positions 0,1,2,3,4,... ReDim sss1(1 To L1) 'Size character array S1 ReDim sss2(1 To L2) 'Size character array S2 For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i 'Fill S1 character array For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i 'Fill S2 character array For j = 1 To L2 ssL = (L1 + 1) * j For i = (ssL + 1) To (ssL + L1) If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0 cI = D(i - 1) + 1 cD = D(i - L1p1) + 1 cS = D(i - L1p2) + cost If cI <= cD Then 'Insertion or Substitution If cI <= cS Then D(i) = cI Else D(i) = cS Else 'Deletion or Substitution If cD <= cS Then D(i) = cD Else D(i) = cS End If Next i Next j LevenshteinDistance = D(LD) End Function
    

3
投票
维基百科文章

讨论了您的算法和各种改进。然而,至少在一般情况下,O(n^2) 是你能得到的最好结果。 但是,如果您可以限制您的问题,则可以进行一些改进(例如,如果您只对小于

d

的距离感兴趣,则复杂性为 O(dn) - 这可能是有意义的,因为距离很近的匹配字符串长度可能不是很有趣)。看看您是否可以利用您问题的具体情况...


3
投票
http://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htm

这是我将其翻译成 Scala:

// The code below is based on code from the Apache Commons lang project. /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with this * work for additional information regarding copyright ownership. The ASF * licenses this file to You under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance with the * License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations * under the License. */ /** * assert(levenshtein("algorithm", "altruistic")==6) * assert(levenshtein("1638452297", "444488444")==9) * assert(levenshtein("", "") == 0) * assert(levenshtein("", "a") == 1) * assert(levenshtein("aaapppp", "") == 7) * assert(levenshtein("frog", "fog") == 1) * assert(levenshtein("fly", "ant") == 3) * assert(levenshtein("elephant", "hippo") == 7) * assert(levenshtein("hippo", "elephant") == 7) * assert(levenshtein("hippo", "zzzzzzzz") == 8) * assert(levenshtein("hello", "hallo") == 1) * */ def levenshtein(s: CharSequence, t: CharSequence, max: Int = Int.MaxValue) = { import scala.annotation.tailrec def impl(s: CharSequence, t: CharSequence, n: Int, m: Int) = { // Inside impl n <= m! val p = new Array[Int](n + 1) // 'previous' cost array, horizontally val d = new Array[Int](n + 1) // cost array, horizontally @tailrec def fillP(i: Int) { p(i) = i if (i < n) fillP(i + 1) } fillP(0) @tailrec def eachJ(j: Int, t_j: Char, d: Array[Int], p: Array[Int]): Int = { d(0) = j @tailrec def eachI(i: Int) { val a = d(i - 1) + 1 val b = p(i) + 1 d(i) = if (a < b) a else { val c = if (s.charAt(i - 1) == t_j) p(i - 1) else p(i - 1) + 1 if (b < c) b else c } if (i < n) eachI(i + 1) } eachI(1) if (j < m) eachJ(j + 1, t.charAt(j), p, d) else d(n) } eachJ(1, t.charAt(0), d, p) } val n = s.length val m = t.length if (n == 0) m else if (m == 0) n else { if (n > m) impl(t, s, m, n) else impl(s, t, n, m) }

}


2
投票

正如其他人提到的,如果您只想检查两个字符串之间的编辑距离是否在某个阈值 k 内,则可以将时间复杂度降低到

O(kn)

。更精确的表达式是 O((2k+1)n)。您取一个横跨对角单元两侧 k 个单元的条带(条带长度 2k+1),并计算位于该条带上的单元格的值。 有趣的是,Li 等人进行了“改进”。等人。 并且这已进一步减少到 O((k+1)n)。

Rosetta 代码

0
投票
Ada

zklC99版本使用缓存来降低递归的复杂性。 由此产生的复杂性是多少?

© www.soinside.com 2019 - 2024. All rights reserved.