`
Tonyguxu
  • 浏览: 272318 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

字符串匹配算法——Edit distance

 
阅读更多

如何比较两个字符串之间的相似程度(或者差异)?


想要比较两个字符串之间的相似程度,可以看其中一个字符串通过几步操作可以转换为另一个字符串,通过度量转换操作的步数可以来衡量两个串的相似程度,如果转换步数越少,则两者越匹配。这里转换操作的度量就称为:

edit distance。该值越小,则两个字符串越匹配。

 

但是对edit distance有不同的definition

 

 http://en.wikipedia.org/wiki/Edit_distance写道
edit distance between two strings of characters generally refers to the Levenshtein distance.

However,The term ‘edit distance’ is sometimes used to refer to the distance in which insertions and deletions have equal cost and replacements have twice the cost of an insertion”

There are several different ways to define an edit distance, depending on which edit operations are allowed: replace, delete, insert, transpose, and so on.
  

 

相应的也有不同的算法来计算这个值,常见的有Levenshtein distance

 

Levenshtein distance

 

http://en.wikipedia.org/wiki/Levenshtein_distance 写道
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
 

算法基本步骤

 

 

Step Description
1 Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.
2 Initialize the first row to 0..n.
Initialize the first column to 0..m.
3 Examine each character of s (i from 1 to n).
4 Examine each character of t (j from 1 to m).
5 If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.
6 Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7 After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

 

 

比较"GUMBO"和"GAMBOL"的相似程度,计算两者的edit distance


 

参考

 

1.Levenshtein_distance

http://www.merriampark.com/ld.htm

http://en.wikipedia.org/wiki/Levenshtein_distance

2.Edit_distance

http://en.wikipedia.org/wiki/Edit_distance

 

 

 

  • 描述: Levenshtein distance算法示例
  • 大小: 387.1 KB
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics