- 浏览: 171533 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
hyt5926:
谢谢啊,终于到找解决方法了。
Dropbox同步制定文件夹外的目录文件 -
alan066:
你好,请教您一个问题,最近在学习SSH,是在MyEclipse ...
SSH的JAR包全介绍(转) -
zhile005:
不错的。。。
(转)流行Scrum工具介绍 -
liuxuejin:
请教一下:Yong区的大小模式是多少?
JVM及其内存分配的设置 -
Magician8421:
最近刚开始学哨笛 很精致的管乐器,可惜学的人好少,偶然搜了搜就 ...
爱尔兰哨笛
Refrence : Dynamic Programming Algorithm (DPA) for Edit-Distance
编辑距离
关于两个字符串s1,s2的差别,可以通过计算他们的最小编辑距离来决定。
所谓的编辑距离: 让s1和s2变成相同字符串需要下面操作的最小次数。
1. 把某个字符ch1变成ch2
2. 删除某个字符
3. 插入某个字符
例如 s1 = “12433” 和s2=”1233”;
则可以通过在s2中间插入4得到12433与s1一致。
即 d(s1,s2) = 1 (进行了一次插入操作)
编辑距离的性质
计算两个字符串s1+ch1, s2+ch2的编辑距离有这样的性质:
1. d(s1,””) = d(“”,s1) = |s1| d(“ch1”,”ch2”) = ch1 == ch2 ? 0 : 1;
2. d(s1+ch1,s2+ch2) = min( d(s1,s2)+ ch1==ch2 ? 0 : 1 ,
d(s1+ch1,s2)+1,
d(s1,s2+ch2) +1);
第一个性质是显然的。
第二个性质: 由于我们定义的三个操作来作为编辑距离的一种衡量方法。
于是对ch1,ch2可能的操作只有
1. 把ch1变成ch2
2. s1+ch1后删除ch1 d = (1+d(s1,s2+ch2))
3. s1+ch1后插入ch2 d = (1 + d(s1+ch1,s2))
对于2和3的操作可以等价于:
_2. s2+ch2后添加ch1 d=(1+d(s1,s2+ch2))
_3. s2+ch2后删除ch2 d=(1+d(s1+ch1,s2))
因此可以得到计算编辑距离的性质2。
复杂度分析
从上面性质2可以看出计算过程呈现这样的一种结构(假设各个层用当前计算的串长度标记,并假设两个串长度都为 n )
可以看到,该问题的复杂度为指数级别 3 的 n 次方,对于较长的串,时间上是无法让人忍受的。
分析: 在上面的结构中,我们发现多次出现了 (n-1,n-1), (n-1,n-2)……。换句话说该结构具有重叠子问题。再加上前面性质2所具有的最优子结构。符合动态规划算法基本要素。因此可以使用动态规划算法把复杂度降低到多项式级别。
动态规划求解
首先为了避免重复计算子问题,添加两个辅助数组。
一. 保存子问题结果。
M[ |s1| ,|s2| ] , 其中M[ i , j ] 表示子串 s1(0->i) 与 s2(0->j) 的编辑距离
二. 保存字符之间的编辑距离.
E[ |s1|, |s2| ] , 其中 E[ i, j ] = s[i] = s[j] ? 0 : 1
三. 新的计算表达式
根据性质1得到
M[ 0,0] = 0;
M[ s1i, 0 ] = |s1i|;
M[ 0, s2j ] = |s2j|;
根据性质2得到
M[ i, j ] = min( m[i-1,j-1] + E[ i, j ] ,
m[i, j-1]+1 ,
m[i-1, j] +1 );
复杂度
从新的计算式看出,计算过程为
i=1 -> |s1|
j=1 -> |s2|
M[i][j] = ……
因此复杂度为 O( |s1| * |s2| ) ,如果假设他们的长度都为n,则复杂度为 O(n^2)
Reference: Dynamic Programming Algorithm (DPA) for Edit-DistanceThe words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `spot' by the deletion of the `p', or equivalently, `spot' can be changed into `sport' by the insertion of `p'.
The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:
change a letter, insert a letter or delete a letter
The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:
d('', '') = 0
d(s, '') = d('', s) = |s| -- i.e. length of s
d(s1+ch1, s2+ch2)
= min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
d(s1+ch1, s2) + 1,
d(s1, s2+ch2) + 1 )
The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.
The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.
Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.
A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:
m[i,j] = d(s1[1..i], s2[1..j])
m[0, 0] = 0
m[i, 0] = i, i=1..|s1|
m[0, j] = j, j=1..|s2|
m[i,j] = min( m[i-1,j-1]
+ if s1[i]=s2[j] then 0 else 1 fi,
m[i-1, j] + 1,
m[i, j-1] + 1 ), i=1..|s1|, j=1..|s2|
m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!
The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `spot' by the deletion of the `p', or equivalently, `spot' can be changed into `sport' by the insertion of `p'.
The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:
change a letter, insert a letter or delete a letter
The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:
d('', '') = 0
d(s, '') = d('', s) = |s| -- i.e. length of s
d(s1+ch1, s2+ch2)
= min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
d(s1+ch1, s2) + 1,
d(s1, s2+ch2) + 1 )
The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.
The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.
Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.
A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:
m[i,j] = d(s1[1..i], s2[1..j])
m[0, 0] = 0
m[i, 0] = i, i=1..|s1|
m[0, j] = j, j=1..|s2|
m[i,j] = min( m[i-1,j-1]
+ if s1[i]=s2[j] then 0 else 1 fi,
m[i-1, j] + 1,
m[i, j-1] + 1 ), i=1..|s1|, j=1..|s2|
m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!
发表评论
-
Dropbox同步制定文件夹外的目录文件
2011-12-15 20:28 2273Dropbox的默认设置为自动同步My Dropbox ... -
(转)正则表达式中贪婪和非贪婪(惰性)匹配的区别与效率问题:
2010-09-18 01:22 1322转自:http://hi.baidu.com/%B3%BF%B ... -
(转)正则表达式解释器实现原理
2010-09-18 01:15 1090转自:http://blog.csdn.net/qipnx/a ... -
(转)Symbian 活动对象彻底理解
2010-08-22 14:47 956Symbian编程总结-基础篇- ... -
Eclipse 常用设置
2010-05-06 20:53 876ECLIPSE常用设置 0. 显 ... -
开源邮件服务器(转)
2010-04-06 12:33 25921.开源.NET邮件服务器 LumiSoft Ma ... -
全球最受欢迎的邮件服务器
2010-04-06 11:59 851转自:http://www.263g.cn/blog/post ... -
Apana Radrails2.0 安装到Ubuntu9.10上后创建项目点finish按钮没反应
2009-12-31 11:25 714下面是解决方法, 不过是德语,可以google一下,呵呵。为了 ... -
ACROBAT7.0实现手写矢量签名的三种方法
2009-11-18 11:32 1106http://design.yesky.com/artist/ ... -
Install Ruby on Rails and it's IDE on Unbuntu8.10
2009-11-13 16:25 874在Ubuntu 上安装Ruby on Rails有四种方法: ... -
团队发展的一些看法
2009-10-12 17:00 592现今社会讲究的都是团队与团队之间的竞争,个人单打独斗的年代已经 ... -
申请域名及建网总结
2009-09-18 00:01 911简单来讲分以下步: 1. 在一个域名代理商上申请一个域名(我 ... -
Forest and Domain Knowledge
2009-08-18 00:00 796In general, forests are used as ... -
WCF 与 WebService的异同
2009-08-05 16:37 30731.WebService:严格来说是 ... -
让人郁闷的WMI
2009-06-27 20:45 7492009.06.27 太郁闷了,写了一天WMI的code, ... -
XML家族
2009-05-09 12:34 579XML, XSD, DTD, XSL -
NTLM验证过程及漏洞
2009-03-25 23:40 1828NT Lan Manager(NTLM)是一种authenti ... -
powershell notes II
2009-03-18 14:36 1930学习疑问:1. 如何任意显示某个对象的metho ... -
powershell notes
2009-03-18 14:20 1845命令模式 //字符串不需要加引号,除变量和 ... -
一些地址
2009-01-30 00:23 695CODERS at WORK http://www.coder ...
相关推荐
SQL SERVER实现编辑距离(Edit Distance)算法,可进行模糊匹配查询
动态规划 编辑距离 可以用来判别字符串的差异
算法中的edit distance问题 给出原序列 再给出目的序列 程序描述出源到目的的转换 编译通过了 本人的算法作业!
用python 实现的编辑距离算法,支持权重的定义,和词语之间的关联
对任给的两个字符串A,B,用动态规划算法算出他们的最小编辑距离
编辑距离 快速实现编辑距离(Levenshtein距离)。 该库仅使用C ++和Cython实现。 该库中使用的算法由提出 。 二元轮 多亏了 ,Linux,Mac OS和Windows上都有二进制轮子。 安装 您可以通过pip安装: pip install ...
1.矩阵初始化 2.计算最小值 3.类推完成,取右下角的值,即为最短编辑距离
莱文斯坦距离,是编辑距离(Edit Distance)的一种。 编辑距离一般是指两个字串之间,由一个转成另一个所需的最少编辑操作次数。允许的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。 比如...
试设计一个有效算法,对任给的2 个字符串A和B,计算出它们的编辑距离d(A,B)。 编程任务: 对于给定的字符串A和字符串B,编程计算其编辑距离d(A,B)。 Input 输入由多组测试数据组成。 每组测试数据输入的第1 行...
编辑距离(EditDistance)定义字符串的相似度 编辑距离就是用来计算从原串(s)转换到目标串 t 所需要的最少的插入 删除和替换的数目 在NLP中应用比较广泛 如一些评测方法中就用到了(wer mWer等) 同时也常用来计算...
编辑距离(EditDistance)定义 编辑距离就是用来计算从原串(s)转换到目标串(t)所需要的最少的插入,删除和替换的数目,在NLP中应用比较广泛,如一些评测方法中就用到了(wer,mWer等),同时也常用来计算你对原...
edit-distance-web:给定两个字符串,该算法将计算编辑距离-动态编程
编辑距离线性工作台这是库的基准。 另请参阅以获取详细信息。 该项目使用stack 。 因此,要构建: stack build 要运行标准基准: stack bench 或者,要生成精美HTML报告: stack bench --benchmark-arguments ' --...
编辑距离的perl语言算法,基于词的增加、修改、删除的编辑距离
这个算法就是通过插入、删除、替换方法将句子A中的字符串变成句子B中的字符串所有的步骤就是算法中提到的编辑距离,最简单的相似度即编辑距离的倒数。
LD算法(Levenshtein Distance)又称编辑距离算法(Edit Distance)。以字符串A通过插入字符、删除字符、替换字符变成另一个字符串B,那么操作的过程的次数表示两个字符串的差异。本资源为此算法的python实现。...
列文斯基距离(LD,Levenshtein Distance)又成为编辑距离算法(Edit Distance)。他是以字符串A通过插入字符、删除字符、替换字符变成另一个字符串B,那么操作的过程的次数表示两个字符串的差异。 文件1里面是需要...
我们使用以下算法计算对数半环上两个非循环概率有限状态自动机X和Y之间的预期编辑距离: D(X,Y) = ShortestDistLog(-DetTrop(RmEpsTrop(Sync(-DetLog(X o T o Y)))))其中T是表示编辑成本函数的对数半环上的加权转换...
最小编辑距离 计算2个字之间的最小编辑距离 ... 例如,如果用户键入“ Jaavscript”,则对单词运行算法可能会确定它与Javascript之间的最小编辑距离足够小,以致UI可以将其建议为“您是故意的”方案
编辑距离 字符串距离算法的实现。 描述 编辑模糊匹配的距离算法。 具体来说,这个库提供: 例子 Levenshtein d = new Levenshtein (); print (d. distance ( 'witch' , 'kitsch' )); // 2