KMP字符串搜索

deepnighttwo

浏览: 49815 次
性别:
来自: 上海

最近访客更多访客>>

sea_wave2011

hae

wh831019

booood

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

KMP 号称是效率最高的字符串搜索算法，大学数据结构的时候老师说这玩意不考，然后我就直接趴下睡觉了-_-||| 。这阵子正好要做点字符串搜索的东西，顺便抄起数据结构的书看看。KMP 算法是Knuth 、Morris 、Pratt 仨牛人发明的，不过俺也只听说过Knuth 。算法的思想不难，不过还是有点小绕脑子的。

Java SDK 的String 类中的indexOf 方法没有使用KMP 搜索，基本上算是最简单的搜索

    /**
     * Code shared by String and StringBuffer to do searches. The source is the
     * character array being searched, and the target is the string being
     * searched for.
     *
     * @param source
     *            the characters being searched.
     * @param sourceOffset
     *            offset of the source string.
     * @param sourceCount
     *            count of the source string.
     * @param target
     *            the characters being searched for.
     * @param targetOffset
     *            offset of the target string.
     * @param targetCount
     *            count of the target string.
     * @param fromIndex
     *            the index to begin searching from.
     */
    static int indexOf( char [] source, int sourceOffset, int sourceCount,
            char [] target, int targetOffset, int targetCount, int fromIndex) {

        // if start from a position that is beyond the source string
        if (fromIndex >= sourceCount) {
            // return the string length if target string is empty, otherwise,
            // return -1 which means match fails
            return (targetCount == 0 ? sourceCount : -1);
        }

        // correct the fromIndex
        if (fromIndex < 0) {
            fromIndex = 0;
        }

        // if target string is empty, return fromIndex
        if (targetCount == 0) {
            return fromIndex;
        }

        // first char to match
        char first = target[targetOffset];

        /*
         * a little optimize. let's say the source string length is 9 and the
         * target String length is 7. Then starting from 3 (index is 2) of
         * source string is the last change to match the whole target sting.
         * Otherwise, there are only 6 characters in source string and it would
         * definitely not going to match the target string whose length is 7.
         */
        int max = sourceOffset + (sourceCount - targetCount);

        // loop from the first to the max
        for ( int i = sourceOffset + fromIndex; i <= max; i++) {
            /* Look for first character. */
            if (source[i] != first) {
                // using i <= max, not i < max
                while (++i <= max && source[i] != first)
                    ;
            }

            /* Found first character, now look at the rest of v2 */
            if (i <= max) {
                int j = i + 1;
                int end = j + targetCount - 1;
                // using j < end, not j <= end
                for ( int k = targetOffset + 1; j < end
                        && source[j] == target[k]; j++, k++)
                     ;

                if (j == end) {
                    /* Found whole string. */
                    return i - sourceOffset;
                }
                // if match fails, i++ and loop again, there are to iterators
                // for two loops. i and j.
            }
        }
        return -1;
    }

这是Java String 中的搜索算法，对于原字符串使用了两个指针来进行搜索。但是实质上来讲，这个算法还是有回溯的，可以看出来，每次搜索的时候，j 都会搜索到一个大于i 的位置，而如果搜索失败，则下次搜索将是从i++ 开始，这就是回溯了。

KMP 的优势就是没有回溯，这对于只能够使用一个指针进行搜索的情况下，不仅仅有效率上的优势，实现起来也更自然。当然对于数组来说，使用俩指针并没有什么不便，如果是对于文件或者输入流进行搜索，那回溯起来就会很麻烦了。下面是KMP 搜索。

KMP 算法的核心就是不回溯原字符串指针，这点其实不难做到，重要的是要想到这一点—— 对于回溯的字符，其实都是已知的。解释一下就是，比如在"abcdefg" 中搜索"abcdeg" ，前五个字符"abcdeg" 都是匹配的，第六个字符f 和g 不匹配，这时候，对于上面的搜索算法，i 将会+1 ，整个匹配重新开始一次，这就是回溯了。但是仔细想一下，回溯其实完全可以避免的，因为如果知道是在第六个字符不匹配，那就说明前五个字符都是匹配的，从而说明“ 知道回溯之后的字符是什么” ，对于这个例子来说，我们肯定知道源字符串前面五个字符是"abcde" 。这是KMP 搜索的根基。

好，下面让我们抛开源字符串吧！我们只关心目标字符串，也就是"abcdeg" 。下面我们来设想，如果在搜索中发现源字符串的【n 】字符和目标字符串的【m 】字符匹配失败，那说明什么呢？说明之前的字符都是匹配的，否则也不会走到这里。也就是源字符串的【n-m 】到【n-1 】这m 个字符与目标字符串的【0 】到【m-1 】这m 个字符匹配。既然已经在搜索之前知道这个相等关系，那何苦在搜索的时候一次又一次的回溯呢？这个本来就是可以预测的，是搞一次就得的事情。因为源字符串的【n-m 】到【n-1 】是已知的。所以不用每次都死板的回溯到源字符串的n-m+1 。

举例来说，对于在"abababc" 中搜索"ababc" ，第一次不匹配的情况如下

0 1 2 3 4 5 6

a b a b a b c

a b a b c

这时候，如果把指针回溯到源字符串的1 位置，其实没有意义的，因为它是b ，和目标字符串的a 不匹配。而且，我们其实已经知道源字符串0 到3 这四个字符的值是跟目标字符串的四个字符一样的，都是abab 。KMP 的思想就是，充分利用这个已知条件，“ 源字符串不回溯，尽量让目标字符串少回溯，然后继续进行搜索” 。那应该让目标字符串回溯到什么地方呢？这就看已经匹配的字符串的内容了。

使用S 代表源字符串，T 代表目标字符串，S[n] 和T[m] 失配（注意，因为失配了，这时候S[n] 是什么是不知道的）。对于源字符串已知的只有 S[n-m+1] 到S[n-1] 这m-1 个字符。假设能够找到这样一个k ，使得S[n-k]...S[n-1]=T[0]....T[k-1] (0<k<m) ，那么就只需要保持S 不回溯，让T 回溯到K ，然后继续匹配就好了。而如果能够找到一个最大的K 值，那么效率则是最高的.

对于上面的例子，k 的值是2 ，KMP 搜索的下一个状态是：

0 1 2 3 4 5 6

a b a b a b c

a b a b c

然后继续匹配就成功啦。

所以，KMP 算法的核心是，如何为目标字符串的每个位置的找到一个k 值，组成一个数组F ，好在每次匹配到目标字符串的m 失配的时候，将目标字符串回溯到F[m] ，然后继续进行匹配。找到这个数组之后，KMP 搜索就算是完成80% 了。

下面是构建这个数组F 的方法。

这时候目标字符串身兼源字符串和目标字符串两个角色。构建数组T 可以说是一个步进的过程，需要用到之前的结果。首先是F[0] ，F[0] 的意思是第一个字符就不匹配，也就是说对源字符串一无所知，这时候没得搞了，直接要源字符串向前挪动一个。在F 里，我们使用-1 来标记第一个字符就匹配失败的情况。也就是F[0]=-1 。F[1] 其实肯定是0 。我们真正需要计算的是从F[2] 到最后的。下面是>=2 的时候的计算方法。注意，F[i] 代表S 的第 i 个字符匹配“ 失败” 的时候，T 需要回溯到的索引的值。如何求F[i] 的值呢？首先取得F[i-1] 的值，然后看S[i-1] 是否=T[F[i-1]] ，如果等于，那么F[i]=F[i-1]+1 。这个原理是递归的。F[i-1] 的值是在i-1 失配的时候，T 索引回溯到的值，如果这时候，这个值与S[i- 1] 相等，那就说明F[i] 可以在F[i-1] 的基础上增加1 了。否则继续检查S[i-1] 是否等于T[[F[i-1]]] ，直到没有的搜索了，就是0 。下面是具体的代码：

/**
     * each value of array rollback means: when source[i] mismatch pattern[i],
     * KMP will restart match process form rollback[j] of pattern with
     * source[i]. And if rollback[i] == -1, it means the current source[i] will
     * never match pattern. then i should be added by 1 and j should be set to
     * 0, which means restart match process from source[i+1] with pattern from
     * pattern[0].
     *
     * @param pattern
     * @return
     */
    private static int [] getRollbackArray( char [] pattern) {
        int [] rollback = new int [pattern.length];
        for ( int i = 0; i < pattern.length; i++) {
            rollback[i] = 0;
        }
        rollback[0] = -1;
        for ( int i = 1; i < rollback.length; i++) {
            char prevChar = pattern[i - 1];
            int prevRollback = i - 1;
            while (prevRollback >= 0) {
                int previousRollBackIdx = rollback[prevRollback];
                if ((previousRollBackIdx == -1)
                        || (prevChar == pattern[previousRollBackIdx])) {
                     rollback[i] = previousRollBackIdx + 1;
                    break ;
                } else {
                    prevRollback = rollback[prevRollback];
                }
            }
        }
        return rollback;
    }

上面并没有吧F[1]=1 写成固定的，不过根据计算，F[1] 始终是=0 的。

有了这个rollback 数组，KMP 搜索就是水到渠成了：

/**
     * search pattern chars in source chars.
     *
     * @param source
     * @param pattern
     * @return
     */
    public static int searchKMP( char [] source, char [] pattern) {
        // validation
        if (source == null || source.length == 0 || pattern == null
                || pattern.length == 0) {
            return -1;
        }

        // get the rollback array.
        int [] rollback = getRollbackArray(pattern);

        // incremental index of pattern. pointing the char to compare with.
        int currMatch = 0;
        int len = pattern.length;
        // i point the char to compare with
        for ( int i = 0; i < source.length;) {
            // if current char match
            if ((currMatch == -1) || (source[i] == pattern[currMatch])) {
                /*
                 * then each of the indexes adding by one, moving to the next
                 * char for comparation. notice that if currMatch is -1, it
                 * means the first char in pattern can not be matched. so i add
                 * by one to move on. and currMatch add by one so its value is
                 * 0.
                 */
                i++;
                currMatch++;
                /*
                 * if reaches the end of pattern, then match success, return the
                 * index of first matched char.
                 */
                if (currMatch == len) {
                    return i - len;
                }
            } else {
                 /*
                 * if current char mismatch, then rollback the next char to
                 * compare in pattern.
                 */
                currMatch = rollback[currMatch];
            }
        }
        return -1;
    }

下面是几个测试方法：

   @Test
    public void testRollBackArray() {
        int [] expectedRollback = new int [] { -1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0,
                0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0 };
        int [] rollback = getRollbackArray("PARTICIPATE IN PARACHUTE"
                 .toCharArray());
        Assert.assertArrayEquals("Rollback array compare failed to match!",
                expectedRollback, rollback);
    }

    @Test
    public void testKMPSearchMatch() {
        int matchIndex = searchKMP(
               "aaaaaababacbaslierjalsdzmflkasjf".toCharArray(),
                "ababacb".toCharArray());
        Assert.assertEquals(5, matchIndex);

        matchIndex = searchKMP(
                "aaaaaababacbaslierjalsdzmflkasjf".toCharArray(),
                "aaaaaababacbaslierjalsdzmflkasjf".toCharArray());
        Assert.assertEquals(0, matchIndex);
    }

    @Test
    public void testKMPSearchNoMatch() {
        int matchIndex = searchKMP("ABCABCDABABCDABCDABDE".toCharArray(),
                "hjABCDABD".toCharArray());
        Assert.assertEquals(-1, matchIndex);

    }

把这三段代码放在一个类里，KMP 搜索就算是完事儿了。

============================= 正文无关内容============================================

在自己看KMP 算法之前，很多文章都说神马KMP 有代价，只适合目标字符串很长很长，搜索字符串也很长很长的case 。但是就我看下来，KMP 对于日常一般的搜索也是有优势的。首先，构建rollback 数组计算并不复杂，当然需要一个额外的数组空间。但是对于匹配来说，还是有很大的加速优势的，而且目标字符串不需要回溯。所以KMP 唯一的代价就是需要一个额外的数组，实际占用的内存应该是目标字符串的两倍（String 是char 的数组，char=short ，int 是char 的两倍）。难道，真的是为了节省内存所以不采用KMP 搜索？