项目中一次正则表达式的实践

jlj008

浏览: 94984 次
性别:
来自: 上海

最近访客更多访客>>

mitcc

tiredboy

卫忆卫来

houzidexinsheng

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

技术积累

正则表达式 F#Blog

今天在项目中遇到这样一个需求：
要求把一段HTML代码中的注释里的某类表达式替换掉，而在注释外的这类表达式不需要替换，例如：
%(/images/a.jpg)

%(/images/a.jpg)

%(/images/a.jpg)
其中，在注释内的%(/images/a.jpg)要替换成${contextPath}/images/a.jpg
替换后应该是这样的：
%(/images/a.jpg)

%(/images/a.jpg)

%(/images/a.jpg)

经过一番研究，深入了解了一下greedy、reluctant和possessive这些量词的区别（可以参考一下），以及Special constructs (non-capturing)的作用（以前也没对REGEX去太多研究，只用点简单的），留下代码，以供日后查阅

	/**
	 * Replace special symbol in html comments, like "<!-- %(/images/a.jpg) -->"
	 * @param text
	 * @return
	 */
	protected String replaceSpecialSymbolInComments(String text) {
		log.debug("call replaceSpecialSymbolInComments(" + text + ")");
		if (text == null) return text;
		
		String commentRegex = "(?s)(?<=<!--).*?(?=-->)";
		String specialRegex01 = "\\%\\((.*?)\\)";
		
		Pattern commentPattern = Pattern.compile(commentRegex);
		Pattern specialPattern01 = Pattern.compile(specialRegex01);
		Matcher commentMatcher = commentPattern.matcher(text);
		
		StringBuilder sb = new StringBuilder(text);
		int offset = 0;
		while (commentMatcher.find()) {
			log.debug("comment match result: " + commentMatcher.group());
			int commentMatchStart = offset + commentMatcher.start();
			int commentMatchEnd = offset + commentMatcher.end();
			log.debug(commentMatchStart + " -- " + commentMatchEnd);
			Matcher specialMatcher01 = specialPattern01.matcher(commentMatcher.group());
			
			while (specialMatcher01.find()) {
				commentMatchStart = offset + commentMatcher.start();
				commentMatchEnd = offset + commentMatcher.end();
				log.debug("special match result01: " + specialMatcher01.group());
				log.debug("special match result01 should be: " + "${contextPath}" + specialMatcher01.group(1));
				int specialMatchStart01 = specialMatcher01.start();
				int specialMatchEnd01 = specialMatcher01.end();
				offset += ("${contextPath}" + specialMatcher01.group(1)).length() - specialMatcher01.group(0).length();
				sb.replace(commentMatchStart + specialMatchStart01, commentMatchStart + specialMatchEnd01, "");
				sb.insert(commentMatchStart + specialMatchStart01, "${contextPath}" + specialMatcher01.group(1));
			}
			log.debug("temp result: " + sb.toString());
		}
		
		return sb.toString();
	}

附录参考：引用 http://zzg810314.iteye.com/blog/194643

greedy、reluctant和possessive量词的区别

greedy、reluctant和possessive量词之间有微妙的区别。

greedy量词被看作“贪婪的”，因为它们在试图搜索第一个匹配之前读完（或者说吃掉）整个输入字符串。如果第一个匹配尝试（整个输入字符串）失败，匹配器就会在输入字符串中后退一个字符并且再次尝试，重复这个过程，直到找到匹配或者没有更多剩下的字符可以后退为止。根据表达式中使用的量词，它最后试图匹配的内容是1 个或者0个字符。

但是，reluctant量词采取相反的方式：它们从输入字符串的开头开始，然后逐步地一次读取一个字符搜索匹配。它们最后试图匹配的内容是整个输入字符串。

最后，possessive量词总是读完整个输入字符串，尝试一次（而且只有一次）匹配。和greedy量词不同，possessive从不后退，即使这样做能允许整体匹配成功。

为了演示，我们分析输入字符串xfooxxxxxxfoo：

Enter your regex: .*foo // greedy quantifier

Enter input string to search: xfooxxxxxxfoo

I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo // reluctant quantifier

Enter input string to search: xfooxxxxxxfoo

I found the text "xfoo" starting at index 0 and ending at index 4.

I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // possessive quantifier

Enter input string to search: xfooxxxxxxfoo

No match found.

第一个例子使用greedy量词.*搜索“任何内容”零次或者多次，后面是字母f、o、o。因为是greedy量词，所以表达式的.*部分首先读完整个字符串。这样，整个表达式不会成功，因为最后三个字母（“f”“o”“o”）已经被消耗了。所以匹配器缓慢地一次后退一个字母，一直后退到最右侧出现“foo”为止，这里匹配成功并且搜索停止。

但是第二个例子使用的量词是reluctant量词，所以它首先消耗“无内容”。因为“foo”没有出现在字符串的开头，所以迫使它消耗掉第一个字母（x），这样就在索引0和4的位置触发第一个匹配。我们的测试示例继续处理，直到输入字符串耗尽为止。它在索引4和13找到了另一个匹配。

第三个例子找不到匹配，因为是possessive量词。这种情况下，.*+消耗整个输入字符串，在表达式的结尾没有剩下满足“foo”的内容。possessive量词用于处理所有内容，但是从不后退的情况；在没有立即发现匹配的情况下，它的性能优于功能相同的greedy量词。

分享到：

常用第3方类库 | DateFormat

2009-10-23 17:13
浏览 928
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论