JavaScript RegExp

fixopen

浏览: 82515 次

最近访客更多访客>>

hdljavaeye

dogeziyun

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (46)

社区版块

存档分类

JavaScript 正则表达式 C C++C#

上面的描述中，我特意的省略掉了RegExp的具体描述。RegExp就是大名鼎鼎的规则表达式，我准备用一章描述之。
规则表达式是用来在文本中进行高效查找和替换的一种语言体系。按照分类，它是正则文法。
按照惯例，先说RegExp的literal。
/.../...，这就是RegExp的literal，两个斜线中间夹着RegExp的主体内容，后面是可有可无的一些选项。
我会先试图描述JavaScript的RegExp的内容和模式，然后描述它的接口和用法。
JavaScript的RegExp是经典RegExp的扩充，但完全兼容经典的RegExp，同时，也几乎完全兼容Perl的RegExp。
Pattern概述
Pattern 就是一系列字符，或者简单的说就是字符串。其中有一些字符是由特殊含义的，另外的是字符本身的原始含义。我举个例子/good/这个pattern其实就是good这个字符序列。而/good./代表的则是good后跟任意一个字符。我们说"."是一个特殊含义的字符，显然，"."代表任意字符，我们需要搞懂的就是所有的这些特殊含义的字符。
pattern后面的（也就是第二个/后面）那些叫做pattern的选项，或者叫做pattern的 attribute。attribute有三个：i g m，其含义分别是，不区分大小写全局搜索多行文本（也就是把\n当成字符串的一部分）。它们可以以任意的方式组合。其含义就是这几个方面的组合。

现在详细描述pattern中的特殊字符以及字符序列。大致上，它们可以分为以下4类：
character literal
character class --- [] or alternation --- |
grouping or sub-patterns --- () and references --- \no
repetition --- {a, b}…… * + ?
position

character literal实际上就是一个字符，它表达它自己，主要是一个关于位置的信息。可以参照我前面举的/good/的例子，其中g、o、o、d就是 character literal。它们将要匹配相同位置的字符序列。具体的说，A-Z、a-z、0-9都是自己表达自己的。别的都是以'\'开头的。下面是其列表：
\$ A single dollar sign ($)
\* A single asterisk (*)
\+ A single plus sign (+)
\, A single comma (,)
\. A single period (.)
\/ A single slash (/)
\? A single question mark (?)
\\ A single backslash (\)
\^ A single circumflex (^)
\d Any digit character as per [0-9]
\D Any non-digit character as per [^0-9]
\f A form feed
\n A newline
\r A carriage return
\S A non-space character
\s A space character
\t A tab character
\v A vertical tab
\w An alphanumeric character and underscore as per [0-9a-zA-Z_]
\W An non-alphanumeric character and underscore as per [^0-9a-zA-Z_]
\| A single vertical bar (|)
$ A single opening parenthesis (()
$ A single closing parenthesis ())
\[ A single opening square bracket ([)
\] A single closing square bracket (])
\{ A single opening curly brace ({)
\} A single closing curly brace (})
\nnn The ASCII character encoded by the octal value nnn
\onnn The ASCII character encoded by the octal value nnn
\uhhhh The Unicode character encoded by the hexadecimal value hhhh
\xhh The ASCII character encoded by the hexadecimal value hh
\c* The control character equivalent to ^*
\c@ (NUL) – Null character
\c[ (ESC) – Escape
\c\ (FS) – File separator (Form separator)
\c] (GS) – Group separator
\c^ (RS) – Record separator
\c_ (US) – Unit separator
\cA (SOH) – Start of header
\cB (STX) – Start of text
\cC (ETX) – End of text
\cD (EOT) – End of transmission
\cE (ENQ) – Enquiry
\cF (ACK) – Positive acknowledge
\cG (BEL) – Alert (bell)
\cH (BS) – Backspace
\cI (HT) – Horizontal tab
\cJ (LF) – Line feed
\cK (VT) – Vertical tab
\cL (FF) – Form feed
\cM (CR) – Carriage return
\cN (SO) – Shift out
\cO (SI) – Shift in
\cP (DLE) – Data link escape
\cQ (DC1) – Device control 1 (XON)
\cR (DC2) – Device control 2 (tape on)
\cS (DC3) – Device control 3 (XOFF)
\cT (DC4) – Device control 4 (tape off)
\cU (NAK) – Negative acknowledgement
\cV (SYN) – Synchronous idle
\cW (ETB) – End of transmission block
\cX (CAN) – Cancel
\cY (EM) – End of medium
\cZ (SUB) – Substitute
\0 to \9 The last remembered substring as per the $n property
[\b] A literal backspace not to be confused with a word boundary match (using the \b outside of square brackets)

character class or alternation
character class就是用[]括起来的一串字符。这一串字符用来表示一个。比如[abcd]表示该位置可以出现或a或b或c或d这四个字符。如果[]里面以^开头，表示否定。另外还有区间表示法和一些简写法。比如[0-9]表示0123456789这十个字符，-用来表示区间。另外[0-9]还可以简写为\d。 [^0-9]可以简写为\D。下面我给出一些character class列表：
[ ... ] Any single character that is one of the set enclosed in the square brackets.
[^ ... ] Any single character that is not one of the set enclosed in the square brackets.
[^abcd] Any character that is not one of the letters "a", "b", "c" or "d".
[abcd] Any one of the letters "a", "b", "c" or "d".
[a-z] Any single lower case character.
[A-Z] Any single upper case character.
[a-zA-Z] Any single alphabetic character.
[0-7] Any octal numeric digit.
. Any character apart from newline.
\d Any decimal digit character.
\s Any whitespace character.
\w Any word character (which is any letter, number or underscore). This does not mean a whitespace character.
\D Any non-digit character.
\S Any non-whitespace character. This is not necessarily a valid word character.
\W Any non-word character.
[\b] A literal backspace not to be confused with a word boundary match (using the \b outside of the square brackets)
[0-1] Any binary numeric digit.
[0-9A-F] Any hexadecimal numeric digit.
[\dA-F] Any hexadecimal numeric digit.
[a-zA-Z0-9] Any single alphanumeric character.
[a-zA-Z\d] Any single alphanumeric character.
[^a-zA-Z0-9_\$] Any character that is not valid within an identifier name.
[a-zA-Z0-9_\$] Any character that is valid within an identifier name.
[0-9] Any decimal numeric digit.
[^0-9] Any any character that is not a digit.
[\t\n\r\f\v] Any whitespace character.
[^\t\n\r\f\v] Any non-whitespace character.
[^\n] Any character apart from newline.
[^a-zA-Z0-9_] Any non-word character.
[a-zA-Z0-9_] Any word character.

alternation其实是一个[]的变形，举例说明之a|b|c|d其实等价于[abcd]。它一般被称作改变或可选择。

grouping or sub-patterns and references
分组和引用是相关的。引用是对分组的引用。分组和子pattern语法和语义是完全一样的，只是语用不同，一个主要用于规定其repetition，另一个主要用于reference。一个()规定了一个分组。分组可以有多个，从左往右的编号依次为1、2、3……，引用的语法就是\1、\2、\3……，整个 pattern规定为group0。举个例子：/(['"])[^'"]*\1/其中\1就是引用的['"]这个group。

repetition
repetition表达重复性。它的语法是{m, n}。跟在需要表达重复性的元素后面。举例说明如下：
/go{1, 2}d/将会导致匹配以g开头，出现o一到两次，后面跟着d的字符序列。具体的说就是god和good这两种情况。{m, n}可以有很多变形，比如：{m}，表示严格匹配m次。比如：{m,}，表示匹配比m次多的次数。特别的{0,1}可以表示为?，{0,}可以表示为*， {1,}可以表示为+。另外，对于+和*，还有后跟?的版本，表示非贪婪匹配。

position
position是用来指示匹配时发生的位置的。有如下几种位置信息：
^ Indicate the start of the line
$ Indicate the end of the line
\b Indicate a word boundary. Note that this cannot be used in a bracketed character
class [\b] means backspace not word boundary
\B Indicate any non-word boundary location
.$ The last character at the end of the line (the dot matches one character)
\b\d*\b A complete word composed only of numeric digits
\b\w*\b A complete word
\s*$ All of the trailing whitespace
^$ A line with nothing between the start and end, an empty line
^. The first character at the beginning of the line (the dot matches one character)
^.*$ The entire line regardless of its contents
^\s A leading whitespace character

现在开始描述编程接口
其接口在上一章已经完全枚举出来了，我现在说说用法。
这几个是RegExp对象不可继承的属性（或者叫做类属性），如果不知道如何做到不可继承，看前面。类属性是全局共享，而不是各个具体的RegExp独占的，这一点必须注意。
$n, input, $_, lastMatch, $&, lastParen, $+, leftContext, $`, multiline, $*, rightContext, $'
这几个是所有RegExp子对象（以RegExp为原型的对象）的属性，也就是可继承的属性。
constructor, global, ignoreCase, index, input, lastIndex, lastMatch, lastParen, leftContext, multiline, prototype, rightContext, source
这是可继承的方法。
compile(), exec(), test(), toSource(), toString()
我详细说明如下：
$n 并不是一个属性，而是一组属性，其分别为：RegExp.$1 RegExp.$2 RegExp.$3 RegExp.$4 RegExp.$5 RegExp.$6 RegExp.$7 RegExp.$8 RegExp.$9，代表的含义是：代表子模版（子表达式，组，sub patterns）的最近一次匹配内容。
input == $_类属性表示exec和test参数的默认值，如果调用exec和test没有给定参数的话，就使用input。
lastMatch == $&类属性表示最近匹配的文本。
lastParen == $+最近子模版的匹配，是最近匹配的一部分。
leftContext == $`最近匹配左边的文本。
multiline == $*用于规定单行匹配还是多行匹配。
rightContext == $'最近匹配右边的文本。
可以看出$x只是相应属性的别名。
需要注意的是，由于$x使用了特殊的字符，不再可以用RegExp.$x这种方法引用，会引起语法错误，改用RegExp["$x"]这种方式引用。
对象属性中跟类属性同名的也是相同的意义，只是它们是针对单个对象的，不用多说。
我说说那些不同名的。
constructor构造子，需要注意的是，几乎所有的对象都有这个属性。它类似于getType或者getClass这样的功能。
global指示是否进行全局匹配。
ignoreCase指示是否忽略大小写。
index第一个匹配的起始位置。
lastIndex前一个匹配的起始位置。用于遍历所有的匹配。
prototype继承相关的属性，我就不多说了吧
source规则表达式本身pattern的文本。
compile()对搜索进行缓存，以便于下一次重复使用。
exec() == test()实施搜索，返回一个数组，表达匹配的结果。
toSource()返回source属性。
toString()返回一个字符串表达该RegExp。

分享到：