`
wenbois2000
  • 浏览: 44670 次
  • 性别: Icon_minigender_1
  • 来自: 湖南
社区版块
存档分类
最新评论

High Performance JavaScript 读书笔记(五)

    博客分类:
  • Web
阅读更多

五.Strings and Regular Expressions
   Practically all JavaScript programs are intimately tied to strings. For example, many applications use Ajax to fetch strings from a server, convert those strings into more easily usable JavaScript objects, and then generate strings of HTML from the data. A typical program deals with numerous tasks like these that require you to merge, split, rearrange, search, iterate over, and otherwise handle strings; and as web applications become more complex, progressively more of this processing is done in the browser.
   In JavaScript, regular expressions are essential for anything more than trivial string processing. A lot of this chapter is therefore dedicated to helping you understand how regular expression engines internally process your strings and teaching you how to write regular expressions that take advantage of this knowledge.  Also in this chapter, you’ll learn about the fastest cross-browser methods for concatenating and trimming strings, discover how to increase regex performance by reducing backtracking, and pick up plenty of other tips and tricks for efficiently processing strings and regular expressions.

  • String Concatenation
       String concatenation can be surprisingly performance intensive. It’s a common task to build a string by continually adding to the end of it in a loop (e.g., when building up an HTML table or an XML document), but this sort of processing is notorious for its poor performance in some browsers.
       So how can you optimize these kinds of tasks? For starters, there is more than one way to merge strings.
    Method Example
    The + operator str = "a" + "b" + "c";
    The += operator

    str = "a";

    str += "b";

    str += "c"; 

    array.join() str = ["a", "b", "c"].join("");
    string.concat()

    tr = "a"; 

    str = str.concat("b", "c");

       All of these methods are fast when concatenating a few strings here and there, so for casual use, you should go with whatever is the most practical. As the length and number of strings that must be merged increases, however, some methods start to show their strength.

    a.Plus (+) and Plus-Equals (+=) Operators
      These operators provide the simplest method for concatenating strings and, in fact, all modern browsers except IE7 and earlier optimize them well enough that you don’t really need to look at other options. However, several techniques maximize the efficiency of these operators.
    First, an example. Here’s a common way to assign a concatenated string:
    str += "one" + "two";
     When evaluating this code, four steps are taken:

       1. A temporary string is created in memory.
       2. The concatenated value "onetwo" is assigned to the temporary string.
       3. The temporary string is concatenated with the current value of str.
       4. The result is assigned to str.

       This is actually an approximation of how browsers implement this task, but it’s close. The following code avoids the temporary string (steps 1 and 2 in the list) by directly appending to str using two discrete statements. This ends up running about 10%–40% faster in most browsers:
    str += "one";
    str += "two";
    
       In fact, you can get the same performance improvement using one statement, as follows:
    str = str + "one" + "two"; // equivalent to str = ((str + "one") + "two")
        This avoids the temporary string because the assignment expression starts with str as the base and appends one string to it at a time, with each intermediary concatenation performed from left to right. If the concatenation were performed in a different order (e.g., str = "one" + str + "two"), you would lose this optimization. This is because of the way that browsers allocate memory when merging strings. Apart from IE, browsers try to expand the memory allocation for the string on the left of an expression and simply copy the second string to the end of it (see Figure 5-1). If, in a loop, the base string is furthest to the left, you avoid repeatedly copying a progressively larger base string.
       These techniques don’t apply to IE. They have little, if any, effect in IE8 and can actually make things slower in IE7 and earlier. That’s because of how IE executes concatenation under the hood. In IE8’s implementation, concatenating strings merely stores references to the existing string parts that compose the new string. At the last possible moment (when you actually use the concatenated string), the string parts are each copied into a new “real” string, which then replaces the previously stored string references so that this assembly doesn’t have to be performed every time the string is used.
      IE8’s implementation can throw off synthetic benchmarks—making concatenation appear faster than it really is—unless you force concatenation to occur after you’ve finished building your test string. For example,this can be done by calling the toString() method on your final string, checking its length property, or inserting it into the DOM.
      IE7 and earlier use an inferior implementation of concatenation in which each pair of concatenated strings must always be copied to a new memory location. You’ll see the potentially dramatic impact of this in the upcoming section “Array Joining”. With the pre-IE8 implementation, the advice in this section can make things slower since it’s faster to concatenate short strings before merging them with a larger base string (thereby avoiding the need to copy the larger string multiple times). For instance, with largeStr = largeStr + s1 + s2, IE7 and earlier must copy the large string twice, first to merge it with s1, then with s2. Conversely, largeStr += s1 + s2 first merges the two smaller strings and then concatenates the result with the large string. Creating the intermediary string of s1 + s2 is a much lighter performance hit than copying the large string twice.

    b.Firefox and compile-time folding
       When all strings concatenated in an assignment expression are compile-time constants,Firefox automatically merges them at compile time. Here’s a way to see this in action:
    function foldingDemo() {
    	var str = "compile" + "time" + "folding";
    	str += "this" + "works" + "too";
    	str = str + "but" + "not" + "this";
    }
    alert(foldingDemo.toString());
    /* In Firefox, you'll see this:
    function foldingDemo() {
    	var str = "compiletimefolding";
    	str += "thisworkstoo";
    	str = str + "but" + "not" + "this";
    } * /
    
       When strings are folded together like this, there are no intermediary strings at runtime and the time and memory that would be spent concatenating them is reduced to zero. This is great when it occurs, but it doesn’t help very often because it’s much more common to build strings from runtime data than from compile-time constants.

  • Array Joining
      The Array.prototype.join method merges all elements of an array into a string and accepts a separator string to insert between each element. By passing in an empty string as the separator, you can perform a simple concatenation of all elements in an array. Array joining is slower than other methods of concatenation in most browsers, but this is more than compensated for by the fact that it is the only efficient way to concatenate lots of strings in IE7 and earlier.
       IE7’s naive concatenation algorithm requires that the browser repeatedly copy and allocate memory for larger and larger strings each time through the loop. The result is quadratic running time and memory consumption.
      The good news is that all other modern browsers (including IE8) perform far better in this test and do not exhibit the quadratic complexity that is the real killer here. However, this demonstrates the impact that seemingly simple string concatenation can have; 226 milliseconds for 5,000 concatenations is already a significant performance hit that would be nice to reduce as much as possible, but locking up a user’s browser for more than 32 seconds in order to concatenate 20,000 short strings is unacceptable for nearly any application.
      This dramatic improvement results from avoiding repeatedly allocating memory for and copying progressively larger and larger strings. When joining an array, the browser allocates enough memory to hold the complete string, and never copies the same part of the final string more than once.

  • String.prototype.concat
       The native string concat method accepts any number of arguments and appends each  to the string that the method is called on. This is the most flexible way to concatenate strings because you can use it to append just one string, a few strings at a time, or an entire array of strings.
    // append one string
    str = str.concat(s1);
    // append three strings
    str = str.concat(s1, s2, s3);
    // append every string in an array by using the array
    // as the list of arguments
    str = String.prototype.concat.apply(str, array);
    
       Unfortunately, concat is a little slower than simple + and += operators in most cases, and can be substantially slower in IE, Opera, and Chrome. Moreover, although using concat to merge all strings in an array appears similar to the array joining approach discussed previously, it’s usually slower (except in Opera), and it suffers from the same potentially catastrophic performance problem as + and += when building large strings in IE7 and earlier.

Regular Expression Optimization
     Incautiously crafted regexes can be a major performance bottleneck (the upcoming section, “Runaway Backtracking” on page 91, contains several examples showing  how severe this can be), but there is a lot you can do to improve regex efficiency. Just  because two regexes match the same text doesn’t mean they do so at the same speed.  Many factors affect a regex’s efficiency. For starters, the text a regex is applied to makes a big difference because regexes spend more time on partial matches than obvious nonmatches. Each browser’s regex engine also has different internal optimizations.
   Regex optimization is a fairly broad and nuanced topic. There’s only so much that can be covered in this section, but what’s included should put you well on your way to understanding the kinds of issues that affect regex performance and mastering the art of crafting efficient regexes.

  • How Regular Expressions Work
       In order to use regular expressions efficiently, it’s important to understand how they work their magic. The following is a quick rundown of the basic steps a regex goes through:

    Step 1: Compilation
       When you create a regex object (using a regex literal or the RegExp constructor), the browser checks your pattern for errors and then converts it into a native code routine that is used to actually perform matches. If you assign your regex to a variable, you can avoid performing this step more than once for a given pattern.

    Step 2: Setting the starting position
       When a regex is put to use, the first step is to determine the position within the target string where the search should start. This is initially the start of the string or the position specified by the regex’s lastIndex property,(§) but when returning here from step 4 (due to a failed match attempt), the position is one character after where the last attempt started.
       Optimizations that browser makers build into their regex engines can help avoid a lot of unnecessary work at this stage by deciding early that certain work can be skipped. For instance, if a regex starts with ^, IE and Chrome can usually determine that a match cannot be found after the start of a string and avoid foolishly searching subsequent positions. Another example is that if all possible matches contain x as the third character, a smart implementation may be able to determine this, quickly search for the next x, and set the starting position two characters back from where it’s found (e.g., recent versions of Chrome include this optimization).

    Step 3: Matching each regex token
       Once the regex knows where to start, it steps through the text and the regex pattern. When a particular token fails to match, the regex tries to backtrack to a prior point in the match attempt and follow other possible paths through the regex.

    Step 4: Success or failure
        If a complete match is found at the current position in the string, the regex declares success. If all possible paths through the regex have been attempted but a match was not found, the regex engine goes back to step 2 to try again at the next character in the string. Only after this cycle completes for every character in the string (as well as the position after the last character) and no matches have been found does the regex declare overall failure.

        Keeping this process in mind will help you make informed decisions about the types of issues that affect regex performance. Next up is a deeper look into a key feature of the matching process in step 3: backtracking.

  • Understanding Backtracking
       In most modern regex implementations (including those required by JavaScript), backtracking is a fundamental component of the matching process. It’s also a big part of what makes regular expressions so expressive and powerful. However, backtracking is computationally expensive and can easily get out of hand if you’re not careful. Although backtracking is only part of the overall performance equation, understanding how it works and how to minimize its use is perhaps the most important key to writing efficient regexes. The next few sections therefore cover the topic at some length.  As a regex works its way through a target string, it tests whether a match can be found at each position by stepping through the components in the regex from left to right. For each quantifier and alternation,(‖) a decision must be made about how to proceed. With a quantifier (such as *, +?, or {2,}), the regex must decide when to try matching additional characters, and with alternation (via the | operator), it must try one option from those available.
       Each time the regex makes such a decision, it remembers the other options to return to later if necessary. If the chosen option is successful, the regex continues through the regex pattern, and if the remainder of the regex is also successful, the match is complete. But if the chosen option can’t find a match or anything later in the regex fails, the regex backtracks to the last decision point where untried options remain and chooses one. It continues on like this until a match is found or all possible permutations of the quantifiers and alternation options in the regex have been tried unsuccessfully, at which point it gives up and moves on to start this process all over at the next character in the string.

    a.Alternation and backtracking
       Here’s an example that demonstrates how this process plays out with alternation.
    /h(ello|appy) hippo/.test("hello there, happy hippo");
       This regex matches “hello hippo” or “happy hippo”. It starts this test by searching for an h, which it finds immediately as the first character in the target string. Next, the subexpression (ello|appy) provides two ways to proceed. The regex chooses the leftmost option (alternation always works from left to right), and checks whether ello matches the next characters in the string. It does, and the regex is also able to match the following space character. At that point, though, it reaches a dead end because the h in hippo cannot match the t that comes next in the string. The regex can’t give up yet, though, because it hasn’t tried all of its options, so it backtracks to the last decision point (just after it matched the leading h) and tries to match the second alternation option. That doesn’t work, and since there are no more options to try, the regex determines that a match cannot be found starting from the first character in the string
    and moves on to try again at the second character. It doesn’t find an h there, so it continues searching until it reaches the 14th character, where it matches the h in “happy”. It then steps through the alternatives again. This time ello doesn’t match, but after backtracking and trying the second alternative, it’s able to continue until it matches the full string “happy hippo” (see Figure 5-4). Success.

    b.Repetition and backtracking
       This next example shows how backtracking works with repetition quantifiers.
    var str = "<p>Para 1.</p>" +
    	"<img src='smiley.jpg'>" +
    	"<p>Para 2.</p>" +
                    "<div>Div.</div>";
    
    /<p>.*<\/p>/i.test(str);
    
     
       Here, the regex starts by matching the three literal characters <p> at the start of the  string. Next up is '.*' The dot matches any character except line breaks, and the greedy  asterisk quantifier repeats it zero or more times—as many times as possible. Since there  are no line breaks in the target string, this gobbles up the rest of the string! There’s still more to match in the regex pattern, though, so the regex tries to match '<' This doesn’t  work at the end of the string, so the regex backtracks one character at a time, continually trying to match <, until it gets back to the < at the beginning of the </div> tag. It then tries to match \/ (an escaped backslash), which works, followed by p, which doesn’t.The regex backtracks again, repeating this process until it eventually matches the </p> at the end of the second paragraph. The match is returned successfully, spanning from the start of the first paragraph until the end of the last one, which is probably not what you wanted.
       You can change the regex to match individual paragraphs by replacing the greedy * quantifier with the lazy (aka nongreedy) *?. Backtracking for lazy quantifiers works in the opposite way. When the regex /<p>.*?<\/p>/ comes to the .*?, it first tries to skip this altogether and move on to matching <\/p>. It does so because *? repeats its preceding element zero or more times, as few times as possible, and the fewest possible times it can repeat is zero. However, when the following < fails to match at this point in the string, the regex backtracks and tries to match the next fewest number of characters: one. It continues backtracking forward like this until the <\/p> that follows the quantifier is able to fully match at the end of the first paragraph.
       You can see that even if there was only one paragraph in the target string and therefore the greedy and lazy versions of this regex were equivalent, they would go about finding their matches differently.

  • Runaway Backtracking
       When a regular expression stalls your browser for seconds, minutes, or longer, the problem is most likely a bad case of runaway backtracking. To demonstrate the problem, consider the following regex, which is designed to match an entire HTML file. The regex is wrapped across multiple lines in order to fit the page. Unlike most other regex flavors, JavaScript does not have an option to make dots match any character, including line breaks, so this example uses [\s\S] to match any character.
    /<html>[\s\S]*?<head>[\s\S]*?<title>[\s\S]*?<\/title>[\s\S]*?<\/head>
    [\s\S]*?<body>[\s\S]*?<\/body>[\s\S]*?<\/html>/
    
       This regex works fine when matching a suitable HTML string, but it turns ugly when the string is missing one or more required tags. If the </html> tag is missing, for instance, the last [\s\S]*? expands to the end of the string since there is no </html> tag to be found, and then, instead of giving up, the regex sees that each of the previous [\s\S]*? sequences remembered backtracking positions that allow them to expand further. The regex tries expanding the second-to-last [\s\S]*?—using it to match the </body> tag that was previously matched by the literal <\/body> pattern in the regex— and continues to expand it in search of a second </body> tag until the end of the string is reached again. When all of that fails, the third-to-last [\s\S]*? expands to the end of the string, and so on.

    a.The solution: Be specific
      The way around a problem like this is to be as specific as possible about what characters can be matched between your required delimiters. Take the pattern ".*?", which is intended to match a string delimited by double-quotes. By replacing the overly permissive .*? with the more specific [^"\r\n]*, you remove the possibility that backtracking will force the dot to match a double-quote and expand beyond what was intended.
      With the HTML example, this workaround is not as simple. You can’t use a negated character class like [^<] in place of [\s\S] because there may be other tags between those you’re searching for. However, you can reproduce the effect by repeating a noncapturing group that contains a negative lookahead (blocking the next required tag) and the [\s\S] (any character) metasequence. This ensures that the tags you’re looking for fail at every intermediate position, and, more importantly, that the [\s\S] patterns cannot expand beyond where the tags you are blocking via negative lookahead are found. Here’s how the regex ends up looking using this approach:
    /<html>(?:(?!<head>)[\s\S])*<head>(?:(?!<title>)[\s\S])*<title>
    (?:(?!<\/title>)[\s\S])*<\/title>(?:(?!<\/head>)[\s\S])*<\/head>
    (?:(?!<body>)[\s\S])*<body>(?:(?!<\/body>)[\s\S])*<\/body>
    (?:(?!<\/html>)[\s\S])*<\/html>/
    
       Although this removes the potential for runaway backtracking and allows the regex to fail at matching incomplete HTML strings in linear time, it’s not going to win any awards for efficiency. Repeating a lookahead for each matched character like this is rather inefficient in its own right and significantly slows down successful matches. This approach works well enough when matching short strings, but since in this case the lookaheads may need to be tested thousands of times in order to match an HTML file, there’s another solution that works better. It relies on a little trick, and it’s described next.

    b.Emulating atomic groups using lookahead and backreferences
       Some regex flavors, including .NET, Java, Oniguruma, PCRE, and Perl, support a feature called atomic grouping. Atomic groups—written as (?>…), where the ellipsis represents any regex pattern—are noncapturing groups with a special twist. As soon as a regex exits an atomic group, any backtracking positions within the group are thrown away. This provides a much better solution to the HTML regex’s backtracking problem: if you were to place each [\s\S]*? sequence and its following HTML tag together inside an atomic group, then every time one of the required HTML tags was found, the match thus far would essentially be locked in. If a later part of the regex failed to match, no backtracking positions would be remembered for the quantifiers within the atomic groups, and thus the [\s\S]*? sequences could not attempt to expand beyond what they already matched.
      That’s great, but JavaScript does not support atomic groups or provide any other feature to eliminate needless backtracking. It turns out, though, that you can emulate atomic groups by exploiting a little-known behavior of lookahead: that lookaheads are atomic groups.# The difference is that lookaheads don’t consume any characters as part of the overall match; they merely check whether the pattern they contain can be matched at that position. However, you can get around this by wrapping a lookahead’s pattern inside a capturing group and adding a backreference to it just outside the lookahead. Here’s what this looks like:
    (?=(pattern to make atomic))\1
       This construct is reusable in any pattern where you want to use an atomic group. Just keep in mind that you need to use the appropriate backreference number if your regex contains more than one capturing group. Here’s how this looks when applied to the HTML regex:
    /<html>(?=([\s\S]*?<head>))\1(?=([\s\S]*?<title>))\2(?=([\s\S]*?
    <\/title>))\3(?=([\s\S]*?<\/head>))\4(?=([\s\S]*?<body>))\5
    (?=([\s\S]*?<\/body>))\6[\s\S]*?<\/html>/
    
       Now, if there is no trailing </html> and the last [\s\S]*? expands to the end of the string, the regex immediately fails because there are no backtracking points to return to. Each time the regex finds an intermediate tag and exits a lookahead, it throws away all backtracking positions from within the lookahead. The following backreference simply rematches the literal characters found within the lookahead, making them a part of the actual match.

    c.Nested quantifiers and runaway backtracking
      So-called nested quantifiers always warrant extra attention and care in order to ensure that you’re not creating the potential for runaway backtracking. A quantifier is nested when it occurs within a grouping that is itself repeated by a quantifier (e.g., (x+)*). Nesting quantifiers is not actually a performance hazard in and of itself. However, if you’re not careful, it can easily create a massive number of ways to divide text between the inner and outer quantifiers while attempting to match a string.  As an example, let’s say you want to match HTML tags, and you come up with the following regex:
    /<(?:[^>"']|"[^"]*"|'[^']*')*>/
      This is perhaps overly simplistic, as it does not handle all cases of valid and invalid markup correctly, but it might work OK if used to process only snippets of valid HTML. Its advantage over even more naive solutions such as /<[^>]*>/ is that it accounts for > characters that occur within attribute values. It does so using the second and third alternatives in the noncapturing group, which match entire double- and single-quoted attribute values in single steps, allowing all characters except their respective quote type to occur within them.
      So far, there’s no risk of runaway backtracking, despite the nested * quantifiers. The second and third alternation options match exactly one quoted string sequence per repetition of the group, so the potential number of backtracking points increases linearly with the length of the target string.
      However, look at the first alternative in the noncapturing group: [^>"']. This can match only one character at a time, which seems a little inefficient. You might think it would be better to add a + quantifier at the end of this character class so that more than one suitable character can be matched during each repetition of the group—and at positions within the target string where the regex finds a match—and you’d be right. By matching more than one character at a time, you’d let the regex skip many unnecessary steps on the way to a successful match.
      What might not be as readily apparent is the negative consequence such a change could lead to. If the regex matches an opening < character, but there is no following > that would allow the match attempt to complete successfully, runaway backtracking will kick into high gear because of the huge number of ways the new inner quantifier can be combined with the outer quantifier (following the noncapturing group) to match the text that follows <. The regex must try all of these permutations before giving up on the match attempt. Watch out!
       For an even more extreme example of nested quantifiers resulting in runaway backtracking, apply the regex /(A+A+)+B/ to a string containing only As. Although this regex would be better written as /AA+B/, for the sake of discussion imagine that the two As represent different patterns that are capable of matching some of the same strings.
      When applied to a string composed of 10 As ("AAAAAAAAAA"), the regex starts by using the first A+ to match all 10 characters. The regex then backtracks one character, letting the second A+ match the last one. The grouping then tries to repeat, but since there are no more As and the group’s + quantifier has already met its requirement of matching at least once, the regex then looks for the B. It doesn’t find it, but it can’t give up yet, since there are more paths through the regex that haven’t been tried. What if the first A+ matched eight characters and the second matched two? Or if the first matched three characters, the second matched two, and the group repeated twice? How about if during the first repetition of the group, the first A+ matched two characters and the second matched three; then on the second repetition the first matched one and the second matched four? Although to you and me it’s obviously silly to think that any amount of backtracking will produce the missing B, the regex will dutifully check all of these futile
    options and a lot more. The worst-case complexity of this regex is an appalling O(2n), or two to the nth power, where n is the length of the string. With the 10 As used here, the regex requires 1,024 backtracking steps for the match to fail, and with 20 As, that number explodes to more than a million. Thirty-five As should be enough to hang Chrome, IE, Firefox, and Opera for at least 10 minutes (if not permanently) while they process the more than 34 billion backtracking steps required to invalidate all permutations of the regex. The exception is recent versions of Safari, which are able to detect that the regex is going in circles and quickly abort the match (Safari also imposes a cap of allowed backtracking steps, and aborts match attempts when this is exceeded). The key to preventing this kind of problem is to make sure that two parts of a regex cannot match the same part of a string. For this regex, the fix is to rewrite it as /AA+B/, but the issue may be harder to avoid with complex regexes. Adding an emulated atomic group often works well as a last resort, although other solutions, when possible, will most likely keep your regexes easier to understand. Doing so for this regex looks like /((?=(A+A+))\2)+B/, and completely removes the backtracking problem.

  • A Note on Benchmarking
       Because a regex’s performance can be wildly different depending on the text it’s applied to, there’s no straightforward way to benchmark regexes against each other. For the best result, you need to benchmark your regexes on test strings of varying lengths that match, don’t match, and nearly match.
      That’s one reason for this chapter’s lengthy backtracking coverage. Without a firm understanding of backtracking, you won’t be able to anticipate and identify backtracking-related problems. To help you catch runaway backtracking early, always test your regexes with long strings that contain partial matches. Think about the kinds of strings that your regexes will nearly but not quite match, and include those in your tests.

  • More Ways to Improve Regular Expression Efficiency
       The following are a variety of additional regex efficiency techniques. Several of the points here have already been touched upon during the backtracking discussion.

    Focus on failing faster
           Slow regex processing is usually caused by slow failure rather than slow matching.  This is compounded by the fact that if you’re using a regex to match small parts of  a large string, the regex will fail at many more positions than it will succeed. A  change that makes a regex match faster but fail slower (e.g., by increasing the  number of backtracking steps needed to try all regex permutations) is usually a  losing trade.

    Start regexes with simple, required tokens
        Ideally, the leading token in a regex should be fast to test and rule out as many obviously nonmatching positions as possible. Good starting tokens for this purpose are anchors (^ or $), specific characters (e.g., x or \u263A), character classes (e.g., [a-z] or shorthands like \d), and word boundaries (\b). If possible, avoid starting regexes with groupings or optional tokens, and avoid top-level alternation such as /one|two/ since that forces the regex to consider multiple leading tokens. Firefox is sensitive to the use of any quantifier on leading tokens, and is better able to optimize, e.g., \s\s* than \s+ or \s{1,}. Other browsers mostly optimize away such differences.

    Make quantified patterns and their following token mutually exclusive
        When the characters that adjacent tokens or subexpressions are able to match overlap, the number of ways a regex will try to divide text between them increases. To help avoid this, make your patterns as specific as possible. Don’t use ".*?" (which relies on backtracking) when you really mean "[^"\r\n]*".

    Reduce the amount and reach of alternation
        Alternation using the | vertical bar may require that all alternation options be tested at every position in a string. You can often reduce the need for alternation by using character classes and optional components, or by pushing the alternation further back into the regex (allowing some match attempts to fail before reaching the alternation). The following table shows examples of these techniques.
    Instead of  Use
    cat|bat [cb]at
    red|read rea?d
    red|raw r(?:ed|aw)
    (.|\r|\n) [\s\S]
       Character classes are faster than alternation because they are implemented using bit vectors (or other fast implementations) rather than backtracking. When alternation is necessary, put frequently occurring alternatives first if this doesn’t affect what the regex matches. Alternation options are attempted from left to right, so the more frequently an option is expected to match, the sooner you want it to be considered.
       Note that Chrome and Firefox perform some of these optimizations automatically,and are therefore less affected by techniques for hand-tuning alternation.

    Use noncapturing groups
       Capturing groups spend time and memory remembering backreferences and keeping them up to date. If you don’t need a backreference, avoid this overhead by using a noncapturing group—i.e., (?:…) instead of (…). Some people like to wrap their regexes in a capturing group when they need a backreference to the entire match. This is unnecessary since you can reference full matches via, e.g., element zero in arrays returned by regex.exec() or $& in replacement strings.
    Replacing capturing groups with their noncapturing kin has minimal impact in Firefox, but can make a big difference in other browsers when dealing with long strings.

    Capture interesting text to reduce postprocessing
       As a caveat to the last tip, if you need to reference parts of a match, then, by all means, capture those parts and use the backreferences produced. For example, if you’re writing code to process the contents of quoted strings matched by a regex, use /"([^"]*)"/ and work with backreference one, rather than using /"[^"]*"/ and manually stripping the quote marks from the result. When used in a loop, this kind of work reduction can save significant time.

    Expose required tokens
        In order to help regex engines make smart decisions about how to optimize a search routine, try to make it easy to determine which tokens are required. When tokens are used within subexpressions or alternation, it’s harder for regex engines to determine whether they are required, and some won’t make the effort to do so. For instance, the regex /^(ab|cd)/ exposes its start-of-string anchor. IE and Chrome see this and prevent the regex from trying to find matches after the start of a string, thereby making this search near instantaneous regardless of string length. However, because the equivalent regex /(^ab|^cd)/ doesn’t expose its ^ anchor, IE doesn’t apply the same optimization and ends up pointlessly searching for matches at every position in the string.

    Use appropriate quantifiers
       As described in the earlier section “Repetition and backtracking” on page 90,greedy and lazy quantifiers go about finding matches differently, even when they match the same strings. Using the more appropriate quantifier type (based on the anticipated amount of backtracking) in cases where they are equally correct can significantly improve performance, especially with long strings. Lazy quantifiers are particularly slow in Opera 9.x and earlier, but Opera 10 removes
    this weakness.

    Reuse regexes by assigning them to variables
       Assigning regexes to variables lets you avoid repeatedly compiling them. Some people go overboard, using regex caching schemes that aim to avoid ever compiling a given pattern and flag combination more than once. Don’t bother; regex compilation is fast, and such schemes likely add more overhead than they evade. The important thing is to avoid repeatedly recompiling regexes within loops. In other words, don’t do this:
    while (/regex1/.test(str1)) {
    	/regex2/.exec(str2);
    	...
    }
    
     Do this instead:
    var regex1 = /regex1/,
    regex2 = /regex2/;
    while (regex1.test(str1)) {
    	regex2.exec(str2);
    	...
    }
    
     
    Split complex regexes into simpler pieces
      Try to avoid doing too much with a single regex. Complicated search problems that require conditional logic are easier to solve and usually more efficient when broken into two or more regexes, with each regex searching within the matches of the last. Regex monstrosities that do everything in one pattern are difficult to maintain, and are prone to backtracking-related problems.

  • When Not to Use Regular Expressions
       When used with care, regexes are very fast. However, they’re usually overkill when you are merely searching for literal strings. This is especially true if you know in advance which part of a string you want to test. For instance, if you want to check whether a string ends with a semicolon, you could use something like this:
    endsWithSemicolon = /;$/.test(str);
       You might be surprised to learn, though, that none of the big browsers are currently smart enough to realize in advance that this regex can match only at the end of the string. What they end up doing is stepping through the entire string. Each time a semicolon is found, the regex advances to the next token ($), which checks whether the match is at the end of the string. If not, the regex continues searching for a match until it finally makes its way through the entire string. The longer your string (and the more semicolons it contains), the longer this takes.
       In this case, a better approach is to skip all the intermediate steps required by a regex and simply check whether the last character is a semicolon:
    endsWithSemicolon = str.charAt(str.length - 1) == ";";
        This is just a bit faster than the regex-based test with small target strings, but, more importantly, the string’s length no longer affects the time needed to perform the test. This example used the charAt method to read the character at a specific position. The string methods slice, substr, and substring work well when you want to extract and check the value of more than one character at a specific position. Additionally, the indexOf and lastIndexOf methods are great for finding the position of literal strings or checking for their presence. All of these string methods are fast and can help you avoid invoking the overhead of regular expressions when searching for literal strings that don’t rely on fancy regex features.


String Trimming

    Removing leading and trailing whitespace from a string is a simple but common task.Although ECMAScript 5 adds a native string trim method (and you should therefore start to see this method in upcoming browsers), JavaScript has not historically included it. For the current browser crop, it’s still necessary to implement a trim method yourself or rely on a library that includes it.
   Trimming strings is not a common performance bottleneck, but it serves as a decent case study for regex optimization since there are a variety of ways to implement it.

  • Trimming with Regular Expressions
       Regular expressions allow you to implement a trim method with very little code, which is important for JavaScript libraries that focus on file size. Probably the best all-around solution is to use two substitutions—one to remove leading whitespace and another to remove trailing whitespace. This keeps things simple and fast, especially with long strings.
    if (!String.prototype.trim) {
    	String.prototype.trim = function() {
    		return this.replace(/^\s+/, "").replace(/\s+$/, "");
    	}
    }
    
    // test the new method...
    // tab (\t) and line feed (\n) characters are
    // included in the leading whitespace.
    var str = " \t\n test string ".trim();
    alert(str == "test string"); // alerts "true"
    
       The if block surrounding this code avoids overriding the trim method if it already exists, since native methods are optimized and usually far faster than anything you can implement yourself using a JavaScript function. Subsequent implementations of this example assume that this conditional is in place, though it is not written out each time.  You can give Firefox a performance boost of roughly 35% (less or more depending on the target string’s length and content)* by replacing /\s+$/ (the second regex) with /\s\s*$/. Although these two regexes are functionally identical, Firefox provides additional optimization for regexes that start with a nonquantified token. In other browsers, the difference is less significant or is optimized differently altogether. However, changing the regex that matches at the beginning of strings to /^\s\s* / does not
    produce a measurable difference, because the leading ^ anchor takes care of quickly invalidating nonmatching positions (precluding a slight performance difference from compounding over thousands of match attempts within a long string).
       Following are several more regex-based trim implementations, which are some of the more common alternatives you might encounter. You can see cross-browser performance numbers for all trim implementations described here in Table 5-2 at the end of this section. There are, in fact, many ways beyond those listed here that you can write a regular expression to help you trim strings, but they are invariably slower (or at least less consistently decent cross-browser) than using two simple substitutions when working with long strings.
    // trim 2
    String.prototype.trim = function() {
    	return this.replace(/^\s+|\s+$/g, "");
    }
    
       This is probably the most common solution. It combines the two simple regexes via  alternation, and uses the /g (global) flag to replace all matches rather than just the first  (it will match twice when its target contains both leading and trailing whitespace). This  isn’t a terrible approach, but it’s slower than using two simple substitutions when  working with long strings since the two alternation options need to be tested at every  character position.
    // trim 3
    String.prototype.trim = function() {
    	return this.replace(/^\s*([\s\S]*?)\s*$/, "$1");
    }
    
       This regex works by matching the entire string and capturing the sequence from the first to the last nonwhitespace characters (if any) to backreference one. By replacing the entire string with backreference one, you’re left with a trimmed version of the string.  This approach is conceptually simple, but the lazy quantifier inside the capturing group makes the regex do a lot of extra work (i.e., backtracking), and therefore tends to make this option slow with long target strings. After the regex enters the capturing group, the [\s\S] class’s lazy *? quantifier requires that it be repeated as few times as possible. Thus, the regex matches one character at a time, stopping after each character to try to match the remaining \s*$ pattern. If that fails because nonwhitespace characters remain somewhere after the current position in the string, the regex matches one more character, updates the backreference, and then tries the remainder of the pattern again.  Lazy repetition is particularly slow in Opera 9.x and earlier. Consequently, trimming long strings with this method in Opera 9.64 performs about 10 to 100 times slower than in the other big browsers. Opera 10 fixes this longstanding weakness, bringing this method’s performance in line with other browsers.
    // trim 4
    String.prototype.trim = function() {
    	return this.replace(/^\s*([\s\S]*\S)?\s*$/, "$1");
    }
    
       This is similar to the last regex, but it replaces the lazy quantifier with a greedy one for performance reasons. To make sure that the capturing group still only matches up to the last nonwhitespace character, a trailing \S is required. However, since the regex must be able to match whitespace-only strings, the entire capturing group is made optional by adding a trailing question mark quantifier. Here, the greedy asterisk in [\s\S]* repeats its any-character pattern to the end of the string. The regex then backtracks one character at a time until it’s able to match the following \S, or until it backtracks to the first character matched within the group (after which it skips the group).
       Unless there’s more trailing whitespace than other text, this generally ends up being faster than the previous solution that used a lazy quantifier. In fact, it’s so much faster that in IE, Safari, Chrome, and Opera 10, it even beats using two substitutions. That’s because those browsers contain special optimization for greedy repetition of character classes that match any character. The regex engine jumps to the end of the string without evaluating intermediate characters (although backtracking positions must still be recorded), and then backtracks as appropriate. Unfortunately, this method is considerably slower in Firefox and Opera 9, so at least for now, using two substitutions still holds up better cross-browser
    // trim 5
    String.prototype.trim = function() {
    	return this.replace(/^\s*(\S*(\s+\S+)*)\s*$/, "$1");
    }
    
       This is a relatively common approach, but there’s no good reason to use it since it’s consistently one of the slowest of the options shown here, in all browsers. It’s similar to the last two regexes in that it matches the entire string and replaces it with the part you want to keep, but because the inner group matches only one word at a time, there are a lot of discrete steps the regex must take. The performance hit may be unnoticeable when trimming short strings, but with long strings that contain many words, this regex can become a performance problem.
       Changing the inner group to a noncapturing group—i.e., changing (\s+\S+) to (?:\s+\S+)—helps a bit, slashing roughly 20%–45% off the time needed in Opera, IE, and Chrome, along with much slighter improvements in Safari and Firefox. Still, a noncapturing group can’t redeem this implementation. Note that the outer group cannot be converted to a noncapturing group since it is referenced in the replacement string.

  • Trimming Without Regular Expressions
      Although regular expressions are fast, it’s worth considering the performance of trimming without their help. Here’s one way to do so:
    // trim 6
    String.prototype.trim = function() {
    	var start = 0,
    	end = this.length - 1,
    	ws = " \n\r\t\f\x0b\xa0\u1680\u180e\u2000\u2001\u2002\u2003
    	\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u202f
    	\u205f\u3000\ufeff";
    
    	while (ws.indexOf(this.charAt(start)) > -1) {
    		start++;
    	}
    	while (end > start && ws.indexOf(this.charAt(end)) > -1) {
    		end--;
    	}
    	return this.slice(start, end + 1);
    }
    
       The ws variable in this code includes all whitespace characters as defined by ECMAScript 5. For efficiency reasons, copying any part of the string is avoided until the trimmed version’s start and end positions are known.
      It turns out that this smokes the regex competition when there is only a bit of whitespace on the ends of the string. The reason is that although regular expressions are well suited for removing whitespace from the beginning of a string, they’re not as fast at trimming from the end of long strings. As noted in the section “When Not to Use Regular Expressions” on page 99, a regex cannot jump to the end of a string without considering characters along the way. However, this implementation does just that, with the second while loop working backward from the end of the string until it finds a nonwhitespace character.
       Although this version is not affected by the overall length of the string, it has its own weakness: long leading and trailing whitespace. That’s because looping over characters to check whether they are whitespace can’t match the efficiency of a regex’s optimized search code.

  • A Hybrid Solution(混合的解决方案)
      The final approach for this section is to combine a regex’s universal efficiency at trimming leading whitespace with the nonregex method’s speed at trimming trailing characters.
    // trim 7
    String.prototype.trim = function() {
    	var str = this.replace(/^\s+/, ""),
    	end = str.length - 1,
    	ws = /\s/;
    
    	while (ws.test(str.charAt(end))) {
    		end--;
    	}
    	return str.slice(0, end + 1);
    }
    
       This hybrid method remains insanely fast when trimming only a bit of whitespace, and removes the performance risk of strings with long leading whitespace and whitespaceonly strings (although it maintains the weakness for strings with long trailing whitespace). Note that this solution uses a regex in the loop to check whether characters at the end of the string are whitespace. Although using a regex for this adds a bit of performance overhead, it lets you defer the list of whitespace characters to the browser for the sake of brevity and compatibility.
      The general trend for all trim methods described here is that overall string length has more impact than the number of characters to be trimmed in regex-based solutions, whereas nonregex solutions that work backward from the end of the string are unaffected by overall string length but more significantly affected by the amount of whitespace to trim. The simplicity of using two regex substitutions provides consistently respectable performance cross-browser with varying string contents and lengths, and therefore it’s arguably the best all-around solution. The hybrid solution is exceptionally fast with long strings at the cost of slightly longer code and a weakness in some browsers for long, trailing whitespace. See Table 5-2 for all the gory details.

Summary
     Intensive string operations and incautiously crafted regexes can be major performance  obstructions, but the advice in this chapter helps you avoid common pitfalls.

  • When concatenating numerous or large strings, array joining is the only method with reasonable performance in IE7 and earlier.
  • If you don’t need to worry about IE7 and earlier, array joining is one of the slowest ways to concatenate strings. Use simple + and += operators instead, and avoid unnecessary intermediate strings.
  • Backtracking is both a fundamental component of regex matching and a frequent source of regex inefficiency.
  • Runaway backtracking can cause a regex that usually finds matches quickly to run slowly or even crash your browser when applied to partially matching strings.Techniques for avoiding this problem include making adjacent tokens mutually exclusive, avoiding nested quantifiers that allow matching the same part of a string more than one way, and eliminating needless backtracking by repurposing the   atomic nature of lookahead.
  • A variety of techniques exist for improving regex efficiency by helping regexes find matches faster and spend less time considering nonmatching positions (see “More Ways to Improve Regular Expression Efficiency” on page 96).
  • Regexes are not always the best tool for the job, especially when you are merely searching for literal strings.
  • Although there are many ways to trim a string, using two simple regexes (one to remove leading whitespace and another for trailing whitespace) offers a good mix of brevity and cross-browser efficiency with varying string contents and lengths.Looping from the end of the string in search of the first nonwhitespace characters, or combining this technique with regexes in a hybrid approach, offers a good alternative that is less affected by overall string length.


分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics