1. Pattern matching: Find one of a specified set of strings in text.
2. Regular expression : a notation to specify a set of strings.
-- basic operations
-- additional operations: can be expressed by basic operations. eg. [A-E]+ is shorthand for (A|B|C|D|E)(A|B|C|D|E)*
3. Duality between REs and DFAs
-- RE: Concise way to describe a set of strings.
-- DFA: Machine to recognize whether a given string is in a given set.
-- Kleene's theorem.
- For any DFA, there exists a RE that describes the same set of strings.
- For any RE, there exists a DFA that recognizes the same set of strings.
-- Bad news : DFA may have exponential # of states.
4. Regular-expression-matching NFA.
-- RE enclosed in parentheses.
-- One state per RE character (start = 0, accept = M).
-- ε-transition: change state, but don't scan text.
-- match transition: change state and scan to next text char.
-- Accept if any sequence of transitions ends in accept state.(after scanning all text characters)
5. How to determine whether a string is matched by an automaton?
-- DFA: Deterministic ⇒ easy because exactly one applicable transition.
-- NFA: Nondeterministic ⇒ can be several applicable transitions; need to select the right one!
6. NFA representation
-- State names: Integers from 0 to M(number of symbols in RE).
-- Match-transitions: Keep regular expression in array re[]. Use (re[position] == input character) to verify the match transition.
-- ε-transitions. Store in a digraph G.
7. NFA simulation
-- Maintain set of all possible states that NFA could be in after reading in the first i text characters.
-- perform reachability in ε-transitions diagraph : Run DFS from each source (possible states after match transition), without unmarking vertices. (Runs in time proportional to E + V)
-- Implementation
public boolean recognizes(String txt) { Bag<Integer> ps = new Bag<Integer>(); // possible states after ε-transitions DirectedDFS dfs = new DirectedDFS(G, 0); // G is ε-transitions diagraph for (int v = 0; v < G.V(); v++) //states reachable from start by ε-transitions if (dfs.marked(v)) pc.add(v); for (int i = 0; i < txt.length(); i++) { Bag<Integer> match = new Bag<Integer>(); //possible states after match transition for (int v : pc) { if (v == M) continue; if ((re[v] == txt.charAt(i)) || re[v] == '.') // match transition match.add(v+1); } dfs = new DirectedDFS(G, match); // DFS search from set of vertices ps = new Bag<Integer>(); for (int v = 0; v < G.V(); v++) //follow ε-transitions if (dfs.marked(v)) pc.add(v); } for (int v : pc) if (v == M) return true; return false; }
-- Determining whether an N-character text is recognized by the NFA corresponding to an M-character pattern takes time proportional to MN in the worst case.
Pf. For each of the N text characters, we iterate through a set of states of size no more than M and run DFS on the graph of ε-transitions. [The NFA construction we will consider ensures the number of edges ≤ 3M.]
8. Building an NFA corresponding to an RE
-- States: Include a state for each symbol in the RE, plus an accept state.
-- Concatenation: Add match-transition edge from state corresponding to characters in the alphabet to next state.
-- Parentheses: Add ε-transition edge from parentheses to next state.
-- Closure: Add three ε-transition edges for each * operator.
-- Or: Add two ε-transition edges for each | operator.
-- Implementation
private Digraph buildEpsilonTransitionDigraph() { Digraph G = new Digraph(M+1); Stack<Integer> ops = new Stack<Integer>(); for (int i = 0; i < M; i++) { int lp = i; if (re[i] == '(' || re[i] == '|') ops.push(i); //left parentheses and |,push onto stack else if (re[i] == ')') { int or = ops.pop(); if (re[or] == '|') { lp = ops.pop(); G.addEdge(lp, or+1); G.addEdge(or, i); } else lp = or; } if (i < M-1 && re[i+1] == '*') { //closure,needs 1-character lookahead G.addEdge(lp, i+1); G.addEdge(i+1, lp); } if (re[i] == '(' || re[i] == '*' || re[i] == ')') G.addEdge(i, i+1); } return G; }
-- Building the NFA corresponding to an M-character RE takes time and space proportional to M.
Pf. For each of the M characters in the RE, we add at most three ε-transitions and execute at most two stack operations.
9. Generalized regular expression print
-- Take a RE as a command-line argument and print the lines from standard input having some substring that is matched by the RE.
-- Implementation:
public class GREP { public static void main(String[] args) { String re = "(.*" + args[0] + ".*)"; NFA nfa = new NFA(re); while (StdIn.hasNextLine()) { String line = StdIn.readLine(); if (nfa.recognizes(line)) StdOut.println(line); } } }
-- Worst-case for grep (proportional to M N ) is the same as for brute-force substring search.
10. Regular expressions in Java
-- String.matches(String regex)
//basic RE matching public class Validate { public static void main(String[] args) { String regexp = args[0]; String input = args[1]; StdOut.println(input.matches(re)); } }
-- java.util.regexp.Pattern and java.util.regexp.Matcher classes:
//Print all substrings of input that match a RE. public class Harvester { public static void main(String[] args) { String regexp = args[0]; In in = new In(args[1]); String input = in.readAll(); Pattern pattern = Pattern.compile(regexp);//compile() creates a Pattern (NFA) from RE Matcher matcher = pattern.matcher(input);//matcher() creates a Matcher (NFA simulator) from NFA and text while (matcher.find()) // find() looks for the next match { //group() returns the substring most recently found by find() StdOut.println(matcher.group()); } } }
相关推荐
书名:Mastering Regular Expressions, 3rd Edition 格式:CHM 语言:English 简介: Regular expressions are an extremely powerful tool for manipulating text and data. They are now standard ...
regular expressions.
Introducing Regular Expressions pdf 高清 有目錄
Mastering Regular Expressions 3rd
Mastering Regular Expressions(3rd) 英文无水印pdf 第3版 pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,...
Mastering Regular Expressions, 2nd Edition
JavaScript Regular Expressions 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传...
Mastering Regular Expressions, Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced ...
Regular expressions are used to describe patterns in text, and they are an invaluable aid when working with loosely formatted textual data. This little booklet describes Oracle's regular expression ...
Mastering Regular Expressions.pdf
Mastering Python Regular Expressions 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请...
Java 9 Regular Expressions_Code 源码 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传者或csdn删除
Delphi 比较第三方正则组件 PerlRegEx 和 官方的 RegularExpressions 速度测试
To introduce readers to regular expressions in several technologies. While the material is primarily for people who have little or no experience with regular expressions, there is also some content ...
Mastering Regular Expressions 3e 掌握正则表达式 英文版,我个人把水印去掉了,非常清晰
SANS AUD 507,SANS AUD 507,SANS AUD 507,SANS AUD 507,
Regular expressions are patterns or templates that allow you to define a set of rules in a natural yet vague way, giving you the ability to match and validate text. Therefore, they have been ...