`
leonzhx
  • 浏览: 768200 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论
阅读更多

1.  Pattern matching: Find one of a specified set of strings in text.

 

2.  Regular expression :  a notation to specify a set of strings.

    --  basic operations

    --  additional operations: can be expressed by basic operations. eg. [A-E]+ is shorthand for (A|B|C|D|E)(A|B|C|D|E)*


 

3.  Duality between REs and DFAs

    --  RE: Concise way to describe a set of strings.

    --  DFA: Machine to recognize whether a given string is in a given set.

    --  Kleene's theorem.

        -  For any DFA, there exists a RE that describes the same set of strings.

        -  For any RE, there exists a DFA that recognizes the same set of strings.


    --  Bad news : DFA may have exponential # of states.

 

4.  Regular-expression-matching NFA.

    --  RE enclosed in parentheses.

    --  One state per RE character (start = 0, accept = M).

    --  ε-transition: change state, but don't scan text.

    --  match transition: change state and scan to next text char.

    --  Accept if any sequence of transitions ends in accept state.(after scanning all text characters)


 

5.  How to determine whether a string is matched by an automaton?

    --  DFA: Deterministic ⇒ easy because exactly one applicable transition.

    --  NFA: Nondeterministic ⇒ can be several applicable transitions; need to select the right one!

 

6.  NFA representation

    --  State names: Integers from 0 to M(number of symbols in RE).

    --  Match-transitions: Keep regular expression in array re[]. Use (re[position] == input character) to verify the match transition.

    --  ε-transitions. Store in a digraph G.

 

7.  NFA simulation

    --  Maintain set of all possible states that NFA could be in after reading in the first i text characters.

    --  perform reachability in ε-transitions diagraph : Run DFS from each source (possible states after match transition), without unmarking vertices. (Runs in time proportional to E + V)

    --  Implementation

public boolean recognizes(String txt)
{
    Bag<Integer> ps = new Bag<Integer>(); // possible states after ε-transitions 
    DirectedDFS dfs = new DirectedDFS(G, 0); // G is ε-transitions diagraph
    
    for (int v = 0; v < G.V(); v++) //states reachable from start by ε-transitions
        if (dfs.marked(v)) pc.add(v);

    for (int i = 0; i < txt.length(); i++)
    {
        Bag<Integer> match = new Bag<Integer>(); //possible states after match transition
        for (int v : pc)
        {
            if (v == M) continue;
            if ((re[v] == txt.charAt(i)) || re[v] == '.') // match transition
                match.add(v+1);
        }
        dfs = new DirectedDFS(G, match); // DFS search from set of vertices
        ps = new Bag<Integer>();
        for (int v = 0; v < G.V(); v++) //follow ε-transitions
            if (dfs.marked(v)) pc.add(v);
    }
    for (int v : pc)
        if (v == M) return true;
    return false;
}

    --  Determining whether an N-character text is recognized by the NFA corresponding to an M-character pattern takes time proportional to MN in the worst case.

        Pf. For each of the N text characters, we iterate through a set of states of size no more than M and run DFS on the graph of ε-transitions. [The NFA construction we will consider ensures the number of edges ≤ 3M.]

 

8.   Building an NFA corresponding to an RE

    --  States: Include a state for each symbol in the RE, plus an accept state.

    --  Concatenation: Add match-transition edge from state corresponding to characters in the alphabet to next state.

    --  Parentheses: Add ε-transition edge from parentheses to next state.

    --  Closure: Add three ε-transition edges for each * operator.


     --  Or:  Add two ε-transition edges for each | operator.

    --  Implementation

private Digraph buildEpsilonTransitionDigraph() {
    Digraph G = new Digraph(M+1);
    Stack<Integer> ops = new Stack<Integer>();
    for (int i = 0; i < M; i++) {
        int lp = i;
        if (re[i] == '(' || re[i] == '|') ops.push(i); //left parentheses and |,push onto stack
        else if (re[i] == ')') {
            int or = ops.pop();
            if (re[or] == '|') {
                lp = ops.pop();
                G.addEdge(lp, or+1);
                G.addEdge(or, i);
            }
            else lp = or;
        }
        if (i < M-1 && re[i+1] == '*') { //closure,needs 1-character lookahead
            G.addEdge(lp, i+1);
            G.addEdge(i+1, lp);
        }
        if (re[i] == '(' || re[i] == '*' || re[i] == ')')
            G.addEdge(i, i+1);
    }
    return G;
}

     --  Building the NFA corresponding to an M-character RE takes time and space proportional to M.

        Pf. For each of the M characters in the RE, we add at most three ε-transitions and execute at most two stack operations.

 

9.  Generalized regular expression print

    --  Take a RE as a command-line argument and print the lines from standard input having some substring that is matched by the RE.

    --  Implementation:

public class GREP
{
    public static void main(String[] args)
    {
        String re = "(.*" + args[0] + ".*)";
        NFA nfa = new NFA(re);
        while (StdIn.hasNextLine())
        {
                String line = StdIn.readLine();
                if (nfa.recognizes(line))
                        StdOut.println(line);
                }
         }
}

    --  Worst-case for grep (proportional to M N ) is the same as for brute-force substring search.

 

10.  Regular expressions in Java

    --  String.matches(String regex)

//basic RE matching
public class Validate
{
    public static void main(String[] args)
    {
        String regexp = args[0];
        String input = args[1];
        StdOut.println(input.matches(re));
    }
}

    --  java.util.regexp.Pattern and java.util.regexp.Matcher classes:

//Print all substrings of input that match a RE.
public class Harvester
{
    public static void main(String[] args)
    {
        String regexp = args[0];
        In in = new In(args[1]);
        String input = in.readAll();
        Pattern pattern = Pattern.compile(regexp);//compile() creates a
Pattern (NFA) from RE
        Matcher matcher = pattern.matcher(input);//matcher() creates a Matcher (NFA simulator) from NFA and text
        while (matcher.find()) // find() looks for the next match
        {
            //group() returns the substring most recently found by find()
            StdOut.println(matcher.group());
        }
    }
}

 

  • 大小: 37.9 KB
  • 大小: 33.8 KB
  • 大小: 19 KB
  • 大小: 19.2 KB
  • 大小: 100.3 KB
  • 大小: 11.4 KB
  • 大小: 8.2 KB
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics