Regular expression made by To Minh Hoang - Portal team

Post on 24-Jan-2015

1.518 views 1 download

description

This is a presentation from eXo Platform SEA.

Transcript of Regular expression made by To Minh Hoang - Portal team

Regular Expressions

Minh Hoang TOPortal Team

2

Agenda

» Finite State Machine

» Pattern Parser

» Java Regex » Parsers in GateIn

» Advanced Theory

Finite State Machine

4

State Diagram

5

JIRA Issue Lifecycle

6

Java Thread Lifecycle

7

Java Compilation Flow

8

Finite State Machine - FSM

» Behavioral model to describe working flow of a system

9

Finite State Machine - FSM

» Directed graph with labeled edges

Pattern Parser

11

Classic Problem

» A – Finite characters set

Ex:

A = {a, b, c, d,..., z} or A = { a, b, c,..., z, public, class, extends, implements, while, if,...}

» Pattern P and input sequence INPUT made of A 's elements

Ex:

P = “a.*b” or P = “class.*extends.*”INPUT = “aaabbbcc” or INPUT = a Java source file

→ Parser reads character-by-character INPUT and recognizes all subsequences matching pattern P

12

Classic Problem - Samples

» Split a sequence of characters into an array of subsequences

String path = “/portal/en/classic/home”; String[] segments = path.split(“/”);

» Handle comment block encountered in a file

» Override readLine() in BufferedReader

» Extract data from REST response

» Write an XML parser from scratch

13

Finite State Machine & Classic Problem

» Acceptor FSM?

» How to transform Classic Problem into graph traversing problem with well-known generic solution?

Find pattern occurrences ↔ Traversing directed graph with labeled edges

14

FSM – Word Accepting

» Consider a word W – sequence of characters from character set A

W = “abcd...xyz”

FSM having graph edges labeled with characters from A, accepts W if there exists a path connecting START node to one of END nodes

START = S1 → S2 → … → Sn = END

1. Duplicate of intermediate nodes is allowed

2. The transition from S_i → S_(i+1) is determined (labeled) by i-th character of W

15

Acceptor FSM

» Given a pattern P, a FSM is called Acceptor FSM if it accepts any word matching pattern P.

Ex:

Acceptor FSM of “a[0-9]b” accepts any elements from word set

{ “a0b”, “a1b”, “a2b”, “a3b”, “a4b”, “a5b”, “a6b”, “a7b”, “a8b”, “a9b”}

16

How Pattern Parser Works?

Traversing directed graph associated with Acceptor FSM

1. Start from root node 2. Read next characters from INPUT, then makes move according to transition rules 3. Repeat second step until visiting one leaf node or INPUT becomes empty

4. Return OK if leaf node refers to success match.

17

Example One

» Recognize pattern

eXo.*er

in:

AAAeXo123erBBBeXoerCCCeXoeXoerDDD

18

Example One

» Acceptor FSM with 8 states:

START – Start reading input sequence

e – encounter eeX – encounter eX

eXo – encounter eXo

eXo.* – encounter eXo.*

eXo.*e – encounter eXo.*e

END – subsequence matching eXo.*er foundFAILURE

19

20

Example Two

» Recognize comment block

/* */in:

/* Don't ask * /final int innerClassVariable;

21

Example Two

» Acceptor FSM with 5 states:

START – start reading input sequence

OUT – stay away from comment blocks

ENTERING – at the beginning of comment block

IN – stay inside a comment block

LEAVING – at the end of comment block

22

23

Finite State Machine With Stack

» Example Two is slightly harder than Example One as transition decision depends on past information → We must keep something in memory

»

FSM with Stack = Ordinary FSM + Stack Structure storing past info

Contextual transition is determined by (next input character ,stack state)

Java Regex

25

Model

» Pattern: Acceptor Finite State Machine

» Matcher: Parser

26

java.util.regex.Pattern

» Construct FSM accepting pattern

Pattern p = Pattern.compile(“a.*b”);

FSM states are instances of java.util.regex.Pattern$Node

» Generate parser working on input sequence

Matcher matcher = p.matcher(“aaabbbb”);

27

java.util.regex.Matcher

» Find next subsequence matching pattern

find()

» Get capturing groups from latest match

group()

28

Capturing Group

Two Pattern objects

Pattern p = Pattern.compile(“abcd.*efgh”);Pattern q = Pattern.compile(“abcd(.*)efgh”);String text = “abcd12345efgh”;Matcher pM = p.match(text);Matcher qM = q.match(text);

» pM.find() == qM.find();

» pM.group(1) != qM.group(1);

29

Capturing Group

» Hold additional information on each match

while(matcher.find()){ matcher.group(index);}

» Pattern P = (A)(B(C))

matcher.group(0) = the whole sequence ABCmatcher.group(1) = ABCmatcher.group(2) = BCmatcher.group(3) = C

30

Capturing Group

» Pattern.compile(“abc(defgh”);Pattern.compile(“abcdef)gh”);

→ PatternSyntaxException

» Pattern.compile(“abc\\(defgh”);Pattern.compile(“abcdef\\)gh”);

→ Success thanks to escape character '\'

31

Operators

» Union

[a-zA-Z-0-9]» Negation

[^abc]

[^X]

32

Contextual Match

» X(?=Y)

Once match X, look ahead to find Y

» X(?!= Y)

Once match X, look ahead and expect not find Y

» X(?<= Y)

Once match X, look behind to find Y

» X(?<!= Y)

Once match X, look behind and expect not find Y

33

Tips

» Pattern is stateless → Maximize reuse

We often see:

static final Pattern p = Pattern.compile(“a*b”);

» Be careful with String.split

String.split vs Java loop + String.charAt

Parsers in GateIn

35

Parsers in GateIn

» JavaScript Compressor

» CSS Compressor

» Groovy Template Optimizer

» Navigation Controller

Extracting URL param = Regex matching + Backtracking algorithm

» StaxNavigator (Nice XML parser based on StAX)

Advanced Theory

37

Grammar & Language

» Any word matching pattern eXo.*er is a combination transforms, starting from S

S → eXoQerQ → RQTQ → ''R → {a,b,c,d,...}T → {a,b,c,d,...}

» Language of a Grammar = Vocabularies generated by finite-combination of transforms, starting from S

Ex: Any valid Java source file is generated by a finite number of transforms mentioned in Java Grammar (JLS)

38

Finite State Machine & Language

» Language accepted by a FSM with Stack must be built from a context-free grammar

Explicit steps to build such context-free grammar are described in Kleene theorem

» Context-free grammar Language is accepted by a FSM with Stack

Explicit steps to build such Finite State Machine aredescribed in Kleene theorem