Claus Brabrand (ITU Copenhagen) & Jakob G. Thomsen (Aarhus University)PPDP 2010
Typed and Unambiguous Pattern Matching on Strings
using Regular Expressions
[http://xkcd.com/208/]
2
Introduction & Motivation
Parsing dynamic input is an ubiquitous problem
URLs:
Log Files:
The solution is pattern matching
http://www.cs.au.dk/index.php?id=141&view=details
13/02/2010 66.249.65.107 get /support.html20/02/2010 42.116.32.64 post /search.html
protocol host path query-string
(list of key-value pairs)
3
Motivating example
Example:
Matching against string: yields:
<day = [0-9]{2} > "/" <month = [0-9]{2} > "/" <year = [0-9]{4} >
"26/06/1992"day = 26
month = 06
year = 1992
4
Our setup<URL = [a-z]*>;...
url.rex
URL.java...
Compile (our tool)
Compile (javac)
URL.classFoo.class...
import URL;class Foo { ...}
Foo.java URL.javaFoo.java...
5
.
Outline The Chomsky Hierarchy (1956) Regular Expressions:
The Recording Construction
Ambiguity: Disambiguation
Type Mapping Conclusion
6
Language classes (+formalisms):
Type-3 regular expressions "enough" for: URLs, log files, ...
"Trade" (excess) expressivity for: declarativity, simplicity, and static safety !
The Chomsky Hierarchy (1956)
Not widely used.No static guarantees.Example: java.net.URL have had 88 bugs spanning a decade and source code still contains a //fixme
Conceptually harder than regular expressions (regular expressions plus recursion).
Simple, declarative and decidable properties(containment, ambiguity, etc.).
Oldie but goodie
7
Outline The Chomsky Hierarchy (1956) Regular Expressions:
The Recording Construction
Ambiguity: Disambiguation
Type Mapping Conclusion
8
Regular Expressions Syntax:
Semantics:
where: L1 L2 is concatenation (i.e., { 1 2 | 1L1,
2L2 }) L* = i0 Li where L0 = { } and Li = L
Li-1
Usual extensions : Any character ”.” as c1|c2|...|cn,
ci Character ranges ”[a-z]” as
a|b|...|z Repetitions ”R{2,3}” as RR|
RRR
9
Outline The Chomsky Hierarchy (1956) Regular Expressions:
The Recording Construction
Ambiguity: Disambiguation
Type Mapping Conclusion
Recording Syntax:
” ” is a recording identifier (it "remembers" the substring it matches)
Semantics:
Example (simplified emails):
Matching against string:yields:
[a-z]+ "@" [a-z]+ ("." [a-z]+)*
user = "obama" domain = "whitehouse.gov"&
<user = > <domain = >
10
Related: "x as R" in XDuce; "x::R" in CDuce; and "x@R" in Scala and HaRP
11
Recording (lists) Another example (yielding lists):
Matching against string:
yields a list structure:
<name = [a-z]+ > " & " <name = [a-z]+ >
"obama & bush"
name = [obama,bush]
( <name = [a-z]+ > "\n" )*
<name = [a-z]+ > (" & " <name = [a-z]+ > )*
12
Recording (structured)
Yet another example :
Matching against string: yields:
<person = <name = [a-z]+ > ", " <age = [0-9]+ >>
"obama, 48"
person.name = obama
Person.age = 48
person = obama, 48
14
Outline The Chomsky Hierarchy (1956) Regular Expressions:
The Recording Construction
Ambiguity: Disambiguation
Type Mapping Conclusion
15
Ambiguity Some regular expressions are ambiguous:
matched on the string “101” gives rise to: day = 1 and month = 01 (ie. 1st of
January) day = 10 and month = 1 (ie. 10th of January)
Multiple ways of matching => ambiguous
<day = [0-9]{1,2} > <month = [0-9]{1,2} >
17
Characterization of Ambiguity
Theorem: R unambiguous iff NB: sound & complete !
18
Characterization of Ambiguity
Theorem: R unambiguous iff
and
<foo = a > | <bar = a* >
For the string ”a”, 2 ways: foo = ”a” or bar = ”a”
19
Characterization of Ambiguity
R* = | RR*
<foo = a|aa >*
<foo = a* > <bar = a* >
For the string ”a”, 2 ways: foo = ”a” or bar = ”a”
For the string ”aa”, 2 ways: foo = [a,a] or foo = [aa]
Related work: [Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce but indirectly via NFAa, not directly (syntax-directed).
20
Outline The Chomsky Hierarchy (1956) Regular Expressions:
The Recording Construction
Ambiguity: Disambiguation
Type mapping Conclusion
21
2) Restriction: R1 - R2
And then encode...: RC as: * - R R1 & R2 as: (R1
C|R2C)C
4) Default disambiguation: concat, choice, and star
are all left-biased (by default) !
(Our tool does this)
1) Manual rewriting: Always possible :-) Tedious :-( Error-prone :-( Not structure-preserving :-(
3) Disambiguators: Three basic operators choice:'|L', '|R' concat: 'L', 'R' star: '*L', '*R'
What to do about it?
<foo = a > | <bar = a* >is rewritten to <foo = a > | <bar = |aaa* >
<foo = a > | <bar = a* >using restriction <foo = a > | <bar = a*-a >
<foo = a > | <bar = a* >using restriction we get <foo = a > |L <bar = a* >
<foo = a > | <bar = a* >no need to rewrite
Related work: [Vansummeren'06] but with global, not local disambiguation
22
Outline The Chomsky Hierarchy (1956) Regular Expressions:
The Recording Construction
Ambiguity: Disambiguation
Type Mapping Conclusion
Type Mapping Our date example
Type of the recordings date, day, month, and year? Strings (=> many type casts) Infer the type
<date = <day = [0-9]{2} > "/" <month = [0-9]{2} > "/" <year = [0-9]{4} >>
23
Type Mapping A recording has three type components:
a linguistic type (language of the recording - maps to String, int, float, etc).
a structural type (nested recordings – maps to (nested) classes).
a type modifier (maps to lists).
24
Related work: Exact type inference in XDuce & CDuce(soundness+completeness proof in [Vansummeren'06])but not for stand-alone and non-intrusive usage (Java)
25
Type Mapping ExamplePerson = <name = > " (" <age = > ")"[a-z]+ [0-9]+
class Person { // auto-generated String name; int age; static Person match(String s) { ... } public String toString() { ... }}
compile(our tool)
String s = "obama (48)";
Person p = Person.match(s);print(p.name + " is " + p.age + "y old");
Usage
26
Usage:
People = ( $Person "\n" )*
class People { // auto-generated String[] name; int[] age; static Person match(String s) { ... } public String toString() { ... }}
compile(our tool)
String s = "obama (48) \n bush (63) \n ";
People p = People.match(s);println("Second name is " + p.name[1]);
Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"
Type Mapping
27
Usage:
People = ( <person = $Person > "\n" )* ;
class People { // auto-generated Person[] person; class Person { // nested class String name; int age; }... }
compile(our tool)
String s = "obama (48) \n bush (63) \n ";
People people = People.match(s);for (p : people.person) println(p.name);
Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"
Type Mapping
28
ConclusionRegular expressions are alive and well. This paper: Precise ambiguity analysis Type mappingFuture work: improve performance, subtype of
recordings "trade (excess) expressivity for
safety+simplicity”
Thank you. Questions?
29
Abstract Syntax Trees (ASTs)
30
Ambiguity Definition:
R ambiguous iffT,T'ASTR: T T' ||T|| = ||T'||
where ||||: AST * (the flattening) is:
TR
T'R'
=
31
Characterization of Ambiguity
Theorem: R unambiguous iff
NB: sound & complete !
R* = | RR*
32
Type Inference Type Inference:
R : (L,S)
Top Related