PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes.

download PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes.

If you can't read please download the document

Transcript of PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes.

  • Slide 1
  • PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris DAntoniMargus Veanes
  • Slide 2
  • 2
  • Slide 3
  • 3
  • Slide 4
  • 4
  • Slide 5
  • 5
  • Slide 6
  • 6 All features of general purpose language Features needed replace, match, char
  • Slide 7
  • FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 7
  • Slide 8
  • OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs Whats next? 8
  • Slide 9
  • AUTOMATA, TRANSDUCERS, AND PROGRAMS 9
  • Slide 10
  • FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 10
  • Slide 11
  • 11 type alphabet = A | T | C | G let rec all_TG (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = T || h = G) && (all_TG t ) let rec all_AC (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = A || h = C) && (all_TG t ) let rec map_base (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> T : : ( map_base t ) | T : : t -> A : : ( map_base t ) | G : : t -> C : : ( map_base t ) | C : : t -> G : : ( map_base t ) let rec filter_AC (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> A : : ( filter_AC t ) | T : : t -> filter_AC t | G : : t -> filter_AC t | C : : t -> C : : ( filter_AC t ) Finite alphabet Languages of strings Transformations from strings to strings q0q0 T G q0q0 A C all_TGall_AC A/T map_base T/A G/CC/G A/A T/ G/C/C filter_AC
  • Slide 12
  • FINITE AUTOMATA 12 a b a b ababYes abaNo bbYes aNo
  • Slide 13
  • FINITE STATE TRANSDUCERS 13 a/aa b/bb zz a/aa b/bb abaabbzz bbbzz abaUNDEFINED a
  • Slide 14
  • BENEFITS OF AUTOMATA AND TRANSDUCERS Closure and decidability for automata: Intersection, union, complement Decidable emptiness Decidable equivalence Can be minimized 14
  • Slide 15
  • BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer composition let m_f_DNA l : base list = filter_AC (map_base l) 15 q0q0 A/T map_base T/A G/CC/G q0q0 A/AT/ G/C/C filter_AC q0q0 A/T/ A G/CC/ m_f_DNA
  • Slide 16
  • BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking map_base o ( all_AC) 16 input in all_TG map_base output in all_AC map_base only defined if output in ( all_AC)
  • Slide 17
  • BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o ( all_AC)) 17 input in all_TG map_base output in all_AC Inputs for which map_base does not output in all_AC
  • Slide 18
  • BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o ( all_AC)) all_TG = 18 input in all_TG map_base output in all_AC
  • Slide 19
  • BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer equivalence let m_f_DNA l : base list = filter_AC (map_base l) let f_m_DNA l : base list = map_base (filter_AC l) Is m_f_DNA equivalent to f_m_DNA ? 19
  • Slide 20
  • FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 20
  • Slide 21
  • OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs Whats next? 21
  • Slide 22
  • [USENIX11, POPL12] P. HooimeijerM. VeanesB. LivshitsD. Molnar BEK analysis of string sanitizers P. Saxena
  • Slide 23
  • 23
  • Slide 24
  • 24
  • Slide 25
  • 25 Q UESTION : What could possibly go wrong?
  • Slide 26
  • 26 Attacker: gollum.png' onload='javascript:...
  • Slide 27
  • 27 Attacker: gollum.png' onload='javascript:... Result:
  • 28 Attacker: im.png' onload='javascript:... Result:
  • FIRST LINE OF DEFENSE: SANITIZERS Sanitizer: a string transformation function. PLDI'12 submission presentations 30 im.png' img.png' Sanitized dataUntrusted data Dec 8, 2011
  • Slide 31
  • COMPARING SANITIZERS 31
  • Slide 32
  • 32 ' ' single quote html entity
  • Slide 33
  • 33 some untrusted input
  • Slide 34
  • 34 Library A Name: Around for: Availability: HtmlEncode Years Readily available to C# developers some untrusted input
  • Slide 35
  • 35 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers some untrusted input
  • Slide 36
  • 36 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers ' ' ' '
  • Slide 37 ': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); }.NET WebUtility MS AntiXSS private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x10000 + ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); }">
  • 37 public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append(""); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); }.NET WebUtility MS AntiXSS private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x10000 + ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); }
  • Slide 38 ': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); } 38.NET WebUtility MS AntiXSS Same behavior on all inputs? If not, what is a differentiating input? Can it generate any known bad outputs?">
  • private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x10000 + ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); } public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append(""); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); } 38.NET WebUtility MS AntiXSS Same behavior on all inputs? If not, what is a differentiating input? Can it generate any known bad outputs?
  • Slide 39
  • 39 PHP Trunk Changes to html.c, 1999--2011
  • Slide 40
  • 40 PHP Trunk Changes to html.c, 19992011 R7,841 April 1999 135 loc R309,482 March 2011 1693 loc
  • Slide 41
  • 41 PHP Trunk Changes to html.c, 19992011 R32,564 September 2000 ENT_QUOTES introduced R7,841 April 1999 135 loc R309,482 March 2011 1693 loc
  • Slide 42
  • 42 PHP Trunk Changes to html.c, 19992011 R32,564 September 2000 ENT_QUOTES introduced R242,949 September 2007 $double_encode=true R7,841 April 1999 135 loc R309,482 March 2011 1693 loc
  • Slide 43
  • 43 PHP Trunk Changes to html.c, 19992011 Safe to apply twice? Safe to combine with other sanitizers?
  • Slide 44
  • MOTIVATION 44 Writing string sanitizers correctly is difficult There is no cheap way to identify problems with sanitizers Correctness is a moving target What if we could say more about sanitizer behavior?
  • Slide 45
  • CONTRIBUTIONS 45 B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation
  • Slide 46
  • CONTRIBUTIONS 46 B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity
  • Slide 47
  • 47 s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE
  • Slide 48
  • 48 Symbolic Finite Transducers Z3 Transformation Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE
  • Slide 49
  • 49 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample \' vs. \\' Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE
  • Slide 50
  • 50 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample \' vs. \\' Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
  • Slide 51
  • 51 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample \' vs. \\' Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
  • Slide 52
  • 52 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES
  • Slide 53
  • 53 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s
  • Slide 54
  • 54 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s while updating one boolean variable b Simple dedicated syntax
  • Slide 55
  • 55 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample \' vs. \\' Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
  • Slide 56
  • FINITE STATE TRANSDUCERS 56 a/A Problem: alphabet has 2 16 characters TOO MANY TRANSITIONS b/B z/Z &/&
  • Slide 57
  • SYMBOLIC FINITE TRANSDUCERS 57 Only two transitions!! x in [a-z] / x-32 x not in [a-z] / x
  • Slide 58
  • SYMBOLIC FINITE TRANSDUCERS 58 x>5/x+1,x x%2=1/x-1,x,x+4 true/5 true/x-4 Predicates Sequence of functions Alphabet theory has to be DECIDABLE Well use Z3 to check predicate satisfiability
  • Slide 59
  • 59 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample \' vs. \\' Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
  • Slide 60
  • 60 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample \' vs. \\' Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen Now what? BEK ARCHITECTURE
  • Slide 61
  • SFT Algorithms 61 EQUIVALENCE CHECKING IS DECIDABLE! Alphabet theory has to be DECIDABLE Well use Z3 to check predicate satisfiability
  • Slide 62
  • SFT Algorithms 62 AntiXSS.HtmlEncode = WebUtility.HtmlEncode EQUIVALENCE CHECKING
  • Slide 63
  • 63 SFT A B inout SFT A inout SFT B CLOSED UNDER COMPOSITION
  • Slide 64
  • SFT Algorithms 64 SFT A B inout SFT A inout SFT B JavaScriptEncode(HtmlEncode(w)) = HtmlEncode(JavaScriptEncode(w)) COMPOSITION
  • Slide 65
  • 65 PRE-IMAGE COMPUTATION Regular Language O Regular Language I outin SFT A
  • Slide 66
  • 66 PRE-IMAGE COMPUTATION MALICIOUS INPUTS Vulnerability signature outin SFT A
  • Slide 67
  • 67 B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Contributions Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity CONTRIBUTIONS
  • Slide 68
  • 68 Can BEK model existing sanitizers? Can we use to check interesting properties on real sanitizers? QUESTIONS?
  • Slide 69
  • Language Features 69 Data: 1x OWASP HTMLencode 13x Google AutoEscape 21x IE 8 XSS Filter 7x Synthetic inspect feature counts WHAT FEATURES ARE NEEDED?
  • Slide 70
  • Language Features 70 Majority (76%) of sanitizers can be ported without extending the language With multi-character lookahead: 90% WHAT FEATURES ARE NEEDED?
  • Slide 71
  • 71 Data 4x MS internal HtmlEncode 3x for hire HtmlEncode based on English- language specification (C#) Commutative? Equivalent? CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
  • Slide 72
  • 72 Short answer: Yes! CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
  • Slide 73
  • 73 Short answer: Yes! EQ results take less than a minute to obtain: 1234567 1 2 3 4 5 6 7 CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
  • Slide 74
  • 74 CommutativitySelf-Equivalence DOES IT SCALE?
  • Slide 75
  • The Cheat Sheet 75 One out of seven implementations correctly encodes all strings for use in both HTML and attribute contexts WERE ALL SANITIZERS BROKEN?
  • Slide 76
  • 76 B EK is a domain-specific language for writing string sanitizers B EK can model programs without approximation using symbolic finite transducers, enabling e.g., equivalence checks B EK was evaluated using real-world sanitizers from a variety of different sources Conclusion BEK IN A NUTSHELL
  • Slide 77
  • OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs Whats next? 77
  • Slide 78
  • BEX ANALYSIS OF STRING ENCODERS Loris DAntoniMargus Veanes [VMCAI13, CAV13]
  • Slide 79
  • 79 Hi, Im plain text! Nice to meet you! SGkgSSdtIHBsYWluI HRleHQsIG5pY2Ugd G8gbWVldCB5b3Uh Encoder Decoder
  • Slide 80
  • NOT SO EASY TO GET RIGHT 80
  • Slide 81
  • WHEN ARE THEY CORRECT? 81 T Encoder T Decoder T Encoder TT
  • Slide 82
  • CAN WE USE TRANSDUCERS? 82 T Encoder T Decoder T Encoder o Decoder = Identity
  • Slide 83
  • Language Features 83 Majority (76%) of sanitizers can be ported without extending Bek With multi-character lookahead: 90% BEK: WHAT FEATURES WERE NEEDED?
  • Slide 84
  • BASE64 encoder 3 Bytes 4 Base64 characters 84 Text contentMan Bytes7797110 Bit Pattern010011010110000101101110 Index1922546 Base64 EncodedTWFu
  • Slide 85
  • 85 HOW DO WE EXTEND BEK?
  • Slide 86
  • 86 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample \' vs. \\' Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE Symbolic finite transducers dont have registers
  • Slide 87
  • TRANSDUCERS WITH REGISTERS 87 12 x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)4)], r := (x&0xF)
  • EXTENDED SYMBOLIC FINITE TRANSDUCERS 93 Man p 3 qp x 1 FF x 2 FF x 3 FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3
  • Slide 94
  • EXTENDED SYMBOLIC FINITE TRANSDUCERS 94 Man pq TWFu 3 qp x 1 FF x 2 FF x 3 FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3
  • Slide 95
  • MORE EXPRESSIVE THAN SYMBOLIC FINITE TRANSDUCERS 95 10 x 1 >x 2 / [x 1 +x 2 ] Do they still have nice properties?
  • Slide 96
  • WHAT DO WE NEED? 96 T Encoder T Decoder T Encoder o Decoder = Identity CompositionEquivalence
  • Slide 97
  • NEGATIVE RESULTS 97 ESFAs: equivalence is undecidable are not closed under intersection are not closed under complement ESFTs equivalence is undecidable are not closed under composition
  • Slide 98
  • A FRIENDLIER RESTRICTION 98
  • Slide 99
  • CARTESIAN EXTENDED SYMBOLIC FINITE TRANSDUCERS 99 Negative results use binary predicates and encoders do not use this feature Only allow conjunctions of unary predicates q p x 1 5 x 2 =1 / [x 1 +x 2, x 1 ]
  • Slide 100
  • CARTESIAN ESFA = SFA 100 Cartesian ESFAs are now equivalent to SFAs 10 x 1 >5 x 2 =1 0,1 0 x=1x>5 1
  • Slide 101
  • STILL MORE EXPRESSIVE THAN SFTS 101 Cartesian ESFTs are strictly more expressive than SFTs!! 10 x 1 >5 x 2 =1 / [x 1 +x 2 ] ?
  • Slide 102
  • WHAT DO WE NEED? 102 T Encoder T Decoder T Encoder o Decoder = Identity CompositionEquivalence
  • Slide 103
  • RESULTS 103 Cartesian ESFTs equivalence is decidable are not closed under composition
  • Slide 104
  • COMPOSITION IN PRACTICE 104
  • Slide 105
  • 105 BEK WITH REGISTERS?
  • Slide 106
  • TRANSDUCERS WITH REGISTERS 106 12 x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)4)], r := (x&0xF)
  • SYMBOLIC TREE TRANSDUCERS [PSI11] q(a.a>3,(x 1,x 2 )) a.a+1,(a.a-2,q 1 (x 1 )) Decidable properties: type-checking, etc Domain expressiveness: infinite alphabets using predicates and functions Structural expressiveness: cant delete a node without reading it first 55+1 5-2 q q1q1 x1x1 x2x2 x1x1 Such that 5>3 is true 131 Alphabet theory has to be DECIDABLE Well use Z3 to check predicate satisfiability
  • Slide 132
  • IMPROVING STRUCTURAL EXPRESSIVENESS Transformation: delete the left child if it contains a script If we delete the node we cant check that the left child contained a script divq q 132 Regular Look-Ahead (RLA) ??
  • Slide 133
  • REGULAR LOOK AHEAD : Transformation: delete the left child if it contains a script Rules can ask whether the children are in particular languages p 1 : the language of trees that contain a script node p 2 : the language of all trees Decidable properties: type-checking, etc Domain expressiveness: infinite alphabets Structural expressiveness: good enough to express our examples div q p1p1 p2p2 q Transformation now is safe 133
  • Slide 134
  • DecidabilityComplexityStructuralExpressiveness Infinite alphabets Top Down Tree Transducers [Engelfriet75]VVXX Top Down Tree Transducers with Regular Look-ahead [Engelfriet76]VV~X Streaming Tree Transducers [AlurDantoni12]VXVX Data Automata [Bojanczyk98]~XXV Symbolic Tree Transducers [VeanesBjoerner11]VVXV Symbolic Tree Transducers RLAVV~V 134
  • Slide 135
  • COMPOSITION OF STT R This is not always possible!! Find the biggest class for which it is possible 135 T1T1 T1T1 T2T2 T2T2 T 1 o T 2
  • Slide 136
  • WHEN CAN WE COMPOSE? Theorem: T(x) = T 2 (T 1 (x)) definable by a Symbolic Tree Transducers with RLA if T 1 is deterministic All our examples fall in this category 136 Alphabet theory has to be DECIDABLE Well use Z3 to check predicate satisfiability
  • Slide 137
  • 137 Symbolic Tree Transducers with RLA Z3 Transformation Analysis Does it do the right thing? Counterexample \' vs. \\' Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE
  • Slide 138
  • CASE STUDIES AND EXPERIMENTS 138
  • Slide 139
  • CASE STUDIES AND EXPERIMENTS Program Optimization: Deforestation of functional programs Verification: HTML sanitization Analysis of functional programs Augmented reality app store 139 Infinite Alphabets: Integer Data types
  • Slide 140
  • DEFORESTATION Removing intermediate data structures from programs ADVANTAGE: the program is a single transducer reads the input list only once, thanks to transducers composition 140 alphabet ILIst [i : int] { nil(0), cons(1) } trans mapC: IList IList { nil() to nil [0] | cons(x) to cons [(i+5)%26] (mapC x) } def mapC 2 : IList IList := compose mapC mapC
  • Slide 141
  • STAGES BY EXAMPLE 141 mapCmapC2 Transducers
  • Slide 142
  • DEFORESTATION: SPEEDUP 142 f(f(f(f(x)...) (f;f;f;;f)(x)
  • Slide 143
  • ANALYSIS OF FUNCTIONAL PROGRAMS 143
  • Slide 144
  • AR INTERFERENCE ANALYSIS Recognizers output data that can be seen as a tree structure Spine Hip Neck HeadKnee Ankle Foot . 144
  • Slide 145
  • APPS AS TREE TRANSFORMATIONS Applications that use recognizers can be modeled as FAST programs 145 trans addHat: STree -> STree Spine(x,y) to Spine(addHat(x), y) | Neck(h,l,r) to Neck(addHat(h), l, r) | Head(a) to Head(Hat(a))
  • Slide 146
  • COMPOSITION OF PROGRAMS Two FAST programs can be composed into a single FAST program p1p1p1p1 p2p2p2p2 p 1 ;p 2 146
  • Slide 147
  • ANOTHER RECOGNIZER 147 Room Floor Wall Table Chair . Chair .
  • Slide 148
  • INTERFERENCE ANALYSIS Apps can be malicious: try to overwrite outputs of other apps Apps interfere when they annotate the same node of a recognizers output We can compose them and check if they interfere statically!! Put checker in the AppStore and analyze Apps before approval Interfering apps Add cat earsAdd hat Add pin to a cityBlur a city Amazon Buy Now button Malicious Buy Now button 148
  • Slide 149
  • INTERFERENCE ANALYSIS IN PRACTICE 100 generated FAST programs, up to 85 functions each Check statically if they conflict pairwise for ANY possible input Checked 99% of program pair in less than 0.5 sec! For an App store these are perfectly fine
  • Slide 150
  • TWO PENDING PATENTS 150
  • Slide 151
  • 151 F AST is a domain-specific language for writing tree manipulating programs F AST can model programs without approximation using Symbolic tree transducers with regular lookahead F AST was evaluated using real-world programs Conclusion FAST IN A NUTSHELL
  • Slide 152
  • OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs Whats next? 152
  • Slide 153
  • WHATS NEXT 153
  • Slide 154
  • FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 154
  • Slide 155
  • DREX EFFICIENT STRING MANIPULATION Loris DAntoni Mukund Raghothaman Here at POPL15! Rajeev Alur
  • Slide 156
  • DECLARATIVE LANGUAGE FOR STRING SCRIPTS (15/1, 2PM, SEC. 2B) 156 a b a b b/b (a|b)*b iterate(choice(a->a, b->b)) a/a Execute this code in linear time left- to-right pass on the input string!!
  • Slide 157
  • BEX 2.0 PARALLEL EXECUTION OF STRING ENCODERS Margus Veanes Here at POPL 15!! David MolnarBen Livshits Todd Mytkowicz
  • Slide 158
  • FROM TRANSDUCERS TO PARALLEL EXECUTIONS (15/1, 2PM, SEC. 2B) Efficient data-parallel code 158 12 x / [ r+x, x+1], r := 0 x / [ x+4 ], r := (x-2) 02
  • Slide 159
  • PROGRAM BOOSTING OR CROWD-SOURCING FOR CORRECTNESS Here at POPL 15!! Loris DAntoni David Molnar Benjamin Livshits Margus Veanes Robert Cochran
  • Slide 160
  • CROWD-SOURCING PROGRAMS WITH AUTOMATA (17/1, 4PM, SEC. 9B) 160 Specification
  • Slide 161
  • YOU CAN HELP TOO! 161
  • Slide 162
  • INTERESTING DIRECTIONS A transducer-based language for WebSrapers Spradsheet transformations Compiler optimizations XML processing Html rendering 162
  • Slide 163
  • SUMMARIZING 163
  • Slide 164
  • 164 Transducer Model Z3 Transformation Analysis Does it do the right thing? Analysis question Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; DSL Code Gen C#JavaScriptC Code Gen OUR RECIPE FOR EACH TASK
  • Slide 165
  • BEK Fast and precise sanitizer analysis with BEK Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11 Symbolic finite state transducers: algorithms and applications Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12 BEX Static analysis of string encoders and decoders DAntoni, Veanes, VMCAI13 Equivalence of extended symbolic finite transducers DAntoni, Veanes, CAV13 Data parallel string manipulating programs Veanes, Mytkowicz, Molnar, Livshits, POPL15 FAST Fast: a transducer based language for tree manipulatio DAntoni, Veanes, Livshits, Molnar, PLDI14 165