Sanitizing HTML 5 with Perl 5

47
Introduction HTML parser choice HTML5::Sanitizer interna HTML5::Sanitizer usage Conclusion HTML5::Sanitizer Sanitizing HTML 5 with Perl 5 Uwe Voelker XING AG August 16th 2011 Uwe Voelker HTML5::Sanitizer

description

 

Transcript of Sanitizing HTML 5 with Perl 5

Page 1: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

HTML5::SanitizerSanitizing HTML 5 with Perl 5

Uwe Voelker

XING AG

August 16th 2011

Uwe Voelker HTML5::Sanitizer

Page 2: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

1 Introduction

2 HTML parser choice

3 HTML5::Sanitizer interna

4 HTML5::Sanitizer usage

5 Conclusion

Uwe Voelker HTML5::Sanitizer

Page 3: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Task: WYSIWYG editorTeamLive example

1 IntroductionTask: WYSIWYG editorTeamLive example

2 HTML parser choice

3 HTML5::Sanitizer interna

4 HTML5::Sanitizer usage

5 Conclusion

Uwe Voelker HTML5::Sanitizer

Page 4: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Task: WYSIWYG editorTeamLive example

Task: WYSIWYG editor

integrate WYSIWYG editor in XING

frontend architect researched open source solutions

none was suited, mostly for security reasons

decision was made, to build it inhouse

goals: secure, share profiles (allowed tags) between frontendand backend

Uwe Voelker HTML5::Sanitizer

Page 5: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Task: WYSIWYG editorTeamLive example

Task: WYSIWYG editor

integrate WYSIWYG editor in XING

frontend architect researched open source solutions

none was suited, mostly for security reasons

decision was made, to build it inhouse

goals: secure, share profiles (allowed tags) between frontendand backend

Uwe Voelker HTML5::Sanitizer

Page 6: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Task: WYSIWYG editorTeamLive example

Task: WYSIWYG editor

integrate WYSIWYG editor in XING

frontend architect researched open source solutions

none was suited, mostly for security reasons

decision was made, to build it inhouse

goals: secure, share profiles (allowed tags) between frontendand backend

Uwe Voelker HTML5::Sanitizer

Page 7: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Task: WYSIWYG editorTeamLive example

Team

Christopher BlumJavascript

Ingo ChaoQA (HTML5/CSS)

Uwe VoelkerPerl

Uwe Voelker HTML5::Sanitizer

Page 8: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Task: WYSIWYG editorTeamLive example

Live example

Uwe Voelker HTML5::Sanitizer

Page 9: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

CPAN modulesEvaluationFinal decision

1 Introduction

2 HTML parser choiceCPAN modulesEvaluationFinal decision

3 HTML5::Sanitizer interna

4 HTML5::Sanitizer usage

5 Conclusion

Uwe Voelker HTML5::Sanitizer

Page 10: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

CPAN modulesEvaluationFinal decision

HTML parser on CPAN

HTML::Parser

HTML::TreeBuilder

HTML::TreeBuilder::LibXML

XML::LibXML

HTML::HTML5::Parser

Marpa::HTML

...

Uwe Voelker HTML5::Sanitizer

Page 11: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

CPAN modulesEvaluationFinal decision

Uwe Voelker HTML5::Sanitizer

Page 12: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

CPAN modulesEvaluationFinal decision

started with HTML::HTML5::Parser (HH5P)

because it understands semantic of HTML 5 tags

but it also did this:

http://example.com/?section=2&copy=3&lang=en

http://example.com/?section=2©=3&lang=en

final choice: XML::LibXML

Uwe Voelker HTML5::Sanitizer

Page 13: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

CPAN modulesEvaluationFinal decision

started with HTML::HTML5::Parser (HH5P)

because it understands semantic of HTML 5 tags

but it also did this:

http://example.com/?section=2&copy=3&lang=en

http://example.com/?section=2©=3&lang=en

final choice: XML::LibXML

Uwe Voelker HTML5::Sanitizer

Page 14: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

CPAN modulesEvaluationFinal decision

started with HTML::HTML5::Parser (HH5P)

because it understands semantic of HTML 5 tags

but it also did this:

http://example.com/?section=2&copy=3&lang=en

http://example.com/?section=2©=3&lang=en

final choice: XML::LibXML

Uwe Voelker HTML5::Sanitizer

Page 15: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

CPAN modulesEvaluationFinal decision

started with HTML::HTML5::Parser (HH5P)

because it understands semantic of HTML 5 tags

but it also did this:

http://example.com/?section=2&copy=3&lang=en

http://example.com/?section=2©=3&lang=en

final choice: XML::LibXML

Uwe Voelker HTML5::Sanitizer

Page 16: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

1 Introduction

2 HTML parser choice

3 HTML5::Sanitizer internaProcessing PhasesParsingConvertingWriting

4 HTML5::Sanitizer usage

5 Conclusion

Uwe Voelker HTML5::Sanitizer

Page 17: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Processing phases

preprocessing (e. g. migration)

parsing (HTML → DOM tree)

converting (rebuild tree according to profile)

writing (DOM tree → HTML)

Uwe Voelker HTML5::Sanitizer

Page 18: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Processing phases

preprocessing (e. g. migration)

parsing (HTML → DOM tree)

converting (rebuild tree according to profile)

writing (DOM tree → HTML)

Uwe Voelker HTML5::Sanitizer

Page 19: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Processing phases

preprocessing (e. g. migration)

parsing (HTML → DOM tree)

converting (rebuild tree according to profile)

writing (DOM tree → HTML)

Uwe Voelker HTML5::Sanitizer

Page 20: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Processing phases

preprocessing (e. g. migration)

parsing (HTML → DOM tree)

converting (rebuild tree according to profile)

writing (DOM tree → HTML)

Uwe Voelker HTML5::Sanitizer

Page 21: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Parsing HTML with XML::LibXML

use XML : : LibXML ;

my $ p a r s e r = XML : : LibXML−>new (e n c o d i n g => ’UTF−8 ’ ,r e c o v e r => 2 ,k e e p b l a n k s => 1 ,n o c d a t a => 1 ,e x p a n d e n t i t i e s => 1 ,no network => 1 ,s u p p r e s s e r r o r s => 1 ,s u p p r e s s w a r n i n g s => 1 ,

) ;

Uwe Voelker HTML5::Sanitizer

Page 22: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Parsing HTML with XML::LibXML

my $doc = $ p a r s e r−>p a r s e h t m l s t r i n g ($html ,{

n o c d a t a => 1 ,s u p p r e s s e r r o r s => 1 ,s u p p r e s s w a r n i n g s => 1 ,

} ,) ;

Uwe Voelker HTML5::Sanitizer

Page 23: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Converting - rebuilding DOM tree

loop through every node (only ELEMENT and TEXT)

drop unwanted elements completely (e. g. <script>)

change unknown elements to <span>

eventually change tag name (profile)

transform (or copy) attributes

proceed recursively with child nodes

Uwe Voelker HTML5::Sanitizer

Page 24: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Converting - rebuilding DOM tree

loop through every node (only ELEMENT and TEXT)

drop unwanted elements completely (e. g. <script>)

change unknown elements to <span>

eventually change tag name (profile)

transform (or copy) attributes

proceed recursively with child nodes

Uwe Voelker HTML5::Sanitizer

Page 25: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Converting - rebuilding DOM tree

loop through every node (only ELEMENT and TEXT)

drop unwanted elements completely (e. g. <script>)

change unknown elements to <span>

eventually change tag name (profile)

transform (or copy) attributes

proceed recursively with child nodes

Uwe Voelker HTML5::Sanitizer

Page 26: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Converting - rebuilding DOM tree

loop through every node (only ELEMENT and TEXT)

drop unwanted elements completely (e. g. <script>)

change unknown elements to <span>

eventually change tag name (profile)

transform (or copy) attributes

proceed recursively with child nodes

Uwe Voelker HTML5::Sanitizer

Page 27: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Writing HTML

mainly for additional escapes

could not find a nice way to integrate this in XML::LibXML

$ t e x t =˜ s/&/&amp ; / g ;$ t e x t =˜ s / ’ /&#39;/g;# ’$ t e x t =˜ s /”/&quot ; / g;#”$ t e x t =˜ s/</& l t ; / g ;$ t e x t =˜ s/>/&gt ; / g ;$ t e x t =˜ s / ‘/&#96;/ g ;$ t e x t =˜ s/{/&#123;/ g ;$ t e x t =˜ s/}/&#125;/ g ;

Uwe Voelker HTML5::Sanitizer

Page 28: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Processing PhasesParsingConvertingWriting

Writing HTML

mainly for additional escapes

could not find a nice way to integrate this in XML::LibXML

$ t e x t =˜ s/&/&amp ; / g ;$ t e x t =˜ s / ’ /&#39;/g;# ’$ t e x t =˜ s /”/&quot ; / g;#”$ t e x t =˜ s/</& l t ; / g ;$ t e x t =˜ s/>/&gt ; / g ;$ t e x t =˜ s / ‘/&#96;/ g ;$ t e x t =˜ s/{/&#123;/ g ;$ t e x t =˜ s/}/&#125;/ g ;

Uwe Voelker HTML5::Sanitizer

Page 29: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

1 Introduction

2 HTML parser choice

3 HTML5::Sanitizer interna

4 HTML5::Sanitizer usageUsageProfileExamplesDebugging

5 Conclusion

Uwe Voelker HTML5::Sanitizer

Page 30: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Usage

# con s t r u c t o b j e c tmy $ s a n i t i z e r = HTML5 : : S a n i t i z e r −>new (

p r o f i l e => ’My : : P r o f i l e ’ ,) ;

# c a l l p r o c e s s ( )my $ c l e a n = $ s a n i t i z e r −>p r o c e s s ( $html ) ;

Uwe Voelker HTML5::Sanitizer

Page 31: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Profile

you have to build your own

class with just one method: element($tag)

return undef or a hashref with:

remove remove complete sub tree (boolean)rename tag rename tag (string)

set attributes set these attributes (hashref)check attributes check/transform these attributes (hashref)

set class set class (string)add class add class from other attributes (hashref)

Uwe Voelker HTML5::Sanitizer

Page 32: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Profile

you have to build your own

class with just one method: element($tag)

return undef or a hashref with:

remove remove complete sub tree (boolean)rename tag rename tag (string)

set attributes set these attributes (hashref)check attributes check/transform these attributes (hashref)

set class set class (string)add class add class from other attributes (hashref)

Uwe Voelker HTML5::Sanitizer

Page 33: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Profile

you have to build your own

class with just one method: element($tag)

return undef or a hashref with:

remove remove complete sub tree (boolean)rename tag rename tag (string)

set attributes set these attributes (hashref)check attributes check/transform these attributes (hashref)

set class set class (string)add class add class from other attributes (hashref)

Uwe Voelker HTML5::Sanitizer

Page 34: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Examples - script

completely remove <script> (including all children)

{remove => 1 ,

}

otherwise it would be converted to <span>

and all children processed recursively

Uwe Voelker HTML5::Sanitizer

Page 35: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Examples - script

completely remove <script> (including all children)

{remove => 1 ,

}

otherwise it would be converted to <span>

and all children processed recursively

Uwe Voelker HTML5::Sanitizer

Page 36: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Examples - script

completely remove <script> (including all children)

{remove => 1 ,

}

otherwise it would be converted to <span>

and all children processed recursively

Uwe Voelker HTML5::Sanitizer

Page 37: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Examples - big

<big> → <span class=”big”>

{rename tag => ’ span ’ ,s e t c l a s s => ’ b i g ’ ,

}

Uwe Voelker HTML5::Sanitizer

Page 38: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Examples - big

<big> → <span class=”big”>

{rename tag => ’ span ’ ,s e t c l a s s => ’ b i g ’ ,

}

Uwe Voelker HTML5::Sanitizer

Page 39: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Examples - a

add rel=”nofollow” and target=” blank” to every link

{s e t a t t r i b u t e s => {

r e l => ’ n o f o l l o w ’ ,t a r g e t => ’ b l a n k ’ ,

} ,}

Uwe Voelker HTML5::Sanitizer

Page 40: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Examples - a

add rel=”nofollow” and target=” blank” to every link

{s e t a t t r i b u t e s => {

r e l => ’ n o f o l l o w ’ ,t a r g e t => ’ b l a n k ’ ,

} ,}

Uwe Voelker HTML5::Sanitizer

Page 41: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Examples - font

rename tag => ’ span ’ ,a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } ,

sub c l a s s s i z e f o n t {my ( $ s e l f , $ v a l ) = @ ;return un less $ v a l ;return ’ s i z e−xx− l a r g e ’ i f $ v a l eq ’ 7 ’ ;# . . .return ’ s i z e−xx−s m a l l ’ i f $ v a l eq ’ 1 ’ ;

return ’ s i z e− l a r g e r ’ i f $ v a l =˜ /ˆ\+/;return ’ s i z e−s m a l l e r ’ i f $ v a l =˜ /ˆ−/;return ;

}

Uwe Voelker HTML5::Sanitizer

Page 42: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Examples - font

rename tag => ’ span ’ ,a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } ,

sub c l a s s s i z e f o n t {my ( $ s e l f , $ v a l ) = @ ;return un less $ v a l ;return ’ s i z e−xx− l a r g e ’ i f $ v a l eq ’ 7 ’ ;# . . .return ’ s i z e−xx−s m a l l ’ i f $ v a l eq ’ 1 ’ ;

return ’ s i z e− l a r g e r ’ i f $ v a l =˜ /ˆ\+/;return ’ s i z e−s m a l l e r ’ i f $ v a l =˜ /ˆ−/;return ;

}

Uwe Voelker HTML5::Sanitizer

Page 43: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

UsageProfileExamplesDebugging

Debugging

if the result is not as expected, you can access intermediateresults:

my $ r e s = $ s a n i t i z e r −>p r o c e s s ( $html , { r e t u r n r e s u l t = 1} ) ;

# see HTML5 : : S a n i t i z e r : : R e s u l tsay $ r e s−>i n p u t ;say $ r e s−>p r e p r o c e s s e d ;say $ r e s−>p a r s e d d o c−>t o S t r i n g ;say $ r e s−>c o n v e r t e d d o c−>t o S t r i n g ;say $ r e s−>output ;

p r i n t $ r e s−>d e b u g o u t p u t ;

Uwe Voelker HTML5::Sanitizer

Page 44: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Repositories

HTML5::Sanitizer (backend)

http://github.com/xing/html5-sanitizer

wysihtml5 (javascript frontend)

http://github.com/xing/wysihtml5

Feedback? [email protected]

Uwe Voelker HTML5::Sanitizer

Page 45: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Repositories

HTML5::Sanitizer (backend)

http://github.com/xing/html5-sanitizer

wysihtml5 (javascript frontend)

http://github.com/xing/wysihtml5

Feedback? [email protected]

Uwe Voelker HTML5::Sanitizer

Page 46: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Repositories

HTML5::Sanitizer (backend)

http://github.com/xing/html5-sanitizer

wysihtml5 (javascript frontend)

http://github.com/xing/wysihtml5

Feedback? [email protected]

Uwe Voelker HTML5::Sanitizer

Page 47: Sanitizing HTML 5 with Perl 5

IntroductionHTML parser choice

HTML5::Sanitizer internaHTML5::Sanitizer usage

Conclusion

Questions?

Uwe Voelker HTML5::Sanitizer