Sanitizing HTML 5 with Perl 5
-
Upload
uwevoelker -
Category
Technology
-
view
4.244 -
download
3
description
Transcript of Sanitizing HTML 5 with Perl 5
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
HTML5::SanitizerSanitizing HTML 5 with Perl 5
Uwe Voelker
XING AG
August 16th 2011
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
1 Introduction
2 HTML parser choice
3 HTML5::Sanitizer interna
4 HTML5::Sanitizer usage
5 Conclusion
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Task: WYSIWYG editorTeamLive example
1 IntroductionTask: WYSIWYG editorTeamLive example
2 HTML parser choice
3 HTML5::Sanitizer interna
4 HTML5::Sanitizer usage
5 Conclusion
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Task: WYSIWYG editorTeamLive example
Task: WYSIWYG editor
integrate WYSIWYG editor in XING
frontend architect researched open source solutions
none was suited, mostly for security reasons
decision was made, to build it inhouse
goals: secure, share profiles (allowed tags) between frontendand backend
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Task: WYSIWYG editorTeamLive example
Task: WYSIWYG editor
integrate WYSIWYG editor in XING
frontend architect researched open source solutions
none was suited, mostly for security reasons
decision was made, to build it inhouse
goals: secure, share profiles (allowed tags) between frontendand backend
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Task: WYSIWYG editorTeamLive example
Task: WYSIWYG editor
integrate WYSIWYG editor in XING
frontend architect researched open source solutions
none was suited, mostly for security reasons
decision was made, to build it inhouse
goals: secure, share profiles (allowed tags) between frontendand backend
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Task: WYSIWYG editorTeamLive example
Team
Christopher BlumJavascript
Ingo ChaoQA (HTML5/CSS)
Uwe VoelkerPerl
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Task: WYSIWYG editorTeamLive example
Live example
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
CPAN modulesEvaluationFinal decision
1 Introduction
2 HTML parser choiceCPAN modulesEvaluationFinal decision
3 HTML5::Sanitizer interna
4 HTML5::Sanitizer usage
5 Conclusion
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
CPAN modulesEvaluationFinal decision
HTML parser on CPAN
HTML::Parser
HTML::TreeBuilder
HTML::TreeBuilder::LibXML
XML::LibXML
HTML::HTML5::Parser
Marpa::HTML
...
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
CPAN modulesEvaluationFinal decision
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
CPAN modulesEvaluationFinal decision
started with HTML::HTML5::Parser (HH5P)
because it understands semantic of HTML 5 tags
but it also did this:
http://example.com/?section=2©=3&lang=en
http://example.com/?section=2©=3&lang=en
final choice: XML::LibXML
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
CPAN modulesEvaluationFinal decision
started with HTML::HTML5::Parser (HH5P)
because it understands semantic of HTML 5 tags
but it also did this:
http://example.com/?section=2©=3&lang=en
http://example.com/?section=2©=3&lang=en
final choice: XML::LibXML
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
CPAN modulesEvaluationFinal decision
started with HTML::HTML5::Parser (HH5P)
because it understands semantic of HTML 5 tags
but it also did this:
http://example.com/?section=2©=3&lang=en
http://example.com/?section=2©=3&lang=en
final choice: XML::LibXML
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
CPAN modulesEvaluationFinal decision
started with HTML::HTML5::Parser (HH5P)
because it understands semantic of HTML 5 tags
but it also did this:
http://example.com/?section=2©=3&lang=en
http://example.com/?section=2©=3&lang=en
final choice: XML::LibXML
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
1 Introduction
2 HTML parser choice
3 HTML5::Sanitizer internaProcessing PhasesParsingConvertingWriting
4 HTML5::Sanitizer usage
5 Conclusion
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Processing phases
preprocessing (e. g. migration)
parsing (HTML → DOM tree)
converting (rebuild tree according to profile)
writing (DOM tree → HTML)
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Processing phases
preprocessing (e. g. migration)
parsing (HTML → DOM tree)
converting (rebuild tree according to profile)
writing (DOM tree → HTML)
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Processing phases
preprocessing (e. g. migration)
parsing (HTML → DOM tree)
converting (rebuild tree according to profile)
writing (DOM tree → HTML)
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Processing phases
preprocessing (e. g. migration)
parsing (HTML → DOM tree)
converting (rebuild tree according to profile)
writing (DOM tree → HTML)
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Parsing HTML with XML::LibXML
use XML : : LibXML ;
my $ p a r s e r = XML : : LibXML−>new (e n c o d i n g => ’UTF−8 ’ ,r e c o v e r => 2 ,k e e p b l a n k s => 1 ,n o c d a t a => 1 ,e x p a n d e n t i t i e s => 1 ,no network => 1 ,s u p p r e s s e r r o r s => 1 ,s u p p r e s s w a r n i n g s => 1 ,
) ;
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Parsing HTML with XML::LibXML
my $doc = $ p a r s e r−>p a r s e h t m l s t r i n g ($html ,{
n o c d a t a => 1 ,s u p p r e s s e r r o r s => 1 ,s u p p r e s s w a r n i n g s => 1 ,
} ,) ;
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Converting - rebuilding DOM tree
loop through every node (only ELEMENT and TEXT)
drop unwanted elements completely (e. g. <script>)
change unknown elements to <span>
eventually change tag name (profile)
transform (or copy) attributes
proceed recursively with child nodes
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Converting - rebuilding DOM tree
loop through every node (only ELEMENT and TEXT)
drop unwanted elements completely (e. g. <script>)
change unknown elements to <span>
eventually change tag name (profile)
transform (or copy) attributes
proceed recursively with child nodes
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Converting - rebuilding DOM tree
loop through every node (only ELEMENT and TEXT)
drop unwanted elements completely (e. g. <script>)
change unknown elements to <span>
eventually change tag name (profile)
transform (or copy) attributes
proceed recursively with child nodes
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Converting - rebuilding DOM tree
loop through every node (only ELEMENT and TEXT)
drop unwanted elements completely (e. g. <script>)
change unknown elements to <span>
eventually change tag name (profile)
transform (or copy) attributes
proceed recursively with child nodes
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Writing HTML
mainly for additional escapes
could not find a nice way to integrate this in XML::LibXML
$ t e x t =˜ s/&/& ; / g ;$ t e x t =˜ s / ’ /'/g;# ’$ t e x t =˜ s /”/" ; / g;#”$ t e x t =˜ s/</& l t ; / g ;$ t e x t =˜ s/>/> ; / g ;$ t e x t =˜ s / ‘/`/ g ;$ t e x t =˜ s/{/{/ g ;$ t e x t =˜ s/}/}/ g ;
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Processing PhasesParsingConvertingWriting
Writing HTML
mainly for additional escapes
could not find a nice way to integrate this in XML::LibXML
$ t e x t =˜ s/&/& ; / g ;$ t e x t =˜ s / ’ /'/g;# ’$ t e x t =˜ s /”/" ; / g;#”$ t e x t =˜ s/</& l t ; / g ;$ t e x t =˜ s/>/> ; / g ;$ t e x t =˜ s / ‘/`/ g ;$ t e x t =˜ s/{/{/ g ;$ t e x t =˜ s/}/}/ g ;
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
1 Introduction
2 HTML parser choice
3 HTML5::Sanitizer interna
4 HTML5::Sanitizer usageUsageProfileExamplesDebugging
5 Conclusion
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Usage
# con s t r u c t o b j e c tmy $ s a n i t i z e r = HTML5 : : S a n i t i z e r −>new (
p r o f i l e => ’My : : P r o f i l e ’ ,) ;
# c a l l p r o c e s s ( )my $ c l e a n = $ s a n i t i z e r −>p r o c e s s ( $html ) ;
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Profile
you have to build your own
class with just one method: element($tag)
return undef or a hashref with:
remove remove complete sub tree (boolean)rename tag rename tag (string)
set attributes set these attributes (hashref)check attributes check/transform these attributes (hashref)
set class set class (string)add class add class from other attributes (hashref)
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Profile
you have to build your own
class with just one method: element($tag)
return undef or a hashref with:
remove remove complete sub tree (boolean)rename tag rename tag (string)
set attributes set these attributes (hashref)check attributes check/transform these attributes (hashref)
set class set class (string)add class add class from other attributes (hashref)
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Profile
you have to build your own
class with just one method: element($tag)
return undef or a hashref with:
remove remove complete sub tree (boolean)rename tag rename tag (string)
set attributes set these attributes (hashref)check attributes check/transform these attributes (hashref)
set class set class (string)add class add class from other attributes (hashref)
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Examples - script
completely remove <script> (including all children)
{remove => 1 ,
}
otherwise it would be converted to <span>
and all children processed recursively
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Examples - script
completely remove <script> (including all children)
{remove => 1 ,
}
otherwise it would be converted to <span>
and all children processed recursively
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Examples - script
completely remove <script> (including all children)
{remove => 1 ,
}
otherwise it would be converted to <span>
and all children processed recursively
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Examples - big
<big> → <span class=”big”>
{rename tag => ’ span ’ ,s e t c l a s s => ’ b i g ’ ,
}
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Examples - big
<big> → <span class=”big”>
{rename tag => ’ span ’ ,s e t c l a s s => ’ b i g ’ ,
}
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Examples - a
add rel=”nofollow” and target=” blank” to every link
{s e t a t t r i b u t e s => {
r e l => ’ n o f o l l o w ’ ,t a r g e t => ’ b l a n k ’ ,
} ,}
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Examples - a
add rel=”nofollow” and target=” blank” to every link
{s e t a t t r i b u t e s => {
r e l => ’ n o f o l l o w ’ ,t a r g e t => ’ b l a n k ’ ,
} ,}
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Examples - font
rename tag => ’ span ’ ,a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } ,
sub c l a s s s i z e f o n t {my ( $ s e l f , $ v a l ) = @ ;return un less $ v a l ;return ’ s i z e−xx− l a r g e ’ i f $ v a l eq ’ 7 ’ ;# . . .return ’ s i z e−xx−s m a l l ’ i f $ v a l eq ’ 1 ’ ;
return ’ s i z e− l a r g e r ’ i f $ v a l =˜ /ˆ\+/;return ’ s i z e−s m a l l e r ’ i f $ v a l =˜ /ˆ−/;return ;
}
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Examples - font
rename tag => ’ span ’ ,a d d c l a s s => { s i z e => ’ s i z e f o n t ’ } ,
sub c l a s s s i z e f o n t {my ( $ s e l f , $ v a l ) = @ ;return un less $ v a l ;return ’ s i z e−xx− l a r g e ’ i f $ v a l eq ’ 7 ’ ;# . . .return ’ s i z e−xx−s m a l l ’ i f $ v a l eq ’ 1 ’ ;
return ’ s i z e− l a r g e r ’ i f $ v a l =˜ /ˆ\+/;return ’ s i z e−s m a l l e r ’ i f $ v a l =˜ /ˆ−/;return ;
}
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
UsageProfileExamplesDebugging
Debugging
if the result is not as expected, you can access intermediateresults:
my $ r e s = $ s a n i t i z e r −>p r o c e s s ( $html , { r e t u r n r e s u l t = 1} ) ;
# see HTML5 : : S a n i t i z e r : : R e s u l tsay $ r e s−>i n p u t ;say $ r e s−>p r e p r o c e s s e d ;say $ r e s−>p a r s e d d o c−>t o S t r i n g ;say $ r e s−>c o n v e r t e d d o c−>t o S t r i n g ;say $ r e s−>output ;
p r i n t $ r e s−>d e b u g o u t p u t ;
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Repositories
HTML5::Sanitizer (backend)
http://github.com/xing/html5-sanitizer
wysihtml5 (javascript frontend)
http://github.com/xing/wysihtml5
Feedback? [email protected]
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Repositories
HTML5::Sanitizer (backend)
http://github.com/xing/html5-sanitizer
wysihtml5 (javascript frontend)
http://github.com/xing/wysihtml5
Feedback? [email protected]
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Repositories
HTML5::Sanitizer (backend)
http://github.com/xing/html5-sanitizer
wysihtml5 (javascript frontend)
http://github.com/xing/wysihtml5
Feedback? [email protected]
Uwe Voelker HTML5::Sanitizer
IntroductionHTML parser choice
HTML5::Sanitizer internaHTML5::Sanitizer usage
Conclusion
Questions?
Uwe Voelker HTML5::Sanitizer