Structurally tackling input problems
Transcript of Structurally tackling input problems
Structurally tackling
input problems
Erik Poll
Digital Security
Radboud University Nijmegen
This week
Approaches to structurally root out / tackle input problems:
• LangSec (Language-theoretic Security)
• Using types to keep track of
• different kinds of data and
• different trust levels
As exemplified by Google’s Trusted Type initiative,
which allows API hardening in the browser.
Two related, recurring themes:
• different languages/formats/protocols/notations/encodings/…
• parsing of such formats, notations, encodings, …
2
Story so far
Most security problems arise due to inputmemory corruption (buffer overflows, NULL dereferencing, use-after-free, …)
integer overflows
race conditions aka TOCTOU aka non-atomic check & use
‘injection attacks’:
SQL injection, path traversal, command injection, HTML injection, XSS,
format string attacks, SSI injection, LDAP injection, XPath injection,
XXE, deserialization attacks, Macros in Word and Excel, XML/Zip bombs,
uploading .exe files, injecting PHP files, …
…
3
Solutions so far (discussed last week)
We can prevent some input problems with
• Validation of inputs
• Sanitisation of inputs or outputs
• Making sure to do canonicalisation whenever we do this
• Better still: use safe(r) APIs
– eg Prepared Statements aka Parameterised Queries
Of course, in addition to trying to prevent problems,
we should also try to detect them and mitigate potential impact
4
Last week: validation ≠ sanitisation
• Sanitisation (eg replacing < with <) is a compensation for a
weakness in the design
– Need to sanitise comes from choice to use certain technologies/APIs,
eg SQL, XML, HTML, JavaScript
– Need is external to the use case, but intrinsic to technologies/APIs
used
• Validation (eg rejecting 31/11/2021 as date) is not
– Need to reject invalid data stems from the use case/application
• So validation of input is needed even if our APIs are immune to
injection attacks
– Need is inherent to the use case, external to the software
5
Two classes of input flaws
6
The I/O attacker model ( = ‘hacking’)
Garbage In, Garbage Out (GIGO) quickly becomes
Malicious Garbage In, Security Incident Out
Attacker goals:
• Remote code execution, DoS, or anything in between
Means: exploiting unwanted behaviour, which can be
– weird, buggy behaviour (eg buffer overflow overwriting return address, integer overflow, ROP, …)
– unwanted access to/triggering of normal behaviour (eg SQLi, XSS, Word Macros, …)
7
applicationmalicious input
I/O
Two types of security flaws
8
(abuse of)
a feature !2. Forwarding/Injection Flaws
back-end
service
malicious
input
eg SQL
query
application
applicationmalicious
input
a bug !1. Processing Flaws
eg buffer overflow
in PDF viewer
Bugs vs features
9
(abuse of)
a feature !
back-end
service
malicious
input
eg SQL
query
application
applicationmalicious
input
a bug !
eg buffer overflow
in PDF viewer
1. Processing Flaws
2. Forwarding/Injection Flaws
Two types of input problems
1. Buggy parsing & processing
– Bug in processing input causes application to go of the rails
– Classic example: buffer overflow in a PDF viewer, leading to remote
code execution
This is unintended behaviour, introduced by mistake
2. Flawed forwarding (aka injection attacks)
– Input is forwarded to back-end service/system/API, to cause damage
there
– Classic examples: SQL injection, path traversal, XSS, Word macros
This is intended behaviour of the back-end, introduced
deliberately, but exposed by mistake by the front-end
10
Parsing is always involved
11
unwanted parsing
eg. of user input as SQL
back-end
service
malicious
input application
applicationmalicious
input
buggy parsing
eg. of a pdf file
1. Processing Flaws
2. Forwarding/Injection Flaws
Parsing is always involved
12
application
pa
rse
r
back-end
service
application
pa
rse
r
sometimes:
unparsing
sometimes:
additional
parsing
Theme: (not) treating user input as code
Key observation: safe APIs such as Parameterised Queries
• prevent treating user data as ‘commands’
• do not parse/interpret/execute user data as some form of code
– In a classic buffer overflow user data is treated as code,
as user-supplied shell code is executed as binary code.
– Word macros are clearly commands, but unzipping is also a limited
form of execution
13
ANTI-PATTERN : Treating user data as commands/CODE
More back-ends, more languages, more problems
14
SQL
databasemalicious
input
web
server
OS
web
browser
XSS
command
injection
SQL
injection
file
systempath
traversal
format
string attack C library
LangSec
(language-theoretic security)
LangSec
• Interesting look at root causes of large class of security problems,
namely problems with input
• Useful suggestions for dos and don’ts
• The ‘Lang’ in ‘LangSec’ refers to input languages,
. not programming languages.
Though some input languages will include programming languages…
16
Sergey Bratus & Meredith Patterson
‘The science of insecurity’
CCC 2012
Motivation: the never ending story where
• Attackers keep discovering
– new bugs or features to exploit
– new input possibilities or data flows to trigger these
– new ways to by-pass protection mechanisms
• Defenders keep
– patching bugs
– adding or tweaking protection mechanisms
• adding validation and sanitisation
• adjusting allow- & deny-lists used to validate or sanatise
• adding VPNs, firewalls, WAF, anti-virus, …
– adding features
Can’t we write code and use components, technologies, and APIs that are inherently robust & secure, by construction?
17
Example issue with sanitisation aka escaping aka encoding
Example: Chrome used to crash on the URL http://%%30%30
• %30 is the URL-encoding of the character 0
• So %%30%30 is the URL-encoding of %00
• %00 is the URL-encoding of null character
So %%30%30 is a double-encoded null character
Some code deep inside Chrome performed a second URL-decoding (as a
well-intended ‘service’ to its client code?) and then some other code crashes
on the resulting null character.
How could this bug have been detected, statically or dynamically?
Or prevented by better design?
Moral: having encoded data around makes validation harder!
Note that encoding is the opposite of canonicalisation:
it introduces different representations of the same data.
18
Windows supports many notations for file names
• classic MS-DOS notation C:\MyData\file.txt
• file URLs file:///C|/MyData/file.txt
• UNC (Uniform Naming Convention) \\192.1.1.1\MyData\file.txt
which can be combined in fun ways, eg file://///192.1.1.1/MyData/file.txt
Some notations cause unexpected behaviour by involving other protocols, eg
• UNC paths to remote servers are handled by SMB protocol
• SMB sends password hash to remote server to authenticate: pass the hash
This can be exploited by SMB relay attacks ……- CVE-2000-0834 in Windows telnet ……
……- CVE-2008-4037 in Windows XP/Server/Vista
……- CVE-2016-5166 in Chromium ……
……- CVE-2017-3085 & CVE-2016-4271 in Adobe Flash …
……- ZDI-16-395 in Foxit PDF viewer
[Example thanks to Björn Ruytenberg, https://blog.bjornweb.nl]
Example: surprising complexity & expressivity – file names
19
Root cause: The Tower of Babel
A typical interaction with software, say on the Web, involves
many languages, formats, and protocols:
HTTP, HTML, CSS, JavaScript, URLs
DNS, TLS, X509 certificates, TCP/IP (IPv4 or IPv6),
jpeg, mpeg, mp4, png, gif, pdf, .docx,
user names, email addresses, .ics ,
file names, directories, OS commands,
SQL, LDAP, JSP, PHP, XML, JSON, …
ASCII, Unicode, UTF-8, ... Ethernet,
Wifi, Bluetooth, GSM/3G/4G/5G, ..
Some handled – parsed - by app or browser,
some by lower protocol layers,
some by external programs & services
This provides a HUGE attack surface of HUGE complexity
20
Data gets en/decoded or (un)parsed as it moves through the
technology stack
App
Attack surface, for eg buffer overflow or injection
21
information: 19th of November
Date
Wifi / 4G
TCP/IP
HTTP
TLS
Ethernet
TCP/IP
HTTP
TLS
serialising/
unparsing/
pretty printingeg with toString
Server
Date
de-serialising/
parsing
database
OS
file system
Data gets en/decoded or (un)parsed as it moves through the
technology stack
App
Attack surface, for eg buffer overflow or injection
22
Wifi / 4G
TCP/IP
HTTP
TLS
Ethernet
TCP/IP
HTTP
TLS
Server
HTML
renderer
image
library
viewer
Root causes identified by LangSec
1. (Too) many languages & formats
2. These languages are often complex & unclearly defined andcombined
3. The code handles all these languages & formats in sloppy way,
– as the succes of fuzzing demonstrates
– as the prevalence of memory corruption & injection attacks shows
23
Processing input
Processing involves
1) parsing/lexing
2) interpreting/executing
Eg interpreting a string as filename, URL, or email address,
or executing a piece of OS command, javascript, SQL statement
This relies on some language or format
Step 1) above relies on syntax of this language
Step 2) above relies also on semantics of this language
24
Processing input is dangerous!
Different ways for an attacker to abuse input
• wasting resources (e.g. a zip-bomb)
• crashing things (and causing DoS)
• abusing unwanted functionality that is accidentily exposed
Such functionality can be
– Normal functionality of eg. SQL database or the OS,
– Bizarre, buggy functionality caused by eg. a buffer overflow
Buggy processing of inputs provides a weird machine that the attacker can
‘program’ to abuse the system
Classic example: ROP (return oriented programming)
25
Example problem: combining languages
X509 certificates involve several languages & formats.
Differences in interpretation caused various security flaws:
• Multiple Common Names
Eg certificate for facebook.com, mafia.com
Handled differently in different browsers. Why is this even allowed?
• ANS.1 attacks
A null terminator in ANS.1 BER-encoded string in an CommonNamecan trick CA into issueing certificate for unauthorized parties
• PKCS#10-tunneled SQL injection
You could have an SQL command inside a BMPString, UTF8String or
UniversalString used as PKCS#10 Subject Name
[Dan Kaminsky, Meredith L. Patterson, and Len Sassaman,
PKI Layer Cake: New Collision Attacks Against the Global X.509 Infrastructure]
26
Anti-pattern: shotgun parsers
Handwritten code that incrementally parses & interprets input, in a
piecemeal fashion
28
An example shotgun parser: spot the defect
char buf1[MAX_SIZE], buf2[MAX_SIZE];
// make sure url is valid URL and fits in buf1 and buf2:
if (!isValid(url)) return;
if (strlen(url) > MAX_SIZE – 1) return;
// copy url excluding spaces, up to first separator, ie. first ’/’, into buf1
out = buf1;
do { // skip spaces
if (*url != ’ ’) *out++ = *url;
} while (*url++ != ’/’);
strcpy(buf2, buf1);
...
29
[Code sample from presentation by Jon Pincus]
Loop fails to
terminate flaw
for URLs without /
Exploited by
Blaster worm
An example shotgun parser: spot the defect
char buf1[MAX_SIZE], buf2[MAX_SIZE];
// make sure url is valid URL and fits in buf1 and buf2:
if (!isValid(url)) return;
if (strlen(url) > MAX_SIZE – 1) return;
// copy url excluding spaces, up to first separator, ie. first ’/’, into buf1
out = buf1;
do { // skip spaces
if (*url != ’ ’) *out++ = *url;
} while (*url++ != ’/’);
strcpy(buf2, buf1);
...
30
[Code sample from presentation by Jon Pincus]
Why not parse the url in
one go into some URL
object or datatype?
Eg as part of the isValid()
method?
Root causes identified by LangSec (revisited)
Obstacles / anti-patterns in producing code without input
vulnerabilities
• Input languages that are too complex
– often caused by unchecked development of input languages,
with eg. standards evolving and adding new features over time
• Ad-hoc & imprecise notion of validity
– causing parser differentials
eg web-browsers parsing X509 certificates in different ways
• Mixing input recognition & processing in shotgun parsers
All this results in weird machines
31
LangSec principles to prevent input problems
No more hand-coded shotgun parsers, but
1. precisely defined input languages
eg with EBNF grammar
2. generated parser
3. complete parsing before processing
Also, don’t substitute strings & then parse,
but parse & then substitute in parse tree
(eg parameterised query instead of dynamic SQL)
4. keep the input language simple & clear
So that bugs are less likely
So that you give minimal processing power to attackers
32
Preventing input problems the LangSec way
33
applicationmalicious
input
pa
rse
r
LangSec approach:
• Simple & clear language spec
• Generated parser code
• Complete parsing before processing
34
35
Weird machine = the strange functionality accidentality
exposed by code that (incorrectly) processing input
Attackers can ‘program’ this weird machine with their
malicious input!
Minimise the resources & computing power that input handling gives
to attackers
36
All parsers should be equivalent.
And parsers should be the exact inverse of the pretty printers /
unparsers
37
Tackling injection attacks
(aka forwarding flaws)
40
Recall forwarding flaws
Anti-patterns & patterns ?
41
(abuse of) a feature !
back-end
service
malicious
input
eg SQL
query
application
Anti-pattern: input escaping
• Input escaping, eg. processing inputs to escape dangerous
meta-characters, is a bad idea
– at the point of input, the context in which inputs will be used
(eg as path name, in SQL query, or as HTML) is unclear, and
different contexts require different solutions
• Output escaping makes more sense, because there context is
known
– but there it can be unclear which data originates from input;
we come back to that later
back-end
service
application
pars
er
input validationrejecting invalid input output sanitisation
aka escaping to make output harmless42
Anti-pattern: string concatenation
• Standard recipe for security disaster:
1. concatenate several pieces of data, some of them user input,
2. pass the result to some API
• Classic example: SQL injection
• Note: string concatenation is inverse of parsing
43
Anti-pattern: strings
The use of strings in itself is already troublesome
be it char*, char[], String, string, StringBuilder, ...
• Strings are useful, because you use them to represent many things:
eg. name, file name, email address, URL, bit of SQL, HTML,…
• This also make strings dangerous:
1. Strings are unstructured & unparsed data, and processing
often involve some interpretation (incl. parsing)
2. The same string may be handled & interpreted in many
– possibly unexpected – ways
3. A single string parameter in an API call can – and often does –
hide a very expressive & powerful language
44
Remedy: Parameterised queries
Note that parameterised queries
• reduce the expressive power of the interface to the back-end
• avoid unparsing in front-end
– and hence avoid re-introduce parsing in back-end
• replace an API call that takes a single string as argument
45
pa
rse
r
Remedy: Types (1) to distinguish languages
• Instead of using strings for everything,
use different types to distinguish different kinds of data
Eg different types for HTML, URLs, file names, user names , …
but also URL-encoded vs URL-decoded data,
HTML-encoded vs HTML-decoded data, etc.
• Advantages
– Types provide structured data
– No ambiguity about the intended use of data
46
Remedy: Types (2) to distinguish trust levels
• Use information flow types to track the origins of data
and/or to control destinations
– Eg distinguish untrusted user input vs compile-time constants
The two uses of types, to distinguish (1) languages or (2) trust levels,
are orthogonal and can be combined.
47
Example: fighting XSS
48
HTML injection & XSS
HTML injection: sneaking malicious payload into a victim’s browser &
let it be rendered as HTML
XSS: special case of HTML injection, where the payload is or contains
JavaScript, which is executed by the Javascript engine in the victim’s
browser
For XSS it is normal that input is forwarded multiple times before it
does its damage, in both reflected and stored XSS attacks
• most XSS attacks are at least 2nd order
49
Reflected, stored & DOM-based XSS
• Reflected XSS
Malicious payload goes back & forth from victim’s machines to
JavaScript engine in the browser
• Stored XSS
Attacker lets the web server store his malicious payload and
passes it on to the victim – another user of same website - later
• DOM-based XSS
The malicious payload is passed around by JavaScript inside
browser/app, using DOM API and other APIs, to ultimately end up
in a place where it is rendered as HTML or executed as JavaScript
50
Reflected XSS attack
Attacker crafts malicious URL containing JavaScript
https://google.com/search?q=<script>...</script>
and tempts victim to click on this link
51
malicious
URL web server
HTML response containing
<script> </script>browser
Stored XSS attack
52
Attacker injects HTML - incl. scripts - into a web site, which is stored at that
web site is echoed back later when victim visit the same site
malicious
input
web
server
data
baseanother user
of the same
website
HTML containing
malicious contentbrowser
DOM-based XSS
53
Attacker somehow (reflected or stored) feeds malicious input as parameters
into JavaScript code in the victims browser where it ultimately ends up ends
being interpreted as HTML / executed as JavaScript via DOM API
browser
DOM API library.js
lots of
JavaScript
Background: common encodings on the web
• HTML encoding
< > & ” ’ replaced by > lt; & " '
• URL encoding aka %-encoding
/ ? = % # replaced by %2F %3F %3D %25 %23
space replaced by %20 or +
Try this out at https://duckduckgo.com/?q=%2F+%3F%3D
• JavaScript string literal encoding
’ replaced by \’
Eg ’this is a JS string with a \’ in the middle’
• CSS encoding
• Unicode encoding
• base 64 encoding
• …
54
Example modern website code
Code from a photo sharing website, with user-chosen album name name
var escapedName = goog.string.htmlEscape(name); // HTML-encoding
var jsEscapedName = goog.string.escapeString(escapedName); // JS string literal encoding
elem.innerHTML = '<a onclick="createAlbum(\' ' + jsEscapedName + '\')">' + escapedName + '</a>';
Idea:
• HTML-encoding with htmlEscape gets rid of scripts in malicious album name
• JavaScript string literal encoding with escapeString makes sure album name
John's Birthday works ok when fed as JS string to JS function createAlbum
55[Example from Christoph Kern, Securing the Tangled Web, CACM 2014]
rendered as HTMLJS string argument
to createAlbum( … )
Spot the XSS bug!
var escapedName = goog.string.htmlEscape(name); // HTML-encoding
var jsEscapedName = goog.string.escapeString(escapedName); // JS string literal encoding
elem.innerHTML = '<a onclick="createAlbum(\' ' + jsEscapedName + '\')">' + escapedName + '</a>';
Attack: malicious name ');attackScript();//
HTML-escaped this becomes ');attackScript();//
JS-escaped this remains ');attackScript();//
So innerHTML becomes
<a onclick= "createAlbum(' ');attackScript();// ')">');attackScript();//</a>
The browser HTML-unescapes value of onclick attribute before evaluation as JS
createAlbum(' ');attackScript();//')
so attackScript(); will be executed
56[Example from Christoph Kern, Securing the Tangled Web, CACM 2014]
Why XSS is so hard to prevent
• Many sources & sinks, with complex data flows between them
• Many different types of data
• URLs, URL parameters, HTML snippets, JavaScript snippets
JavaScript strings, …
with different trust levels
• safe HTML without scripts (ie. already validated and/or escaped)
• unsafe HTML, possibly with scripts
• HTML with scripts that we trust
and different forms of encoding (aka escaping)
• URL-encoding
• HTML-encoding
• JavaString-literal encoding
57
Moral of this example
• Complex data flows are impossible to keep track of:
– Which strings have been escaped/encoded/validated, and how?
– Which strings can be trusted and which can come from an
attacker?
• Programmer cannot keep track of this
• Compiler & type-checker could keep track of this
– but then it needs more type information
58
Trusted Types for DOM Manipulation
Google’s Trusted Types initiative [https://github.com/WICG/trusted-types]
replaces string-based DOM API with a typed API
– using TrustedHtml, TrustedUrl, TrustedScriptUrl,
TrustedJavaScript,…
– ‘hardened’ aka‘safe’ APIs for back-ends which auto-escape
inputs or reject untrusted inputs
Released as a Chrome browser feature
[https://developers.google.com/web/updates/2019/02/trusted-types]
59
Beyond types: extending programming language
Wyvern programming language by Jonathan Aldrich et al.
allows domain-specific extensions, eg
where HTML and SQL are ‘built-in’ types of the programming language
Added advantage over types: more convenient syntax
[D. Kurilova et al, Wyvern: Impacting Software Security via
Programming Language Design, PLATEAU 2014, ACM]
60
Conclusions
Many security problems arise in handling
• buggy parsing
• unintended parsing due to forwarding
Ironically, parsing is a well-understood area of computer science
Constructive remedies to tackle this
• Have clear, simple & well-specified input languages
• Generate parser code
• Don’t use strings
• Do use TYPES, to distinguish languages & trust levels
• Have ‘safe’ or ‘hardened’ APIs, that are immune to injection and
prevent untrusted input being used incorrectly
61
input
To read
• The LangSec manifesto LangSec: Recognition, Validation, and
Compositional Correctness for Real World Security
• Wang et al., If It's Not Secure, It Should Not Compile: Preventing DOM-
Based XSS in Large-Scale Web Development with API Hardening, ICSE'21,
ACM/IEEE, 2021
• Poll, Strings considered harmful, USENIX ;login:, 2018
62