SWAN WG - Year 2011 Activity Report - WIDE

13
SWAN WG - Year 2011 Activity Report Gregory BLANC ([email protected]) Youki KADOBAYASHI ([email protected]) 2011/12/31 1 Introduction The SWAN WG carries out research in the field of Web 2.0 application security. SWAN stands for Security for Web 2.0 ApplicatioN. It was founded and started its activities in June 2010. It aims to bring forward Web 2.0 application security issues in the WIDE project as these issues are becoming more and more critical to services offered on top of the Internet. As a matter of fact, the IETF has also witnessed the recent foundation of the WebSec WG this year and SWAN actually preempted this creation. 1.1 What is Web 2.0? There is no strict difference between Web 1.0 and Web 2.0 but it is universally understood that Web 1.0 applications rely mainly on the HTTP protocol to download pages in a synchronous pattern. On the other hand, Web 2.0 applications do involve abundant processing on the client side through embedded scripts transferring data to the server, even asynchronously, without the user experienc- ing delays. Web 2.0 does not rely on any partic- ularly recent technology but on technologies that have been spreading the Web since its early years. JavaScript and XML, at the origin of the coined word Ajax (Asynchronous JavaScript with XML, 2005)[15], were technologies designed in the mid- 1990 s. However what characterize Web 2.0 ap- plications are their content-richness, their collab- oration features (user participation, folksonomies, social networks), their ability to syndicate contents (aggregate sites, feeds, mashups) as well as their ex- tensive use of the Ajax framework to perform dy- namic, asynchronous HTTP transactions. Other essential objects comprise the Document Object Model (DOM), which is usually modified dynam- ically to avoid reloading web pages, JS Object No- tation (JSON) objects, which are used for the seri- alization of structured data during XmlHTTPRe- quest (XHR) object transactions. Further informa- tion can be found in [19]. 1.2 Web 2.0 Security Issues For many years, web applications have been threat- ened by attacks such as SQL injections, direc- tory traversals, buffer overflows, command injec- tions and the likes. Then, the increasing popu- larity of BBS promote stored cross-site scripting (XSS) among attackers. Classic attacks usually at- tempt to abuse the targeted web server in order to take control of it. On the contrary, new genera- tion attacks concentrate follow the paradigm shift prompted by Web 2.0 applications which are user- centric, cross-domain and rely on dynamic scripting technologies. As a matter of fact, new generation attacks often leverage scripting languages and trig- ger cross-domain bypasses to harm the user. Not to mention that Web 2.0 applications also often push much of the application logic to the browser allow- ing attackers to test their security as a whitebox. And vulnerabilities are not only found in the tar- geted application but also in script libraries, APIs or plugins the application uses, or in contents the application mashes from external origins. Attackers have therefore learnt how to take ad- vantage of such security holes to make their at- tacks stealthier and more massive. These attacks can leverage the user’s browser to carry out a num- ber of purposes ranging from information leakage, live keylogging, session riding or more elaborated schemes such as internal network fingerprinting, worm propagation or botnet management. Such sophisticated attacks are carried out through com- plex scripts hidden through layers of redirection and obfuscation and commonly known as malicious JavaScript or JS malware[23]. More information on Web 2.0 vulnerabilities and related attacks can be found in [8, 18] 1.3 Research Approaches The SWAN WG is basically approach-agnostic in that it allows anyone to join and to propose its own approach to contribute to the fight against the proliferation of web-based attacks. Of course, on- going projects do implement one specific approach 1

Transcript of SWAN WG - Year 2011 Activity Report - WIDE

Page 1: SWAN WG - Year 2011 Activity Report - WIDE

SWAN WG - Year 2011 Activity Report

Gregory BLANC ([email protected])Youki KADOBAYASHI ([email protected])

2011/12/31

1 Introduction

The SWAN WG carries out research in the fieldof Web 2.0 application security. SWAN stands forSecurity for Web 2.0 ApplicatioN. It was foundedand started its activities in June 2010. It aims tobring forward Web 2.0 application security issuesin the WIDE project as these issues are becomingmore and more critical to services offered on topof the Internet. As a matter of fact, the IETF hasalso witnessed the recent foundation of the WebSecWG this year and SWAN actually preempted thiscreation.

1.1 What is Web 2.0?

There is no strict difference between Web 1.0 andWeb 2.0 but it is universally understood that Web1.0 applications rely mainly on the HTTP protocolto download pages in a synchronous pattern. Onthe other hand, Web 2.0 applications do involveabundant processing on the client side throughembedded scripts transferring data to the server,even asynchronously, without the user experienc-ing delays. Web 2.0 does not rely on any partic-ularly recent technology but on technologies thathave been spreading the Web since its early years.JavaScript and XML, at the origin of the coinedword Ajax (Asynchronous JavaScript with XML,2005)[15], were technologies designed in the mid-1990  s. However what characterize Web 2.0 ap-plications are their content-richness, their collab-oration features (user participation, folksonomies,social networks), their ability to syndicate contents(aggregate sites, feeds, mashups) as well as their ex-tensive use of the Ajax framework to perform dy-namic, asynchronous HTTP transactions. Otheressential objects comprise the Document ObjectModel (DOM), which is usually modified dynam-ically to avoid reloading web pages, JS Object No-tation (JSON) objects, which are used for the seri-alization of structured data during XmlHTTPRe-quest (XHR) object transactions. Further informa-tion can be found in [19].

1.2 Web 2.0 Security Issues

For many years, web applications have been threat-ened by attacks such as SQL injections, direc-tory traversals, buffer overflows, command injec-tions and the likes. Then, the increasing popu-larity of BBS promote stored cross-site scripting(XSS) among attackers. Classic attacks usually at-tempt to abuse the targeted web server in order totake control of it. On the contrary, new genera-tion attacks concentrate follow the paradigm shiftprompted by Web 2.0 applications which are user-centric, cross-domain and rely on dynamic scriptingtechnologies. As a matter of fact, new generationattacks often leverage scripting languages and trig-ger cross-domain bypasses to harm the user. Not tomention that Web 2.0 applications also often pushmuch of the application logic to the browser allow-ing attackers to test their security as a whitebox.And vulnerabilities are not only found in the tar-geted application but also in script libraries, APIsor plugins the application uses, or in contents theapplication mashes from external origins.

Attackers have therefore learnt how to take ad-vantage of such security holes to make their at-tacks stealthier and more massive. These attackscan leverage the user’s browser to carry out a num-ber of purposes ranging from information leakage,live keylogging, session riding or more elaboratedschemes such as internal network fingerprinting,worm propagation or botnet management. Suchsophisticated attacks are carried out through com-plex scripts hidden through layers of redirectionand obfuscation and commonly known as maliciousJavaScript or JS malware[23]. More information onWeb 2.0 vulnerabilities and related attacks can befound in [8, 18]

1.3 Research Approaches

The SWAN WG is basically approach-agnostic inthat it allows anyone to join and to propose itsown approach to contribute to the fight against theproliferation of web-based attacks. Of course, on-going projects do implement one specific approach

1

Page 2: SWAN WG - Year 2011 Activity Report - WIDE

but this does not constrain all members to followthe same path; parallel, complementary or concur-rent approaches are also welcomed. Indeed, thereare many ways to deal with security issues in Web2.0 applications either from the client or server’sviewpoint or through the collaboration of the twosides. However it is usually agreed that server-sidecountermeasures are ineffective when attacks targetusers unless one is able to protect every web appli-cation on the Internet. Therefore, recent researchworks have been carried out on the client-side ex-clusively, except for secure mashup schemes[21, 25].

Some researchers have concentrated on thebrowser itself where many vulnerabilities lie andwhere exploitation takes place: initiatives aiming tosandbox the browser[16] or to enforce security poli-cies in the browser[22] have been proposed. Otherworks focused on how to prevent some kind of at-tacks such as XSS[20] or CSRF[24] by designingsome heuristics to characterize these attacks. To amuch more granular level, JavaScript being the de-facto standard in AJAX and in other programmingframeworks, attackers have naturally took advan-tage of it being enabled in the victims’ browsers.Consequently, some researchers took interest inhow to mitigate such attacks by trying to mini-mize the impact of the JavaScript language: securesubsets[29], interposition[35, 39] or other harden-ing techniques have been elaborated. An adverseapproach is not to restrain JavaScript in any waybut rather to analyze programs to detect any mali-cious behavior: VM-based execution[31, 33], datatainting to prevent XSS[38] or even control flowanalysis[17].

Though, we have favorized the latter approach insome of our works, we do not restrain WG membersfrom contributing using any other approach.

2 Second Year Activities

From its inception, the SWAN WG has decided toconcentrate on a rather precise objective in orderto appeal to the community and launch projectsas fast as possible. We launched a first project, theWeb2Sec Testbed, a testbed to accommodate large-scale practical Web 2.0 application security-relatedexperiments. The project does not only intend tobuild a practical testbed on top of the WIDE cloudbut also to develop tools to be used within thetestbed. While building a testbed seems relativelyeasy given the efforts deployed in other WIDE WGssuch as WIDE-Cloud or ds1, the WebSec Testbed isyet to be constructed. However, the SWAN WG hasbeen concentrating on developing analysis tools, es-pecially to analyze JS malware which is a hot topic

in Web 2.0 security.For that purpose, we designed and started de-

veloping a proxy-based solution to analyze JS mal-ware. This led to several publications on how to de-tect and extract obfuscation in JS malware[3] andhow to automate deobfuscation to provide staticanalysis of JS malware[4, 5].

3 Web2Sec Testbed

The testbed is an all-purpose tool in that it allowsdesigning large-scale experiments on both the of-fensive and the defensive side. By collaboratingwith the WIDE-cloud WG, we propose to deploy atestbed to perform experiments on Web 2.0 secu-rity issues. Similar to previous works from StanfordUniversity[7], the deployment is two-fold. We pro-pose on one end cloud services that will simulate avulnerable and/or hostile Web 2.0 environment andon the other end, users will be provided with virtualmachine (VM) images comprising several browsers(and several personalities) equipped with some plu-gins for defense and analysis, as well as several toolsto perform security audit or attacks.

3.1 Motivation

The main motivation for such testbed is the one forany testbed: we cannot perform large-scale experi-ments in the wild and need to recreate a practicalenvironment to provide containment of our exper-iments. Furthermore, large-scale attacks on socialnetworks have already been witnessed and it is le-gitimate to try to understand propagation schemesbetter in order to mitigate these. But this cannotbe reproduced on a small network, thereby the needfor a larger experiment environment.

3.2 Overview

Figure 1 shows a simple overview of the tools wewill deploy and develop for the Web2Sec testbed.As stated previously, the deployment is two-fold.On the server-side are present Web services we willprovide to create the experiment environment andon the client-side are listed tools that will be in-cluded in the VM image: aside from browsers, prox-ies, vulnerability scanners, automation engines, at-tack frameworks will also be featured. On the rightside of Figure 1, we present some tools we wish todevelop in order to support experiments carried outin the testbed. At least 3 tools are to be engineered:a web security scanner for Web 2.0 applications, anAJAX monitor to track AJAX transactions gener-ated by the browser and a JavaScript debugger to

2

Page 3: SWAN WG - Year 2011 Activity Report - WIDE

Figure 1: Overview of the tools to be deployed inthe Web2Sec testbed

analyze JS programs and their behavior.

3.3 Applications

The Web2Sec testbed will support both offensiveand defensive experiments in the realm of the Web2.0. It can be used to perform sophisticated webapplication security evaluations and measure theirimpact on different browser personalities. We canalso evaluate independently the security of differ-ent browsers against different type of attacks. Itcan also be used to foresee next-generation attacksand monitor the propagation of Web 2.0 attack vec-tors. Instrumentation of browsers and applicationscan also provide several interesting measures. Notto mention that such tool can also be used for edu-cation purposes as it was done in previous researchworks cited in Section 3.4

3.4 Related Works

Though testbed is not a recent research topic, therehave not been any proposal on web security ori-ented testbeds until 2010. Contributions are notrevolutionary but rather attempt to shed light onthe critical issues the web community incurs. Un-surprisingly, most of the proposed testbeds were de-signed and used for educational purpose, the mostoutstanding being the Webseclab[7] used in websecurity classes in both Stanford and CMU. Ourtestbed basically follows the same design with onone end the cloud service and on the other endVM images. Webseclab’s cloud service does actu-ally perform class administration tasks while everystudent gets a VM distribution that contains theclass exercises and tools to resolve them. Connect-ing to the cloud service allows for rating, download-ing class materials and updating contents. Othernotable testbeds have been the Blunderdome[27]

which is rather an academic multi-layer service of-fensive security testbed simulating a university net-work and moth[6], a VM image that contains a setof vulnerable web applications and scripts for test-ing web application security scanners or static codeanalysis tools.

4 JavaScript Analysis Proxy

One of the other main projects of the SWAN WGis the development of a JS analysis proxy.

4.1 Motivation and Approach

Web 2.0 applications make an important use of theJavaScript (JS) language through the AJAX frame-work and it has proven to be a critical componentof modern applications in terms of security, to theextent that is it often considered safer for users todisable JS in the browser. However, there is animportant shortfall regarding user-experience andsome popular applications are simply not accessiblewithout JS enabled, not to mention the addictionof users to eye-candy interfaces.

Therefore, it has become obvious that such usersare in need of systems able to provide them witha safe and usable Web experience. Earlier workshave already tackled how to prevent attacks eitheron the server-side or on the client-side but oftenrely on heuristics an attacker can mimick. Lately,it has been understood that Web 2.0 security is-sues are more and more exploited through client-side vulnerabilities, notably by abusing the Same-Origin Policy (SOP) weakness and injecting remoteJS payloads through Cross-Site Scripting (XSS)vulnerabilities. Attack payloads are often obfus-cated and infected pages use many anti-analysistechniques, making straightforwared dynamic anal-ysis techniques somehow difficult to use. Manyproposals have also focused on analysis but fail toprovide a consistent context, especially in context-dependent obfuscations against which forensic anal-ysis is useless.

Additionally, a recent technical report fromGoogle[34] pointed out several weaknesses of cur-rent web malware detectors. In particular, itdistinguishes four prevalent classes of detectors:virtual machine-based detectors (web honeypots),emulation-based detectors which are the succes-sors of virtual machine-based detectors, reputation-based detectors (blacklists) and signature-based de-tectors (antiviruses). The authors of this reportsurveyed circumvention techniques in modern webmalware:

3

Page 4: SWAN WG - Year 2011 Activity Report - WIDE

• social engineering are a set of techniques thatrequire the end-user interaction in order to cir-cumvent passive web honeypots;

• cloaking are a set of techniques that preventanalysis from emulation-based environmentsby detecting their presence;

• redirection are a set of techniques to hide mal-ware distribution websites from blacklists byleveraging a redirection network made of nu-merous domains redirecting to each other;

• obfuscation are a set of techniques that renderthe code of program unintelligible to a humanuser and thwart signature-matching.

The authors propose to combine existing ap-proaches in order to increase effectiveness.

On the other hand, we remark that actualweb malware detectors that propose to analyzeJavaScript are often execution-based which leadto undesired effects. First, the side-effects of ex-ecution have prompted attackers to design anti-analysis techniques to thwart dynamic analysis suchas cloaking and obfuscation. Second, executing amalware can result in exploitation if the execu-tion is not properly contained, prompting an offlineanalysis, that is analysis does not take place dur-ing browsing the Web. This has the consequence torender offline analyzers less attractive to end-usersdue to a low usability. Indeed, it is difficult to co-erce users into analyzing scripts on a web malwareanalysis website each time they suspect a page ofcontaining malware.

We also remark obfuscation is not really con-sidered an issue by past research: either it isseen as an indicator of malice[9, 26] (while it isnot[33]) or it is seen as trivially deobfuscated duringexecution[11, 36, 12]. However, recent advances incloaking have demonstrated a hardening of obfusca-tion techniques that are able to stop deobfuscationin the case an analysis environment is detected. Of-ten, benign contents are produced in order to con-fuse analysis.

We propose to apply a static, that is execution-less, approach to JS malware. The above remarksindicate that such analysis should take place on un-obfuscated malware in order to be successful. Sincestatic examination of obfuscated JS code cannotyield any result, it is necessary that we design amethod to reliably cancel obfuscation while pre-venting execution of the JS code.

4.2 Threat Model

A common scenario unfolds as follows (cf. Fig. 2):

Figure 2: Typical Web infection threat model sce-nario

• step 1: the victim browses an infected web ap-plication (infected at step 0);

• step 2: the server will process requests fromthe user and may include infected contents inits response;

• step 3: the malware content possesses somespecificities that makes it possible to bypassdeployed security devices;

• step 4: the browser parses the web contentsand eventually executes embedded scripts ordownload linked scripts that will be then ex-ecuted. The user can also be trapped into aredirection or have a malicious iframe injectedinto the page she is browsing;

• step 5: upon execution, the script does some-thing harmful and might communicate resultsback to an attacker’s remote server or simplytake silent actions that will profit the attackeror else make the victim download some mal-ware;

• in most advanced scenarios, several attacks arecombined to produce more massive harm lever-aging either the participatory or social charac-teristics of Web 2.0 applications, or the looselysecured internal network of the victim.

One may opt that such scenarios do seldom occur.However, isolated infected users might not noticethey are part of a massive attack, and companiesoften fail (sometimes on purpose) to report on suchevents. Hence, the publicity of large-scale Web 2.0attacks remain lower than reality.

4.3 Overview

JS malware are stealthy, polymorphic and highlyuse obfuscation to conceal its malicious intent. Itis often hidden through several layers of redirec-tion and obfuscation. Since, we cannot rule thatan obfuscated script is malicious[33], we need to

4

Page 5: SWAN WG - Year 2011 Activity Report - WIDE

Domain A (malicious)

Domain C (vulnerable)

Domain P (target)

remote

inclusion

dynamically

generated

requests

Prefetching

Deobfuscation

mashing

Decision

11

1111

11

12

13

141414

15

16

17

18

18

ICAP server

End-user PC

Web server

Application

cluster

Malware

database

HTTP

request

HTTP

response

Cross-domain

inclusion

Proxy

communication

Proxy

13

Internal

process

obfuscatedscript

webpage

aggregatedscript page

decodingroutine

safeweb page

13

maliciousscript

benignscript

Figure 3: Overview of the Proposed Solution

deobfuscate it to ensure a sounder and more com-plete analysis. Besides, we require static analysismethods in order to maximize code coverage in asingle pass. However to fight against different anti-analysis tricks specific to obfuscation schemes, weneed to convey enough usable context to the an-alyzer. This can be done by parsing suspiciouslinked contents and fetching potential decipheringscripts. In order to tackle client-side JavaScript, wedecided to deploy our solution on the client-side andto avoid putting any trust on the server-side sinceit is likely to be infected. However, since we needa consistent context in order to conduct JS pro-gram analysis, a realtime processing becomes nec-essary, which implies new constraints on our designsolution. On one hand, program analysis can be acomputation-intensive task and we should not im-pose it on the browser which is already busy withWeb 2.0 application rendering. Additionally, if wewere to partly execute suspicious code, we wouldrather avoid the browser performing such a riskytask. On the other hand, we are looking for a re-alistic browser context in order to analyse the JSprogram as precisely as possible, but since we areaiming at performing static analysis, the context of

the application and the personality of the browsercan be provided to a proxy.

For these reasons, we advocate the use of a proxy-based solution, we named (sak mis) which wouldsustain the load of realtime static analysis on JScandidates. The system is designed to work as fol-lows (please refer to Fig. 3):

1. (sak mis) is a proxy that intercepts HTTP re-quests issued by the end user and subsequentresponses returned by the server;

2. upon reception of an HTTP response, theproxy kicks off the prefetching stage:

3. the requested web page is parsed to detectscript inclusions, links and potential maliciouslocations (iframes, images, etc.). Script con-tents are then retrieved (prefetching) and in-lined into the original web page;

4. if newly downloaded contents also contain linksor inclusions to contents of interest, prefetch-ing is also performed on these contents. Thisscenario also applies when new inclusions areuncovered after deobfuscation;

5

Page 6: SWAN WG - Year 2011 Activity Report - WIDE

5. once prefetching is completed, the aggregatedscript page is sent to an external applicationserver;

6. the application server is responsible for au-tomating deobfuscation: obfuscated contentsand decoding routines are extracted first. Thedeobfuscation stage of the attack scenario isemulated by the server. The process is re-peated in case the deobfuscation yielded ob-fuscated contents as well;

7. once scripting contents can be directly inter-preted by the machine, the decision moduleapplies static analysis to extract a model ofthe script’s intents. This model is comparedto a knowledge base in order to infer whetherit is a model of malicious intents or not;

8. in case the script is benign, the deobfuscatedscript is injected back into the original webpage and served to the end user.

Since the goal is ultimately to decide on the mal-ice of a given JS program, the proposed solutionshould prepare the JS candidate to be analyzed inthe most efficient way. To do so, we need to de-obfuscate obfuscated contents in order to providereadable code from which we can compute abstractsemantics. But again, to achieve such task, weshould ensure enough code is actually available toreverse the cipher (in case of a cipher obfuscation).Obfuscation techniques actually rely on several lay-ers of links and redirections, as well as layers ofobfuscation, therefore we need to ensure that alllinked contents (scripts) are prefetched and mashedin order to perform deobfuscation. Therefore, theproposed system revolves around 3 modules: aggre-gation, deobfuscation, decision (analysis).

4.4 Recursive Pre-fetching of Suspi-cious Linked Contents

A malicious script is not expected to be monolithicand is often scattered across script files, DOM-embedded contents or iframes. In particular, linkedcontents injected via script or iframe tags may orig-inate from different domains. Therefore, the scriptfound in the original web page may not be mali-cious itself, or hard to be decided on. It is nec-essary to fetch these additional or linked contentsin order to have an accurate view of the script-ing contents involved in a web page. Another ex-ample is on-demand loading, where additional con-tents are only downloaded later during execution.Such behavior is implemented through the Xml-HTTPRequest (XHR) object which allows asyn-chronous HTTP communication. Pre-fetching here

will look for XHR and download script contents inadvance to evaluate its impact on the current scriptcontents. Prototype hijacking[32] has the potentialto override a benign function with a malicious one.

Since malicious scripts may be obfuscatedthrough several layers of obfuscations and redirec-tions; the two first processing steps are recursive.Once a deobfuscation step has been taken, the out-come may be still obfuscated but the decipheringroutine may not be directly present in the outputbut rather as a link, a DOM-embedded string oran iframe. Therefore, it is necessary to run thepre-fetching step again to gather scripting contentsused in the deobfuscation step. These steps are re-peated until the outcome is a plain interpretablescript with no linked or DOM-embedded contents.

4.5 Emulation-based Deobfuscation

Obfuscation is any transformation that render apiece of code unreadable, hence hindering analysisby a human or an automated analyst. Obfuscat-ing transformations span a wide set of techniquesranging from simple string splitting to encryption.

Obfuscated scripts, in particular, malicious ob-fuscated scripts carry out 4 common stages asstated in [11]: 1) redirection and cloaking, 2) de-obfuscation, 3) environment preparation and 4) ex-ploitation. This highlights the fact that some ob-fuscated scripts, e.g., encrypted ones, will eventu-ally deobfuscate themselves before further execu-tion. Such argument is always pointed out by dy-namic analysis advocates to support the fact thatobfuscation is not an issue, if not completely dis-carded by researchers that confuse obfuscation withmalice. But the boundary between deobfuscationand the following stages is not always clear andsome recently witnessed obfuscation schemes do in-terleave obfuscation with environment preparationand/or exploitation.

Contrary to past research that emulated a com-plete browser environment and just integrated aJavaScript engine, sometimes instrumented, wepropose to emulate down to the execution ofJavaScript, precisely at the deobfuscation stage(stage 2).

4.5.1 Approach

Since static methods cannot overcome the obstacleof obfuscation, emulation allows to reproduce theoutcome of the deobfuscation stage without incur-ring side-effects inherent to a real execution. Here,the emulation is not aimed at the whole script andshould only be carried out on the obfuscated path.The obfuscated path is comprised of all the instruc-

6

Page 7: SWAN WG - Year 2011 Activity Report - WIDE

tions involved in deobfuscating the script: it canbe further decomposed into the obfuscated stringsand the deciphering routine (or decoder), when itexists. The first step of deobfuscation is thereforeto extract the obfuscated path from the aggregatedscript parse tree, obtained after pre-fetching. Butonly the decoder is actually emulated, obfuscatedstrings being used as input.

In this research, we focus on JavaScript,which has been widely used in attack scenarios.JavaScript is both an object-oriented and func-tional language but the obfuscation schemes we areconsidering seldom make use of functional proper-ties of JavaScript (variable promotion, variable sub-stitution). Given that the deobfuscation stage can-cels a prior obfuscation to provide an executablescript to the next stage, the deobfuscation processis bound to terminate. We assume that the de-obfuscation stage ultimately converges to a uniquenormal form: the unobfuscated original script.

Concerned with issues of performance, soud-ness and completeness, we considered several ap-proaches to automate the deduction of obfuscatedstrings by the deciphering routine. Emulating JSinstructions through another scripting language ob-viously suffers from lack of completeness as well aspoor performance inherent to scripting languagesin general. To satisfy properties of soundnessand completeness, we considered formal approachessuch as theorem proving, but state-of-the-art the-orem provers can not completely emulate JS ob-fuscating transformations. Eventually, we turnedto rewriting systems. Maude[10] is such rewritingframework whose underlying logic is membershipequational logic. Meseguer[30] observed that equa-tional logic is very well suited to give executableaxiomatizations of imperative sequential languages.It was suggested to us that functional modules inMaude[10] could fit our requirements for sound andcomplete deduction of the outcome of JS instruc-tions. As a matter of fact, computation in func-tional modules is accomplished using the equationsas rewrite rules.

4.5.2 Maude and Membership EquationalLogic

Functional modules satisfy the membership equa-tional logic as well as the additional requirementof being confluent and terminating. Functionalmodules are thereby used to emulate deobfusca-tion: variables and objects are mapped to Maude’ssorts, instructions are emulated through equations.Computation is realized by using these equations asrewrite rules applied to the obfuscated strings, un-til a canonical form is found, i.e., the deobfuscated

script. The system supports the extraction andconversion steps to deobfuscate a script: at first, apreliminary analysis should allow the system to dis-ambiguate obfuscated contents from the rest of thecode, and then further isolate the deciphering rou-tine and the obfuscated strings. The decipheringroutine is then converted to a functional module’sequations which are then used to rewrite the obfus-cated string through reduction. Algorithm 1 givesa more detailed description of the processing thattakes place just after pre-fetching. Once the em-ployed obfuscation scheme has been detected, theobfuscated path is extracted and the decipheringroutine is converted into a Maude functional mod-ule. Upon deobfuscation, an additional step verifieswhether deobfuscation is still needed.

An advantage of Maude is that it employsterm-indexing techniques to achieve high speeds ofrewriting[37]. However, it does not support everyJS native construct, such as loops. Yet, we cantake advantage of conditional equations to emulaterecursions. Doing so requires additional process-ing to transform JS loops to recursive functions.With a similar approach to some previous works inbinary deobfuscation[1], we propose to apply auto-mated deduction to the deobfuscation of scriptinglanguages. In particular, for this work, we focusour efforts on JavaScript, a Web 2.0 de-facto stan-dard language embedded natively in every browserand actively used in attack scripts. However, we be-lieve that such approach can be applied to scriptinglanguages bearing similar properties to JS, such asActionScript (Flash) or VBScript.

Algorithm 1 Automated deduction of script in-structions1: obfstring, decroutine = extract(script)2: if script contains loops then3: script = loopToRecursion(loops)4: end if5: fmod = convert(decroutine)6: output = Maude.reduce(fmod,obfstring)7: if output contains links then8: output += prefetch(links)9: end if

10: if output is obfuscated then11: script = output12: repeat from line 113: end if

4.5.3 AST-based Extraction of ObfuscatedContents

Techniques to detect obfuscation in web scripts[9,26] have been proposed recently. The goal is usu-

7

Page 8: SWAN WG - Year 2011 Activity Report - WIDE

ally to detect malware assuming obfuscation is anindicator of malice. Both proposals make use of ma-chine learning methods whether on statistical stringfeatures (byte occurence, entropy, word size) or se-mantic features (JS keywords and symbols).

Contrary to these related works, we are not con-sidering the obfuscated string itself, but rather theobfuscating transformation or combination of ob-fuscating transformations, often implemented asautomated tools. This proposal is not a methodfor deobfuscation however; it seeks to exhibit sig-nificant features of obfuscating transformations inorder to automate their detection and deobfusca-tion.

In our approach, we consider the abstract syn-tax trees of scripts to analyze. We distinguish twodistinct phases: first, we attempt to learn obfus-cating transformations as subtrees in obfuscatedscripts’ ASTs; then, we try to identify the presenceof such obfuscating transformations by matchinglearned patterns in candidate scripts’ ASTs. Theexpression of a script as an AST is common to bothphases and is obtained by simply parsing the script.Our subtree matching prototype developed in Rubytakes advantage of the johnson library [2] whichsupports the SpiderMonkey [13] JavaScript engine.

Since we are concerned with reducing the entropyof script contents for the purpose of analysis, it be-came necessary to abstract the script code in orderto get rid of the randomization introduced in theidentifiers and values. An accurate and abstractrepresentation of a program is its abstract syntaxtree. However, the AST used here is different insome aspects from similar approaches [12]:

• values are discarded and replaced by generictypes: NUM for numeric values, STR forstring values, ID for identifiers;

• some identifiers for core objects,e.g., document, and functions, e.g.,String.fromCharCode(), are preservedin order to make explicit operations such asoverriding or aliasing of core objects;

• conditions of branching and looping are dis-carded and the whole construct is replaced bya couple 〈S , I〉 where S represents a symbol(either BRANCH or LOOP) and I the set ofchild nodes, which are instructions that formthe body of the branch or loop;

• all instructions in a block are represented atthe same level by sibling nodes, children of anode representing the containing block (whichcan be a branching control, a loop, a functiondefinition, etc.).

In this approach, we focus on the hierarchical prop-erties of tree-like structures.

Obfuscated contents represent part of the codeof JS malware and can therefore be considered assubtrees of the malware’s AST. Provided we canlearn recurring subtrees from obfuscted JS samples,it is possible to match such patterns in other JSASTs by subtree matching.

We can build a pushdown automaton (PDA) ac-cepting the selected subtrees. Then, any analyzedscript is transformed into an AST and fed to thePDA in order to match occurrences of the learnedsubtrees (the ones expressing obfuscation). Thistechnique has been proposed by Flouri et al. [14],who aim to apply or adapt algorithms, commonlyapplied to strings and sequences, to the field of treestructures. Strings are usually processed using fi-nite state machines as the model of computationand this naturally led to the use of pushdown au-tomata towards trees, since the addition of a stackaccommodates recursion. Thanks to syntactic rulesthat limit the number of combination between sym-bols in a programming language, employing a sub-tree automaton seems an effective solution that canscale with a large number of subtrees. Still, the sizeof these subtrees can badly affect performance andshould be limited.

Constructing a deterministic pushdown au-tomata accomodating a set of subtrees is done inthree distinct steps:

1. construction of a PDA accepting a set of sub-trees in their prefix notation

2. construction of a nondeterministic subtreematching PDA for a set of subtrees in theirprefix notation

3. transformation of the nondeterministic subtreematching PDA to an equivalent deterministicPDA

Each step employs well-known PDA constructionalgorithms which are further detailed in [14]. Inthis research, we implemented a script that parsesan XML file listing recurring subtrees and gener-ates the transition rules for the deterministic sub-tree matching PDA. We also modified a programemulating a finite state machine to accommodate astack. This program would parse the file containingthe transition rules, as well as a file containing thescript to be analyzed. The script to be analyzedis transformed into an AST and then expressed asits prefix notation to be fed to the subtree match-ing PDA. Whenever, there is a match, the matchedsubtree is reported, allowing to identify character-istic subtrees while traversing the script’s AST.

8

Page 9: SWAN WG - Year 2011 Activity Report - WIDE

<html><t i t l e>404 Not Found</ t i t l e></head><body><h1>Not Found</h1><p>The requested URL / index . php wasnot found on t h i s s e r v e r .</p><p>Addit iona l ly , a 404 Not Founde r r o r was encountered whi le t ry ing to usean ErrorDocument to handle the reques t .</p><hr></body></html><script language=JavaScr ipt>s t r = ”qndy ‘mh) ( : ” // the obfuscated s t r i n g i s// abbrev iated f o r the purpose o f b r ev i tys t r 2=”” ; f o r ( i = 0 ; i < s t r . l ength ; i ++){s t r 2=s t r 2+String . fromCharCode ( s t r . charCodeAt ( i ) ˆ 1 ) ; } ;eva l ( s t r 2 );</ script></html>

Figure 4: Original HTML code

4.5.4 Example

This example (see Fig. 4) is a simple eval unfoldingfeaturing a single loop that deciphers an obfuscatedstring via XOR operations. The script is includedinto an HTML file that displays a 404 error page toan unsuspecting user.

In the previous learning stage, we have learntone instance of eval unfolding loop and its AST.Figure 5 displays a sample AST for the followingeval unfolding loop:

for(i=0;i<str.length;i++){str2 = str2 +String.fromCharCode(str.charCodeAt(i)^1);str3 = str3 + str3};

eval (str2);

The LOOP statement contains two instructionsrepresented by two paths stemming from the LOOPnode: one being the actual decoder using theString.fromCharCode function and the other pathbeing a dummy operation on an unrelated variable.

We first extract the script contents from theHTML file, i.e., instructions comprised between the<script> tags, and parses the contents. The parsetree is analyzed to detect the obfuscation schemeby PDA-based subtree matching. Here, the ob-fuscation scheme uses a loop to process the obfus-cated string. This loop is converted to a recursivefunction whose body is the decoding routine. TheMaude system readily provides a predefined func-tional module that defines the string data type aswell as operators to manipulate string objects: thefromCharCode() function is mapped to Maude’schar operator, which converts an ASCII code to thecorresponding character; the charCodeAt() func-tion is emulated by the combination of two basicoperators, ascii, the inverse of char, and substr,the substring operator. The result of the conver-sion to a Maude functional module is displayed in

LOOP

ASSIGN

body

ASSIGN

body

root

CALL

ID

left

ADD

right

ID

op

CALL

op

ACCESS

function

XOR

param

String

caller

fromCharCode

callee

CALL

op

NUM

op

ACCESS

function

ID

param

ID

caller

charCodeAt

callee

ID

left

ADD

right

ID

op

ID

op

eval

function

ID

param

Figure 5: Abstract syntax tree of an eval unfoldingtransformation.

fmod TEST i sp ro t e c t i ng INT .p ro t e c t i ng STRING .op t e s t : Int S t r ing St r ing −> St r ing .var I : Int .vars S1 S2 : S t r ing .ceq t e s t ( I , S1 , S2 ) = S2 i f l ength ( S1 ) <= I .ceq t e s t ( I , S1 , S2 ) = t e s t ( ( I + 1) , S1 , S2 )+ char ( a s c i i ( subs t r (S1 , ( l ength ( S1 ) − I − 1 ) , 1 ) ) xor 1)

i f I < l ength ( S1 ) .endfm

Figure 6: Maude functional module

Figure 6. The workflow of the recursion is realizedthrough conditional equations.

4.6 Decision

Static methods are difficult to apply to obfuscatedstrings. Our method here applies to a deobfuscatedstring obtained after emulation. The intuition isthat we are able to tell what the script intends todo without executing it. By looking to what thescript offers to do, i.e., its functionalities, it is pos-sible to understand the actions a script might per-form upon execution. To that end, script contentsundergo static flow analysis on function calls or op-erators that are traced back to the root of an in-struction block. This analysis allows building func-

9

Page 10: SWAN WG - Year 2011 Activity Report - WIDE

tional units that decompose the script into blocksthat express a single functionality. The combina-tion or sequence of functionalities indicate what arethe intentions of the script.

4.6.1 Static Functional Unit Decomposi-tion

Algorithm 2 Block-based forward flow tracingfunctions = []params = []transitions = []for blk ∈ blocks do

for instr ∈ instructions doif instr ⊇ call ∈ calls then

functions[call].add(instr,blk)for par ∈ params do

params[par].add(instr,blk)end for

end ifend for

end forfor f ∈ functions do

for instr, blk ∈ f dofor instr′, blk′ ∈ f ′ do

if (instr, blk) = (instr′, blk′) thentransitions.add(instr,blk)

end ifend for

end forend for

transitions.sort()

A functional unit is a set of instructions thatexpress a single functionality. The term was firstcoined by Lu and Kan[28] and originally refers toa JavaScript instance, combined with all of (poten-tially) called subprocedures. Algorithm 2 describesthe steps taken to cluster instructions to function-alities and link these through their interaction. Ablock-based forward flow tracing approach is usedto list native function calls in every block of the pro-gram as well as instructions related to these func-tions. Functions manipulating common variablesare then seen as interacting, providing a logical linkbetween two clusters.

While Lu and Kan favored a top-down approach,we advocate a bottom-up linking approach in orderto focus on the functional characteristic of the clus-tered instructions. Besides, a JS function is not ex-pected to perform a single functionality, especiallymalicious ones. Our method applies on the parsetree (or its abstract syntax tree (AST) representa-tion) of a deobfuscated script and directly targetsexplicit functions. Explicit functions are functionsof which functionality is obvious and that have beenclassified by us. By tracing the flow of these func-tions, we can identify clusters of instructions thatexpress a single functionality (Fig. 7) and eventu-ally trace the flow of variables through these dif-ferent clusters. Such abstract model can be thencompared with a database of models. At the time

Figure 7: The proposed JavaScript functional unit

Figure 8: Colored output after processing

of writing, we are not able to predict the averagesize of such models. Examples usually feature 3 or4 functional units.

4.6.2 Example

The output of the Maude functional module (seeFig. 6) obtained in the previous step is then parsedand decomposed into functional units. This out-put is actually the result of emulating the decod-ing routine on the obfuscated string, which nor-mally yield the original unobfuscated code. Inthis deobfuscated code, the open() call issuedby the variable asq, which is an instance of themsxml2.XMLHTTP object allows clustering instruc-tions dedicated to the download of some data, whilethe SaveToFile() call from the ADODB streaminstance, asst isolates storage-related instructions.Finally, the shellexecute() function is linked to aShellApplication instance, ass that expresses anexecution functionality. The clustered instructionshave been colored differently in a modified output(see Fig. 8) of the Maude framework.

This output can be further abstracted to a func-

10

Page 11: SWAN WG - Year 2011 Activity Report - WIDE

Figure 9: Functional model computed from thesample file

tional model as shown in Figure 9. This ab-stract view offers a more straightforward modelof what activities the malicious script is actu-ally carrying out: after downloading contents froma remote URL, the contents of the response,asq.responseBody, are passed to a storage func-tionality that saves these into a file called imya andthis file is then inputted into an execution function-ality.

5 Conclusion and FutureWorks

For its second year of activity, the SWAN WG issteadily progressing. So far, we have proposed someapproaches to tackle Web 2.0 security issues withour JS analysis proxy, as well as a federating projectthat is the Web2Sec testbed.

In particular, the proxy is entering an advancedstage in its design since all modules are completelydefined. So far, its implementation has yieldedpromising proof-of-concepts. However, this stillneeds a little more work to reach completion, andobviously further evaluation is needed. Numericalresults and discussion can be found in the producedpublications[4, 3, 5].

As for the Testbed, it is in stand-by mode butshould be resumed in a near future.

We are still calling for other participants to pro-pose new projects or contribute on exisiting ones.In particular, we are interested in developing a JSdebugger, as well as other analysis tools targetedto other aspects of web applications (Flash, PDF,etc.) and web services. New approaches to analysissuch as formal methods or symbolic execution alsooffer plenty of research opportunities and perspec-tives.

References

[1] R. Ando. Parallel Analysis of PolymorphicViral Code Using Automated Deduction Sys-tem. In Proceedings of the 8th ACIS Inter-national Conference on Software Engineering,

Artificial Intelligence, Networking and Paral-lel/Distributed Computing, 2007.

[2] J. Barnette. johnson. Available at: http://github.com/jbarnette/johnson/.

[3] G. Blanc, M. Akiyama, D. Miyamoto, andY. Kadobayashi. Identifying CharacteristicSyntactic Structures in Obfuscated Scripts bySubtree Matching. In Proceedings of the antiMalware engineering WorkShop (MWS 2011),Oct. 2011.

[4] G. Blanc, R. Ando, and Y. Kadobayashi.Term-Rewriting Deobfuscation for StaticClient-Side Scripting Malware Detection. InProceedings of the 4th IFIP InternationalConference on New Technologies, Mobilityand Security (NTMS 2011), Feb. 2011.

[5] G. Blanc and Y. Kadobayashi. A Step TowardsStatic Script Malware Abstraction: RewritingObfuscated Script with Maude. IEICE Trans-actions on Information and Systems, E94-D(11):2159–2166, Nov. 2011.

[6] Bonsai Information Security. moth.http://www.bonsai-sec.com/en/research/moth.php.

[7] E. Bursztein et al. Webseclab Security Edu-cation Workbench. In Proceedings of the 3rdWorkshop on Cyber Security Experimentationand Test, Aug. 2010.

[8] R. Cannings, H. Dwidedi, and Z. Lackey.Hacking Exposed Web 2.0: Web 2.0 SecuritySecrets and Solutions. McGraw-Hill, 2007.

[9] Y. Choi, T. Kim, S. Choi, and C. Lee. Auto-matic detection for javascript obfuscation at-tacks in web pages through string pattern anal-ysis. In Proceedings of the 1st InternationalConference on Future Generation InformationTechnology, pages 160–172. Springer-Verlag,2009.

[10] M. Clavel, F. Duran, S. Eker, P. Lincoln,N. Martı Oliet, J. Meseguer, and C. Talcott.The Maude System. Available at: http://maude.cs.uiuc.edu.

[11] M. Cova, C. Kruegel, and G. Vigna. Detec-tion and Analysis of Drive-by Download At-tacks and Malicious JavaScript Code. In Pro-ceedings of the 19th International WWW Con-ference, 2010.

11

Page 12: SWAN WG - Year 2011 Activity Report - WIDE

[12] C. Curtsinger, B. Livshits, B. Zorn, andC. Seifert. ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection. InProceedings of the 20th USENIX SecuritySymposium, Aug. 2011.

[13] B. Eich. SpiderMonkey (JavaScript-C) Engine.Available at: http://www.mozilla.org/js/spidermonkey/.

[14] T. Flouri, J. Janousek, and B. Melichar.Subtree Matching by Pushdown Automata.Computer Science and Information Systems,7(2):331–357, Apr. 2010.

[15] J. J. Garrett. Ajax: A New Approach to WebApplications. http://www.adaptivepath.com/ideas/essays/archives/000385.php,2005.

[16] C. Grier, S. Tang, and S. T. King. Secure WebBrowsing with the OP Web Browser. In Pro-ceedings of the 2008 IEEE Symposium on Se-curity and Privacy (S&P 2008), May 2008.

[17] A. Guha, S. Krishnamurthi, and T. Jim. Us-ing Static Analysis for Ajax Intrusion Detec-tion. In Proceedings of the 18th InternationalWorld Web Wide Conference (WWW 2009),Apr. 2009.

[18] B. Hoffman and B. Sullivan. Ajax Security.Addison-Wesley (Pearson), 2007.

[19] A. Holdoner. Ajax: The Definitive Guide.O’Reilly, 2008.

[20] O. Ismail, M. Etoh, Y. Kadobayashi, andS. Yamaguchi. A Proposal and Implementa-tion of Automatic Detection/Collection Sys-tem for Cross-Site Scripting Vulnerability. InProceedings of 18th International Conferenceon Advanced Information Networking and Ap-plications (AINA 2004), Mar. 2004.

[21] C. Jackson and H. J. Wang. Subspace: Se-cure Cross-Domain Communication for WebMashups. In Proceedings of the 16th Interna-tional World Web Wide Conference (WWW2007), May 2007.

[22] T. Jim, N. Swamy, and M. Hicks. BEEP:Browser-Enforced Embedded Policies. In Pro-ceedings of the 16th International World WebWide Conference (WWW 2007), May 2007.

[23] M. Johns. On JavaScript Malware and Re-lated Threats. Journal in Computer Virology,4(3):161–178, Aug. 2008.

[24] M. Johns and J. Winter. RequestRodeo:Client Side Protection against Session Riding.In Proceedings of the OWASP Europe 2006Conference, May 2006.

[25] F. D. Keukelaere, S. Bhola, M. Steiner,S. Chari, and S. Yoshihama. SMash: SecureComponent Model for Cross-Domain Mashupson Unmodified Browsers. In Proceedings of the17th International World Web Wide Confer-ence (WWW 2008), Apr. 2008.

[26] P. Likarish, E. Jung, and I. Jo. Obfuscated ma-licious javascript detection using classificationtechniques. In 4th International Conference onMalicious and Unwanted Software, pages 47 –54, Oct. 2009.

[27] G. Louthan, W. Roberts, M. Butler, andJ. Hale. The Blunderdome: An Offensive Ex-ercise for Building Network, Systems, and WebSecurity Awareness. In Proceedings of the 3rdWorkshop on Cyber Security Experimentationand Test, Aug. 2010.

[28] W. Lu and M.-Y. Kan. Supervised Catego-rization of JavaScript using Program AnalysisFeatures. In Asian Information Retrieval Sym-posium. Springer-Verlag, 2005.

[29] S. Maffeis and A. Taly. Language-Based Iso-lation of Untrusted JavaScript. In Proceedingsof the 22nd IEEE Computer Security Founda-tions Symposium, 2009.

[30] J. Meseguer. Software Specification and Veri-fication in Rewriting Logic. Models, Algebrasand Logic of Engineering Software, 2003.

[31] A. Moshchuk, T. Bragin, D. Deville, S. Grib-ble, and H. Levy. SpyProxy: Execution-basedDetection of Malicious Web Content. In Pro-ceedings of the 16th Annual USENIX SecuritySymposium, 2007.

[32] S. D. Paola and G. Fedon. Subverting Ajaxfor Fun and Profit. In Proceedings of the 23rdChaos Communication Congress, 2006.

[33] N. Provos, D. McNamee, P. Mavrommatis,K. Wang, and N. Modadugu. The Ghost in theBrowser: Analysis of Web-based Malware. InProceedings of the 1st Workshop on Hot Topicsin Understanding Botnets, Apr. 2007.

[34] M. Rajab, L. Ballard, N. Jagpal, P. Mavrom-matis, D. Nojiri, N. Provos, and L. Schmidt.Trends in Circumventing Web-Malware Detec-tion. Technical Report rajab-2011a, Google,Inc,, July 2011.

12

Page 13: SWAN WG - Year 2011 Activity Report - WIDE

[35] C. Reis, J. Dunagan, H. J. Wang, andO. Dubrovsky. BrowserShield: Vulnerability-Driven Filtering of Dynamic HTML. In Pro-ceedings of the 7th USENIX Symposium onOperating Systems Design and Implementa-tion, Nov. 2006.

[36] K. Rieck, T. Krueger, and A. Dewald. Cujo:Efficient Detection and Prevention of Drive-by-download Attacks. In Proceedings of the26th Annual Computer Security ApplicationsConference, pages 31–39. ACM, 2010.

[37] N. Shankar. Automated Deduction for Verifi-cation. ACM Computing Surveys, 41(4), Oct.2009.

[38] P. Vogt et al. Cross Site Scripting Preven-tion with Dynamic Data Tainting and StaticAnalysis. In Proceedings of the 14th AnnualNetwork & Distributed System Security Sym-posium (NDSS 2007), Feb. 2007.

[39] D. Yu, A. Chander, N. Islam, and I. Serikov.JavaScript Instrumentation for Browser Secu-rity. In Proceedings of the 34th Annual ACMSIGPLAN - SIGACT Symposium on Princi-ples of Programming Languages, Jan. 2007.

13