Text Analytics for Mobile App Security and Beyond

1

Text Analytics for Mobile App Security and Beyond

Tao XieUniversity of Illinois at Urbana-Champaign

[email protected]

mailto:[email protected]

Mobile App Markets

Apple App Store Google Play Microsoft Windows Phone

App Store beyond Mobile Apps!

What If Formal Specs Are Written?!

4

APP DEVELOPERS

APP USERS

App Functional Requirements

App Security Requirements

User Functional Requirements

User Security Requirements

informal: app description, etc. permission list, etc.

Informal App Functional Requirements: App Description

5

App Code

App Permissions

App Security Requirements: Permission List

6


7

APP DEVELOPERS

APP USERS






Example Andriod App: Angry Birds

8


9

APP DEVELOPERS

APP USERS





In reality, few of these requirements are (formally) specified!! Hope?!: Bring human into the loop: user perception + judgment


Our Yin-Yang View on Mobile App Security

10

App Description

App Code

App Permissions

User-Perceived Information

App Security Behavior

o Reason about user-perceived info, e.g., WHYPER ( )

o Push app security behavior across the boundary ()

o Check consistency across the boundary ()

o Reduce user judgment effort ( )

App UIs, App categories, App metadata, User forums, …

[functional]

[security]

11

oApple (Market’s Responsibility)o Apple performs manual inspection

oGoogle (User’s Responsibility)o Users approve permissions for security/privacyo Bouncer (static/dynamic malware analysis)

oWindows Phone (Hybrid)o Permissions / manual inspection

Assuring Market Security/Privacy

12

o Previous approaches look at permissions code (runtime behaviors)

o What does the users expect?o GPS Tracker: record and send locationo Phone-Call Recorder: record audio during phone call

Need More Than Program Analysis

App Description

App Code

App Permissions

13

oUser expectationso user perception + user judgment

o Focus on permission app descriptionso permissions (protecting user understandable

resources) should be discussed

Vision“Bridging the gap between

user expectation app behaviors”

App Description Sentence Permission

Linkage

14

WHYPER Overview

Application Market

WHYPER

DEVELOPERS

USERSPandita et al. WHYPER: Towards Automating Risk Assessment of Mobile Applications. USENIX Security 2013http://web.engr.illinois.edu/~taoxie/publications/usenixsec13-whyper.pdf

• Enhance user experience while installing apps• Enforce functionality disclosure on developers• Complement program analysis to ensure justifications

http://web.engr.illinois.edu/~taoxie/publications/usenixsec13-whyper.pdf

15

Example Sentence in App Desc.• E.g., “Also you can share the yoga

exercise to your friends via Email and SMS.” – Implication of using the contact

permission– Permission sentences

Keyword-based search on application descriptions

16

Problems with Ctrl + F

• Confounding effects:– Certain keywords such as “contact” have a

confounding meaning – E.g., “... displays user contacts, ...” vs “... contact

me at [email protected]”.

• Semantic inference: – Sentences often describe a sensitive operation

without actually referring to keywords – E.g., “share yoga exercises with your friends via

Email and SMS”


Natural Language Processing

• Natural Language Processing (NLP) techniques help computers understand NL artifacts

• In general, NLP is still difficult

• NLP on domain specific sentences with specific styles is feasible– Text2Policy: extraction of security policies from use cases [FSE 12]– APIInfer: inferring contracts from API docs [ICSE 12]– WHYPER: domain knowledge from API docs [USENIX Security 13]

18

Overview of WHYPER

APP Description

APP Permission

SemanticGraphs

PreprocessorIntermediate

RepresentationGenerator

SemanticEngine

NLP Parser

Semantic GraphGeneratorAPI Docs

AnnotatedDescription

FOLRepresentation

WHYPER

Domain Knowledge

19

Preprocessor• Period Handling

– Decimals, ellipsis, shorthand notations (Mr., Dr.)

• Sentence Boundaries– Tabs, bullet points, delimiters (:)– Symbols (*,-) and enumeration sentence

• Named Entity Handling– E.g., “Pandora internet radio”

• Abbreviation Handling– E.g., “Instant Message (IM)”

20

Intermediate-Representation Generator

Alsoyoucanshare yogaexercisetoyourfriendsviaEmailandSMSVBRB PRP MD NNDT NN NNSPRP NNP NNP

the

Alsoyoucan

share

exercise

yourfriendsEmail

SMS

yoga

advmodnsubjauxdobj

detnn

prep_topossprep_via

conj_and

the

shareto

youyoga exercise

ownedyouvia

friendsand

EmailSMS

Predicate

Governing

Entity

DependentEntit

y

Semantic Engine

shareto

youyoga exercise

ownedyouvia

friendsand

EmailSMSEmail

share

WordNet Similarity

21

Inferred from API

DocsGoverning

Entity

DependentEntit

y

22

Systematic approach to infer graphso Identify resource associated with the permissions

from the API class nameo ContactsContract.Contacts

o Inspect the member variables and member methods to identify actions and subordinate resourceso ContactsContract.CommonDataKinds.Email

Semantic-Graph Generator

23

Evaluation• Subjects

– Permissions: • READ_CONTACTS • READ_CALENDAR • RECORD_AUDIO

– 581 application descriptions – 9,953 sentences

• Evaluation setup– Manual annotation of the sentences– WHYPER for identifying permission sentences– Comparison to keyword-based searching

24

Evaluation Results

• Precision and recall of WHYPER – Average precision (82.8%) and recall (81.5%)

• Comparison to keyword-based searching – Improving precision (41.6%) and recall (-1.2%)– E.g., microphone-blow into and call-record

Permission KeywordsREAD_CONTACTS contact, data, number,

name, emailREAD_CALENDAR calendar, event, date,

month, day, yearRECORD_AUDIO record, audio, voice,

capture, microphone

Access Control Policies (ACP) in Requirements Document

• Access control is often governed by security policies called Access Control Policies (ACP)– Includes rules to control which principals have access to

which resources

• A policy rule includes four elements– Subject – HCP – Action – edit– Resource - patient's account– Effect - deny

“The Health Care Personnel (HCP) does not have the ability to edit the patient's account.”

ex.

Overview of Text2Policy

A HCP should not change patient’s account.

An [subject: HCP] should not [action: change] [resource: patient’s account].

ACP Rule

EffectSubject Action Resource

HCP UPDATE - change

patient’s account

deny

Linguistic Analysis

Model-Instance Construction

TransformationXiao et al. Automated Extraction of Security Policies from Natural-Language Software Documents. FSE 2012. http://web.engr.illinois.edu/~taoxie/publications/fse12-nlp.pdf

http://web.engr.illinois.edu/~taoxie/publications/fse12-nlp.pdf

http://web.engr.illinois.edu/~taoxie/publications/fse12-nlp.pdf

Example Technical Challenges in ACP Extraction

• Semantic Structure Variance– different ways to specify the same rule

• Negative Meaning Implicitness– verb could have negative meaning

ACP 1: An HCP cannot change patient’s account.ACP2: An HCP is disallowed to change patient’s account.

Road Ahead: Yin-Yang View

28

App Description

App Code

App Permissions

User-Perceived Information

App Security Behavior

o Reason about user-perceived info, e.g., WHYPER ( )

o Push app security behavior across the boundary ()

o Check consistency across the boundary ()

o Reduce user judgment effort ( )


[functional]

[security]

Text Analytics for Mobile App Security and Beyond

29

App Description

App Code

App Permissions

[email protected]


Acknowledgments: Supported in part by NSA Science of Security (SoS) Lablet, NSF SaTC, NSF SHF, NSF CAREER


31

Problems with Ctrl + F

o Confounding effects:

o Certain keywords such as “contact” have a confounding meaning. o For instance, “... displays user contacts, ...” vs “... contact me at [email protected]”.

o Semantic Inference:

o Sentences often describe a sensitive operation such as reading contacts without actually referring to keyword “contact”.

o For instance, “share yoga exercises with your friends via email, sms”.


32

• NLP techniques help computers understand NL artifacts

• NLP is still difficult

• NLP on domain specific sentences with specific styles is feasible

Natural Language Processing (NLP)

33

RQ1 Results: Effectiveness of WHYPER

• Low FPs and FNs• out of 9,061 sentences, only 129 are flagged as FPs• among 581 applications, 109 applications (18.8%) contain at least one FP• among 581 applications, 86 applications (14.8%) contain at least one FN

Permission SI TP FP FN TN Prec. Recall F-Score Acc

READ_CONTACTS 204 186 18 49 2,930 91.2 79.2 84.8 97.9

READ_CALENDAR 288 241 47 42 2,422 83.7 85.2 84.5 96.8

RECORD_AUDIO 259 195 64 50 3,470 75.3 79.6 77.4 97.0

TOTAL 751 622 129 141 9,061 82.8 81.5 82.2 97.3

34

• Incorrect parsing• “MyLink Advanced provides full

synchronization of all Microsoft Outlook emails (inbox, sent, outbox and drafts), contacts, calendar, tasks and notes with all Android phones via USB”

• Synonym analysis• “You can now turn recordings into

ringtones.”

Result Analysis (False Positives)

35

• Incorrect parsing• Incorrect identification of sentence boundaries and limitations of

underlying NLP infrastructure

• Limitations of Semantic Graphs• Manual Augmentation

• microphone-blow into and call-record• significant improvement of Delta Recalls: -6.6% to 0.6%

• Automatic mining from user comments and forums

Result Analysis (False Negatives)




ACP Rule


HCP UPDATE - change

patient’s account

deny

Linguistic Analysis


Transformation

Linguistic Analysis

• Incorporate syntactic and semantic analysis– syntactic structure -> noun group, verb group, etc.– semantic meaning -> subject, action, resource, negative

meaning, etc.

• Provide New techniques for model extraction– Identify ACP and AS sentences– Infer semantic meaning

Common Techniques

• Shallow parsing• Domain dictionary• Anaphora resolution

An HCP can view patient’s account.He is disallowed to change the patient’s account.

Subject Main Verb Group

Object

NP PNP

UPDATEHCP

VG

Technical Challenges (TC) in ACP Extraction

• TC1: Semantic Structure Variance– different ways to specify the same rule

• TC2: Negative Meaning Implicitness– verb could have negative meaning

ACP 1: An HCP cannot change patient’s account.ACP2: An HCP is disallowed to change patient’s account.

Semantic-Pattern Matching

• Address TC1 Semantic Structure Variance

• Compose pattern based on grammatical function

An HCP is disallowed to change the patient’s account.ex.

passive voice to-infinitive phrasefollowed by

Negative-Expression Identification

• Address TC2 Negative Meaning Implicitness

• Negative expression– “not” in subject:

– “not” in verb group:

• Negative meaning words in main verb group

No HCP can edit patient’s account.ex.

HCP can not edit patient’s account.HCP can never edit patient’s account.

ex.

ex. An HCP is disallowed to change the patient’s account.

AS: Syntactic-Pattern Matching

• Syntactic elements– Subject , Main verb, Object

• Subject and Object Checking– subject is a not a user or object is not a resource

• Filtering negative-meaning sentences– Negative sentences tend not to describe ASs

The prescription list should include medication, the name of the doctor. . .

ex.




ACP Rule


HCP UPDATE - change

patient’s account

deny

Linguistic Analysis


Transformation

ACP Model-Instance Construction

• Identify subject, action, and resource:– Subject: HCP– Action: change– Resource: patient’s account

• Infer effect:– Negative Expression: none– Negative Verb: disallow– Inferred Effect: deny

An HCP is disallowed to change the patient’s account.

ex.

ACP Rule


HCP UPDATE - change

patient’s account

deny

AS Model-Instance Construction

• Use case patterns– industry use cases [DSN’09]– public use cases

• Model-Instance ConstructionThe patient views access log.ex.

Action Step

Actor Action Resource

patient OUTPUT – view

access log

Technical Challenges in Action-Step Extraction

• TC4: Transitive Subject

• TC5: Perspective Variance

AS 1:He edits the account.AS 2: The system updates the account.AS 3: The system displays the updated account.

HCPHCP views the updated account.

Subject Flow Tracking

• Address TC4 Transitive Subject• Apply data flow to track non-system subject:

AS 1: The HCP edits the account.AS 2: The system updates the account.

Tracking Only system as subject

replaced with HCP as subject

Perspective Conversion

• Address TC5 Perspective Variance• Apply data flow to track non-system subject:

AS 1: The HCP edits the account.AS 2: The system shows the updated account.

Tracking Only system as subject andaction is output

Converting to “HCP views the updated account”

Evaluation – RQs

• RQ1: How effectively does Text2Policy identify ACP sentences in NL documents?

• RQ2: How effectively does Text2Policy extract ACP rules from ACP sentences?

• RQ3: How effectively does Text2Policy extract action steps from action-step sentences?

Evaluation – Subject

• iTrust open source project– http://agile.csc.ncsu.edu/iTrust/wiki/– 448 use-case sentences (37 use cases)– preprocessed use cases

• Collected ACP sentences– 100 ACP sentences – From 17 sources (published papers and websites)

• A module of an IBMApp (financial domain)– 25 use cases

RQ1 ACP Sentence Identification

• Apply Text2Policy to identify ACP sentences in iTrust use cases and IBMApp use cases

• Text2Policy effectively identifies ACP sentences with precision and recall more than 88%

• Precision on IBMApp use cases is better– proprietary use cases are often of higher quality compared to open-source

use cases

Evaluation –RQ2 Accuracy of Policy Extraction

• Apply Text2Policy to extract ACP rules from ACP sentences

• Text2Policy effectively extracts ACP model instances with accuracy above 86%

Evaluation –RQ3 Accuracy of Action-Step Extraction

• Apply Text2Policy to extract action steps from iTrust and IBMApp use cases

• Text2Policy effectively extracts AS model instances with accuracy above 81%

• Limitations: – Subordinate conjunction or else and long phrases

Detected Inconsistencies

• No violation between ASs against the extracted ACPs

• Inconsistent names used for referring to the same entity (e.g., user) across different use cases

editor used in UC 4 of iTrust use cases actually refers to HCP, admin, and all usersin UCs 1, 2, and 4

ex.

Summary

• Natural Language Processing (NLP) for domain-specific purposes is feasible– Challenging for general documents– Feasible for domain-specific sentences with specific

styles

• New techniques are required – Addressing unique challenges in software engineering

http://research.csc.ncsu.edu/ase/projects/text2policy/

Text Analytics for Mobile App Security and Beyond

Documents

Transcript of Text Analytics for Mobile App Security and Beyond