Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Machine Learning forProgram Language

ResearchYao Peisen

Prism Group, HKUST

[email protected]

Outline

• Overview– Potential applications – Intermediate Representation– Probabilistic model

• Statistical Code Completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusion

Marriage of ML and PL

SLANG: Code completion[PLDI 14] PL Translation[Onward 14]

More Application:

Program bug detection

Program invariants inference

PL design: Probabilistic PL

Binary analysis

........

JSNice: Type Predication[POPL 15]

Intermediate Representation

Sequences Trees

Graphical Models Feature Vectors

Other IRs in PL research: AST, CFG, CDG, DDG, PDG. SSA, CPS.....

Extract program representation with Program Analysis

•SLANG: alias and typestate analysis •JSNice: scope and alias analysis, type analysis

......

•Other application:

Use type inference to get trained labels

Use SAT Solver to check path condition

....

What's the suitableprobabilistic model?

N-gram language model [PLDI 14]

Probabilistic context-free grammers [ICSE 12]

Netural networks

Support vector machine

Conditional Ramdom Fields[POPL 15]

......

Same like the IR, it's dependent on the application.

ML for PL

[Picture from Martin Vechec's slide]

Outline


• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions

Statistical Code Completion

[SLANG, V.Raychev et al' PLDI 14]

Key insight: Regularities in code are similar to regularities in natural language

Techniques in SLANG

• IR: Sequences (setences) • Program Analysis: typestate analysis, alias analysis • Trained Model: Netural Network, N-gram language model • Some smoothing techniques

N-gram language model

Conditional probability only on previous n-1 words

Training is achieved by counting n-grams.

Time complexity for each word encountreed in training is constant, so training

is usually fast.

Other models used: Recurrent Netural Network(RNN). RNN can learn dependencies beyond the prior several words, but usually slower

Outline


• Statistical code completion • Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusion

Learning to Recognize Functions in Binary Code

[Tiffany Bao et al' Usenix Security 14]

When we use gcc with -O3, the function information may be stripped.

Can we automatically and accurately recover function information from binaries?

Example: GCC

#include <stdio.h>int fac(int x){ if (x == 1) return 1; else return x * fac(x - 1);}

void main(int argc, char **argv){ printf("%d", fac(10));}

Example: GCC

• default -O0 08048443 <main>:push %ebpmov %esp,%ebpand $0xfffffff0,%espsub $0x10,%esp…

0804841c <fac>:push %ebpmov %esp,%ebpsub $0x18,%espcmpl $0x1,0x8(%ebp)jne 804842f <fac+0x13>mov $0x1,%eax…

-O1 -O2

0804841c <fac>:push %ebxsub $0x18,%espmov 0x20(%esp),%ebxmov $0x1,%eaxcmp $0x1,%ebx…

08048330 <main>:mov $0x1,%edxmov $0xa,%eaxlea 0x0(%esi),%esi…push %ebpmov %esp,%ebpand $0xfffffff0,%espsub $0x10,%esp…

ByteWeight

A machine learning + program analysis approach to function identification

Training: •Creates a model of function start patterms using supervised learning

Usage:– Use trained models to match function start on stripped binaries — Function

Start Identification– Use program analysis to identify all bytes associated with a function —

Function Identification

[Picture from Bao's slide]

Outline



Problems of Program analysis

• Program have unbounded behaviors

• Program analysis – Analyze all behaviors– Run for a finite time

• In finite time, observe only finite behaviors

• Need to generalize

Generalization in Program Analyais

• Abstraction interpretation: widening operator[

• CEGAR: interpolants

• Parameter tuning of tools(flow, path sensitivity, etc)

• Lots of folk knowledge, heuristics,...

Generalization in Machine Learning

• “It’s all about generalization”I• A famous concept in Computational learning theory

– Complexity and Feasibility of learning

• Learn a function from observations

• Hope that the function generalizes

Bias-variance Tradeofs in Program analysis

[Aiken, POPL 14]

•Model the generalization process – Probably Approximately Correct(PAC) model – Bias: Empirical error of best available hypothesis– Variance: O(VC-d)

•Explain know observations by this model

•Use this model to obtain better tools(in ASTREE, Yogi Project..)

Outline



Combine ML and PL Research

•Already lots of work in: POPL, PLDI, SOSP, OSDI, Usenix Security....

•Lots of applications and theories to be found.

•Combination with other fileds: System, Security..

Thank you !

Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Documents

Transcript of Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST