Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

26
Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST [email protected]

description

Marriage of ML and PL SLANG: Code completion[PLDI 14] PL Translation[Onward 14] More Application: Program bug detection Program invariants inference PL design: Probabilistic PL Binary analysis JSNice: Type Predication[POPL 15]

Transcript of Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Page 1: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Machine Learning forProgram Language

ResearchYao Peisen

Prism Group, HKUST

[email protected]

Page 2: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Outline

• Overview– Potential applications – Intermediate Representation– Probabilistic model

• Statistical Code Completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusion

Page 3: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Marriage of ML and PL

SLANG: Code completion[PLDI 14] PL Translation[Onward 14]

More Application:

Program bug detection

Program invariants inference

PL design: Probabilistic PL

Binary analysis

........

JSNice: Type Predication[POPL 15]

Page 4: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Intermediate Representation

Sequences Trees

Graphical Models Feature Vectors

Other IRs in PL research: AST, CFG, CDG, DDG, PDG. SSA, CPS.....

Page 5: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Extract program representation with Program Analysis

•SLANG: alias and typestate analysis •JSNice: scope and alias analysis, type analysis

......

•Other application:

Use type inference to get trained labels

Use SAT Solver to check path condition

....

Page 6: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

What's the suitableprobabilistic model?

N-gram language model [PLDI 14]

Probabilistic context-free grammers [ICSE 12]

Netural networks

Support vector machine

Conditional Ramdom Fields[POPL 15]

......

Same like the IR, it's dependent on the application.

Page 7: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

ML for PL

[Picture from Martin Vechec's slide]

Page 8: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Outline

• Overview– Potential applications – Intermediate Representation– Probabilistic model

• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions

Page 9: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Statistical Code Completion

[SLANG, V.Raychev et al' PLDI 14]

Key insight: Regularities in code are similar to regularities in natural language

Page 10: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Techniques in SLANG

• IR: Sequences (setences) • Program Analysis: typestate analysis, alias analysis • Trained Model: Netural Network, N-gram language model • Some smoothing techniques

Page 11: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

N-gram language model

Conditional probability only on previous n-1 words

Training is achieved by counting n-grams.

Time complexity for each word encountreed in training is constant, so training

is usually fast.

Other models used: Recurrent Netural Network(RNN). RNN can learn dependencies beyond the prior several words, but usually slower

Page 12: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Outline

• Overview– Potential applications – Intermediate Representation– Probabilistic model

• Statistical code completion • Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusion

Page 13: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Learning to Recognize Functions in Binary Code

[Tiffany Bao et al' Usenix Security 14]

When we use gcc with -O3, the function information may be stripped.

Can we automatically and accurately recover function information from binaries?

Page 14: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Example: GCC

#include <stdio.h>int fac(int x){ if (x == 1) return 1; else return x * fac(x - 1);}

void main(int argc, char **argv){ printf("%d", fac(10));}

Page 15: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Example: GCC

• default -O0 08048443 <main>:push %ebpmov %esp,%ebpand $0xfffffff0,%espsub $0x10,%esp…

0804841c <fac>:push %ebpmov %esp,%ebpsub $0x18,%espcmpl $0x1,0x8(%ebp)jne 804842f <fac+0x13>mov $0x1,%eax…

Page 16: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

-O1 -O2

0804841c <fac>:push %ebxsub $0x18,%espmov 0x20(%esp),%ebxmov $0x1,%eaxcmp $0x1,%ebx…

08048330 <main>:mov $0x1,%edxmov $0xa,%eaxlea 0x0(%esi),%esi…push %ebpmov %esp,%ebpand $0xfffffff0,%espsub $0x10,%esp…

Page 17: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

ByteWeight

A machine learning + program analysis approach to function identification

Training: •Creates a model of function start patterms using supervised learning

Usage:– Use trained models to match function start on stripped binaries — Function

Start Identification– Use program analysis to identify all bytes associated with a function —

Function Identification

Page 18: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

[Picture from Bao's slide]

Page 19: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Outline

• Overview– Potential applications – Intermediate Representation– Probabilistic model

• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions

Page 20: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Problems of Program analysis

• Program have unbounded behaviors

• Program analysis – Analyze all behaviors– Run for a finite time

• In finite time, observe only finite behaviors

• Need to generalize

Page 21: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Generalization in Program Analyais

• Abstraction interpretation: widening operator[

• CEGAR: interpolants

• Parameter tuning of tools(flow, path sensitivity, etc)

• Lots of folk knowledge, heuristics,...

Page 22: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Generalization in Machine Learning

• “It’s all about generalization”I• A famous concept in Computational learning theory

– Complexity and Feasibility of learning

• Learn a function from observations

• Hope that the function generalizes

Page 23: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Bias-variance Tradeofs in Program analysis

[Aiken, POPL 14]

•Model the generalization process – Probably Approximately Correct(PAC) model – Bias: Empirical error of best available hypothesis– Variance: O(VC-d)

•Explain know observations by this model

•Use this model to obtain better tools(in ASTREE, Yogi Project..)

Page 24: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Outline

• Overview– Potential applications – Intermediate Representation– Probabilistic model

• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions

Page 25: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Combine ML and PL Research

•Already lots of work in: POPL, PLDI, SOSP, OSDI, Usenix Security....

•Lots of applications and theories to be found.

•Combination with other fileds: System, Security..

Page 26: Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST

Thank you !