Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST
-
Upload
lynne-martin -
Category
Documents
-
view
221 -
download
0
description
Transcript of Machine Learning for Program Language Research Yao Peisen Prism Group, HKUST
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical Code Completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusion
Marriage of ML and PL
SLANG: Code completion[PLDI 14] PL Translation[Onward 14]
More Application:
Program bug detection
Program invariants inference
PL design: Probabilistic PL
Binary analysis
........
JSNice: Type Predication[POPL 15]
Intermediate Representation
Sequences Trees
Graphical Models Feature Vectors
Other IRs in PL research: AST, CFG, CDG, DDG, PDG. SSA, CPS.....
Extract program representation with Program Analysis
•SLANG: alias and typestate analysis •JSNice: scope and alias analysis, type analysis
......
•Other application:
Use type inference to get trained labels
Use SAT Solver to check path condition
....
What's the suitableprobabilistic model?
N-gram language model [PLDI 14]
Probabilistic context-free grammers [ICSE 12]
Netural networks
Support vector machine
Conditional Ramdom Fields[POPL 15]
......
Same like the IR, it's dependent on the application.
ML for PL
[Picture from Martin Vechec's slide]
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions
Statistical Code Completion
[SLANG, V.Raychev et al' PLDI 14]
Key insight: Regularities in code are similar to regularities in natural language
Techniques in SLANG
• IR: Sequences (setences) • Program Analysis: typestate analysis, alias analysis • Trained Model: Netural Network, N-gram language model • Some smoothing techniques
N-gram language model
Conditional probability only on previous n-1 words
Training is achieved by counting n-grams.
Time complexity for each word encountreed in training is constant, so training
is usually fast.
Other models used: Recurrent Netural Network(RNN). RNN can learn dependencies beyond the prior several words, but usually slower
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical code completion • Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusion
Learning to Recognize Functions in Binary Code
[Tiffany Bao et al' Usenix Security 14]
When we use gcc with -O3, the function information may be stripped.
Can we automatically and accurately recover function information from binaries?
Example: GCC
#include <stdio.h>int fac(int x){ if (x == 1) return 1; else return x * fac(x - 1);}
void main(int argc, char **argv){ printf("%d", fac(10));}
Example: GCC
• default -O0 08048443 <main>:push %ebpmov %esp,%ebpand $0xfffffff0,%espsub $0x10,%esp…
0804841c <fac>:push %ebpmov %esp,%ebpsub $0x18,%espcmpl $0x1,0x8(%ebp)jne 804842f <fac+0x13>mov $0x1,%eax…
-O1 -O2
0804841c <fac>:push %ebxsub $0x18,%espmov 0x20(%esp),%ebxmov $0x1,%eaxcmp $0x1,%ebx…
08048330 <main>:mov $0x1,%edxmov $0xa,%eaxlea 0x0(%esi),%esi…push %ebpmov %esp,%ebpand $0xfffffff0,%espsub $0x10,%esp…
ByteWeight
A machine learning + program analysis approach to function identification
Training: •Creates a model of function start patterms using supervised learning
Usage:– Use trained models to match function start on stripped binaries — Function
Start Identification– Use program analysis to identify all bytes associated with a function —
Function Identification
[Picture from Bao's slide]
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions
Problems of Program analysis
• Program have unbounded behaviors
• Program analysis – Analyze all behaviors– Run for a finite time
• In finite time, observe only finite behaviors
• Need to generalize
Generalization in Program Analyais
• Abstraction interpretation: widening operator[
• CEGAR: interpolants
• Parameter tuning of tools(flow, path sensitivity, etc)
• Lots of folk knowledge, heuristics,...
Generalization in Machine Learning
• “It’s all about generalization”I• A famous concept in Computational learning theory
– Complexity and Feasibility of learning
• Learn a function from observations
• Hope that the function generalizes
Bias-variance Tradeofs in Program analysis
[Aiken, POPL 14]
•Model the generalization process – Probably Approximately Correct(PAC) model – Bias: Empirical error of best available hypothesis– Variance: O(VC-d)
•Explain know observations by this model
•Use this model to obtain better tools(in ASTREE, Yogi Project..)
Outline
• Overview– Potential applications – Intermediate Representation– Probabilistic model
• Statistical code completion• Learning to Recognize Functions in Binary Code• Bias-Variance Tradeoffs in program analysis• Conclusions
Combine ML and PL Research
•Already lots of work in: POPL, PLDI, SOSP, OSDI, Usenix Security....
•Lots of applications and theories to be found.
•Combination with other fileds: System, Security..
Thank you !