Probabilistic Programming; Ways Forwardfwood/talks/2015/dali-keynote...Operative Definition...
Transcript of Probabilistic Programming; Ways Forwardfwood/talks/2015/dali-keynote...Operative Definition...
Probabilistic Programming; Ways Forward
Frank Wood
Outline• What is probabilistic programming?
• What are the goals of the field?
• What are some challenges?
• Where are we now?
• Ways forward…
What is probabilistic programming?
An Emerging Field
ML: Algorithms &Applications
STATS: Inference &
Theory
PL: Compilers,Semantics,
Analysis
ProbabilisticProgramming
Conceptualization
Parameters
Program
Output
CS
Parameters
Program
Observations
Probabilistic Programming Statistics
p(✓|x)
p(x|✓)p(✓)
x
Operative Definition“Probabilistic programs are usual functional or imperative programs with two added constructs:
(1) the ability to draw values at random from distributions, and
(2) the ability to condition values of variables in a program via observations.”
Gordon et al, 2014
What are the goals of probabilistic
programming?
Increased Productivity
(fn [x] (logb 1.04 (+ 1 x)))
Lines of Matlab/Java Code
Line
s of
Ang
lican
Cod
e
HPYP, [Wood 2007]
DDPMO, [Neiswanger et al 2014]
PDIA, [Pfau 2010]
Collapsed LDA
DP Conjugate Mixture
log lin
p(⋅|d
ata)
Automatic Inference
Programming Language Representation / Abstraction Layer
Inference Engine(s)
Models
CARON ET AL.
This lack of consistency is shared by other models based on the Polya urn construction (Zhuet al., 2005; Ahmed and Xing, 2008; Blei and Frazier, 2011). Blei and Frazier (2011) provide adetailed discussion on this issue and describe cases where one should or should not bother about it.
It is possible to define a slightly modified version of our model that is consistent under marginal-isation, at the expense of an additional set of latent variables. This is described in Appendix C.
3.2 Stationary Models for Cluster Locations
To ensure we obtain a first-order stationary Pitman-Yor process mixture model, we also need tosatisfy (B). This can be easily achieved if for k 2 I(mt
t)
Uk,t ⇠⇢
p (·|Uk,t�1) if k 2 I(mtt�1)
H otherwise
where H is the invariant distribution of the Markov transition kernel p (·|·). In the time seriesliterature, many approaches are available to build such transition kernels based on copulas (Joe,1997) or Gibbs sampling techniques (Pitt and Walker, 2005).
Combining the stationary Pitman-Yor and cluster locations models, we can summarize the fullmodel by the following Bayesian network in Figure 1. It can also be summarized using a Chineserestaurant metaphor (see Figure 2).
Figure 1: A representation of the time-varying Pitman-Yor process mixture as a directed graphi-cal model, representing conditional independencies between variables. All assignmentvariables and observations at time t are denoted ct and zt, respectively.
3.3 Properties of the Models
Under the uniform deletion model, the number At =P
imti,t�1 of alive allocation variables at time
t can be written as
At =
t�1X
j=1
nX
k=1
Xj,k
8
c0 Hr
� �m
c1 ⇡m
Hy ✓m
r0
s0
r1
s1
y1
r2
s2
y2
rT
sT
yT
r3
s3
y31
Gaussian Mixture Model
¼
µc
yi
k
k
i
N
K
K
α
Gπ
θc
yi
k
k o
i
N
K
K
α
Gπ
θc
yi
k
k o
i
N
1
1
Figure : From left to right: graphical models for a finite Gaussian mixture model(GMM), a Bayesian GMM, and an infinite GMM
ci |~⇡ ⇠ Discrete(~⇡)
~
yi |ci = k ;⇥ ⇠ Gaussian(·|✓k).
~⇡|↵ ⇠ Dirichlet(·| ↵K
, . . . ,
↵
K
)
⇥ ⇠ G0
Wood (University of Oxford) Unsupervised Machine Learning January, 2014 16 / 19
Latent Dirichlet Allocation
↵
w
diz
di �k �
d = 1 . . . D
i = 1 . . . N
d.
✓d
k = 1 . . . K
Figure 1. Graphical model for LDA model
Lecture LDA
LDA is a hierarchical model used to model text documents. Each document is modeled as
a mixture of topics. Each topic is defined as a distribution over the words in the vocabulary.
Here, we will denote by K the number of topics in the model. We use D to indicate the
number of documents, M to denote the number of words in the vocabulary, and N
d. to
denote the number of words in document d. We will assume that the words have been
translated to the set of integers {1, . . . , M} through the use of a static dictionary. This is
for convenience only and the integer mapping will contain no semantic information. The
generative model for the D documents can be thought of as sequentially drawing a topic
mixture ✓d for each document independently from a DirK(↵
~
1) distribution, where DirK(
~
�)
is a Dirichlet distribution over the K-dimensional simplex with parameters [�1, �2, . . . , �K ].
Each of K topics {�k}Kk=1 are drawn independently from DirM (�
~
1). Then, for each of the
i = 1 . . . N
d. words in document d, an assignment variable z
di is drawn from Mult(✓
d).
Conditional on the assignment variable z
di , word i in document d, denoted as w
di , is drawn
independently from Mult(�zdi). The graphical model for the process can be seen in Figure 1.
The model is parameterized by the vector valued parameters {✓d}Dd=1, and {�k}K
k=1, the
parameters {Z
di }d=1,...,D,i=1,...,Nd
., and the scalar positive parameters ↵ and �. The model
is formally written as:
✓d ⇠ DirK(↵
~
1)
�k ⇠ DirM (�
~
1)
z
di ⇠ Mult(✓d)
w
di ⇠ Mult(�zd
i)
1
✓d ⇠ DirK (↵~1)
�k ⇠ DirM(�~1)
z
di ⇠ Discrete(✓d)
w
di ⇠ Discrete(�zdi
)
Wood (University of Oxford) Unsupervised Machine Learning January, 2014 15 / 19
What are some challenges?
Challenges• Unbounded recursion
• Equality and continuous variables
Unbounded Recursion(defn geometric "generates geometrically distributed values in {0,1,2,...}" ([p] (geometric p 0)) ([p n] (if (sample (flip p)) n (geometric p (+ n 1)))))
0
1
1-pp
1-pp
2
1-pp
Defining Distributions(defm pick-a-stick [stick v l k] “picks a stick given a stick generator given a value v ~ uniform-continuous(0,1) should be called with l = 0.0, k=1” (let [u (+ l (stick k))] (if (> u v) k (pick-a-stick stick v u (+ k 1)))))
(stick 1) (stick 2) (stick 3) (stick 4) (…) (stick 6)
v (sample (uniform-continuous 0 1))
Semantics and Termination(defn p [] (if (sample (flip 0.5)) 1 (if (sample (flip 0.5)) (p) (infinite-loop))))
(def infinite-loop #(loop [] (recur)))
1
1
1
0.50.5
0.50.5
0.50.5
0.50.5
0.50.5
p(x = 1) =1X
n=1
1
2
2n�1
=2
3?
Equality and Continuous Variables
Why are your probabilistic programming systems anti-
equality?
(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true))
(sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true))
(sample (flip 0.9)))] (observe (= wet-grass true))
(predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))
Equality
(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true))
(sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true))
(sample (flip 0.9)))] (observe (dirac wet-grass) true)
(predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))
p(x|y = o) / �(y � o)p(x,y)
= p(x,y = o)
Dirac Observe
(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true))
(sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true))
(sample (flip 0.9)))] (observe (normal 0.0 tolerance) (d wet-grass true))
(predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))
ABC Observe
p(x|y = o) / p(d(y,o))p(x,y)
(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true))
(flip 0.99) (and (= sprinkler false) (= is-raining false)) (flip 0.0) (or (= sprinkler true) (= is-raining true))
(flip 0.9))] (observe wet-grass true)
(predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))
Noisy Observe
p(x|y = o) / p(o|x)p(x)
Continuous Variables(defquery unknown-mean [] (let [sigma (sqrt 2) mu (marsaglia-normal 1 5)] (observe (normal mu sigma) 9) (observe (normal mu sigma) 8) (predict :mu mu)))
Measure Theoretic Challenges
(defquery which-nationality [gpa] (let [nationality (sample (categorical [["USA" 0.25] ["India" 0.75]])) simulated_gpa (if (= nationality "USA") (american-gpa) (indian-gpa))] (observe (dirac simulated_gpa) gpa) (predict :nationality nationality)))
p(nationality = “USA"| gpa = 4.0) = ?
The “Indian GPA problem”
American GPA Distribution [0,4](defn american-gpa [] (if (sample (flip 0.95)) (* 4 (sample (beta 8 2))) (if (sample (flip 0.85)) 4.0 0.0)))
Indian GPA Distribution [0,10](defn indian-gpa [] (if (sample (flip 0.99)) (* 10 (sample (beta 5 5))) (if (sample (flip 0.1)) 0.0 10.0)))
Mixed GPA Distribution(defn student-gpa [] (if (sample (flip 0.25)) (american-gpa) (indian-gpa)))
Measure Theoretic Challenges
(defquery which-nationality [gpa tolerance] (let [nationality (sample (categorical [["USA" 0.25] ["India" 0.75]])) simulated_gpa (if (= nationality "USA") (american-gpa) (indian-gpa))] (observe (normal simulated_gpa tolerance) gpa) (predict :nationality nationality)))
p(nationality = “USA"| gpa = 4.0) = ?
The “Indian GPA problem” by Russell
Where are we now?
Discrete RV’s Only
2000
1990
2010
SystemsPL
HANSAI
IBAL
Figaro
ML STATS
WinBUGS
BUGS
JAGS
STANLibBi
Venture Anglican
Church
Probabilistic-C
infer.NET
webChurch
Blog
Factorie
AI
Prism
Prolog
KMP
Bounded Recursion
Problog
Simula
Ways forward...
Trace Probability• observe data points
• internal random choices
• simulate from
by running the program forward
• weight execution traces byy1 y2
✓
x1 x2
x11 x12 x13 x21 x22
{ {etc
p(y1:N ,x1:N ) =NY
n=1
g(yn|x1:n)f(xn|x1:n�1)
y1 y2
x1 x2 x3
y3
f(xn|x1:n�1)
g(yn|x1:n)
xn
yn
n = 1 n = 2Iteratively,
- simulate - weight - resample
SMC
Observe
Parti
cle
Intuitively
- run- wait - fork
SMC for Probabilistic ProgrammingTh
read
s
observe delimiter
continuations
SMC Inner Loop
n n n
…
n n n
…
n n n
…
• Sequential Monte Carlo is now a building block for other inference techniques
• Particle MCMC - PIMH : “particle
independent Metropolis-Hastings”
- iCSMC : “iterated conditional SMC”
-‐
s=1
s=2
s=3
[Andrieu, Doucet, Holenstein 2010]
[W., van de Meent, Mansinghka 2014]
SMC slowed down for clarity
SMC Parallelism Bottleneck
Asynchronously
- simulate - weight - branch
n = 1 n = 2
Particle Cascade
Paige, W., Doucet, Teh; NIPS 2014
Particle Cascade
Particle Cascade
Particle Cascade
Particle Cascade
Particle Cascade
Particle Cascade
Particle Cascade
The particle cascade provides an unbiased estimator of the marginal likelihood, whose variance decreases proportionally to the number of initial particles K0:
Theorem: For any K0 ≥ 1 and n ≥ 0, .
Theorem: For any n ≥ 0, there exists a constant an such that
p(y0:n) :=1
K0
KnX
k=1
W kn
V[p(y0:n)] <anK0
E[p(y0:n)] = p(y0:n)
Theoretical Properties
Conclusion
Bubble Up
Inference
Probabilistic Programming Language
Models
Applications
Probabilistic Programming System
Thank You• Questions?
• Funding: DARPA, Amazon, Microsoft
Opportunities• Parallelism
“Asynchronous Anytime Sequential Monte Carlo” [Paige, W., Doucet, Teh NIPS 2014]
• Backwards passing “Particle Gibbs with Ancestor Sampling for Probabilistic Programs” [van de Meent, Yang, Mansinghka, W. AISTATS 2015]
• Search “Maximum a Posteriori Estimation by Search in Probabilistic Models” [Tolpin, W., SOCS, 2015]
• Adaptation “Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs” [Tolpin, van de Meent, Paige, W ; in submission]
• Novel proposals “Adaptive PMCMC” [Paige, W.; in submission]
Probabilistic-C z0 ⇠ Discrete([1/K, . . . , 1/K]) zn|zn�1 ⇠ Discrete(Tzn�1) yn|zn ⇠ Normal(µzn ,�
2)
Paige & W.; ICML 2014
How can you participate?
Ways to Participate• Contribute applications
• https://bitbucket.org/fwood/anglican-examples
• Contribute inference algorithms
• https://bitbucket.org/dtolpin/embang
An Analogy
Automatic Differentiation Supervised Learning
Probabilistic Programming Unsupervised Learning
General Purpose Inference(defquery sat-solver [N formula] "explores an N-dimensional universe for worlds that satisfy the formula” (let [state (repeatedly N (fn [] (sample (flip 0.5))))] (observe (dirac (formula state)) true) (predict :state state)))
(defdist dirac "Dirac distribution" [x] [] (sample [this] x) (observe [this value] (if (= x value) 0.0 NegInf)))
(defm satisfiable-3cnf-formula [state] (let [v (fn [i] (nth state i))] (and (or (v 0) (not (v 1)) (not (v 2))) (or (not (v 0)) (v 1) (v 2)) (or (not (v 0)) (not (v 1)) (not (v 2))))))
General Purpose Inference(defquery md5-inverse [L md5str] "conditional distribution of strings that map to the same MD5 hashed string" (let [mesg (sample (string-generative-model L))] (observe (dirac md5str) (md5 mesg)) (predict :message mesg))))
NN
AI
RLPM
Vision
Particle Cascade
Not Sum-Product: Bayesian HMMTk ⇠ Dirichlet(↵k)Suppose the transition matrix is unknown:
Paige & W.; ICML 2014
2000
1990
2010
Range of EffectivenessPL
HANSAI
IBAL
Figaro
ML STATS
WinBUGS
BUGS
JAGS
STANLibBi
Venture Anglican
Church
Probabilistic-C
infer.NET
webChurch
Blog
Factorie
AI
Prism
Prolog
KMP
Continuous Variables(defm marsaglia-normal [mean var] (let [d (uniform-continuous -1.0 1.0) x (sample d) y (sample d) s (+ (* x x) (* y y))] (if (< s 1) (+ mean (* (sqrt var)
(* x (sqrt (* -2 (/ ( log s) s)))))) (marsaglia-normal mean var))))
Scalability: Particle Count
• Comparison across particle-based inference approaches: raw speed of drawing samples
Unbounded Recursion
Expressivity Efficiency
Credits
• Code highlighting: http://hilite.me
Forward Inference (SMC)
Observe
Parti
cle
/ C
ontin
uatio
n
Bayesian Nonparametrics(defm pick-a-stick [stick v l k] ; picks a stick given a stick generator ; given a value v ~ uniform-continuous(0,1) ; should be called with l = 0.0, k=1 (let [u (+ l (stick k))] (if (> u v) k (pick-a-stick stick v u (+ k 1)))))
(defm remaining [b k] (if (<= k 0) 1 (* (- 1 (b k)) (remaining b (- k 1)))))
(defm polya [stick] ; given a stick generating function ; polya returns a function that samples ; stick indexes from the stick lengths (let [uc01 (uniform-continuous 0 1)] (fn [] (let [v (sample uc01)] (pick-a-stick stick v 0.0 1)))))
(defm dirichlet-process-breaking-rule [alpha k] (sample (beta 1.0 alpha)))
(defm stick [breaking-rule] ; given a breaking-rule function which ; returns a value between 1 and 0 given a ; stick index k returns a function that ; returns the stick length for index k (let [b (mem breaking-rule)] (fn [k] (if (< 0 k) (* (b k) (remaining b (- k 1))) 0))))
(stick 1) (stick 2) (stick 3) (stick 4) (…) (stick 6)
v (sample(uniform-continuous 0 1))
Syntax & Implementation Considerations• Embedded vs. Standalone
• Imperative vs. functional
• Lisp vs. Python vs. C vs.