Bayes Net Learning

Slide 1

Oliver SchulteMachine Learning 726Bayes Net Learning#/13If you use insert slide number under Footer, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.1Learning Bayes Nets#/13Structure Learning Example: Sleep Disorder Network

Source: Development of Bayesian Network models for obstructive sleep apnea syndrome assessment Fouron, Anne Gisle. (2006) . M.Sc. Thesis, SFU. #/13generally we dont get into structure learning in this course.3Parameter Learning ScenariosComplete data (today).Later: Missing data (EM).Parent Node/Child NodeDiscreteContinuousDiscreteMaximum LikelihoodDecision Treeslogit distribution(logistic regression)Continuousconditional Gaussian(not discussed)linear Gaussian(linear regression)#/13The Parameter Learning ProblemInput: a data table XNxD.One column per node (random variable)One row per instance.How to fill in Bayes net parameters?PlayTennisHumidityDayOutlookTemperatureHumidityWindPlayTennis1sunnyhothighweakno2sunnyhothighstrongno3overcasthothighweakyes4rainmildhighweakyes5raincoolnormalweakyes6raincoolnormalstrongno7overcastcoolnormalstrongyes8sunnymildhighweakno9sunnycoolnormalweakyes10rainmildnormalweakyes11sunnymildnormalstrongyes12overcastmildhighstrongyes13overcasthotnormalweakyes14rainmildhighstrongno#/13What is N? What is D?PlayTennis: Do you play tennis Saturday morning?For now complete data, incomplete data another day (EM).5Start Small: Single NodeWhat would you choose?HumidityHow about P(Humidity = high) = 50%?DayHumidity1high2high3high4high5normal6normal7normal8high9normal10normal11normal12high13normal14highP(Humidity = high)#/13Parameters for Two NodesDayHumidityPlayTennis1highno2highno3highyes4highyes5normalyes6normalno7normalyes8highno9normalyes10normalyes11normalyes12highyes13normalyes14highnoPlayTennisHumidityP(Humidity = high)HP(PlayTennis = yes|H)high1normal2

Is as in single node model?How about 1=3/7?How about 2=6/7?#/13Maximum Likelihood Estimation#/13MLEAn important general principle: Choose parameter values that maximize the likelihood of the data.Intuition: Explain the data as well as possible.Recall from Bayes theorem that the likelihood isP(data|parameters) = P(D|).#/13calligraphic font D in book.9Finding the Maximum Likelihood Solution: Single NodeHumidityWrite downIn example, P(D|)= 7(1-)7.Maximize for this function.

DayHumidityP(Hi|)1high2high3high4high5normal1-6normal1-7normal1-8high9normal1-10normal1-11normal1-12high13normal1-14highP(Humidity = high)

independent identically distributed data! iid

#/13binomial MLE10Solving the Equation#/13Finding the Maximum Likelihood Solution: Two NodesIn a Bayes net, can maximize each parameter separately.Fix a parent condition single node problem.#/13Finding the Maximum Likelihood Solution: Single Node, >2 possible values.Lagrange Multipliers#/13Problems With MLEThe 0/0 problem: what if there are no data for a given parent-child configuration?Single point estimate: does not quantify uncertainty.Is 6/10 the same as 6000/10000?[show Bayes net with playtennis as child, three parents.#/13Discuss first, do they see the problems?Curse of Dimensionality.Discussion: how to solve this problem?14Classical Statistics and MLETo quantify uncertainty, specify confidence interval.For the 0/0 problem, use data smoothing.#/13Bayesian Parameter Learning#/13Parameter ProbabilitiesIntuition: Quantity uncertainty about parameter values by assigning a prior probability to parameter values.Not based on data.[give Russell and Norvig example]#/13Bayesian Prediction/InferenceWhat probability does the Bayesian assign to PlayTennis = true?I.e., how should we bet on PlayTennis = true?Answer: Make a prediction for each parameter value.Average the predictions using the prior as weights.[Russell and Norvig Example]#/13MeanBayesian prediction can be seen as the expected value of a probability distribution P.Aka average or mean of P.Notation: E(P), mu.#/13Give example of grades.19VarianceDefineVariance of a parameter estimate = uncertainty.Decreases with learning.#/13Continuous priorsProbabilities usually range over a continuous interval.Then probabilities of probabilities are probabilities of continuous variables.Probability of continuous variables = probability density function.p(x) behaves like probability of discrete value, but with integrals replacing sum.E.g. [integral over 01 = 1].Exercise: Find the p.d.f. of the uniform distribution over an interval [a,b].#/13Bayesian Prediction With P.D.F.s#/13Bayesian Learning#/13Bayesian UpdatingUpdate prior using Bayes theorem.Exercise: Find the posterior of the uniform distribution given 10 heads, 20 tails.#/13Answer: theta^h x (1-theta)t/2^{-n}.Notice that the posterior has a different from than the prior.24The Laplace CorrectionStart with uniform prior: the probability of Playtennis could be any value in [0,1], with equal prior probability.Suppose I have observed n data points. Find posterior distribution.Predict probability of heads using posterior distribution.Integral:Solved by Laplace in A.D. x!

#/13Parametrized PriorsMotivation: Suppose I dont want a uniform prior.Smooth with m>0.Express prior knowledge.Use parameters for the prior distribution.Called hyperparameters.Chosen so that updating the prior is easy.#/13Beta Distribution: Definition#/13Beta Distribution: Examples#/13Updating the Beta Distribution#/13Conjugate Prior for non-binary variablesDirichlet distribution: generalizes Beta distribution for variables with >2 values.#/13SummaryMaximum likelihood: general parameter estimation method.Choose parameters that make the data as likely as possible.For Bayes net parameters: MLE = match sample frequency.Typical result!Problems:not defined for 0/0 situation.doesnt quantity uncertainty in estimate.Bayesian approach:Assume prior probability for parameters; prior has hyperparameters.E.g., beta distribution.Problems: prior choice not based on data.inferences (averaging) can be hard to compute.#/13

Bayes Net Learning

Documents

Transcript of Bayes Net Learning