Probability in Machine Learning

25
Probability in Machine Learning

Transcript of Probability in Machine Learning

Page 1: Probability in Machine Learning

Probability in Machine Learning

Page 2: Probability in Machine Learning

Three Axioms of Probability

โ€ข Given an Event ๐ธ in a sample space ๐‘†, S = ๐‘–=1ฺ‚๐‘ ๐ธ๐‘–

โ€ข First axiomโˆ’ ๐‘ƒ ๐ธ โˆˆ โ„, 0 โ‰ค ๐‘ƒ(๐ธ) โ‰ค 1

โ€ข Second axiomโˆ’ ๐‘ƒ ๐‘† = 1

โ€ข Third axiomโˆ’ Additivity, any countable sequence of mutually exclusive events ๐ธ๐‘–โˆ’ ๐‘ƒ ๐‘–=1ฺ‚

๐‘› ๐ธ๐‘– = ๐‘ƒ ๐ธ1 + ๐‘ƒ ๐ธ2 +โ‹ฏ+ ๐‘ƒ ๐ธ๐‘› = ฯƒ๐‘–=1๐‘› ๐‘ƒ ๐ธ๐‘–

Page 3: Probability in Machine Learning

Random Variables

โ€ข A random variable is a variable whose values are numerical outcomes of a random phenomenon.

โ€ข Discrete variables and Probability Mass Function (PMF)

โ€ข Continuous Variables and Probability Density Function (PDF)

๐‘ฅ

๐‘๐‘‹ ๐‘ฅ = 1

Pr ๐‘Ž โ‰ค ๐‘‹ โ‰ค ๐‘ = เถฑ๐‘Ž

๐‘

๐‘“๐‘‹ ๐‘ฅ ๐‘‘๐‘ฅ

Page 4: Probability in Machine Learning

Expected Value (Expectation)

โ€ข Expectation of a random variable X:

โ€ข Expected value of rolling one dice?

https://en.wikipedia.org/wiki/Expected_value

Page 5: Probability in Machine Learning

Expected Value of Playing Roulette

โ€ข Bet $1 on single number (0 ~ 36), and get $35 payoff if you win. Whatโ€™s the expected value?

๐ธ ๐บ๐‘Ž๐‘–๐‘› ๐‘“๐‘Ÿ๐‘œ๐‘š $1 ๐‘๐‘’๐‘ก = โˆ’1 ร—36

37+ 35 ร—

1

37=โˆ’1

37

Page 6: Probability in Machine Learning

Variance

โ€ข The variance of a random variable X is the expected value of the squared deviation from the mean of X

Var ๐‘ฟ = ๐ธ[(๐‘ฟ โˆ’ ๐œ‡)2]

Page 7: Probability in Machine Learning

Bias and Variance

Page 8: Probability in Machine Learning

Covariance

โ€ข Covariance is a measure of the joint variability of two random variables.

๐ถ๐‘œ๐‘ฃ ๐‘‹, ๐‘Œ = ๐ธ ๐‘‹ โˆ’ ๐ธ[๐‘‹] ๐ธ ๐‘Œ โˆ’ ๐ธ[๐‘Œ] = ๐ธ ๐‘‹๐‘Œ โˆ’ ๐ธ ๐‘‹ ๐ธ[๐‘Œ]

https://programmathically.com/covariance-and-correlation/

Page 9: Probability in Machine Learning

Standard Deviation

ฯƒ =1

๐‘

๐‘–=1

๐‘

(๐‘ฅ๐‘– โˆ’ ๐œ‡)2 = ๐‘‰๐‘Ž๐‘Ÿ(๐‘ฟ)

68โ€“95โ€“99.7 rule

Page 10: Probability in Machine Learning

6 Sigma

โ€ข A product has 99.99966% chance to be free of defects

https://www.leansixsigmadefinition.com/glossary/six-sigma/

Page 11: Probability in Machine Learning

Union, Intersection, and Conditional Probability โ€ข ๐‘ƒ ๐ด โˆช ๐ต = ๐‘ƒ ๐ด + ๐‘ƒ ๐ต โˆ’ ๐‘ƒ ๐ด โˆฉ ๐ต

โ€ข ๐‘ƒ ๐ด โˆฉ ๐ต is simplified as ๐‘ƒ ๐ด๐ต

โ€ข Conditional Probability ๐‘ƒ ๐ด|๐ต , the probability of event A given B has occurred

โˆ’ ๐‘ƒ ๐ด|๐ต = ๐‘ƒ๐ด๐ต

๐ต

โˆ’ ๐‘ƒ ๐ด๐ต = ๐‘ƒ ๐ด|๐ต ๐‘ƒ ๐ต = ๐‘ƒ ๐ต|๐ด ๐‘ƒ(๐ด)

Page 12: Probability in Machine Learning

Chain Rule of Probability

โ€ข The joint probability can be expressed as chain rule

Page 13: Probability in Machine Learning

Mutually Exclusive

โ€ข ๐‘ƒ ๐ด๐ต = 0

โ€ข ๐‘ƒ ๐ด โˆช ๐ต = ๐‘ƒ ๐ด + ๐‘ƒ ๐ต

Page 14: Probability in Machine Learning

Independence of Events

โ€ข Two events A and B are said to be independent if the probability of their intersection is equal to the product of their individual probabilitiesโˆ’๐‘ƒ ๐ด๐ต = ๐‘ƒ ๐ด ๐‘ƒ ๐ต

โˆ’๐‘ƒ ๐ด|๐ต = ๐‘ƒ ๐ด

Page 15: Probability in Machine Learning

Bayes Rule

๐‘ƒ ๐ด|๐ต =๐‘ƒ ๐ต|๐ด ๐‘ƒ(๐ด)

๐‘ƒ(๐ต)

Proof:

Remember ๐‘ƒ ๐ด|๐ต = ๐‘ƒ๐ด๐ต

๐ต

So ๐‘ƒ ๐ด๐ต = ๐‘ƒ ๐ด|๐ต ๐‘ƒ ๐ต = ๐‘ƒ ๐ต|๐ด ๐‘ƒ(๐ด)Then Bayes ๐‘ƒ ๐ด|๐ต = ๐‘ƒ ๐ต|๐ด ๐‘ƒ(๐ด)/๐‘ƒ ๐ต

Page 16: Probability in Machine Learning

Naรฏve Bayes Classifier

Page 17: Probability in Machine Learning

Naรฏve = Assume All Features Independent

Page 18: Probability in Machine Learning

Graphical Model

๐‘ ๐‘Ž, ๐‘, ๐‘, ๐‘‘, ๐‘’ = ๐‘ ๐‘Ž ๐‘ ๐‘ ๐‘Ž ๐‘ ๐‘ ๐‘Ž, ๐‘ ๐‘ ๐‘‘ ๐‘ ๐‘(๐‘’|๐‘)

Page 19: Probability in Machine Learning

Normal (Gaussian) Distributionโ€ข A type of continuous probability distribution for a real-valued random variable.

โ€ข One of the most important distributions

Page 20: Probability in Machine Learning

Central Limit Theoryโ€ข Averages of samples of observations of random variables independently

drawn from independent distributions converge to the normal distribution

https://corporatefinanceinstitute.com/resources/knowledge/other/central-limit-theorem/

Page 21: Probability in Machine Learning

Bernoulli Distribution

โ€ข Definition

โ€ข PMF

โ€ข ๐ธ[๐‘‹] = ๐‘

โ€ข Var(๐‘‹) = ๐‘๐‘ž

https://en.wikipedia.org/wiki/Bernoulli_distribution

https://acegif.com/flipping-coin-gifs/

Page 22: Probability in Machine Learning

Information Theory

โ€ข Self-information: ๐ผ ๐‘ฅ = โˆ’ log ๐‘ƒ(๐‘ฅ)

โ€ข Shannon Entropy:

๐ป = โˆ’

๐‘–

๐‘๐‘– log2 ๐‘๐‘–

Page 23: Probability in Machine Learning

Entropy of Bernoulli Variable

โ€ข๐ป ๐‘ฅ = ๐ธ ๐ผ ๐‘ฅ = โˆ’๐ธ[log ๐‘ƒ(๐‘ฅ)]

https://en.wikipedia.org/wiki/Information_theory

Page 24: Probability in Machine Learning

Kullback-Leibler (KL) Divergence

โ€ข ๐ท๐พ๐ฟ(๐‘| ๐‘ž = ๐ธ log ๐‘ƒ ๐‘‹ โˆ’ log๐‘„ ๐‘‹ = ๐ธ log๐‘ƒ ๐‘ฅ

๐‘„ ๐‘ฅ

KL Divergence is Asymmetric!

๐ท๐พ๐ฟ(๐‘| ๐‘ž != ๐ท๐พ๐ฟ(๐‘ž| ๐‘

https://www.deeplearningbook.org/slides/03_prob.pdf

Page 25: Probability in Machine Learning

Key Takeaways

โ€ข Expected value (expectation) is mean (weighted average) of a random variable

โ€ข Event A and B are independent if ๐‘ƒ ๐ด๐ต = ๐‘ƒ ๐ด ๐‘ƒ ๐ต

โ€ข Event A and B are mutually exclusive if ๐‘ƒ ๐ด๐ต = 0

โ€ข Central limit theorem tells us that Normal distribution is the one, if the data probability distribution is unknown

โ€ข Entropy is expected value of self information โˆ’๐ธ[log๐‘ƒ(๐‘ฅ)]

โ€ข KL divergence can measure the difference of two probability distributions and is asymmetric