A Quick Overview of Probability William W. Cohen Machine Learning 10-605.
Probability in Machine Learning
Transcript of Probability in Machine Learning
Probability in Machine Learning
Three Axioms of Probability
โข Given an Event ๐ธ in a sample space ๐, S = ๐=1ฺ๐ ๐ธ๐
โข First axiomโ ๐ ๐ธ โ โ, 0 โค ๐(๐ธ) โค 1
โข Second axiomโ ๐ ๐ = 1
โข Third axiomโ Additivity, any countable sequence of mutually exclusive events ๐ธ๐โ ๐ ๐=1ฺ
๐ ๐ธ๐ = ๐ ๐ธ1 + ๐ ๐ธ2 +โฏ+ ๐ ๐ธ๐ = ฯ๐=1๐ ๐ ๐ธ๐
Random Variables
โข A random variable is a variable whose values are numerical outcomes of a random phenomenon.
โข Discrete variables and Probability Mass Function (PMF)
โข Continuous Variables and Probability Density Function (PDF)
๐ฅ
๐๐ ๐ฅ = 1
Pr ๐ โค ๐ โค ๐ = เถฑ๐
๐
๐๐ ๐ฅ ๐๐ฅ
Expected Value (Expectation)
โข Expectation of a random variable X:
โข Expected value of rolling one dice?
https://en.wikipedia.org/wiki/Expected_value
Expected Value of Playing Roulette
โข Bet $1 on single number (0 ~ 36), and get $35 payoff if you win. Whatโs the expected value?
๐ธ ๐บ๐๐๐ ๐๐๐๐ $1 ๐๐๐ก = โ1 ร36
37+ 35 ร
1
37=โ1
37
Variance
โข The variance of a random variable X is the expected value of the squared deviation from the mean of X
Var ๐ฟ = ๐ธ[(๐ฟ โ ๐)2]
Bias and Variance
Covariance
โข Covariance is a measure of the joint variability of two random variables.
๐ถ๐๐ฃ ๐, ๐ = ๐ธ ๐ โ ๐ธ[๐] ๐ธ ๐ โ ๐ธ[๐] = ๐ธ ๐๐ โ ๐ธ ๐ ๐ธ[๐]
https://programmathically.com/covariance-and-correlation/
Standard Deviation
ฯ =1
๐
๐=1
๐
(๐ฅ๐ โ ๐)2 = ๐๐๐(๐ฟ)
68โ95โ99.7 rule
6 Sigma
โข A product has 99.99966% chance to be free of defects
https://www.leansixsigmadefinition.com/glossary/six-sigma/
Union, Intersection, and Conditional Probability โข ๐ ๐ด โช ๐ต = ๐ ๐ด + ๐ ๐ต โ ๐ ๐ด โฉ ๐ต
โข ๐ ๐ด โฉ ๐ต is simplified as ๐ ๐ด๐ต
โข Conditional Probability ๐ ๐ด|๐ต , the probability of event A given B has occurred
โ ๐ ๐ด|๐ต = ๐๐ด๐ต
๐ต
โ ๐ ๐ด๐ต = ๐ ๐ด|๐ต ๐ ๐ต = ๐ ๐ต|๐ด ๐(๐ด)
Chain Rule of Probability
โข The joint probability can be expressed as chain rule
Mutually Exclusive
โข ๐ ๐ด๐ต = 0
โข ๐ ๐ด โช ๐ต = ๐ ๐ด + ๐ ๐ต
Independence of Events
โข Two events A and B are said to be independent if the probability of their intersection is equal to the product of their individual probabilitiesโ๐ ๐ด๐ต = ๐ ๐ด ๐ ๐ต
โ๐ ๐ด|๐ต = ๐ ๐ด
Bayes Rule
๐ ๐ด|๐ต =๐ ๐ต|๐ด ๐(๐ด)
๐(๐ต)
Proof:
Remember ๐ ๐ด|๐ต = ๐๐ด๐ต
๐ต
So ๐ ๐ด๐ต = ๐ ๐ด|๐ต ๐ ๐ต = ๐ ๐ต|๐ด ๐(๐ด)Then Bayes ๐ ๐ด|๐ต = ๐ ๐ต|๐ด ๐(๐ด)/๐ ๐ต
Naรฏve Bayes Classifier
Naรฏve = Assume All Features Independent
Graphical Model
๐ ๐, ๐, ๐, ๐, ๐ = ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐, ๐ ๐ ๐ ๐ ๐(๐|๐)
Normal (Gaussian) Distributionโข A type of continuous probability distribution for a real-valued random variable.
โข One of the most important distributions
Central Limit Theoryโข Averages of samples of observations of random variables independently
drawn from independent distributions converge to the normal distribution
https://corporatefinanceinstitute.com/resources/knowledge/other/central-limit-theorem/
Bernoulli Distribution
โข Definition
โข PMF
โข ๐ธ[๐] = ๐
โข Var(๐) = ๐๐
https://en.wikipedia.org/wiki/Bernoulli_distribution
https://acegif.com/flipping-coin-gifs/
Information Theory
โข Self-information: ๐ผ ๐ฅ = โ log ๐(๐ฅ)
โข Shannon Entropy:
๐ป = โ
๐
๐๐ log2 ๐๐
Entropy of Bernoulli Variable
โข๐ป ๐ฅ = ๐ธ ๐ผ ๐ฅ = โ๐ธ[log ๐(๐ฅ)]
https://en.wikipedia.org/wiki/Information_theory
Kullback-Leibler (KL) Divergence
โข ๐ท๐พ๐ฟ(๐| ๐ = ๐ธ log ๐ ๐ โ log๐ ๐ = ๐ธ log๐ ๐ฅ
๐ ๐ฅ
KL Divergence is Asymmetric!
๐ท๐พ๐ฟ(๐| ๐ != ๐ท๐พ๐ฟ(๐| ๐
https://www.deeplearningbook.org/slides/03_prob.pdf
Key Takeaways
โข Expected value (expectation) is mean (weighted average) of a random variable
โข Event A and B are independent if ๐ ๐ด๐ต = ๐ ๐ด ๐ ๐ต
โข Event A and B are mutually exclusive if ๐ ๐ด๐ต = 0
โข Central limit theorem tells us that Normal distribution is the one, if the data probability distribution is unknown
โข Entropy is expected value of self information โ๐ธ[log๐(๐ฅ)]
โข KL divergence can measure the difference of two probability distributions and is asymmetric