L2: Review of probability and statistics - Texas A&M...
Transcript of L2: Review of probability and statistics - Texas A&M...
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1
L2: Review of probability and statistics
⢠Probability
â Definition of probability
â Axioms and properties
â Conditional probability
â Bayes theorem
⢠Random variables
â Definition of a random variable
â Cumulative distribution function
â Probability density function
â Statistical characterization of random variables
⢠Random vectors
â Mean vector
â Covariance matrix
⢠The Gaussian random variable
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2
Review of probability theory
⢠Definitions (informal) â Probabilities are numbers assigned to events that
indicate âhow likelyâ it is that the event will occur when a random experiment is performed
â A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment
â The sample space S of a random experiment is the set of all possible outcomes
⢠Axioms of probability â Axiom I: ð ðŽð ⥠0
â Axiom II: ð ð = 1
â Axiom III: ðŽð â© ðŽð = â â ð ðŽðâðŽð = ð ðŽð + ð ðŽð
A1
A2
A3 A4
event
pro
bab
ility
A1 A2 A3 A4
Sample space
Probability law
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3
⢠Warm-up exercise â I show you three colored cards
⢠One BLUE on both sides ⢠One RED on both sides ⢠One BLUE on one side, RED on the other
â I shuffle the three cards, then pick one and show you one side only. The side visible to you is RED ⢠Obviously, the card has to be either A or C, right?
â I am willing to bet $1 that the other side of the card has the same color, and need someone in class to bet another $1 that it is the other color ⢠On the average we will end up even, right? ⢠Letâs try it!
A B C
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4
⢠More properties of probability
â ð ðŽð¶ = 1 â ð ðŽ
â ð ðŽ †1
â ð â = 0
â ððð£ðð ðŽ1âŠðŽð , ðŽð â© ðŽð = â ,âðð â ð â ðŽððð=1 = ð ðŽð
ðð=1
â ð ðŽ1âðŽ2 = ð ðŽ1 + ð ðŽ2 â ð ðŽ1 â© ðŽ2
â ð â ðŽððð=1 =
ð ðŽð â ð ðŽð â© ðŽð +â¯+ â1 ð+1ð ðŽ1 â© ðŽ2 âŠâ© ðŽððð<ð
ðð=1
â ðŽ1 â ðŽ2 â ð ðŽ1 †ð ðŽ2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5
⢠Conditional probability â If A and B are two events, the probability of event A when we already
know that event B has occurred is
ð ðŽ|ðµ =ð ðŽâðµ
ð ðµ ðð ð ðµ > 0
⢠This conditional probability P[A|B] is read: â the âconditional probability of A conditioned on Bâ, or simply
â the âprobability of A given Bâ
â Interpretation ⢠The new evidence âB has occurredâ has the following effects
⢠The original sample space S (the square) becomes B (the rightmost circle)
⢠The event A becomes AB
⢠P[B] simply re-normalizes the probability of events that occur jointly with B
S S
A AB B A AB B B has
occurred
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6
⢠Theorem of total probability â Let ðµ1, ðµ2âŠðµð be a partition of ð (mutually exclusive that add to ð)
â Any event ðŽ can be represented as
ðŽ = ðŽ â© ð = ðŽ â© ðµ1 ⪠ðµ2âŠðµð = ðŽ â© ðµ1 ⪠ðŽ â© ðµ2 ⊠ðŽ â© ðµð
â Since ðµ1, ðµ2âŠðµð are mutually exclusive, then
ð ðŽ = ð ðŽ â© ðµ1 + ð ðŽ â© ðµ2 +â¯+ ð ðŽ â© ðµð
â and, therefore
ð ðŽ = ð ðŽ|ðµ1 ð ðµ1 +â¯ð ðŽ|ðµð ð ðµð = ð ðŽ|ðµð ð ðµððð=1
B1
B2
B3 BN-1
BN
A
B4
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7
⢠Bayes theorem â Assume ðµ1, ðµ2âŠðµð is a partition of S
â Suppose that event ðŽ occurs
â What is the probability of event ðµð?
â Using the definition of conditional probability and the Theorem of total probability we obtain
ð ðµð|ðŽ =ð ðŽ â© ðµð
ð ðŽ=
ð ðŽ|ðµð ð ðµð
ð ðŽ|ðµð ð ðµððð=1
â This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8
⢠Bayes theorem and statistical pattern recognition â When used for pattern classification, BT is generally expressed as
ð ðð|ð¥ =ð ð¥|ðð ð ðð
ð ð¥|ðð ð ðððð=1
=ð ð¥|ðð ð ðð
ð ð¥
⢠where ðð is the ð-th class (e.g., phoneme) and ð¥ is the
feature/observation vector (e.g., vector of MFCCs)
â A typical decision rule is to choose class ðð with highest P ðð|ð¥
⢠Intuitively, we choose the class that is more âlikelyâ given observation ð¥
â Each term in the Bayes Theorem has a special name
⢠ð ðð prior probability (of class ðð)
⢠ð ðð|ð¥ posterior probability (of class ðð given the observation ð¥)
⢠ð ð¥|ðð likelihood (probability of observation ð¥ given class ðð)
⢠ð ð¥ normalization constant (does not affect the decision)
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9
⢠Example â Consider a clinical problem where we need to decide if a patient has a
particular medical condition on the basis of an imperfect test
⢠Someone with the condition may go undetected (false-negative)
⢠Someone free of the condition may yield a positive result (false-positive)
â Nomenclature
⢠The true-negative rate P(NEG|¬COND) of a test is called its SPECIFICITY
⢠The true-positive rate P(POS|COND) of a test is called its SENSITIVITY
â Problem
⢠Assume a population of 10,000 with a 1% prevalence for the condition
⢠Assume that we design a test with 98% specificity and 90% sensitivity
⢠Assume you take the test, and the result comes out POSITIVE
⢠What is the probability that you have the condition?
â Solution ⢠Fill in the joint frequency table next slide, or
⢠Apply Bayes rule
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10
TEST IS POSITIVE
TEST IS NEGATIVE
ROW TOTAL
HAS CONDITION True-positive P(POS|COND)
100Ã0.90
False-negative P(NEG|COND) 100Ã(1-0.90)
100
FREE OF CONDITION
False-positive P(POS|¬COND) 9,900Ã(1-0.98)
True-negative P(NEG|¬COND)
9,900Ã0.98
9,900 COLUMN TOTAL 288 9,712 10,000
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11
TEST IS POSITIVE
TEST IS NEGATIVE
ROW TOTAL
HAS CONDITION True-positive P(POS|COND)
100Ã0.90
False-negative P(NEG|COND) 100Ã(1-0.90)
100
FREE OF CONDITION
False-positive P(POS|¬COND) 9,900Ã(1-0.98)
True-negative P(NEG|¬COND)
9,900Ã0.98
9,900 COLUMN TOTAL 288 9,712 10,000
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12
â Applying Bayes rule
ð ðððð| + =
=ð +|ðððð ð ðððð
ð +=
=ð +|ðððð ð ðððð
ð +|ðððð ð ðððð + ð +|¬ðððð ð ¬ðððð=
=0.90 Ã 0.01
0.90 Ã 0.01 + 1 â 0.98 Ã 0.99=
= 0.3125
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13
⢠Random variables â When we perform a random experiment we are usually interested in
some measurement or numerical attribute of the outcome ⢠e.g., weights in a population of subjects, execution times when
benchmarking CPUs, shape parameters when performing ATR
â These examples lead to the concept of random variable ⢠A random variable ð is a function that assigns a real number ð ð to each
outcome ð in the sample space of a random experiment
⢠ð ð maps from all possible outcomes in sample space onto the real line
â The function that assigns values to each outcome is fixed and deterministic, i.e., as in the rule âcount the number of heads in three coin tossesâ ⢠Randomness in ð is due to the underlying randomness
of the outcome ð of the experiment
â Random variables can be ⢠Discrete, e.g., the resulting number after rolling a dice
⢠Continuous, e.g., the weight of a sampled individual
ð
ð¥ = ð ð
ð¥
ðð¥
real line
ð
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14
⢠Cumulative distribution function (cdf) â The cumulative distribution function ð¹ð ð¥
of a random variable ð is defined as the probability of the event ð †ð¥
ð¹ð ð¥ = ð ð †ð¥ â â < ð¥ < â
â Intuitively, ð¹ð ð is the long-term proportion of times when ð ð †ð
â Properties of the cdf
⢠0 †ð¹ð 𥠆1
⢠limð¥ââ
ð¹ð ð¥ = 1
⢠limð¥âââ
ð¹ð ð¥ = 0
⢠ð¹ð ð †ð¹ð ð ðð ð †ð
⢠FX ð = limââ0
ð¹ð ð + â = ð¹ð ð+
1 2 3 4 5 6
P(X
<x)
x
cdf for rolling a dice
1
5/6
4/6
3/6
2/6
1/6
P(X
<x)
100 200 300 400 500 x(lb)
cdf for a personâs weight
1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15
⢠Probability density function (pdf)
â The probability density function ðð ð¥ of a continuous random variable ð, if it exists, is defined as the derivative of ð¹ð ð¥
ðð ð¥ =ðð¹ð ð¥
ðð¥
â For discrete random variables, the equivalent to the pdf is the probability mass function
ðð ð¥ =Îð¹ð ð¥
Îð¥
â Properties
⢠ðð ð¥ > 0
⢠ð ð < ð¥ < ð = ðð ð¥ ðð¥ð
ð
⢠ð¹ð ð¥ = ðð ð¥ ðð¥ð¥
ââ
⢠1 = ðð ð¥ ðð¥â
ââ
⢠ðð ð¥|ðŽ =ð
ðð¥ð¹ð ð¥|ðŽ ð€âððð ð¹ð ð¥|ðŽ =
ð ð<ð¥ â©ðŽ
ð ðŽ ðð ð ðŽ > 0
100 200 300 400 500
pd
f
x(lb)
pdf for a personâs weight
1
1 2 3 4 5 6
pm
f
x
pmf for rolling a (fair) dice
1
5/6
4/6
3/6
2/6
1/6
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16
⢠What is the probability of somebody weighting 200 lb? ⢠According to the pdf, this is about 0.62 ⢠This number seems reasonable, right?
⢠Now, what is the probability of somebody weighting 124.876 lb?
⢠According to the pdf, this is about 0.43 ⢠But, intuitively, we know that the probability should be zero (or very,
very small)
⢠How do we explain this paradox? ⢠The pdf DOES NOT define a probability, but a probability DENSITY! ⢠To obtain the actual probability we must integrate the pdf in an interval ⢠So we should have asked the question: what is the probability of
somebody weighting 124.876 lb plus or minus 2 lb?
1 2 3 4 5 6
pm
f
x
pmf for rolling a (fair) dice
1
5/6
4/6
3/6
2/6
1/6
100 200 300 400 500
pd
f
x(lb)
pdf for a personâs weight
1
⢠The probability mass function is a âtrueâ probability (reason why we call it a âmassâ as opposed to a âdensityâ)
⢠The pmf is indicating that the probability of any number when rolling a fair dice is the same for all numbers, and equal to 1/6, a very legitimate answer
⢠The pmf DOES NOT need to be integrated to obtain the probability (it cannot be integrated in the first place)
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17
⢠Statistical characterization of random variables â The cdf or the pdf are SUFFICIENT to fully characterize a r.v.
â However, a r.v. can be PARTIALLY characterized with other measures
â Expectation (center of mass of a density)
ðž ð = ð = ð¥ðð ð¥ ðð¥â
ââ
â Variance (spread about the mean)
ð£ðð ð = ð2 = ðž ð â ðž ð 2 = ð¥ â ð 2ðð ð¥ ðð¥â
ââ
â Standard deviation ð ð¡ð ð = ð = ð£ðð ð 1/2
â N-th moment
ðž ðð = ð¥ððð ð¥ ðð¥â
ââ
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18
⢠Random vectors
â An extension of the concept of a random variable
⢠A random vector ð is a function that assigns a vector of real numbers to each outcome ð in sample space ð
⢠We generally denote a random vector by a column vector
â The notions of cdf and pdf are replaced by âjoint cdfâ and âjoint pdfâ
⢠Given random vector ð = ð¥1, ð¥2âŠð¥ððwe define the joint cdf as
ð¹ð ð¥ = ðð ð1 †ð¥1 â© ð2 †ð¥2 ⊠ðð †ð¥ð
⢠and the joint pdf as
ðð ð¥ =ððð¹ð ð¥
ðð¥1ðð¥2âŠðð¥ð
â The term marginal pdf is used to represent the pdf of a subset of all the random vector dimensions
⢠A marginal pdf is obtained by integrating out variables that are of no interest
⢠e.g., for a 2D random vector ð = ð¥1, ð¥2ð, the marginal pdf of ð¥1 is
ðð1 ð¥1 = ðð1ð2 ð¥1ð¥2 ðð¥2
ð¥2=+â
ð¥2=ââ
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19
⢠Statistical characterization of random vectors â A random vector is also fully characterized by its joint cdf or joint pdf
â Alternatively, we can (partially) describe a random vector with measures similar to those defined for scalar random variables
â Mean vector
ðž ð = ð = ðž ð1 , ðž ð2 âŠðž ððð= ð1, ð2, ⊠ðð
ð
â Covariance matrix
ððð£ ð = Σ = ðž ð â ð ð â ðð=
=ðž ð¥1 â ð1
2 ⊠ðž ð¥1 â ð1 ð¥ð â ððâ® â± â®
ðž ð¥1 â ð1 ð¥ð â ðð ⊠ðž ð¥ð â ðð2
=
=ð12 ⊠ð1ðâ® â± â®ð1ð ⊠ðð
2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20
â The covariance matrix indicates the tendency of each pair of features (dimensions in a random vector) to vary together, i.e., to co-vary*
⢠The covariance has several important properties
â If ð¥ð and ð¥ð tend to increase together, then ððð > 0
â If ð¥ð tends to decrease when ð¥ð increases, then ððð < 0
â If ð¥ð and ð¥ð are uncorrelated, then ððð = 0
â ððð †ð1ðð, where ðð is the standard deviation of ð¥ð
â ððð = ðð2 = ð£ðð ð¥ð
⢠The covariance terms can be expressed as ððð = ðð2 and ððð = ððððððð
â where ððð is called the correlation coefficient
Xi
Xk
Cik=-sisk
rik=-1
Xi
Xk
Cik=-Âœsisk
rik=-œ
Xi
Xk
Cik=0
rik=0
Xi
Xk
Cik=+Âœsisk
rik=+œ
Xi
Xk
Cik=sisk
rik=+1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 21
A numerical example
⢠Given the following samples from a 3D distribution â Compute the covariance matrix
â Generate scatter plots for every pair of vars.
â Can you observe any relationships between the covariance and the scatter plots?
â You may work your solution in the templates below
Exam
ple
x 1
x 2
x 3
x 1-
1
x 2-
2
x 3-
3
(x1-
1)2
(x2-
2) 2
(x3-
3)2
(x1-
1)(
x2-
2)
(x1-
1)(
x3-
3)
(x2-
2)(
x3-
3)
1 2 3 4
Average
Variables (or features)
Examples x1 x2 x3
1 2 2 4
2 3 4 6
3 5 4 2
4 6 6 4
x1
x2
x3
x2
x3
x1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 22
⢠The Normal or Gaussian distribution â The multivariate Normal distribution ð ð, Σ is defined as
ðð ð¥ =1
2ð ð/2 Σ 1/2ðâ12 ð¥âð
ðΣâ1 ð¥âð
â For a single dimension, this expression is reduced to
ðð ð¥ =1
2ðððâð¥âð 2
2ð2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 23
â Gaussian distributions are very popular since
⢠Parameters ð, Σ uniquely characterize the normal distribution
⢠If all variables ð¥ð are uncorrelated ðž ð¥ðð¥ð = ðž ð¥ð ðž ð¥ð , then
â Variables are also independent ð ð¥ðð¥ð = ð ð¥ð ð ð¥ð , and
â Σ is diagonal, with the individual variances in the main diagonal
⢠Central Limit Theorem (next slide)
⢠The marginal and conditional densities are also Gaussian
⢠Any linear transformation of any ð jointly Gaussian rvâs results in ð rvâs that are also Gaussian
â For ð = ð1ð2âŠðððjointly Gaussian, and ðŽðÃð invertible, then ð = ðŽð is
also jointly Gaussian
ðð ðŠ =ðð ðŽ
â1ðŠ
ðŽ
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 24
⢠Central Limit Theorem â Given any distribution with a mean ð and variance ð2, the sampling
distribution of the mean approaches a normal distribution with mean ð and variance ð2/ð as the sample size ð increases ⢠No matter what the shape of the original distribution is, the sampling
distribution of the mean approaches a normal distribution
⢠ð is the sample size used to compute the mean, not the overall number of samples in the data
â Example: 500 experiments are performed using a uniform distribution
⢠ð = 1 â One sample is drawn from the distribution
and its mean is recorded (500 times)
â The histogram resembles a uniform distribution, as one would expect
⢠ð = 4 â Four samples are drawn and the mean of the
four samples is recorded (500 times)
â The histogram starts to look more Gaussian
⢠As ð grows, the shape of the histograms resembles a Normal distribution more closely