Post on 19-Oct-2020
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
4. Distributions
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Principles of statistical inference
• so far, we have worked with datasets that represent samples of a more general population๏ sample: 403 diabetes patients → population: Diabetic Patients๏ sample: 152 ALL patients → population : All ALL patients
• We assume that this sample represents well the larger population→ representative sample
Women
Diabeticpatients
sample is chosen based on the population of interest….female
diabetessamples
diabetic women
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Diabetic patients
sample population?
How can we estimate the parameters(mean, spread,…) of the population basedon the sample?
inference:
?
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Random variable
• Sample = vector of values (“realizations")
• Population = random variable, describing the (random) issue of an experiment or measurement
• Notations๏ X,Y, … = Random variable (continuous or discrete)‣ values obtained by rolling a dice, electoral behavior in Germany, increase in
survival after treatment๏ x, y, … = Realizations (numerical or categorical)‣ values after three throws, votes of person A,B,C, increase in survival of
specific patient cohort
• Random variables can be๏ continuous (weight, height, time,…)๏ discrete (counts)
e.g. Body weight
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Random variables
• Random variable X can be described using๏ expectation E(X): “theoretical" mean of the entire population๏ Variance Var(X): “theoretical" variance of the entire population๏ Density distribution (continuous RV) p(X) ๏ Probability distribution (discrete RV) p(X)
∑i∈A
p(Xi) = 1
p(X=2) : probability that X takes value 2
∫x∈Ap(x)dx = 1
p(X=2) =0 p(2 < x < 3) >0
Discrete probability distribution Continuous probability distribution
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Example of a discrete RV
• Random variable X : number of 6 obtained in 10 dice throws
• X ∈ {1,…,10}
10 games 1000 games
E(X) ? Var(X) ?
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Cumulative distributions
• Example: discrete RV
• Probability distribution p(k)
๏ cumulative distribution up
๏ cumulative distribution down
p+(k) = p(x > k)
p−(k) = p(x ≤ k)
p+(k) + p−(k) = 1
1 2 3 4 5 6 7 8 9 100
1 2 3 4 5 6 7 8 9 100
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Cumulative distribution
• Continuous random variable X
• density distribution p(X) ๏ cumulative distribution up๏ cumulative distribution down
p+(x0) = ∫+∞
x0
p(x)dx
p−(x0) = ∫x0
−∞p(x)dx
p+(x0) = p(x > x0)p−(x0) = p(x ≤ x0)
Cumulative distribution of the uniform distribution?
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Binomial distribution
• Number of successes in L independent trials, each with probability of success p
• Example of DNA polymerase ๏ error rate at each position p (constant, independent of the previous
position)๏ sequence of length L
→ number of errors (= “successes”) ?
p(k) = (Lk) pk(1 − p)L−k
E(X) = Lp
Var(X) = E(X)(1 − p)
(how to compute the cumulative distribution?)
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Poisson distribution
• Measures the number of events in a time period for a given rate of events
• Example: number of drops per tile during a rainshower
• Rate needs to be constant and independent of previous period!
p(k) =λk
k!e−λ
E(X) = λVar(X) = λ
λ = drop per tile per unit time
λ = 0.56
λ = 0.56
λ = 0.94
λ = 3.125
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Negative binomial
• Independent trials, constant probability p
• Different questions can be asked๏ Number of successes in L trials?
→ Binomial distribution with 2 parameters (L, p)
๏ Waiting time until r successes reached?→ Negative binomial distribution with 2 parameters (r, p)
๏ Example: Taq-polymerase has accuracy (1-p) with p ~ 1/900 (one error every 900 bases)→ length distribution of sequences with r = 3 errors?
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Negative binomial
• p = 1/900 ; r=3 errors
p(k) = (k + r − 1r − 1 ) pr(1 − p)k
E(X) =(1 − p)r
p
Var(X) =E(X)
p=
(1 − p)rp2
E(X) = 2697
k = number of error-free bases
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Negative binomial
• The variance of the NB distribution can be made arbitrarily wide, by making p smaller
• This can be usefull in modelling processes showing an over-dispersion
Var(X ) =E(X )
p=
(1 − p)rp2
Negative BinomialBinomial Poisson
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Over-dispersion
• RNA-seq: mRNA molecules are fragmented and sequenced as short “reads” (e.g. 75-150 bp)
• reads are mapped onto the genome
• for each transcript, the number of reads is recorded
reads
samples / replicates
trans
crip
ts
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
RNA-seq data
• each dot is a transcript; several replicates available
• x-axis : mean number of reads per transcript (over all replicates)
• y-axis : variance of number of reads per transcript (over all replicates)
• variance is larger than mean: cannot be described by Poisson or binomial process!
if number of reads per transcript would be a Poisson distribution, dots should lieon the y = x line
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Normal distribution
• Continuous distribution
• plays a central role in statistics and data analysis due to Central Limit Theorem
• 2 parameters: expectation and standard deviationμ σ
p(X) =1
2πσe− (X − μ)2
2σ2
E(X) = μ
Var(X) = σ2
σ = 1
σ = 0.5
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Normal distribution
• The normal distribution with mu=0 and sd=1 is called the Standard Normal distribution
• Every normal distribution can be transformed into the SND through a Z-transformation
• By definition, the new RV Z has expectation 0 and variance 1
• The Z-transformation can be applied to any distribution!
X ⟶ Z =X − μ
σ
𝒩(0,1)
𝒩(μ, σ)
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Properties of SND
• the area represents…
๏ [-1 , +1] = 68%๏ [-2 , +2] = 95.4%
๏ [-1.96 , +1.96] = 95%๏ [-1.64 , +1.64] = 90%
๏ Critical values‣ t95 = 1.96‣ t90 = 1.64
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
t-distribution
• Student’s t-distribution is a continuous distributiondescribes the distribution of mean values over small samples
• Parameter: degree of freedom (df)
• Tends to the SND for large number of degrees of freedom
df = 100 ~ N(0,1)
df = 1
Var(X) =df
df − 2for df > 2
E(X) = 0
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Distributions are like an italian village….
• all are somehow related…
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Poisson vs. binomial
Binom(n, p) ⟶ Pois(λ = np)n → ∞; np = λ
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Poisson vs. normal
Pois(λ) ⟶ 𝒩(λ, λ)λ ≫ 1
Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg
Distribution = italian wedding party!