4. Distributions - Heidelberg...

Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg

4. Distributions

Principles of statistical inference

• so far, we have worked with datasets that represent samples of a more general population๏ sample: 403 diabetes patients → population: Diabetic Patients๏ sample: 152 ALL patients → population : All ALL patients

• We assume that this sample represents well the larger population→ representative sample

Diabeticpatients

sample is chosen based on the population of interest….female

diabetessamples

diabetic women

Diabetic patients

sample population?

How can we estimate the parameters(mean, spread,…) of the population basedon the sample?

inference:

Random variable

• Sample = vector of values (“realizations")

• Population = random variable, describing the (random) issue of an experiment or measurement

• Notations๏ X,Y, … = Random variable (continuous or discrete)‣ values obtained by rolling a dice, electoral behavior in Germany, increase in

survival after treatment๏ x, y, … = Realizations (numerical or categorical)‣ values after three throws, votes of person A,B,C, increase in survival of

specific patient cohort

• Random variables can be๏ continuous (weight, height, time,…)๏ discrete (counts)

e.g. Body weight

Random variables

• Random variable X can be described using๏ expectation E(X): “theoretical" mean of the entire population๏ Variance Var(X): “theoretical" variance of the entire population๏ Density distribution (continuous RV) p(X) ๏ Probability distribution (discrete RV) p(X)

∑i∈A

p(Xi) = 1

p(X=2) : probability that X takes value 2

∫x∈Ap(x)dx = 1

p(X=2) =0 p(2 < x < 3) >0

Discrete probability distribution Continuous probability distribution

Example of a discrete RV

• Random variable X : number of 6 obtained in 10 dice throws

• X ∈ {1,…,10}

10 games 1000 games

E(X) ? Var(X) ?

Cumulative distributions

• Example: discrete RV

• Probability distribution p(k)

๏ cumulative distribution up

๏ cumulative distribution down

p+(k) = p(x > k)

p−(k) = p(x ≤ k)

p+(k) + p−(k) = 1

1 2 3 4 5 6 7 8 9 100

Cumulative distribution

• Continuous random variable X

• density distribution p(X) ๏ cumulative distribution up๏ cumulative distribution down

p+(x0) = ∫+∞

p(x)dx

p−(x0) = ∫x0

−∞p(x)dx

p+(x0) = p(x > x0)p−(x0) = p(x ≤ x0)

Cumulative distribution of the uniform distribution?

Binomial distribution

• Number of successes in L independent trials, each with probability of success p

• Example of DNA polymerase ๏ error rate at each position p (constant, independent of the previous

position)๏ sequence of length L

→ number of errors (= “successes”) ?

p(k) = (Lk) pk(1 − p)L−k

E(X) = Lp

Var(X) = E(X)(1 − p)

(how to compute the cumulative distribution?)

Poisson distribution

• Measures the number of events in a time period for a given rate of events

• Example: number of drops per tile during a rainshower

• Rate needs to be constant and independent of previous period!

p(k) =λk

k!e−λ

E(X) = λVar(X) = λ

λ = drop per tile per unit time

λ = 0.56

λ = 0.94

λ = 3.125

Negative binomial

• Independent trials, constant probability p

• Different questions can be asked๏ Number of successes in L trials?

→ Binomial distribution with 2 parameters (L, p)

๏ Waiting time until r successes reached?→ Negative binomial distribution with 2 parameters (r, p)

๏ Example: Taq-polymerase has accuracy (1-p) with p ~ 1/900 (one error every 900 bases)→ length distribution of sequences with r = 3 errors?

Negative binomial

• p = 1/900 ; r=3 errors

p(k) = (k + r − 1r − 1 ) pr(1 − p)k

E(X) =(1 − p)r

Var(X) =E(X)

(1 − p)rp2

E(X) = 2697

k = number of error-free bases

Negative binomial

• The variance of the NB distribution can be made arbitrarily wide, by making p smaller

• This can be usefull in modelling processes showing an over-dispersion

Var(X ) =E(X )

(1 − p)rp2

Negative BinomialBinomial Poisson

Over-dispersion

• RNA-seq: mRNA molecules are fragmented and sequenced as short “reads” (e.g. 75-150 bp)

• reads are mapped onto the genome

• for each transcript, the number of reads is recorded

samples / replicates

RNA-seq data

• each dot is a transcript; several replicates available

• x-axis : mean number of reads per transcript (over all replicates)

• y-axis : variance of number of reads per transcript (over all replicates)

• variance is larger than mean: cannot be described by Poisson or binomial process!

if number of reads per transcript would be a Poisson distribution, dots should lieon the y = x line

Normal distribution

• Continuous distribution

• plays a central role in statistics and data analysis due to Central Limit Theorem

• 2 parameters: expectation and standard deviationμ σ

p(X) =1

2πσe− (X − μ)2

E(X) = μ

Var(X) = σ2

σ = 1

σ = 0.5

Normal distribution

• The normal distribution with mu=0 and sd=1 is called the Standard Normal distribution

• Every normal distribution can be transformed into the SND through a Z-transformation

• By definition, the new RV Z has expectation 0 and variance 1

• The Z-transformation can be applied to any distribution!

X ⟶ Z =X − μ

𝒩(0,1)

𝒩(μ, σ)

Properties of SND

• the area represents…

๏ [-1 , +1] = 68%๏ [-2 , +2] = 95.4%

๏ [-1.96 , +1.96] = 95%๏ [-1.64 , +1.64] = 90%

๏ Critical values‣ t95 = 1.96‣ t90 = 1.64

t-distribution

• Student’s t-distribution is a continuous distributiondescribes the distribution of mean values over small samples

• Parameter: degree of freedom (df)

• Tends to the SND for large number of degrees of freedom

df = 100 ~ N(0,1)

df = 1

Var(X) =df

df − 2for df > 2

E(X) = 0

Distributions are like an italian village….

• all are somehow related…

Poisson vs. binomial

Binom(n, p) ⟶ Pois(λ = np)n → ∞; np = λ

Poisson vs. normal

Pois(λ) ⟶ 𝒩(λ, λ)λ ≫ 1

Distribution = italian wedding party!

4. Distributions - Heidelberg...

Documents

Transcript of 4. Distributions - Heidelberg...

· lbp- lbp- 3100 8610 9500c 9500c 9500c 9500c crg- crg- crg- crg- crg- crg- crg- 312 527 322 11 322 11 322 322 11 cyn emu) 505(73y5) 30 29 10 15 30 30 30 30 11 11 11 17 20 25 20

Company Profile CRG

Arma.X.1. .CRG

CRG Manual

וסינרגיה Crg

CRG APRE LA NUOVA SEDE IN SPAGNA CON LA CREAZIONE DI CRG-SPAIN€¦ · CRG-SPAIN La sede di CRG Spain è presso il Kart Center di Campillos. La direzione del nuovo centro CRG in Spagna

Life@CRG n5

6. Hypothesis testing - Heidelberg Universitybioinfo.ipmb.uni-heidelberg.de/crg/datascience3fs/... · Carl Herrmann Health Data Science Unit - Medizinische Fakultät Heidelberg Hypothesis

Article on CRG

CRG VantageHealth

CRG Brochure

YEN - CRG JAPAN

Crg mintzberg configuraciones

Hypothesen Tests - Heidelberg Universitybioinfo.ipmb.uni-heidelberg.de/crg/biostat4fs/_downloads/... · 2018. 5. 9. · Hypothesen Tests : was braucht man ? Statistische Frage (=

SURFACES ATHLETIC TURF - CRGcrg.us.com/_Downloads/CRG/CRG-Turf Care Manual.pdfAt times, however, it is necessary to remove snow or ice to make the field playable for a scheduled event.

CRG Presentation

Hypothesen Tests - Heidelberg Universitybioinfo.ipmb.uni-heidelberg.de/crg/biostat4fs/_downloads/Biostat_2017... · Hypothesen Tests : was braucht man ? Statistische Frage (= zu untersuchender

Divisió cel. crg

Biostatistik 101 - bioinfo.ipmb.uni-heidelberg.debioinfo.ipmb.uni-heidelberg.de/crg/biostat4fs/_downloads/Biostat_2017_Teil7.pdf · Biostatistik 101 Carl Herrmann IPMB Uni Heidelberg

Crg fabryhome web