Multivariate data. Regression and Correlation The Scatter Plot.

Multivariate data

Regression and Correlation

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

The Scatter Plot

Pearson’s correlation coefficient

xy

xx yy

Sr

S S

n

x

xxxS

n

iin

ii

n

iixx

2

1

1

2

1

2

n

yx

yx

n

ii

n

iin

iii

11

1

n

y

yyyS

n

iin

ii

n

iiyy

2

1

1

2

1

2

n

iiixy yyxxS

1

where

Where for each case i, di = ri – si = difference in the rank of xi and the rank of yi.

1

61

21

2

nn

dn

ii

Spearman’s rank correlation coefficient

Simple Linear Regression

Fitting straight lines to data

The Least Squares Line The Regression Line

• When data is correlated it falls roughly about a straight line.

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

In this situation wants to:

• Find the equation of the straight line through the data that yields the best fit.

The equation of any straight line:

is of the form:

Y = a + bX

b = the slope of the line

a = the intercept of the line

For any equation of a straight line

Y = a + b X

The predicted value of Y when X = xi (ith case)

can be computed:

ˆi iy a bx

The error in the prediction is given by:

ˆi i i i ir y y y a bx

This is called the residual for the ith case.

The residuals

iiiii bxayyyr ˆ

,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

can be computed for each case in the sample,

The residual sum of squares (RSS)

is a measure of the goodness of fit of the line

Y = a + bX to the data

The optimal choice of a and b will result in the residual sum of squares

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

attaining a minimum.

If this is the case than the line:

Y = a + bX

is called the Least Squares Line

Then the slope of the least squares line can be shown to be:

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

and the intercept of the least squares line can be shown to be:

xS

Syxbya

xx

xy

Computing the residual sum of squares for the least squares line

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

Once a and b have been determined this can be computed using the far right hand side.

This can also be computed using the values of Sxx, Syy and Sxy.

For the Least Squares Line2xy

yyxx

SRSS S

S

The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.

Country (i) Xi Yi

Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

Per capita consumption of cigarettes

deat

h ra

tes

from

lung

can

cer

(195

0)

404,541

2

n

iix

914,161

n

iii yx

018,61

2

n

iiy

Fitting the Least Squares Line

6641

n

iix

2261

n

iiy

55.1432211

66454404

2

xxS

73.1374

11

2266018

2

yyS

82.3271

11

22666416914 xyS

Fitting the Least Squares Line - continued

First compute the following three quantities:

Computing Estimate of Slope and Intercept

288.055.14322

82.3271

xx

xy

S

Sb

756.611

664288.0

11

226

xbya

2 23271.81811374.72727-

14322.545xy

yyxx

SRSS S

S

627.3196

Computing the Residual Sum of Squares

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

Per capita consumption of cigarettes

deat

h ra

tes

from

lung

can

cer

(195

0)

Y = 6.756 + (0.228)X

Interpretation of the slope and intercept

1. Intercept – value of Y at X = 0.– Predicted death rate from lung cancer

(6.756) for men in 1950 in Counties with no smoking in 1930 (X = 0).

2. Slope – rate of increase in Y per unit increase in X.

– Death rate from lung cancer for men in 1950 increases 0.228 units for each increase of 1 cigarette per capita consumption in 1930.

Relationship between correlation and Linear Regression

1. Pearsons correlation.

• Takes values between –1 and +1

n

ii

n

ii

n

iii

yyxx

xy

yyxx

yyxx

SS

Sr

1

2

1

2

1

2. Least squares Line Y = a + bX– Minimises the Residual Sum of Squares:

– The Sum of Squares that measures the variability in Y that is unexplained by X.

– This can also be denoted by:

SSunexplained

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

Some other Sum of Squares:

– The Sum of Squares that measures the total variability in Y (ignoring X).

n

iiTotal yySS

1

2

– The Sum of Squares that measures the total variability in Y that is explained by X.

n

iiExplained yySS

1

2ˆ

It can be shown:

(Total variability in Y) = (variability in Y explained by X) + (variability in Y unexplained by X)

n

iii

n

ii

n

ii yyyyyy

1

2

1

2

1

2 ˆˆ

lainedUnExplainedTotal SSSSSS exp

It can also be shown:

= proportion variability in Y unexplained by X.

= the coefficient of determination

n

ii

n

ii

yy

yyr

1

2

1

2

2

ˆ

Further:

= proportion variability in Y that is unexplained by X.

n

ii

n

iii

yy

yyr

1

2

1

2

2

ˆ1

Example

TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.

Country (i) Xi Yi

Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20

Computing r and r2

737.0

73.137455.14322

82.3271

yyxx

xy

SS

Sr

544.0737.0 22 r

54.4% of the variability in Y (death rate due to lung Cancer (1950) is explained by X (per capita cigarette smoking in 1930)

Categorical Data

Techniques for summarizing, displaying and graphing

The frequency tableThe bar graph

Suppose we have collected data on a categorical variable X having k categories – 1, 2, … , k.

To construct the frequency table we simply count for each category (i) of X, the number of cases falling in that category (fi)

To plot the bar graph we simply draw a bar of height fi above each category (i) of X.

Example

In this example data has been collected for n = 34,188 subjects.

• The purpose of the study was to determine the relationship between the use of Antidepressants, Mood medication, Anxiety medication, Stimulants and Sleeping pills.

• In addition the study interested in examining the effects of the independent variables (gender, age, income, education and role) on both individual use of the medications and the multiple use of the medications.

The variables were: 1. Antidepressant use, 2. Mood medication use, 3. Anxiety medication use, 4. Stimulant use and 5. Sleeping pills use.6. gender, 7. age, 8. income, 9. education and 10. Role –

i. Parent, worker, partnerii. Parent, partneriii. Parent, workeriv. worker, partner

v. worker onlyvi. Parent onlyvii. Partner onlyviii. No roles

Frequency Table for Age

Age - (G)

5349 15.7 15.7 15.7

6758 19.8 19.8 35.5

6420 18.8 18.8 54.3

5528 16.2 16.2 70.5

4400 12.9 12.9 83.4

5663 16.6 16.6 100.0

34118 100.0 100.0

20-29

30-39

40-49

50-59

60-69

70+

Total

ValidFrequency Percent Valid Percent

CumulativePercent

20-29 30-39 40-49 50-59 60-69 70+

Age - (G)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Co

un

t

Bar Graph for Age

Frequency Table for Role

role

6614 19.4 24.5 24.5

1068 3.1 4.0 28.5

1351 4.0 5.0 33.5

5427 15.9 20.1 53.6

5711 16.7 21.2 74.7

456 1.3 1.7 76.4

3262 9.6 12.1 88.5

3097 9.1 11.5 100.0

26986 79.1 100.0

7132 20.9

34118 100.0

parent, partner, worker

parent, partner

parent, worker

partner, worker

worker only

parent only

partner only

no roles

Total

Valid

SystemMissing

Total

Frequency Percent Valid PercentCumulative

Percent

parent, partner, worker

parent, partnerparent, worker

partner, workerworker only

parent onlypartner only

no roles

role

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Co

un

t

Bar Graph for Role

The two way frequency table

The 2 statistic

Techniques for examining dependence amongst two categorical

variables

Situation

• We have two categorical variables R and C.

• The number of categories of R is r.

• The number of categories of C is c.

• We observe n subjects from the population and count

xij = the number of subjects for which R = i and

C = j.

• R = rows, C = columns

Example

Both Systolic Blood pressure (C) and Serum Chlosterol (R) were meansured for a sample of n = 1237 subjects.

The categories for Blood Pressure are:

<126 127-146 147-166 167+

The categories for Chlosterol are:

<200 200-219 220-259 260+

Table: two-way frequency

Serum Cholesterol

Systolic Blood pressure <127 127-146 147-166 167+ Total

< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439

260+ 67 99 46 33 245

Total 388 527 204 118 1237

Example

This comes from the drug use data.

The two variables are:

1. Age (C) and

2. Antidepressant Use (R)

measured for a sample of n = 33,957 subjects.

Two-way Frequency Table

Took anti-depressants - 12 mo * Age - (G) Crosstabulation

Count

322 523 570 522 265 249 2451

5007 6201 5822 4982 4114 5380 31506

5329 6724 6392 5504 4379 5629 33957

YES

NO

Took anti-depressants- 12 mo

Total

20-29 30-39 40-49 50-59 60-69 70+

Age - (G)

Total

Age - (G)

20-29 30-39 40-49 50-59 60-69 70+6.04% 7.78% 8.92% 9.48% 6.05% 4.42%

Percentage antidepressant use vs Age

Antidepressant Use vs Age

0.0%

5.0%

10.0%

20-29 30-39 40-49 50-59 60-69 70+

The 2 statistic for measuring dependence

amongst two categorical variables

DefineTotal row

1

thc

jiji ixR

1

column Totalc

thj ij

i

C x j

n

CRE ji

ij

= Expected frequency in the (i,j) th cell in the case of independence.

Columns

1 2 3 4 5 Total

1 x11 x12 x13 x14 x15 R1

2 x21 x22 x23 x24 x25 R2

3 x31 x32 x33 x34 x35 R3

4 x41 x42 x43 x44 x45 R4

Total C1 C2 C3 C4 C5 N

Total row 1

thc

jiji ixR

1

column Totalc

thj ij

i

C x j

Columns

1 2 3 4 5 Total

1 E11 E12 E13 E14 E15 R1

2 E21 E22 E23 E24 E25 R2

3 E31 E32 E33 E34 E35 R3

4 E41 E42 E43 E44 E45 R4

Total C1 C2 C3 C4 C5 n

n

CRE ji

ij

Justification if i jij

R CE

n then ij j

i

E C

R n

1 2 3 4 5 Total

1 E11 E12 E13 E14 E15 R1

2 E21 E22 E23 E24 E25 R2

3 E31 E32 E33 E34 E35 R3

4 E41 E42 E43 E44 E45 R4


Proportion in column j for row i

overall proportion in column j

and if i jij

R CE

n then ij i

j

E R

C n

1 2 3 4 5 Total

1 E11 E12 E13 E14 E15 R1

2 E21 E22 E23 E24 E25 R2

3 E31 E32 E33 E34 E35 R3

4 E41 E42 E43 E44 E45 R4


Proportion in row i for column j

overall proportion in row i

The 2 statistic

r

i

c

j ij

ijij

E

Ex

1 1

2

2

Eij= Expected frequency in the (i,j) th cell in the case of independence.

xij= observed frequency in the (i,j) th cell

Example: studying the relationship between Systolic Blood pressure and Serum Cholesterol

In this example we are interested in whether Systolic Blood pressure and Serum Cholesterol are related or whether they are independent.

Both were measured for a sample of n = 1237 cases

Serum Cholesterol


< 200 117 121 47 22 307200-219 85 98 43 20 246220-259 115 209 68 43 439

260+ 67 99 46 33 245

Total 388 527 204 118 1237

Observed frequencies

Serum Cholesterol


< 200 96.29 130.79 50.63 29.29 307200-219 77.16 104.8 40.47 23.47 246220-259 137.70 187.03 72.40 41.88 439

260+ 76.85 104.38 40.04 23.37 245

Total 388 527 204 118 1237

Expected frequencies

In the case of independence the distribution across a row is the same for each rowThe distribution down a column is the same for each column

Table Expected frequencies, Observed frequencies, Standardized Residuals

Serum Systolic Blood pressure

Cholesterol <127 127-146 147-166 167+ Total <200 96.29 130.79 50.63 29.29 307 (117) (121) (47) (22) 2.11 -0.86 -0.51 -1.35 200-219 77.16 104.80 40.47 23.47 246 (85) (98) (43) (20) 0.86 -0.66 0.38 -0.72 220-259 137.70 187.03 72.40 41.88 439 (119) (209) (68) (43) -1.59 1.61 -0.52 0.17 260+ 76.85 104.38 40.04 23.37 245 (67) (99) (46) (33) -1.12 -0.53 0.88 1.99 Total 388 527 204 118 1237

2 = 20.85

ij

ijijij

E

Exr

Standardized residuals

ij

ijijij

E

Exr

85.20

1 1

2

1 1

2

2

r

i

c

jij

r

i

c

j ij

ijij rE

Ex

The 2 statistic

Example

This comes from the drug use data.

The two variables are:

1. Role (C) and

2. Antidepressant Use (R)

measured for a sample of n = 33,957 subjects.

Two-way Frequency Table

Percentage antidepressant use vs Role

Took anti-depressants - 12 mo * role Crosstabulation

Count

344 101 201 275 455 63 224 414 2077

6268 967 1150 5150 5249 392 3036 2679 24891

6612 1068 1351 5425 5704 455 3260 3093 26968

YES

NO

Took anti-depressants- 12 mo

Total

parent,partner,worker

parent,partner parent, worker

partner,worker worker only parent only partner only no roles

role

Total

Role parent, partner, worker

parent, partner

parent, worker

partner, worker

worker only parent only

partner only no roles

5.20% 9.46% 14.88% 5.07% 7.98% 13.85% 6.87% 13.39%

Antidepressant Use vs Role

0.0%

5.0%

10.0%

15.0%

20.0%

parent,partner,worker

parent,partner

parent,worker

partner,worker

workeronly

parentonly

partneronly

no roles

2 = 381.961

Calculation of 2

1 2 3 4 5 6 7 8 Total

YES 344 101 201 275 455 63 224 414 2077NO 6268 967 1150 5150 5249 392 3036 2679 24891

Total 6612 1068 1351 5425 5704 455 3260 3093 26968

The Raw data

Expected frequencies1 2 3 4 5 6 7 8 Total (R i )

YES 509.24 82.25 104.05 417.82 439.31 35.04 251.08 238.21 2077NO 6102.76 985.75 1246.95 5007.18 5264.69 419.96 3008.92 2854.79 24891

Total (C j ) 6612 1068 1351 5425 5704 455 3260 3093 26968

ij

ijijij

E

Exr

i jij

R CE

n

The Residuals

The calculation of 2

ij

ijijij

E

Exr

1 2 3 4 5 6 7 8

YES -7.32 2.07 9.50 -6.99 0.75 4.72 -1.71 11.39NO 2.12 -0.60 -2.75 2.02 -0.22 -1.36 0.49 -3.29

2

2 2 381.961ij ij

iji j i j ij

x Er

E

Probability Theory

Modelling random phenomena

Some counting formulae

Permutations

the number of ways that you can order n objects is:

n! = n(n-1)(n-2)(n-3)…(3)(2)(1)

Example:

the number of ways you can order the three letters A, B, and C is 3! = 3(2)(1) = 6

ABC ACB BAC BCA CAB CBA

Permutations

the number of ways that you can choose k objects from n objects in a specific order:

Example:

the number of ways you choose two letters from the four letters A, B, D, C in a specific order is

)1()1()!(

!

knnn

kn

nPkn

12)3)(4(!2

!4

)!24(

!424

P

AB BA AC CA AD DA

BC CB BD DB CD DC

Combinations

the number of ways that you can choose k objects from n objects (order irrelevant) is:

)1()1(

)1()1(

)!(!

!

kk

knnn

knk

n

k

nCkn

Example:

the number of ways you choose two letters from the four letters A, B, D, C

{A,B} {A,C} {A,D} {B,C} {B,D}{C,D}

62

12

)1)(2(

)3)(4(

!2!2

!4

)!24(!2

!4

2

424

C

Example:

Suppose we have a committee of 10 people and we want to choose a sub-committee of 3 people. How many ways can this be done

45)1)(2)(3(

)3)(9)(10(

!7!3

!10

3

10310

C

Example: Random sampling

Suppose we have a club of N =1000 persons and we want to choose sample of k = 250 of these individuals to determine there opinion on a given issue. How many ways can this be performed?

The choice of the sample is called random sampling if all of the choices has the same probability of being selected

2422501000 10823.4

!750!250

!1000

250

1000

C

Important Note:

0! is always defined to be 1.

Also

are called Binomial Coefficients

)!(!

!

knk

n

k

nCkn

Reason:

The Binomial Theorem

nyx

0222

111

00 yxCyxCyxCyxC n

nnn

nn

nn

n

022110

210yx

n

nyx

nyx

nyx

n nnnn

Binomial Coefficients can also be calculated using Pascal’s triangle

11 1

1 2 11 3 3 1

1 4 6 4 1

1 5 10 10 5 1

1 6 15 20 15 6 1

Random Variables

Probability distributions

Definition:

A random variable X is a number whose value is determined by the outcome of a random experiment (random phenomena)

Examples1. A die is rolled and X = number of spots

showing on the upper face.2. Two dice are rolled and X = Total number

of spots showing on the two upper faces.3. A coin is tossed n = 100 times and

X = number of times the coin toss resulted in a head.

4. A person is selected at random from a population and

X = weight of that individual.

5. A sample of n = 100 individuals are selected at random from a population (i.e. all samples of n = 100 have the same probability of being selected) .

X = the average weight of the 100 individuals.

In all of these examples X fits the definition of a random variable, namely:– a number whose value is determined by the

outcome of a random experiment (random phenomena)

Probability distribution of a Random Variable

Random variables are either

• Discrete– Integer valued – The set of possible values for X are integers

• Continuous– The set of possible values for X are all real

numbers – Range over a continuum.

Examples

• Discrete

– A die is rolled and X = number of spots showing on the upper face.

– Two dice are rolled and X = Total number of spots showing on the two upper faces.

– A coin is tossed n = 100 times and X = number of times the coin toss resulted in a head.

Examples

• Continuous– A person is selected at random from a

population and X = weight of that individual.– A sample of n = 100 individuals are selected

at random from a population (i.e. all samples of n = 100 have the same probability of being selected) . X = the average weight of the 100 individuals.

The probability distribution of a discrete random variable is describe by its :

probability function p(x).

p(x) = the probability that X takes on the value x.

Examples

• Discrete

– A die is rolled and X = number of spots showing on the upper face.

– Two dice are rolled and X = Total number of spots showing on the two upper faces.

x 1 2 3 4 5 6

p(x) 1/6 1/6 1/6 1/6 1/6 1/6

x 2 3 4 5 6 7 8 9 10 11 12p(x) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

Graphs

To plot a graph of p(x), draw bars of height p(x) above each value of x.

Rolling a die

0

1 2 3 4 5 6

Rolling two dice

0

Note:1. 0 p(x) 1

2.

3.

x

xp 1

b

ax

xpbXaP )(

The probability distribution of a continuous random variable is described by its :

probability density curve f(x).

i.e. a curve which has the following properties :• 1. f(x) is always positive.

• 2. The total are under the curve f(x) is one.

• 3. The area under the curve f(x) between a and b is the probability that X lies between the two values.

0

0.005

0.01

0.015

0.02

0.025

0 20 40 60 80 100 120

f(x)

An Important discrete distribution

The Binomial distribution

Suppose we have an experiment with two outcomes – Success(S) and Failure(F).

Let p denote the probability of S (Success).

In this case q=1-p denotes the probability of Failure(F).

Now suppose this experiment is repeated n times independently.

Let X denote the number of successes occuring in the n repititions.

Then X is a random variable.

It’s possible values are

0, 1, 2, 3, 4, … , (n – 2), (n – 1), n

and p(x) for any of the above values of x is given by:

xnxxnx qpx

npp

x

nxp

1

X is said to have the Binomial distribution with parameters n and p.

Summary:


1. X is the number of successes occuring in the n repititions of a Success-Failure Experiment.

2. The probability of success is p.

3. xnx pp

x

nxp

1

Examples:

1. A coin is tossed n = 5 times. X is the number of heads occuring in the 5 tosses of the coin. In this case p = ½ and

3215

215

21

21

555

xxxxp xx

x 0 1 2 3 4 5

p(x)321

325

325

321

3210

3210

Random Variables

Numerical Quantities whose values are determine by the outcome of a

random experiment

Discrete Random VariablesDiscrete Random Variable: A random variable usually assuming an integer value.

• a discrete random variable assumes values that are isolated points along the real line. That is neighbouring values are not “possible values” for a discrete random variable

Note: Usually associated with counting• The number of times a head occurs in 10 tosses of a coin

• The number of auto accidents occurring on a weekend

• The size of a family

Continuous Random Variables

Continuous Random Variable: A quantitative random variable that can vary over a continuum

• A continuous random variable can assume any value along a line interval, including every possible value between any two points on the line

Note: Usually associated with a measurement• Blood Pressure

• Weight gain

• Height

Probability Distributionsof a Discrete Random Variable

Probability Distribution & Function

Probability Distribution: A mathematical description of how probabilities are distributed with each of the possible values of a random variable.

Notes: The probability distribution allows one to determine probabilities

of events related to the values of a random variable. The probability distribution may be presented in the form of a

table, chart, formula.

Probability Function: A rule that assigns probabilities to the values of the random variable

x 0 1 2 3

p(x) 6/14 4/14 3/14 1/14

ExampleIn baseball the number of individuals, X, on base when a home run is hit ranges in value from 0 to 3. The probability distribution is known and is given below:

P X( )the random variable equals 2 p ( ) 23

14

Note: This chart implies the only values x takes on are 0, 1, 2, and 3. If the random variable X is observed repeatedly the probabilities,

p(x), represents the proportion times the value x appears in that sequence.

2least at is variablerandom the XP 32 pp 14

4

14

1

14

3

A Bar Graph

No. of persons on base when a home run is hit

0.429

0.286

0.214

0.071

0.000

0.100

0.200

0.300

0.400

0.500

0 1 2 3

# on base

p(x)

Comments:Every probability function must satisfy:

1)(0 xp

1. The probability assigned to each value of the random variable must be between 0 and 1, inclusive:

x

xp

1)(

2. The sum of the probabilities assigned to all the values of the random variable must equal 1:

b

ax

xpbXaP )(3.

)()1()( bpapap

Mean and Variance of aDiscrete Probability Distribution

• Describe the center and spread of a probability distribution

• The mean (denoted by greek letter (mu)), measures the centre of the distribution.

• The variance (2) and the standard deviation () measure the spread of the distribution.

is the greek letter for s.

Mean of a Discrete Random Variable• The mean, , of a discrete random variable x is found by

multiplying each possible value of x by its own probability and then adding all the products together:

Notes: The mean is a weighted average of the values of X.

x

xxp

kk xpxxpxxpx 2211

The mean is the long-run average value of the random variable.

The mean is centre of gravity of the probability distribution of the random variable

-

0.1

0.2

0.3

1 2 3 4 5 6 7 8 9 10 11

2

Variance and Standard DeviationVariance of a Discrete Random Variable: Variance, 2, of a discrete random variable x is found by multiplying each possible value of the squared deviation from the mean, (x )2, by its own probability and then adding all the products together:

Standard Deviation of a Discrete Random Variable: The positive square root of the variance:

x

xpx 22

2

2

xx

xxpxpx

22 x

xpx

ExampleThe number of individuals, X, on base when a home run is hit ranges in value from 0 to 3.

x p (x ) xp(x) x 2 x 2 p(x)

0 0.429 0.000 0 0.0001 0.286 0.286 1 0.2862 0.214 0.429 4 0.8573 0.071 0.214 9 0.643

Total 1.000 0.929 1.786

)(xp )(xxp )(2 xpx

• Computing the mean:

Note: • 0.929 is the long-run average value of the random variable • 0.929 is the centre of gravity value of the probability

distribution of the random variable

929.0x

xxp

• Computing the variance:

x

xpx 22

2

2

xx

xxpxpx

923.0929.786.1 2

• Computing the standard deviation:

2

961.0923.0

The Binomial distribution1. We have an experiment with two outcomes

– Success(S) and Failure(F).

2. Let p denote the probability of S (Success).

3. In this case q=1-p denotes the probability of Failure(F).

4. This experiment is repeated n times independently.

5. X denote the number of successes occuring in the n repititions.

The possible values of X are

0, 1, 2, 3, 4, … , (n – 2), (n – 1), n

and p(x) for any of the above values of x is given by:

xnxxnx qpx

npp

x

nxp

1


Summary:


1. X is the number of successes occurring in the n repetitions of a Success-Failure Experiment.

2. The probability of success is p.

3. The probability function

xnx ppx

nxp

1

Example:

1. A coin is tossed n = 5 times. X is the number of heads occurring in the 5 tosses of the coin. In this case p = ½ and

3215

215

21

21

555

xxxxp xx

x 0 1 2 3 4 5

p(x)321

325

325

321

3210

3210

0.0

0.1

0.2

0.3

0.4

1 2 3 4 5 6

number of heads

p(x

)

Computing the summary parameters for the distribution – , 2,

x p (x ) xp(x) x 2 x 2 p(x)

0 0.03125 0.000 0 0.0001 0.15625 0.156 1 0.1562 0.31250 0.625 4 1.2503 0.31250 0.938 9 2.8134 0.15625 0.625 16 2.5005 0.03125 0.156 25 0.781

Total 1.000 2.500 7.500

)(xp )(xxp )(2 xpx

• Computing the mean: 5.2

x

xxp


x

xpx 22

2

2

xx

xxpxpx

25.15.25.7 2


2

118.125.1

Example:

• A surgeon performs a difficult operation n = 10 times.

• X is the number of times that the operation is a success.

• The success rate for the operation is 80%. In this case p = 0.80 and

• X has a Binomial distribution with n = 10 and p = 0.80.

xx

xxp

1020.080.0

10

x 0 1 2 3 4 5p (x ) 0.0000 0.0000 0.0001 0.0008 0.0055 0.0264

x 6 7 8 9 10p (x ) 0.0881 0.2013 0.3020 0.2684 0.1074

Computing p(x) for x = 1, 2, 3, … , 10

The Graph

-

0.1

0.2

0.3

0.4

0 1 2 3 4 5 6 7 8 9 10

Number of successes, x

p(x

)

Computing the summary parameters for the distribution – , 2,

)(xxp )(2 xpx

x p (x ) xp(x) x 2 x 2 p(x)

0 0.0000 0.000 0 0.0001 0.0000 0.000 1 0.0002 0.0001 0.000 4 0.0003 0.0008 0.002 9 0.0074 0.0055 0.022 16 0.0885 0.0264 0.132 25 0.6616 0.0881 0.528 36 3.1717 0.2013 1.409 49 9.8658 0.3020 2.416 64 19.3279 0.2684 2.416 81 21.743

10 0.1074 1.074 100 10.737Total 1.000 8.000 65.600

• Computing the mean: 0.8

x

xxp


x

xpx 22

2

2

xx

xxpxpx

60.10.86.65 2


2 118.125.1

Multivariate data. Regression and Correlation The Scatter Plot.

Documents

Transcript of Multivariate data. Regression and Correlation The Scatter Plot.