Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

21
Factors influencing the Human Development Index (HDI) using Multiple linear regression ADITYA PANUGANTI 1202062944 Industrial Engineering Year of data: 2008 Source: UN Development Programme Database

description

Identified the most crucial factors that influence Human Development Index through regression analysis using Minitab software

Transcript of Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Page 1: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Factors influencing the Human Development Index (HDI) using

Multiple linear regression

ADITYA PANUGANTI

1202062944

Industrial Engineering

Year of data: 2008Source: UN Development Programme Database

Page 2: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Objective and Dataset description• To find which of the following variables have an effect on the Human

Development Index (HDI)

DESCRIPTION OF THE REGRESSOR VARIABLE TYPE LEVELS

Adult Literacy rate LIT NUMERICAL  Combined gross enrollment ratio in

education (both sexes) GRO NUMERICAL 

Gender Inequality index GEN NUMERICAL  Mean years of schooling (people

whose age is more than 25) SCH NUMERICAL 

Expected years of schooling (for a child at entrance age) EXP NUMERICAL

 

Life Expectancy at birth LIF NUMERICAL  Gross Domestic product per capita

(GDP) GDP NUMERICAL 

Gross National income per capita (GNI) GNI NUMERICAL

 Maternal Mortality ratio MAT NUMERICAL  

Intensity of Deprivation DEP CATEGORICAL 2Homicide rate HOM NUMERICAL  

Under Five age mortality rate MOR NUMERICAL  

Continent indicator variable CON1-Africa; CON2-Asia; CON3-Europe; CON4-N.America; CON5-Oceania; CON6-S.America CATEGORICAL 6

Page 3: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Fitting the full model without interaction terms

• The regression equation for full model is• y = 0.0596 + 0.00440 LIF + 0.000007 GDP - 0.000748 GRO + 0.0158 SCH +

0.0080 GEN+ 0.0159 EXP - 0.000004 GNI + 0.000003 MAT - 0.000051 HOM - 0.000540 MOR+ 0.000176 LIT - 0.0185 DEP + 0.0023 CON1 - 0.0117 CON2 - 0.0100 CON3+ 0.00431 CON4 - 0.0268 CON5

• Difficult to interpret the coefficients of the above regression equation.• Hence standardized the regression coefficients using Unit Normal

scaling

Page 4: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Fitting the full model after Standardization

• The regression equation is

y = 0.684 + 0.0404 LIF + 0.100 GDP - 0.0117 GRO + 0.0408 SCH + 0.00136 GEN + 0.0443 EXP - 0.0627 GNI + 0.00089 MAT - 0.00068 HOM - 0.0196 MOR+ 0.00259 LIT - 0.0185 DEP + 0.0023 CON1 - 0.0117 CON2 - 0.0100 CON3+ 0.00431 CON4 - 0.0268 CON5

• Model Statistics:

R-Sq = 98.5% R-Sq(adj) = 98.2%

Analysis of Variance (ANOVA)

Source DF SS MS F P

Regression 17 2.21784 0.13046 325.49 0.000

Residual Error 84 0.03367 0.00040

Total 101 2.25150

Page 5: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Signs of Multicollinearity• Inference from Variance Inflation Factor (VIFs):

VIF of GDP = 560.116 and VIF of GNI = 533.109 (Indicating Severe Multicollinearity)

VIF of EXP = 18.368 and VIF of GRO = 16.456 (just over 10; Indicating Multicollinearity)

• Inference from Correlation matrix: LIF GDP GRO SCH GEN EXP GNI MAT

GDP 0.595

GRO 0.719 0.630

SCH 0.603 0.553 0.776

GEN -0.677 -0.705 -0.758 -0.743

EXP 0.692 0.636 0.956 0.774 -0.798

GNI 0.584 0.999 0.618 0.539 -0.688 0.620

Dropped GNI from the model.No change in R-sq and R-sq(adj) statistics before and after dropping the model

R-Sq = 98.5% R-Sq(adj) = 98.2%

• To confirm Multicollinearity between EXP and GRO, did a further analysis using Principal Component Analysis.

• Found the condition number to be (Condition number = λmax/ λmin =7.8001/0.0327 = 238.53

>100, indicating moderate multicollinearity

Dropped EXP also from the model and check the Model summary statistics- a slight reduction in R-sq and R-sq(adj) .

Page 6: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Residual plots and Model Adequacy

Both normality and Residual vs fitted plots look good, satisfying the normality and constant variance conditions

Page 7: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Indicator Interactions

• Considered interaction terms of DEP and other numerical variables.• 24 variables in all including all the interaction terms• S = 0.0220704 R-Sq = 98.3% R-Sq(adj) = 97.8%; R-Sq(pred) = 96.80%• Residual plots:

0.0500.0250.000-0.025-0.050

99.9

99

90

50

10

1

0.1

Residual

Perc

ent

1.00.80.60.40.2

0.050

0.025

0.000

-0.025

-0.050

Fitted Value

Residual

0.060.040.020.00-0.02-0.04-0.06

30

20

10

0

Residual

Fre

quency

1009080706050403020101

0.050

0.025

0.000

-0.025

-0.050

Observation Order

Residual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for y

Page 8: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Outliers and Influential points

ObsHii

(0.49484)Outliers or

not

Student residuals

(>2)

Outliers or not

DFFIT (0.97014)

Influential or not

94 0.322922 3.642607124 Outlier 2.747085588 Influential

95 0.174703 2.499454616 Outlier 1.191860945 Influential

70 0.514913 Outlier 2.612075845 Outlier 2.800604995

62 0.825517 Outlier 0.957486148 2.08152168 Influential

38 0.108995 3.082041891 Outlier 1.14891402 Influential

Page 9: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Other outliers in graph

• Fitting each of the datapoints 45, 50, 80 and checking if there is any changes in summary stats

• These points are not contributing to any leverage, nor being influential; except for the fact that they are outliers; also R-sq not changing much, therefore we are leaving them in the model.

Page 10: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Residual plots after taking off the outliers and influential points

0.0500.0250.000-0.025-0.050

99.9

99

90

50

10

1

0.1

Residual

Perc

ent

1.00.80.60.4

0.04

0.02

0.00

-0.02

-0.04

Fitted Value

Resi

dual

0.0300.0150.000-0.015-0.030-0.045

20

15

10

5

0

Residual

Fre

quency

9080706050403020101

0.04

0.02

0.00

-0.02

-0.04

Observation Order

Resi

dual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for y

• Normal probability plot looks good but the Residuals vs fit looks like a double bow shaped.

• To confirm this, we have used box cox transformation which showed us that there is a need in the transformation on ‘y’

Page 11: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Box-Cox Transformation

lambda=> -1 -0.5 0 0.15 0.25 0.375 0.5 0.75 1 2

SS res = 0.07918 0.05952 0.04749 0.04484 0.04326 0.04148 0.03989 0.03721 0.03511 0.03147

Suggests lambda = 2, implies transform y y2

5.02.50.0-2.5-5.0

0.6

0.5

0.4

0.3

0.2

0.1

Lambda

StD

ev

Lower CL Upper CL

Limit

Estimate 2.09

Lower CL 1.15Upper CL 3.06

Rounded Value 2.00

(using 95.0% confidence)

Lambda

Box-Cox Plot of y

Page 12: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Residual plots after transformation

0.0500.0250.000-0.025-0.050

99.9

99

90

50

10

1

0.1

Residual

Perc

ent

0.80.60.40.20.0

0.050

0.025

0.000

-0.025

-0.050

Fitted Value

Resi

dual

0.040.020.00-0.02-0.04-0.06

40

30

20

10

0

Residual

Fre

quency

9080706050403020101

0.050

0.025

0.000

-0.025

-0.050

Observation Order

Resi

dual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for y2

Can find some outliers in the Normal probability plot

Page 13: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Outliers and Influential points

ObsHii

(0.49485)Outlier or

notStudent

residuals (>2)Outlier or not

DFFIT (0.99483)

Influential

82 0.319989 1.567901961 1.07554448 Influential

76 0.329424 1.663755784 1.166118631 Influential

55 0.137185 3.046901559 Outlier 1.214935585

45 0.10912 2.940944269 Outlier 1.029270034

44 0.844545 Outlier 0.063698129 0.148468899

39 0.243545 2.155012184 Outlier 1.222778345 Influential

Page 14: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Residual plots after taking off the outliers and influential points

0.0500.0250.000-0.025-0.050

99.9

99

90

50

10

1

0.1

Residual

Perc

ent

0.80.60.40.20.0

0.050

0.025

0.000

-0.025

-0.050

Fitted Value

Resi

dual

0.040.020.00-0.02-0.04

40

30

20

10

0

Residual

Fre

quency

9080706050403020101

0.050

0.025

0.000

-0.025

-0.050

Observation Order

Resi

dual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for y2

No need for any transformation, Box-Cox suggests λ = 1

Page 15: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Variable selection and Model building

Method Suggested Predictors Summary Statistics

Best subsets methodLIF GDP GRO MOR SCH GEN CON2

CON5 GDP_D HOM_D MOR_D(No of variables = 11)

S = 0.01869R-Sq = 99.1

R-sq (adj) = 99Mallows Cp = 10.6

Forward Selection

(α–to-enter: 0.15 )

GEN GRO LIF SCH GDP CON2 GDP_D LIF_D MOR CON5 MOR_D HOM_D DEP

HOM(No of variables = 14)

S = 0.0186R-Sq = 99.18

R-sq (adj) = 99.03Mallows Cp = 12.8

Backward Elimination

(α–to-remove: 0.15)

LIF GDP GRO MOR SCH GEN CON2 CON5 GDP_D HOM_D MOR_D

(No of variables = 11)

S = 0.0187R-Sq = 99.14

R-sq (adj) = 99.02Mallows Cp = 10.6

Stepwise Regression

( α–to-enter: 0.15, α–to-remove: 0.15)

GEN GRO LIF GDP SCH CON2 MOR GDP_D CON5 MOR_D HOM_D

(No of variables = 11)

S = 0.0187R-Sq = 99.14

R-sq (adj) = 99.02Mallows Cp = 10.6

Page 16: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Fit the selected model• Regression equation:

y2 = 0.476 - 0.0164 GEN + 0.0403 GRO + 0.0422 LIF + 0.0557 GDP + 0.0449 SCH - 0.0181 CON2 - 0.0388 MOR + 0.0523 GDP_D + 0.0289 CON5 + 0.0412 MOR_D - 0.0476 HOM_D

• Detected Multicollinearity using Principal component analysis

condition number = 134.837 (>100, Moderate Multicollinearity)

• Linear dependency equation: 0.107GRO+0.337LIF+0.798MOR-0.467MOR_D (dependency between the variables in the equation)

• Using correlation matrix found that the variable MOR has large correlation with LIF and MOR_D.

• Dropping MOR removed multicollinearity from model (condition number = 39.04617 (<100, No multicollinearity)

Page 17: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Residual plots after dropping MOR

0.080.040.00-0.04-0.08

99.9

99

90

50

10

1

0.1

Residual

Perc

ent

0.80.60.40.20.0

0.05

0.00

-0.05

Fitted Value

Resi

dual

0.040.020.00-0.02-0.04-0.06-0.08

20

15

10

5

0

Residual

Fre

quency

9080706050403020101

0.05

0.00

-0.05

Observation OrderResi

dual

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for y2

• Presence of an outlier datapoint 72 • No need for any transformation, Box-Cox suggests λ = 1

Page 18: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Fit the model after dropping off the outlier

• The regression equation is

y2 = 0.482 - 0.0221 GEN + 0.0436 GRO + 0.0576 LIF + 0.0528 GDP + 0.0483 SCH - 0.0115 CON2 + 0.0556 GDP_D + 0.0182 CON5 + 0.0169 MOR_D - 0.0538 HOM_D

• R-sq = 99.1% R-sq(adj) = 99% R-sq(pred) = 98.73%

0.0500.0250.000-0.025-0.050

99.9

99

95

90

80706050403020

10

5

1

0.1

Residual

Perc

ent

Normal Probability Plot(response is y2)

0.90.80.70.60.50.40.30.20.1

0.04

0.03

0.02

0.01

0.00

-0.01

-0.02

-0.03

-0.04

Fitted Value

Resi

dual

Versus Fits(response is y2)

Page 19: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Model validation• Considered 118 countries for modelling

102 Estimation data and 16 prediction data

y[pred] y[actual]

0.784 0.8670.583 0.5990.793 0.870.796 0.8960.781 0.8810.776 0.870.695 0.770.831 0.8510.575 0.6160.378 0.4170.720 0.8160.341 0.4080.753 0.8470.815 0.90.315 0.4210.770 0.849

R2 [pred] by model = 83.65%

R2[pred] (L.S fit) = 98.73%

MS res by model = 0.006122809

Ms res (L.S fit) = 0.00036

Page 20: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Conclusion

• The reduced model has a better R-sq than the actual model and most of the variables are significant (low p-value) in the model.

• The following variables were found to be significant – Gender inequality index– Combined gross enrolment– Life expectancy at birth– GDP– Mean schooling years– Countries in continent 2– GDP& intensity of deprivation– Under 5 mortality rate& intensity of deprivation– Homicide rate& intensity of deprivation

Page 21: Factors influencing the Human Development Index (HDI) using Multiple Linear Regression

Possible improvements

• More datapoints

• Ridge regression to eliminate multicollinearity

• Robust regression – to add more weight to the datapoints and retain them in the model.