Download - BY DEVELOPING IMPUTATION STRATEGIES

Q2008 - ROME, 09-11 JULY 2008Q2008 - ROME, 09-11 JULY 2008

Implementation and evaluation of Implementation and evaluation of

imputation strategies to improve the data imputation strategies to improve the data accuracyaccuracy

The case of Italian students data from the The case of Italian students data from the Programme for International Student Programme for International Student

Assessment (PISA 2003)Assessment (PISA 2003)

Claudio Quintano, Rosalia Castellano, Sergio LongobardiClaudio Quintano, Rosalia Castellano, Sergio LongobardiUniversity of Naples “Parthenope”University of Naples “Parthenope”

[email protected]; [email protected]; [email protected]@uniparthenope.it

[email protected] [email protected]

22

Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)

C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”

IMPROVING THE ACCURACY IMPROVING THE ACCURACY OF ITALIAN DATA FROMOF ITALIAN DATA FROMOECD’s “Programme for OECD’s “Programme for

International Student International Student Assessment” (PISA 2003)Assessment” (PISA 2003)

BY DEVELOPING BY DEVELOPING IMPUTATION STRATEGIESIMPUTATION STRATEGIES

TOTO REDUCE THE NON- REDUCE THE NON-SAMPLING ERROR OF SAMPLING ERROR OF

PARTIAL NON RESPONSESPARTIAL NON RESPONSES

OUTLINES

33



PISA 2003

The OECD’s PISA “The OECD’s PISA “Programme for Programme for International Student AssessmentInternational Student Assessment” survey is ” survey is an internationally standardised assessment an internationally standardised assessment administered to 15 years old studentsadministered to 15 years old students

The survey involves

276.165 students (11.639 in Italy)

10.274 schools (406 in Italy)

41 Countries (20 European Union members)

44



The survey assesses the students’ The survey assesses the students’ competencies in three areascompetencies in three areas

Reading literacy Scientific literacy

Mathematical literacy

PISA 2003

55



The OECD collects data on

FAMILY ENVIRONMENT OF STUDENT

STUDENT DATASET

SCHOOL DATASET

SCHOOL CHARACTERISTICS

AVAILABLE DATA

66



0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

Aus

tral

iaF

ranc

eT

unis

iaM

exic

oU

nite

dB

razi

lC

anad

aT

urke

yN

orw

ayIr

elan

dN

ewN

ethe

rland

sG

erm

any

Indo

nesi

aB

elgi

umS

pain

Den

mar

kIc

elan

dS

witz

erla

ndJa

pan

Mac

aoS

lova

kia

Rus

sian

Uru

guay

Hun

gary

Yug

osla

via

Aus

tria

Cze

chLa

tvia

Uni

ted

Kor

eaG

reec

eS

wed

enIt

aly

Pol

and

Tha

iland

Hon

g K

ong

Por

tuga

lLu

xem

bour

gLi

echt

enst

eiF

inla

nd

ITALY: EXCLUDED STUDENT UNITS (8%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING

Multilevel (school and student)model with 4 covariates

77



0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

AU

SA

UT

FIN

FR

AH

KG

IDN

JP

NM

AC

ME

XT

UN

US

AY

UG

TU

RL

UX

DE

UU

RY

ITA

NL

DC

ZE

BR

AN

ZL

DN

KE

SP

CA

NN

OR

BE

LC

HE

IRL

SW

ES

VK

ISL

LV

AR

US

GB

RP

RT

HU

NG

RC

KO

RL

IEP

OL

TH

A

Multilevel (school and student)model with 29 covariates

ITALY: EXCLUDED STUDENT UNITS (81%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING

88



STEPS OF ANALYSIS

Missing data pattern

Imputation strategies

Evaluation of results

99



TWO SUBSETS OF VARIABLES

OECD’S PISA DATASET

COLLECTEDCOLLECTEDVARIABLESVARIABLES

DERIVED DERIVED VARIABLESVARIABLES

Computed on collected variables (by linear

combination or factorial analysis). This increases the potentialities of the survey

Data collected by student and school questionnaires

1010



EXAMPLE OF DERIVED EXAMPLE OF DERIVED VARIABLESVARIABLES

The PISA 2003 index of confidence in ICT internet tasks is derived from students’ responses to the five items. All items are inverted for IRT scaling and positive values on this index indicate high self-confidence in ICT internet tasks

The PISA 2003 index of school size (SCHLSIZE) is derived from summing school principals’ responses to the number of girls and boys at a school

The PISA 2003 index of availability of computers (RATCOMP) is derived from school principals’ responses to the items measuring the availability of computers. It is calculated by dividing the number of computers at school by the number of students at school

1111



Distribution of variables classified as “collected” and “derived” in the school and student dataset of PISA 2003

SCHOOL DATASET STUDENT DATASET

“COLLECTED”VARIABLES

154 215

“DERIVED” VARIABLES

30 109

TOTAL 184 324

“COLLECTED” AND “DERIVED” VARIABLES

1212



VARIABLE TYPOLOGYVARIABLE TYPOLOGY

CATEGORICAL VARIABLES (n .197) 91,7%

CONTINUOUS VARIABLES (n. 15) 6,9%

DISCRETE VARIABLES (n. 3 ) 1,4%

TOTAL OF COLLECTED VARIABLES AT STUDENT LEVEL (n. 215)

100%

STUDENTS’ DATASET

VARIABLES WITH > 5% OF MISSING (n.39) 18%

VARIABLES WITHOUT MISSING (n.3) 1,5%

1313



Iterative and sequential multiple regression applied to whole dataset

FIVE IMPUTATION PROCEDURES

Iterative and sequential multiple regression applied to each section of student questionnaire

Iterative and sequential multiple regression applied to imputation classes computed by a regression tree

Random selection of donors within imputation classes computed by a regression tree

Random selection of donors within imputation classes computed by a regression tree for each section of the student questionnaire

PROCEDURE A

PROCEDURE B

PROCEDURE C

PROCEDURE D

PROCEDURE E

USUAL ASSOCIATIONS AND ANTINOMIES OF USUAL ASSOCIATIONS AND ANTINOMIES OF ADOPTED IMPUTATION PROCEDURES (A-E)ADOPTED IMPUTATION PROCEDURES (A-E)

ALL PROCEDURES ARE BELONGING TO CATEGORIES ALL PROCEDURES ARE BELONGING TO CATEGORIES USUALLY WELL KNOWNUSUALLY WELL KNOWN

TWO CATEGORIES ARE INVOLVED: REGRESSION TWO CATEGORIES ARE INVOLVED: REGRESSION METHODS (A,B,E) AND DONORS METHODS (C,D)METHODS (A,B,E) AND DONORS METHODS (C,D)

DIMENSION OF TREATED DATA MATRIX. THE DIMENSION OF TREATED DATA MATRIX. THE IMPUTATION PROCEDURE IS (A,D) / IS NOT (B,C,E) IMPUTATION PROCEDURE IS (A,D) / IS NOT (B,C,E) PUT ON EACH SECTIONS OF THE QUESTIONNAIREPUT ON EACH SECTIONS OF THE QUESTIONNAIRE

TWO DATA MATRIX SIDES ARE INVOLVED: UNITS TWO DATA MATRIX SIDES ARE INVOLVED: UNITS (Classification And Regression Tree B,C,D) AND (Classification And Regression Tree B,C,D) AND VARIABLES (A,D)VARIABLES (A,D)

MISSING DATA MECHANISM IS (A,E) / IS NOT MISSING DATA MECHANISM IS (A,E) / IS NOT CONSIDERED (B,C,D) CONSIDERED (B,C,D)

1515



Iterative and sequential multiple regression (Raghunatahan et al. 2001) on each section of

student questionnaire

PROCEDURE A

The data matrix is partitioned in the seven sections of student questionnaire

The features of each section, as partition of data matrix:

•Strong logical links between the questions

•Homogeneous structure of association and relationship

•Homogeneous presence of missing data

1616



PROCEDURE A

Subset Section of the questionnaire

Categorical variables

Continuous variables

Discrete variables

VariablesAverage of

missing data for each section

1Family context

(Sect. B)39 0 0 39 107

2Educational level

(Sect. C)14 6 1 21 508

3School context

(Sect. D)18 0 0 18 117

4Learning

mathematics(Sect. E)

40 6 0 46 237

5Mathematics classes

(Sect.F)21 2 1 24 212

6ICT confidence

(Sect. ICT)49 0 0 49 713

7Educational career

(Sect.EC)6 0 1 7 242

1717



Iterative and sequential multiple, regression applied to imputation classes computed by a

regression tree

PROCEDURE B

Computation of regression tree (14 terminal nodes)

DEPENDENT VARIABLEMissing data for each student

PREDICTORS Selected from five categories of derived indicators θ:•Family background•Scholastic context•Approach to study•Attitudes toward ICT struments•Performance scores

Each terminal node of the tree is considered as imputation class

Their missing values are imputed by iterative and sequential regression model

(Raghunatahan et al. 2001)

STEP I

UNITS CLASSIFICATION

STEP IIIMPUTATION

1818



Random selection of donors inside of imputation classes computed by a

regression tree

PROCEDURE C

Computation of regression tree (14 terminal nodes)

DEPENDENT VARIABLEMissing data for each student

PREDICTORS Selected from five categories of derived indicators θ:•Family background•Scholastic context•Approach to study•Attitudes toward ICT struments•Performance scores

A different donor is selected to impute each missing value of each student

The donor is selected randomly from the same node

STEP IUNITS

CLASSIFICATION

STEP IIIMPUTATION

1919



THE DATA MATRIX IS PARTITIONED IN THE SEVEN SECTIONS OF STUDENT

QUESTIONNAIRE

A REGRESSION TREE IS PRODUCED WITHIN EACH PARTITION OF THE MATRIX

(see the next slide)

WITHIN ALL LEAVES, A DIFFER DONOR IS SELECTED TO IMPUTE EACH MISSING VALUE OF

EACH STUDENTTHE DONOR IS SELECTED RANDOMLY FROM

THE SAME NODE

STEP IIUnits

Classification

STEP IIIImputation

Random selection of donors within imputation classes computed by a regression tree for each

section of the student questionnaire

PROCEDURE D

STEP IMatrix

partition

2020



PROCEDURE D

REGRESSION TREES FOR EACH MATRIX PARTITION

SECTIONDEPENDENT VARIABLE

PREDICTORSTERMINAL

NODES

BNumber of missing

data for each record (student)

ISCO code Mother, Disciplinary climate in maths lessons, Mathematics self-efficacy, Computer facilities

at home, Plausible value in problem solving12

CNumber of missing


Expected educational level of student (ISCED), Mathematics anxiety, Mathematics self-concept, ICT: Confidence in routine tasks, Plausible value in math

11

ENumber of missing


Home educational resources, Mathematics anxiety, Mathematics self-concept, ICT: Confidence in routine

tasks, Plausible value in problem solving7

ICNumber of missing


Index of Socio-Economic and Cultural Status, Mathematics anxiety, Mathematics self-efficacy, Computer facilities at home, Plausible value in

problem solving

16

D+F+ECNumber of missing


Expected educational level of student (ISCED), Disciplinary climate in maths lessons, Mathematics self-efficacy, Computer facilities at home, Plausible

value in problem solving

12

2121



ITERATIVE AND SEQUENTIAL ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION MULTIPLE REGRESSION

(Raghunatahan et al. 2001) ON (Raghunatahan et al. 2001) ON THE WHOLE DATASET THE WHOLE DATASET

(without any partition of units and (without any partition of units and variables)variables)

PROCEDURE E

2222



METHODOLOGICAL DETAILS OF THE METHODOLOGICAL DETAILS OF THE IMPUTATION PROCEDURESIMPUTATION PROCEDURES

2323



The classification is obtained through

the recursive binary partition of the

measurement space and containing

subgroups (NODES) of the target variable

values internally homogeneous, correspond to

imputation cells

Classification and Regression Tree creates a tree-based classification model. It classifies cases into groups or predicts values of a dependent (target) variable based (Y) on values of independent (predictor) variables (X)

Classification And Regression Tree

PARENT

NODE

CHILD NODE

TERMINAL NODE

CREATE IMPUTATION CELLS

2424



Example: A tree T Example: A tree T composed of five nodes tcomposed of five nodes tii i=1,2,3,4,5i=1,2,3,4,5

ti

i tyytN

tR 2))(()(

1)(

t1

t2 t3

t4t5

Impurity of a node t

STRUCTURE OF A REGRESSION TREESTRUCTURE OF A REGRESSION TREE

For any split s of t into tL and tR,

the best split s* is such that

)()()(max RLSs

tRtRtR

2525



The variable with the fewest number of missing values -Y1 – is regressed on the

subset of variables without missing data U=X

Each variable is imputed by using all available variables (completed or

imputed)

ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (1/2)

PARTITION OF THE VARIABLES

Variables with missing data -X-

Variables without missing data -Y-STEP 1

STEP 2

Update U by appending Y1 Then the next fewest missing values Y2 is

regressed on U = (X, Y1) where Y1 has imputed values

STEP 3

STEP N

……..

2626



ALL MISSING DATA ARE IMPUTED FOR EACH VARIABLE

THE IMPUTATION PROCESS IS THEN REPEATED MODIFYING THE PREDICTOR SET TO INCLUDE ALL X AND

Y VARIABLES EXCEPT THE ONE USED AS THE DEPENDENT VARIABLE

ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (2/2)

NEXT ROUND

2727



EVALUATION OF IMPUTATION PROCEDURES

IMPACT ON

UNIVARIATE DISTRIBUTIONS

RELATIONSHIP BETWEEN

VARIABLES

2828



IMPUTATION EFFECTS ON UNIVARIATE DISTRIBUTIONS

2N

*j j jk kk 1

1Im

2f f

CATEGORICAL VARIABLES

N denotes the number of categorical variables

*

j jj

j

Iy y

y

*j j

j

j

I

CONTINUOUS

VARIABLES

ABSOLUTE ABSOLUTE RELATIVE RELATIVE

VARIATION INDEX VARIATION INDEX

(AMONG MEANS)(AMONG MEANS)

ABSOLUTE RELATIVE ABSOLUTE RELATIVE VARIATION INDEX VARIATION INDEX

(AMONG STANDARD (AMONG STANDARD DEVIATIONS)DEVIATIONS)

ABSOLUTE RELATIVE ABSOLUTE RELATIVE SQUARE DISSIMILARITIES SQUARE DISSIMILARITIES INDEX (LETI 1983)INDEX (LETI 1983)

the education survey data have analysed with multilevel models.

2929



Mean difference for each imputed variable (Yj) between the association pre and post imputation of Yj vs remaining n-1

categorical variables

Variation Association Index(categorical variables)

jh

impact

impactzero

N

CramerCramer

VAI jhjhjh

NJ

max1

0

1

*

IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (1/2)

2

12

)1,1min(

crnCramer

3030



Mean difference for each imputed variable (Yj) beetwen the correlation pre and post imputation of Yj

vs remaining n-1 continuous variables

Variation Association Index (continuous variables)

jh

impact

impactzero

N

rr

VAI jhjhjh

CJ

max1

0

)1(2

*

IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (2/2)

3131



MATRIX V x P - “VARIABLES x PROCEDURES”

The matrices VxP are five, one for each evaluation index

Example of structure of matrix VxP with a generic evaluation index Gj ≡ (Imj or Iμ

j or Iσj or VAIN

j or VAICj)*

Proc.A Proc.B Proc.C Proc.D Proc.E

Var.1 g1a g1b g1c g1d g1e

Var.2 g2a g2b g2c g2d g2e

…... …. …. …. …. ….

Var. N gNa gNb gNc gNd gNe

(*) According: a) the typology (categorical, etc.) of variables (the number of variables for each typology is denoted by N); b) type of impact on: univariate

distributions and relationship between variables

3232



Each of five matrix VPG(Nx5)

-whose Gjs is a generic element-

is transformed in ajs (0,1) score matrix SI

(Nx5) with ajs

Matrix VPMatrix VPGGNx5Nx5

Proc.A

Proc.B

Proc.C

Proc.D

Proc.E

Var.1 0,6 0,4 0,5 0,1 0,7

Var.2 0,3 0,6 0,9 0,5 ,08

…. …. …. ….

Var.N 0,4 0,2 0,8 0,6 0,5

SCORES MATRICES

1 if gjs is the minimum value in the row j

0 otherwise

Matrix SMatrix SGGNx5Nx5

Proc.A

Proc.B

Proc.C

Proc.D

Proc.E

Var.1 0 0 0 1 0

Var.2 1 0 0 0 0

…. …. …. ….

Var.N 0 1 0 0 0

j min{ gjs} ajs=1; s:gjs≠min{ gjs} ajs=0

3333



The ranking indicators measure the relative performance of each procedure according to each evaluation index

BUILDING A RANKING INDICATOR (1/3)

Scores column vector extraction from score matrix S for sth procedure

Generic evaluation index Gj = (Imj or Iμj or Iσj or VAINj or VAICj)*

Proc.A …… Proc.S …… Proc.M

Var.1 a11=0 …. a1s=1 …. a1M=0 a1s=1

Var.2 a21=1 …. a2s=0 …. a2M=0 a2s=0

…... …. …. …. …. …. ….

…... …. …. …. …. …. ….

Var.N aN1=0 …. aNs=1 …. a NM=0

(ajs) j=1,2,..N ((0,1) score vector of sth

procedure for each evaluation index)

=

aNs=1

M is equal to 5 in this experimentation (s = 1,2,3,4,5) (*) According: a) the typology (categorical, etc.) of variables (the number of variables for each typology is denoted by N); b) type of impact on: univariate distributions and relationship between variables

3434



The vector of 0,1 scores extracted from the S matrix (for each procedure and for each evaluation indicator) is reduced to a scalar as a sum of its elements

This sum is divided by the number of vector elements to obtain a ranking index R whose range is 0,1

N

aaa

N

a

R Nsss

N

jjs

GS

...211


3535



Lowest performance of sth procedure compared to other ones for generic evaluation index G

Highest performance of sth

procedure compared to other procedures for generic evaluation index G

The ranking indicators measure the relative performance of each procedure according to each evaluation index

1

0

1

N

a

R

N

jjs

GS


3636



Evaluation indexes Aim Ranking index

Evaluating the imputation impact on marginal distributions (categorical variables)

Evaluating the imputation impact on marginal distributions (continuous variables)

Evaluating the imputation impact on marginal distributions (continuous variables)

Evaluating the imputation impact on the association between continuous variables

Evaluating the imputation impact on the association between categorical variables

CJVAI

NJVAI

CVAIsR

NVAIsR

sIM

SI

SI

IMSR

IsR

IsR

FROM AN EVALUATION INDICATOR TO A RANKING INDICATOR

3737



Ranking based on dissimilarities index

(categorical variables)

Absolute relative variation index (among means)

Absolute relative variation index (among standard

deviations)

Rank Proced.

Ranking

index Rank Proced.

Ranking

index Rank Proced.

Ranking

index

I C 0,695 I C 0,538 I C 0,385

II D 0,626 II E 0,308 II D 0,385

III A 0,474 III D 0,077 III B 0,231

IV E 0,442 IV B 0,077 IV E 0,000

V B 0,405 V A 0,000 V A 0,000

EVALUATING THE IMPACT ON MARGINAL DISTRIBUTIONS AND ON SOME DISTRIBUTIVE PARAMETERS

sIMSI

SI

IMSR

IsR

IsR

3838



Ranking based on Variation Association Index for categorical

variables

Ranking based on Variation Association Index for continuos

variables

Rank Procedure

Ranking indicators

Rank Procedure

Ranking indicators

I B 0,538 I B 0,542

II A 0,308 II A 0,126

III E 0,077 III E 0,121

IV D 0,077 IV D 0,121

V C 0,000 V C 0,089

csIVA

EVALUATING THE IMPUTATION IMPACT ON THE VARIABLES ASSOCIATION

CVAIsR

NVAIsR

CJVAIN

JVAI

3939



CONCLUDING REMARKS

MISSING DATA IMPUTATION IS AN EXTREMELY COMPLEX PROCESS

EACH METHOD SHOWS CRITICAL ASPECTS

IT IS IMPORTANT TO DEVELOP A RECONTRUCTION STRATEGY CONSIDERING SOME BASIC ASPECTS:

•THE MISSING DATA PATTERN•THE IMPACT ON THE STATISTICAL DISTRIBUTIONS •THE IMPACT ON THE ASSOCIATIONS AMONG VARIABLES