Q2008 - ROME, 09-11 JULY 2008Q2008 - ROME, 09-11 JULY 2008
Implementation and evaluation of Implementation and evaluation of
imputation strategies to improve the data imputation strategies to improve the data accuracyaccuracy
The case of Italian students data from the The case of Italian students data from the Programme for International Student Programme for International Student
Assessment (PISA 2003)Assessment (PISA 2003)
Claudio Quintano, Rosalia Castellano, Sergio LongobardiClaudio Quintano, Rosalia Castellano, Sergio LongobardiUniversity of Naples “Parthenope”University of Naples “Parthenope”
[email protected]; [email protected]; [email protected]@uniparthenope.it
22
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
IMPROVING THE ACCURACY IMPROVING THE ACCURACY OF ITALIAN DATA FROMOF ITALIAN DATA FROMOECD’s “Programme for OECD’s “Programme for
International Student International Student Assessment” (PISA 2003)Assessment” (PISA 2003)
BY DEVELOPING BY DEVELOPING IMPUTATION STRATEGIESIMPUTATION STRATEGIES
TOTO REDUCE THE NON- REDUCE THE NON-SAMPLING ERROR OF SAMPLING ERROR OF
PARTIAL NON RESPONSESPARTIAL NON RESPONSES
OUTLINES
33
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
PISA 2003
The OECD’s PISA “The OECD’s PISA “Programme for Programme for International Student AssessmentInternational Student Assessment” survey is ” survey is an internationally standardised assessment an internationally standardised assessment administered to 15 years old studentsadministered to 15 years old students
The survey involves
276.165 students (11.639 in Italy)
10.274 schools (406 in Italy)
41 Countries (20 European Union members)
44
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
The survey assesses the students’ The survey assesses the students’ competencies in three areascompetencies in three areas
Reading literacy Scientific literacy
Mathematical literacy
PISA 2003
55
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
The OECD collects data on
FAMILY ENVIRONMENT OF STUDENT
STUDENT DATASET
SCHOOL DATASET
SCHOOL CHARACTERISTICS
AVAILABLE DATA
66
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
0,00%
20,00%
40,00%
60,00%
80,00%
100,00%
Aus
tral
iaF
ranc
eT
unis
iaM
exic
oU
nite
dB
razi
lC
anad
aT
urke
yN
orw
ayIr
elan
dN
ewN
ethe
rland
sG
erm
any
Indo
nesi
aB
elgi
umS
pain
Den
mar
kIc
elan
dS
witz
erla
ndJa
pan
Mac
aoS
lova
kia
Rus
sian
Uru
guay
Hun
gary
Yug
osla
via
Aus
tria
Cze
chLa
tvia
Uni
ted
Kor
eaG
reec
eS
wed
enIt
aly
Pol
and
Tha
iland
Hon
g K
ong
Por
tuga
lLu
xem
bour
gLi
echt
enst
eiF
inla
nd
ITALY: EXCLUDED STUDENT UNITS (8%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING
Multilevel (school and student)model with 4 covariates
77
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
0,00%
20,00%
40,00%
60,00%
80,00%
100,00%
AU
SA
UT
FIN
FR
AH
KG
IDN
JP
NM
AC
ME
XT
UN
US
AY
UG
TU
RL
UX
DE
UU
RY
ITA
NL
DC
ZE
BR
AN
ZL
DN
KE
SP
CA
NN
OR
BE
LC
HE
IRL
SW
ES
VK
ISL
LV
AR
US
GB
RP
RT
HU
NG
RC
KO
RL
IEP
OL
TH
A
Multilevel (school and student)model with 29 covariates
ITALY: EXCLUDED STUDENT UNITS (81%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING
88
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
STEPS OF ANALYSIS
Missing data pattern
Imputation strategies
Evaluation of results
99
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
TWO SUBSETS OF VARIABLES
OECD’S PISA DATASET
COLLECTEDCOLLECTEDVARIABLESVARIABLES
DERIVED DERIVED VARIABLESVARIABLES
Computed on collected variables (by linear
combination or factorial analysis). This increases the potentialities of the survey
Data collected by student and school questionnaires
1010
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
EXAMPLE OF DERIVED EXAMPLE OF DERIVED VARIABLESVARIABLES
The PISA 2003 index of confidence in ICT internet tasks is derived from students’ responses to the five items. All items are inverted for IRT scaling and positive values on this index indicate high self-confidence in ICT internet tasks
The PISA 2003 index of school size (SCHLSIZE) is derived from summing school principals’ responses to the number of girls and boys at a school
The PISA 2003 index of availability of computers (RATCOMP) is derived from school principals’ responses to the items measuring the availability of computers. It is calculated by dividing the number of computers at school by the number of students at school
1111
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Distribution of variables classified as “collected” and “derived” in the school and student dataset of PISA 2003
SCHOOL DATASET STUDENT DATASET
“COLLECTED”VARIABLES
154 215
“DERIVED” VARIABLES
30 109
TOTAL 184 324
“COLLECTED” AND “DERIVED” VARIABLES
1212
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
VARIABLE TYPOLOGYVARIABLE TYPOLOGY
CATEGORICAL VARIABLES (n .197) 91,7%
CONTINUOUS VARIABLES (n. 15) 6,9%
DISCRETE VARIABLES (n. 3 ) 1,4%
TOTAL OF COLLECTED VARIABLES AT STUDENT LEVEL (n. 215)
100%
STUDENTS’ DATASET
VARIABLES WITH > 5% OF MISSING (n.39) 18%
VARIABLES WITHOUT MISSING (n.3) 1,5%
1313
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Iterative and sequential multiple regression applied to whole dataset
FIVE IMPUTATION PROCEDURES
Iterative and sequential multiple regression applied to each section of student questionnaire
Iterative and sequential multiple regression applied to imputation classes computed by a regression tree
Random selection of donors within imputation classes computed by a regression tree
Random selection of donors within imputation classes computed by a regression tree for each section of the student questionnaire
PROCEDURE A
PROCEDURE B
PROCEDURE C
PROCEDURE D
PROCEDURE E
USUAL ASSOCIATIONS AND ANTINOMIES OF USUAL ASSOCIATIONS AND ANTINOMIES OF ADOPTED IMPUTATION PROCEDURES (A-E)ADOPTED IMPUTATION PROCEDURES (A-E)
ALL PROCEDURES ARE BELONGING TO CATEGORIES ALL PROCEDURES ARE BELONGING TO CATEGORIES USUALLY WELL KNOWNUSUALLY WELL KNOWN
TWO CATEGORIES ARE INVOLVED: REGRESSION TWO CATEGORIES ARE INVOLVED: REGRESSION METHODS (A,B,E) AND DONORS METHODS (C,D)METHODS (A,B,E) AND DONORS METHODS (C,D)
DIMENSION OF TREATED DATA MATRIX. THE DIMENSION OF TREATED DATA MATRIX. THE IMPUTATION PROCEDURE IS (A,D) / IS NOT (B,C,E) IMPUTATION PROCEDURE IS (A,D) / IS NOT (B,C,E) PUT ON EACH SECTIONS OF THE QUESTIONNAIREPUT ON EACH SECTIONS OF THE QUESTIONNAIRE
TWO DATA MATRIX SIDES ARE INVOLVED: UNITS TWO DATA MATRIX SIDES ARE INVOLVED: UNITS (Classification And Regression Tree B,C,D) AND (Classification And Regression Tree B,C,D) AND VARIABLES (A,D)VARIABLES (A,D)
MISSING DATA MECHANISM IS (A,E) / IS NOT MISSING DATA MECHANISM IS (A,E) / IS NOT CONSIDERED (B,C,D) CONSIDERED (B,C,D)
1515
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Iterative and sequential multiple regression (Raghunatahan et al. 2001) on each section of
student questionnaire
PROCEDURE A
The data matrix is partitioned in the seven sections of student questionnaire
The features of each section, as partition of data matrix:
•Strong logical links between the questions
•Homogeneous structure of association and relationship
•Homogeneous presence of missing data
1616
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
PROCEDURE A
Subset Section of the questionnaire
Categorical variables
Continuous variables
Discrete variables
VariablesAverage of
missing data for each section
1Family context
(Sect. B)39 0 0 39 107
2Educational level
(Sect. C)14 6 1 21 508
3School context
(Sect. D)18 0 0 18 117
4Learning
mathematics(Sect. E)
40 6 0 46 237
5Mathematics classes
(Sect.F)21 2 1 24 212
6ICT confidence
(Sect. ICT)49 0 0 49 713
7Educational career
(Sect.EC)6 0 1 7 242
1717
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Iterative and sequential multiple, regression applied to imputation classes computed by a
regression tree
PROCEDURE B
Computation of regression tree (14 terminal nodes)
DEPENDENT VARIABLEMissing data for each student
PREDICTORS Selected from five categories of derived indicators θ:•Family background•Scholastic context•Approach to study•Attitudes toward ICT struments•Performance scores
Each terminal node of the tree is considered as imputation class
Their missing values are imputed by iterative and sequential regression model
(Raghunatahan et al. 2001)
STEP I
UNITS CLASSIFICATION
STEP IIIMPUTATION
1818
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Random selection of donors inside of imputation classes computed by a
regression tree
PROCEDURE C
Computation of regression tree (14 terminal nodes)
DEPENDENT VARIABLEMissing data for each student
PREDICTORS Selected from five categories of derived indicators θ:•Family background•Scholastic context•Approach to study•Attitudes toward ICT struments•Performance scores
A different donor is selected to impute each missing value of each student
The donor is selected randomly from the same node
STEP IUNITS
CLASSIFICATION
STEP IIIMPUTATION
1919
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
THE DATA MATRIX IS PARTITIONED IN THE SEVEN SECTIONS OF STUDENT
QUESTIONNAIRE
A REGRESSION TREE IS PRODUCED WITHIN EACH PARTITION OF THE MATRIX
(see the next slide)
WITHIN ALL LEAVES, A DIFFER DONOR IS SELECTED TO IMPUTE EACH MISSING VALUE OF
EACH STUDENTTHE DONOR IS SELECTED RANDOMLY FROM
THE SAME NODE
STEP IIUnits
Classification
STEP IIIImputation
Random selection of donors within imputation classes computed by a regression tree for each
section of the student questionnaire
PROCEDURE D
STEP IMatrix
partition
2020
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
PROCEDURE D
REGRESSION TREES FOR EACH MATRIX PARTITION
SECTIONDEPENDENT VARIABLE
PREDICTORSTERMINAL
NODES
BNumber of missing
data for each record (student)
ISCO code Mother, Disciplinary climate in maths lessons, Mathematics self-efficacy, Computer facilities
at home, Plausible value in problem solving12
CNumber of missing
data for each record (student)
Expected educational level of student (ISCED), Mathematics anxiety, Mathematics self-concept, ICT: Confidence in routine tasks, Plausible value in math
11
ENumber of missing
data for each record (student)
Home educational resources, Mathematics anxiety, Mathematics self-concept, ICT: Confidence in routine
tasks, Plausible value in problem solving7
ICNumber of missing
data for each record (student)
Index of Socio-Economic and Cultural Status, Mathematics anxiety, Mathematics self-efficacy, Computer facilities at home, Plausible value in
problem solving
16
D+F+ECNumber of missing
data for each record (student)
Expected educational level of student (ISCED), Disciplinary climate in maths lessons, Mathematics self-efficacy, Computer facilities at home, Plausible
value in problem solving
12
2121
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
ITERATIVE AND SEQUENTIAL ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION MULTIPLE REGRESSION
(Raghunatahan et al. 2001) ON (Raghunatahan et al. 2001) ON THE WHOLE DATASET THE WHOLE DATASET
(without any partition of units and (without any partition of units and variables)variables)
PROCEDURE E
2222
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
METHODOLOGICAL DETAILS OF THE METHODOLOGICAL DETAILS OF THE IMPUTATION PROCEDURESIMPUTATION PROCEDURES
2323
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
The classification is obtained through
the recursive binary partition of the
measurement space and containing
subgroups (NODES) of the target variable
values internally homogeneous, correspond to
imputation cells
Classification and Regression Tree creates a tree-based classification model. It classifies cases into groups or predicts values of a dependent (target) variable based (Y) on values of independent (predictor) variables (X)
Classification And Regression Tree
PARENT
NODE
CHILD NODE
TERMINAL NODE
CREATE IMPUTATION CELLS
2424
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Example: A tree T Example: A tree T composed of five nodes tcomposed of five nodes tii i=1,2,3,4,5i=1,2,3,4,5
ti
i tyytN
tR 2))(()(
1)(
t1
t2 t3
t4t5
Impurity of a node t
STRUCTURE OF A REGRESSION TREESTRUCTURE OF A REGRESSION TREE
For any split s of t into tL and tR,
the best split s* is such that
)()()(max RLSs
tRtRtR
2525
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
The variable with the fewest number of missing values -Y1 – is regressed on the
subset of variables without missing data U=X
Each variable is imputed by using all available variables (completed or
imputed)
ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (1/2)
PARTITION OF THE VARIABLES
Variables with missing data -X-
Variables without missing data -Y-STEP 1
STEP 2
Update U by appending Y1 Then the next fewest missing values Y2 is
regressed on U = (X, Y1) where Y1 has imputed values
STEP 3
STEP N
……..
2626
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
ALL MISSING DATA ARE IMPUTED FOR EACH VARIABLE
THE IMPUTATION PROCESS IS THEN REPEATED MODIFYING THE PREDICTOR SET TO INCLUDE ALL X AND
Y VARIABLES EXCEPT THE ONE USED AS THE DEPENDENT VARIABLE
ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (2/2)
NEXT ROUND
2727
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
EVALUATION OF IMPUTATION PROCEDURES
IMPACT ON
UNIVARIATE DISTRIBUTIONS
RELATIONSHIP BETWEEN
VARIABLES
2828
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
IMPUTATION EFFECTS ON UNIVARIATE DISTRIBUTIONS
2N
*j j jk kk 1
1Im
2f f
CATEGORICAL VARIABLES
N denotes the number of categorical variables
*
j jj
j
Iy y
y
*j j
j
j
I
CONTINUOUS
VARIABLES
ABSOLUTE ABSOLUTE RELATIVE RELATIVE
VARIATION INDEX VARIATION INDEX
(AMONG MEANS)(AMONG MEANS)
ABSOLUTE RELATIVE ABSOLUTE RELATIVE VARIATION INDEX VARIATION INDEX
(AMONG STANDARD (AMONG STANDARD DEVIATIONS)DEVIATIONS)
ABSOLUTE RELATIVE ABSOLUTE RELATIVE SQUARE DISSIMILARITIES SQUARE DISSIMILARITIES INDEX (LETI 1983)INDEX (LETI 1983)
the education survey data have analysed with multilevel models.
2929
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Mean difference for each imputed variable (Yj) between the association pre and post imputation of Yj vs remaining n-1
categorical variables
Variation Association Index(categorical variables)
jh
impact
impactzero
N
CramerCramer
VAI jhjhjh
NJ
max1
0
1
*
IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (1/2)
2
12
)1,1min(
crnCramer
3030
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Mean difference for each imputed variable (Yj) beetwen the correlation pre and post imputation of Yj
vs remaining n-1 continuous variables
Variation Association Index (continuous variables)
jh
impact
impactzero
N
rr
VAI jhjhjh
CJ
max1
0
)1(2
*
IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (2/2)
3131
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
MATRIX V x P - “VARIABLES x PROCEDURES”
The matrices VxP are five, one for each evaluation index
Example of structure of matrix VxP with a generic evaluation index Gj ≡ (Imj or Iμ
j or Iσj or VAIN
j or VAICj)*
Proc.A Proc.B Proc.C Proc.D Proc.E
Var.1 g1a g1b g1c g1d g1e
Var.2 g2a g2b g2c g2d g2e
…... …. …. …. …. ….
Var. N gNa gNb gNc gNd gNe
(*) According: a) the typology (categorical, etc.) of variables (the number of variables for each typology is denoted by N); b) type of impact on: univariate
distributions and relationship between variables
3232
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Each of five matrix VPG(Nx5)
-whose Gjs is a generic element-
is transformed in ajs (0,1) score matrix SI
(Nx5) with ajs
Matrix VPMatrix VPGGNx5Nx5
Proc.A
Proc.B
Proc.C
Proc.D
Proc.E
Var.1 0,6 0,4 0,5 0,1 0,7
Var.2 0,3 0,6 0,9 0,5 ,08
…. …. …. ….
Var.N 0,4 0,2 0,8 0,6 0,5
SCORES MATRICES
1 if gjs is the minimum value in the row j
0 otherwise
Matrix SMatrix SGGNx5Nx5
Proc.A
Proc.B
Proc.C
Proc.D
Proc.E
Var.1 0 0 0 1 0
Var.2 1 0 0 0 0
…. …. …. ….
Var.N 0 1 0 0 0
j min{ gjs} ajs=1; s:gjs≠min{ gjs} ajs=0
3333
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
The ranking indicators measure the relative performance of each procedure according to each evaluation index
BUILDING A RANKING INDICATOR (1/3)
Scores column vector extraction from score matrix S for sth procedure
Generic evaluation index Gj = (Imj or Iμj or Iσj or VAINj or VAICj)*
Proc.A …… Proc.S …… Proc.M
Var.1 a11=0 …. a1s=1 …. a1M=0 a1s=1
Var.2 a21=1 …. a2s=0 …. a2M=0 a2s=0
…... …. …. …. …. …. ….
…... …. …. …. …. …. ….
Var.N aN1=0 …. aNs=1 …. a NM=0
(ajs) j=1,2,..N ((0,1) score vector of sth
procedure for each evaluation index)
=
aNs=1
M is equal to 5 in this experimentation (s = 1,2,3,4,5) (*) According: a) the typology (categorical, etc.) of variables (the number of variables for each typology is denoted by N); b) type of impact on: univariate distributions and relationship between variables
3434
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
The vector of 0,1 scores extracted from the S matrix (for each procedure and for each evaluation indicator) is reduced to a scalar as a sum of its elements
This sum is divided by the number of vector elements to obtain a ranking index R whose range is 0,1
N
aaa
N
a
R Nsss
N
jjs
GS
...211
BUILDING A RANKING INDICATOR (2/3)
3535
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Lowest performance of sth procedure compared to other ones for generic evaluation index G
Highest performance of sth
procedure compared to other procedures for generic evaluation index G
The ranking indicators measure the relative performance of each procedure according to each evaluation index
1
0
1
N
a
R
N
jjs
GS
BUILDING A RANKING INDICATOR (3/3)
3636
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Evaluation indexes Aim Ranking index
Evaluating the imputation impact on marginal distributions (categorical variables)
Evaluating the imputation impact on marginal distributions (continuous variables)
Evaluating the imputation impact on marginal distributions (continuous variables)
Evaluating the imputation impact on the association between continuous variables
Evaluating the imputation impact on the association between categorical variables
CJVAI
NJVAI
CVAIsR
NVAIsR
sIM
SI
SI
IMSR
IsR
IsR
FROM AN EVALUATION INDICATOR TO A RANKING INDICATOR
3737
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Ranking based on dissimilarities index
(categorical variables)
Absolute relative variation index (among means)
Absolute relative variation index (among standard
deviations)
Rank Proced.
Ranking
index Rank Proced.
Ranking
index Rank Proced.
Ranking
index
I C 0,695 I C 0,538 I C 0,385
II D 0,626 II E 0,308 II D 0,385
III A 0,474 III D 0,077 III B 0,231
IV E 0,442 IV B 0,077 IV E 0,000
V B 0,405 V A 0,000 V A 0,000
EVALUATING THE IMPACT ON MARGINAL DISTRIBUTIONS AND ON SOME DISTRIBUTIVE PARAMETERS
sIMSI
SI
IMSR
IsR
IsR
3838
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Ranking based on Variation Association Index for categorical
variables
Ranking based on Variation Association Index for continuos
variables
Rank Procedure
Ranking indicators
Rank Procedure
Ranking indicators
I B 0,538 I B 0,542
II A 0,308 II A 0,126
III E 0,077 III E 0,121
IV D 0,077 IV D 0,121
V C 0,000 V C 0,089
csIVA
EVALUATING THE IMPUTATION IMPACT ON THE VARIABLES ASSOCIATION
CVAIsR
NVAIsR
CJVAIN
JVAI
3939
Implementation and evaluation of imputation strategies to improve the data accuracyThe case of Italian students data from the Programme for International Student Assessment (PISA 2003)
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
CONCLUDING REMARKS
MISSING DATA IMPUTATION IS AN EXTREMELY COMPLEX PROCESS
EACH METHOD SHOWS CRITICAL ASPECTS
IT IS IMPORTANT TO DEVELOP A RECONTRUCTION STRATEGY CONSIDERING SOME BASIC ASPECTS:
•THE MISSING DATA PATTERN•THE IMPACT ON THE STATISTICAL DISTRIBUTIONS •THE IMPACT ON THE ASSOCIATIONS AMONG VARIABLES