Data Reduction Techniques

download Data Reduction Techniques

of 11

Transcript of Data Reduction Techniques

  • 8/12/2019 Data Reduction Techniques

    1/11

    DATA REDUCTION TECHNIQUES

    RANJANA AGARWAL AND A.R. RAO

    Indian Agricultural Statistics Research Institute

    Library Avenue, New Delhi-110 012

    [email protected]

    1. IntroductionData reduction techniques are applied where the goal is to aggregate or amalgamate theinformation contained in large data sets into manageable (smaller) information nuggets.Data reduction techniques can include simple tabulation, aggregation (computingdescriptive statistics) or more sophisticated techniques like principal components analysis,factor analysis. Here, mainly principal component analysis (PCA) and factor analysis arecovered along with examples and software demonstration. Different commands used inSPSS and SAS packages for PCA and factor analyses are also given.

    2. Principal Component Analysis

    Often, the variables under study are highly correlated and as such they are effectivelysaying the same thing. It may be useful to transform the original set of variables to anew set of uncorrelated variables called principal components. These new variables arelinear combinations of original variables and are derived in decreasing order ofimportance so that the first principal component accounts for as much as possible of thevariation in the original data. Also, PCA is a linear dimensionality reduction technique,which identifies orthogonal directions of maximum variance in the original data, andprojects the data into a lower-dimensionality space formed of a sub set of the highestvariance components.

    Let x1, x2, x3,...,xp are variables under study, then first principal component may be

    defined as

    z1= a11x1+ a12x2+ ...... + a1pxp

    such that variance of z1is as large as possible subject to the condition that

    1a...aa 2p1212

    211 =+++

    This constraint is introduced because if this is not done, then Var(z1) can be increasedsimply by multiplying any a1js by a constant factor. The second principal component isdefined as

    z2 = a21 x1+ a22 x2+ ....... + a2p xpsuch that Var(z2) is as large as possible next to Var( z1) subject to the constraint that

    1a...aa 2p2222

    221 =+++ and cov(z1, z2) = 0 and so on.

    It is quite likely that first few principal components account for most of the variability inthe original data. If so, these few principal components can then replace the initial pvariables in subsequent analysis, thus, reducing the effective dimensionality of theproblem. An analysis of principal components often reveals relationships that were not

  • 8/12/2019 Data Reduction Techniques

    2/11

    Data Reduction Techniques

    2

    previously suspected and thereby allows interpretation that would not ordinarily result.However, Principal Component Analysis is more of a means to an end rather than an endin itself because this frequently serves as intermediate steps in much larger investigationsby reducing the dimensionality of the problem and providing easier interpretation. It is amathematical technique, which does not require user to specify the statistical model orassumption about distribution of original variates. It may also be mentioned that principal

    components are artificial variables and often it is not possible to assign physical meaningto them. Further, since Principal Component Analysis transforms original set of variablesto new set of uncorrelated variables, it is worth stressing that if original variables areuncorrelated, then there is no point in carrying out principal component analysis.

    Computation of Principal Components and Scores

    Consider the following data on average minimum temperature (x1), average relativehumidity at 8 hrs. (x2), average relative humidity at 14 hrs. (x3) and total rainfall in cm.(x4) pertaining to Raipur district from 1970 to 1986 for kharif season from 21st May to 7thOct.

    S.No. x1 x2 x3 x41 25.0 86 66 186.492 24.9 84 66 124.343 25.4 77 55 98.794 24.4 82 62 118.885 22.9 79 53 71.886 7.7 86 60 111.967 25.1 82 58 99.748 24.9 83 63 115.209 24.9 82 63 100.1610 24.9 78 56 62.3811 24.3 85 67 154.4012 24.6 79 61 112.7113 24.3 81 58 79.6314 24.6 81 61 125.5915 24.1 85 64 99.8716 24.5 84 63 143.5617 24.0 81 61 114.97

    Mean 23.56 82.06 61.00 112.97

    S.D. 4.13 2.75 3.97 30.06

    The variance-covariance matrix is as follows:

    =

    17 02 4 12 154 514

    7 56 8 50 54 82

    15 75 92 95

    90387

    . . . .

    . . .

    . .

    .

    .

    The eigen values (i's) and eigen vectors (ai) of the above matrix is obtained and is givenas follows in decreasing order alongwith the corresponding eigen vectors:

  • 8/12/2019 Data Reduction Techniques

    3/11

    Data Reduction Techniques

    3

    1 = 916.902 a1 = (0.006, 0.061, 0.103, 0.993)2 = 18.375 a2 = (0.955, -0.296, 0.011, 0.012)3 = 7.87 a3 = (0.141, 0.485, 0.855, -0.119)4 = 1.056 a4 = (0.260, 0.820, -0.509, 0.001)

    The principal components for this data will be

    z1 = 0.006 x1+ 0.061 x2 + 0.103 x3+ 0.993 x4z2 = 0.955 x1 - 0.296 x2+ 0.011 x3+ 0.012 x4z3 = 0.141 x1+ 0.485 x2+ 0.855 x3 - 0.119 x4z4 = 0.26 x1+ 0.82 x2 - 0.509 x3+ 0.001 x4

    The variance of principal components will be eigen values i.e.

    Var(z1) = 916.902, Var(z2) =18.375, Var (z3) = 7.87, Var(z4) = 1.056

    The total variation explained by original variables is

    Var(x1) + Var(x2) + Var(x3) + Var(x4)

    = 17.02 + 7.56 + 15.75 + 903.87 = 944.20

    The total variation explained by principal components is

    1+ 2+ 3+ 4= 916.902 + 18.375 + 7.87 + 1.056 = 944.20

    As such, it can be seen that the total variation explained by principal components is sameas that explained by original variables. It could also be proved mathematically as well as

    empirically that the principal components are uncorrelated. The proportion of totalvariation accounted for by the first principal component is

    203.944

    902.916

    4321

    1 =+++

    = 0.97

    Continuing, the first two components account for a proportion

    203.944

    277.935

    4321

    21 =+++

    += 0.99

    of the total variance.

    Hence, in further analysis, the first or first two principal components z1 and z2 couldreplace four variables by sacrificing negligible information about the total variation in thesystem. The scores of principal components can be obtained by substituting the values ofxis in equations of zis. For above data, the first two principal components for firstobservation i.e. for year 1970 can be worked out as

  • 8/12/2019 Data Reduction Techniques

    4/11

    Data Reduction Techniques

    4

    z1= 0.006 x 25.0 + 0.061 x 86 + 0.103 x 66 + 0.993 x 186.49 = 197.380

    z2= 0.955 x 25.0 - 0.296 x 86 + 0.011 x 66 + 0.012 x 186.49 = 1.383

    Similarly for the year 1971

    z1 = 0.006 x 24.9 + 0.061 x 84 + 0.103 x 66 + 0.993 x 124.34 = 135.54

    z2 = 0.955 x 24.9 - 0.296 x 84 + 0.011 x 66 + 0.012 x 124.34 = 1.134

    Thus the whole data with four variables can be converted to a new data set with twoprincipal components.

    Note:The principal components depend on the unit of measurement, for example, if in theabove example x1is measured in

    0F instead of 0C and x4in mm in place of cm, the datagives different principal components when transformed to original xs. In very specificsituations the results are same. The conventional way of getting around this problem is touse standardized variables with unit variances, i.e., correlation matrix in place ofdispersion matrix. But the principal components obtained from original variables as such

    and from correlation matrix will not be same and they may not explain the sameproportion of variance in the system. Further more, one set of principal components is notsimple function of the other. When the variables are standardized, the resulting variablescontribute almost equally to the principal components determined from correlation matrix.Variables should probably be standardized if they are measured in units with widelydiffering ranges or if measured units are not commensurate. Often population dispersionmatrix or correlation matrix is not available. In such situations, sample dispersion matrixor correlation matrix can be used.

    Applications of Principal Components

    The most important use of principal component analysis is reduction of data. Itprovides the effective dimensionality of the data. If first few components account formost of the variation in the original data, then first few components scores can beutilized in subsequent analysis in place of original variables.

    Plotting of data becomes difficult with more than three variables. Through principalcomponent analysis, it is often possible to account for most of the variability in thedata by first two components, and it is possible to plot the values of first twocomponents scores for each individual. Thus, principal component analysis enables usto plot the data in two dimensions. Particularly detection of outliers or clustering ofindividuals will be easier through this technique. Often, use of principal componentanalysis reveals grouping of variables which would not be found by other means.

    Reduction in dimensionality can also help in analysis where number of variables ismore than the number of observations, for example, in discriminant analysis andregression analysis. In such cases, principal component analysis is helpful by reducingthe dimensionality of data.

    Multiple regression can be dangerous if independent variables are highly correlated.Principal component analysis is the most practical technique to solve the problem.Regression analysis can be carried out using principal components as regressors inplace of original variables. This is known as principal component regression.

  • 8/12/2019 Data Reduction Techniques

    5/11

    Data Reduction Techniques

    5

    SPSS Syntax for Principal Component Analysis

    matrix.read x/file='c:\princomp.dat'/field=1 to 80/size={17,4}.

    compute x1=sscp(x).compute x2=csum(x).compute x3=nrow(x).compute x4=x2/x3.compute x5=t(x4)*x4.compute x6=(x1-(x3*x5))/16.call eigen(x6,a, ).compute P=x*a.print x.print x1.print x2.

    print x3.print x4.print x5.print x6.print a.print .print P.end matrix.

    OUTPUT from SPSS

    X =(data file )25.00 86.00 66.00 186.4924.90 84.00 66.00 124.3425.40 77.00 55.00 98.7924.40 82.00 62.00 118.8822.90 79.00 53.00 71.88

    7.70 86.00 60.00 111.9625.10 82.00 58.00 99.7424.90 83.00 63.00 115.2024.90 82.00 63.00 100.1624.90 78.00 56.00 62.3824.30 85.00 67.00 154.4024.60 79.00 61.00 112.7124.30 81.00 58.00 79.6324.60 81.00 61.00 125.5924.10 85.00 64.00 99.8724.50 84.00 63.00 143.5624.00 81.00 61.00 114.97

  • 8/12/2019 Data Reduction Techniques

    6/11

    Data Reduction Techniques

    6

    X6 (Variance covariance matrix)

    17.0200735 -4.1161765 1.5375000 5.1359669-4.1161765 7.5588235 8.5000000 54.82540441.5375000 8.5000000 15.7500000 92.94562505.1359669 54.8254044 92.9456250 903.8760243

    a (Eigen vectors).0055641909 .9551098955 -.1414004053 .2602691929.0607949529 -.2958047790 -.4849262138 .8207618861.1029819138 .0114295322 -.8547837625 -.5085359482.9928080071 .0115752378 .1191520536 .0010310528

    (Eigen values)916.903118818.3755276

    7.87036251.0559124

    P (PC Variables)197.3130423 1.3515516 -79.4337264 43.7211605135.4878783 1.1282492 -85.8550340 41.9895299108.5660501 3.2549654 -76.1729643 41.9418833129.5308469 1.1333848 -82.0459166 42.246385781.7513022 -0.0587676 -78.2861300 43.9220603

    122.6049095 -16.1031292 -80.7391990 42.1928748110.1204690 1.5346936 -81.0063321 44.4429836126.0438724 1.2839677 -83.8948063 42.6849520111.0512450 1.4056809 -85.2019270 41.848683172.5789053 2.0715808 -81.7803004 42.0864340

    165.4921253 0.6187596 -83.5281930 42.1765877123.1209676 2.1289724 -80.4998024 40.338328390.0898536 0.8336324 -82.8044334 43.3932719

    136.0299246 1.6864519 -79.9349764 41.9931320111.0442461 -0.2377487 -87.4329232 43.5939184154.2584768 0.9343927 -80.9440201 43.4308469125.4829651 0.9904569 -81.1155310 41.8260207

    SAS Commands for Principal Components Analysisdata test;input x1 x2 x3 x4;cards;

    ;proc princomp cov;var x1 x2 x3 x4;run;

  • 8/12/2019 Data Reduction Techniques

    7/11

    Data Reduction Techniques

    7

    2. FACTOR ANALYSISThe essential purpose of factor analysis is to describe, if possible, the covariancerelationships among many variables in terms of a few underlying, but unobservable,random quantities called factors. Under the factor model assuming linearity, eachresponse variate is represented as a linear function of a small number of unobservablecommon factors and a single latent specific factor. The common factors generate the

    covariances among the observable responses while the specific terms contribute only tothe variances of their particular response. Basically the factor model is motivated by thefollowing argument - Suppose variables can be grouped by their correlations, i.e., allvariables within a particular group are highly correlated among themselves but haverelatively small correlations with variables in a different group. It is conceivable that eachgroup of variables represents a single underlying construct, or factor, that is responsiblefor the observed correlations. For example, for an individual, marks in different subjectsmay be governed by aptitudes (common factors) and individual variations (specificfactors) and interest may lie in obtaining scores on unobservable aptitudes (commonfactors) from observable data on marks in different subjects.

    The Factor ModelSuppose observations are made on p variables, x1, x2,,xp. The factor analysis modelassumes that there are m underlying factors (m

  • 8/12/2019 Data Reduction Techniques

    8/11

    Data Reduction Techniques

    8

    circumstances, the specific factors play the dominant role, whereas the major aim of thefactor analysis is to determine a few important common factors.

    Number of factors is theoretically given by rank of population variance-covariance matrix.However, in practice, number of common factors retained in the model is increased until asuitable proportion of total variance is explained. Another convention, frequently

    encountered in computer packages is to set m equal to the number of eigen values greaterthan one.

    As in principal component analysis, principal factor method for factor analysis dependsupon unit of measurement. If units are changed, the solution will change. However, in thisapproach estimated factor loadings for a given factor do not change as the number offactors is increased. In contrast to this, in maximum likelihood method, the solution doesnot change if units of measurements are changed. However, in this method the solutionchanges if number of common factors is changed.

    Example 2.1: In a consumer - preference study, a random sample of customers wereasked to rate several attributes of a new product. The response on a 5-point semanticdifferential scale were tabulated and the attribute correlation matrix constructed is asgiven below:

    Attribute Correlation matrix

    1 2 3 4 5

    Taste 1 1 .02 .96 .42 .01

    Good buy for money 2 .02 1 .13 .71 .85

    Flavor 3 .96 .13 1 .5 .11

    Suitable for snack 4 .42 .71 .50 1 .79

    Provides energy 5 .01 .85 .11 .79 1

    It is clear from the correlation matrix that variables 1 and 3 and variables 2 and 5 formgroups. Variable 4 is closer to the (2,5) group than (1,3) group. Observing the results, onecan expect that the apparent linear relationships between the variables can be explained interms of, at most, two or three common factors.

    Solution from SAS package.

    Initial Factor Method: Principal Components

    Prior Communality Estimates: ONEEigenvalues of the Correlation Matrix: Total = 5 Average = 1

  • 8/12/2019 Data Reduction Techniques

    9/11

    Data Reduction Techniques

    9

    1 2 3 4 5Eigenvalue 2.8531 1.8063 0.2045 0.1024 0.0337Difference 1.0468 1.6018 0.1021 0.0687Proportion 0.5706 0.3613 0.0409 0.0205 0.0067Cumulative 0.5706 0.9319 0.9728 0.9933 1.0000

    FACTOR1 FACTOR2TASTE 0.55986 0.81610MONEY 0.77726 -0.52420FLAVOR 0.64534 0.74795SNACK 0.93911 -0.10492ENERGY 0.79821 -0.54323

    Variance explained by each factorFACTOR1 FACTOR22.853090 1.806332

    Final Communality Estimates: Total = 4.659423

    TASTE MONEY FLAVOR SNACK ENERGY0.979461 0.878920 0.975883 0.892928 0.932231

    Residual Correlations with Uniqueness on the Diagonal

    TASTE MONEY FLAVOR SNACK ENERGYTASTE 0.02054 0.01264 -0.01170 -0.02015 0.00644MONEY 0.01264 0.12108 0.02048 -0.07493 -0.05518FLAVOR -0.01170 0.02048 0.02412 -0.02757 0.00119SNACK -0.02015 -0.07493 -0.02757 0.10707 -0.01660

    ENERGY 0.00644 -0.05518 0.00119 -0.01660 0.06777

    Rotation Method: Varimax

    Rotated Factor PatternFACTOR1 FACTOR2

    TASTE 0.01970 0.98948MONEY 0.93744 -0.01123FLAVOR 0.12856 0.97947SNACK 0.84244 0.42805ENERGY 0.96539 -0.01563

    Variance explained by each factor

    FACTOR1 FACTOR22.537396 2.122027

    Final Communality Estimates: Total = 4.659423

    TASTE MONEY FLAVOR SNACK ENERGY0.979461 0.878920 0.975883 0.892928 0.932231

  • 8/12/2019 Data Reduction Techniques

    10/11

    Data Reduction Techniques

    10

    Initial Factor Method: Maximum LikelihoodEigenvalues of the Weighted Reduced Correlation Matrix:Total = 84.5563187Average = 16.9112637

    1 2 3 4 5Eigenvalue 59.7487 24.8076 0.1532 -0.0025 -0.1507

    Difference 34.9411 24.6544 0.1557 0.1482Proportion 0.7066 0.2934 0.0018 -0.0000 -0.0018Cumulative 0.7066 1.0000 1.0018 1.0018 1.0000

    Factor PatternFACTOR1 FACTOR2

    TASTE 0.97601 -0.13867MONEY 0.14984 0.86043FLAVOR 0.97908 -0.03180SNACK 0.53501 0.73855ENERGY 0.14567 0.96257

    Variance explained by each factor

    FACTOR1 FACTOR2Weighted 59.748704 24.807616Unweighted 2.241100 2.232582

    TASTE MONEY FLAVOR SNACK ENERGYCommunality 0.971832 0.762795 0.959603 0.831686 0.947767

    Rotation Method: VarimaxRotated Factor Pattern

    FACTOR1 FACTOR2TASTE 0.02698 0.98545MONEY 0.87337 0.00342FLAVOR 0.13285 0.97054SNACK 0.81781 0.40357ENERGY 0.97337 -0.01782

    Variance explained by each factorFACTOR1 FACTOR2

    Weighted 25.790361 58.765959Unweighted 2.397426 2.076256

    TASTE MONEY FLAVOR SNACK ENERGYCommunality 0.971832 0.762795 0.959603 0.831686 0.947767

    It can be seen that two factor model with factor loadings shown above is providing a goodfit to the data as the first two factors explains 93.2% of the total standardized sample

  • 8/12/2019 Data Reduction Techniques

    11/11

    Data Reduction Techniques

    11

    variance, i.e., 100xp

    21

    +, where p is the number of variables. It can also be seen from

    the results that there is no clear-cut distinction between factor loadings for the two factorsbefore rotation.

    SAS Commands for Factor Analysistitle Factor Analysis;data consumer (type=corr);_type_=CORR;Input name$ taste money flavour snack energy;Cards;taste 1.00 . . . .money .02 1.00 . . .flavour .96 .13 1.00 . .snack .42 .71 .50 1.00 .energy .01 .85 .11 .79 1.00;

    proc factor res data=consumermethod = prin nfact=2 rotate = varimax preplot plot;var taste money flavour snack energy;run;

    SPSS Stepsfor Factor Analysis

    File NewData DefineVariables StatisticsData reductionFactorVariables(mark variables and click 4 to enter variables in the box) Descriptives Statistics (a univariate descriptives , a initial solution) Corrlation matrix (acoefficients) Extraction Method (principal components or maximum likelihood)

    Extract (Number of factors as 2)

    Display (Unrotated factor solution)

    Continue

    Rotation Method (varimax) Display (loading plots) Continue Scores Display factor score coefficient matrix Continue OK.

    References and Suggested Reading

    Chatfield, C. and Collins, A.J. (1990). Introduction to multivariate analysis. Chapmanand Hall publications.

    Johnson, R.A. and Wichern, D.W. (1996). Applied multivariate statistical analysis.Prentice-Hall of India Private Limited.