[3]Pca(Data Reduction)

24
Data Reduction • Purpose – Obtain a reduced representation of the dataset that is much smaller in volume, yet closely maintains the integrity of the original data • Strategies – Data cube aggregation • Aggregation operations are applied to construct a data cube – Attribute subset selection • Irrelevant, weakly relevant, or redundant attributes are detected and removed – Data compression • Data encoding or transformations are applied so as to obtain a reduced or “compressed” representation of the original data – Numerosity reduction • Data are replaced or estimated by alternative, smaller data representations (e.g. models) – Discretization and concept hierarchy generation • Attribute values are replaced by ranges or higher conceptual levels 1

description

PCA

Transcript of [3]Pca(Data Reduction)

Slide 1

Data ReductionPurposeObtain a reduced representation of the dataset that is much smaller in volume, yet closely maintains the integrity of the original dataStrategiesData cube aggregationAggregation operations are applied to construct a data cubeAttribute subset selectionIrrelevant, weakly relevant, or redundant attributes are detected and removedData compressionData encoding or transformations are applied so as to obtain a reduced or compressed representation of the original dataNumerosity reductionData are replaced or estimated by alternative, smaller data representations (e.g. models)Discretization and concept hierarchy generationAttribute values are replaced by ranges or higher conceptual levels1Stepwise SelectionStepwise Forward (Example)Start with an empty reduced setThe best attribute is selected first and added to the reduced setAt each subsequent step, the best of the remaining attributes is selected and added to the reduced set (conditioning on the attributes that are already in the set)Stepwise Backward (Example)Start with the full set of attributesAt each step, the worst of the attributes in the set is removedCombination of Forward and BackwardAt each step, the procedure selects the best attribute and adds it to the set, and removes the worst attribute from the setSome attributes were good in initial selections but may not be good anymore after other attributes have been included in the set 2Decision InductionDecision TreeA mode in the form of a tree structureDecision nodesEach denotes a test on the corresponding attribute which is the best attribute to partition data in terms of class distributions at the pointEach branch corresponds to an outcome of the testLeaf nodesEach denotes a class predictionCan be used for attribute selection34

Stepwise and Decision Tree Methods for Attribute SelectionData CompressionPurposeApply data encoding or transformations to obtain a reduced or compressed representation of the original dataLossless CompressionThe original data can be reconstructed from the compressed data without any loss of informatione.g. some well-tuned algorithms for string compressionLossy CompressionOnly an approximation of the original data can be constructed from the compressed datae.g. wavelet transforms and principal component analysis (PCA)

Numerosity ReductionPurposeReduce data volume by choosing alternative, smaller data representationsParametric MethodsA model is used to fit data, store only the model parameters not original data (except possible outliers)e.g. Regression models and Log-linear modelsNon-Parametric Methods Do not use models to fit dataHistogramsUse binning to approximate data distributionsA histogram of attribute A partitions the data distribution of A into disjoint subsets, or bucketsClusteringUse cluster representation of the data to replace the actual dataSampling Represent the original data by a much smaller sample (subset) of the data6SamplingSimple Random Sampling without Replacement (SRSWOR)Draw s of the N records from dataset D (s