Overview of DATA PREPROCESS..

11

Transcript of Overview of DATA PREPROCESS..

Page 1: Overview of DATA PREPROCESS..
Page 2: Overview of DATA PREPROCESS..
Page 3: Overview of DATA PREPROCESS..

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Page 4: Overview of DATA PREPROCESS..

Why Data Preprocessing?

Data in the real world is dirty

incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or

names

No quality data, no quality mining results

Quality decisions must be based on quality data

Data warehouse needs consistent integration of quality

data

Page 5: Overview of DATA PREPROCESS..

O Data cleaning tasks

O Fill in missing values

O Identify outliers and smooth out noisy data

O Correct inconsistent data

duplicate records

incomplete data

inconsistent data

Page 6: Overview of DATA PREPROCESS..

O Data integration: O combines data from multiple sources into a coherent store

O Schema integrationO integrate metadata from different sources

O Entity identification problem: identify real world entities from multiple data sources,

O Detecting and resolving data value conflictsO for the same real world entity, attribute values from different

sources are different

O possible reasons: different representations, different scales,

e.g., metric vs. British units

Page 7: Overview of DATA PREPROCESS..

O Smoothing: remove noise from data

O Aggregation: summarization, data cube construction

O Generalization: concept hierarchy climbing

O Normalization: scaled to fall within a small, specified

range

O min-max normalization

O z-score normalization

O normalization by decimal scaling

Page 8: Overview of DATA PREPROCESS..

O Data reduction

O Obtains a reduced representation of the data set

that is much smaller in volume but yet produces

the same (or almost the same) analytical results

O Data reduction strategies

O Data cube aggregation

O Dimensionality reduction

O Numerosity reduction

O Discretization and concept hierarchy generation

Page 9: Overview of DATA PREPROCESS..

O Discretization: divide the range of a continuous attribute into intervals

O Some classification algorithms only accept categorical attributes.

O Reduce data size by discretization

O Prepare for further analysis

O reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.

Page 10: Overview of DATA PREPROCESS..

OBinning

OHistogram analysis

OClustering analysis

Page 11: Overview of DATA PREPROCESS..