Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
-
Upload
byron-trow -
Category
Documents
-
view
226 -
download
5
Transcript of Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
![Page 1: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/1.jpg)
Ch2 Data Preprocessing part3
Dr. Bernard Chen Ph.D.University of Central Arkansas
Fall 2009
![Page 2: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/2.jpg)
Knowledge Discovery (KDD) Process
Data mining—core of knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
![Page 3: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/3.jpg)
Forms of Data Preprocessing
![Page 4: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/4.jpg)
Data Transformation
Data transformation – the data are transformed or consolidated into forms appropriate for mining
![Page 5: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/5.jpg)
Data Transformation
Data Transformation can involve the following: Smoothing: remove noise from the data,
including binning, regression and clustering
Aggregation Generalization Normalization Attribute construction
![Page 6: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/6.jpg)
Normalization
Min-max normalization Z-score normalization Decimal normalization
![Page 7: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/7.jpg)
Min-max normalization
Min-max normalization: to
[new_minA, new_maxA]
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
716.00)00.1(000,12000,98
000,12600,73
![Page 8: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/8.jpg)
Z-score normalization
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ = 54,000, σ = 16,000. Then
A
Avv
'
225.1000,16
000,54600,73
![Page 9: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/9.jpg)
Decimal normalization
Normalization by decimal scaling
Suppose the recorded value of A range from -986 to 917, the max absolute value is 986, so j = 3
j
vv
10' Where j is the smallest integer such that Max(|ν’|) < 1
![Page 10: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/10.jpg)
Data Reduction
Why data reduction? A database/data warehouse may
store terabytes of data Complex data analysis/mining may
take a very long time to run on the complete data set
![Page 11: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/11.jpg)
Data Reduction
Data reduction Obtain a reduced representation of
the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
![Page 12: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/12.jpg)
Data Reduction Data reduction strategies
Data cube aggregation Attribute subset selection Dimensionality reduction — e.g.,
remove unimportant attributes Numerosity reduction — e.g., fit data
into models Discretization and concept hierarchy
generation
![Page 13: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/13.jpg)
Data cube aggregation
![Page 14: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/14.jpg)
Data cube aggregation
Multiple levels of aggregation in data cubes Further reduce the size of data to deal with
Reference appropriate levels Use the smallest representation which is
enough to solve the task
![Page 15: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/15.jpg)
Attribute subset selectionDimensionality reduction Feature selection (i.e., attribute subset
selection):
Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features
reduce # of patterns in the patterns, easier to understand
![Page 16: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/16.jpg)
Attribute subset selectionDimensionality reduction
Heuristic methods (due to exponential # of choices): Step-wise forward selection Step-wise backward elimination Combining forward selection and
backward elimination Decision-tree induction
![Page 17: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/17.jpg)
Attribute subset selectionDimensionality reduction
Initial attribute set:{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
![Page 18: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/18.jpg)
Numerosity reduction
Reduce data volume by choosing alternative, smaller forms of data representation
Major families: histograms, clustering, sampling
![Page 19: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/19.jpg)
Data Reduction Method: Histograms
0
5
10
15
20
25
30
35
40
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
![Page 20: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/20.jpg)
Data Reduction Method: Histograms
Divide data into buckets and store average (sum) for each bucket
Partitioning rules: Equal-width: equal bucket range Equal-frequency (or equal-depth) V-optimal: with the least histogram variance (weighted
sum of the original values that each bucket represents)
MaxDiff: set bucket boundary between each pair for pairs have the β–1 largest differences
![Page 21: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/21.jpg)
Data Reduction Method: Clustering
Partition data set into clusters based on
similarity, and store cluster representation
(e.g., centroid and diameter) only
There are many choices of clustering
definitions and clustering algorithms
Cluster analysis will be studied in depth in
Chapter 7
![Page 22: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/22.jpg)
Data Reduction Method: Sampling Sampling: obtaining a small sample s to
represent the whole data set N
Simple random sample without replacement
Simple random sample with replacement Cluster sample: if the tuples in D are grouped
into M mutually disjoint clusters, then an Simple Random Sample can be obtained, where s < M
Stratified sample
![Page 23: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/23.jpg)
Sampling: with or without Replacement
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
![Page 24: Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader034.fdocuments.net/reader034/viewer/2022042522/56649c765503460f9492a694/html5/thumbnails/24.jpg)
Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample