Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team...
Transcript of Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team...
![Page 1: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/1.jpg)
Data Preprocessing
Week 2
![Page 2: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/2.jpg)
TopicsTopics
• Data Types• Data Repositories• Data Preprocessing• Present homework assignment #1
![Page 3: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/3.jpg)
Team Homework Assignment #2Team Homework Assignment #2
• Read pp 227 240 pp 250 250 and pp 259 263 the text• Read pp. 227 – 240, pp. 250 – 250, and pp. 259 – 263 the text book.
• Do Examples 5.3, 5.4, 5.8, 5.9, and Exercise 5.5.p , , , ,• Write an R program to verify your answer for Exercise 5.5.
Refer to pp. 453 – 458 of the lab book.• Explore frequent pattern mining tools and play them for
Exercise 5.5• Prepare for the results of the homework assignmentPrepare for the results of the homework assignment.• Due date
– beginning of the lecture on Friday February 11th. g g y y
![Page 4: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/4.jpg)
Team Homework Assignment #3Team Homework Assignment #3
P f h d i i f j• Prepare for the one‐page description of your group project topic
• Prepare for presentation using slidesPrepare for presentation using slides• Due date
– beginning of the lecture on Friday February 11th.
![Page 5: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/5.jpg)
Figurd
isco re 1.4 Da
taovery
a Mining as a step in the proceess of knoww
ledge
![Page 6: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/6.jpg)
Why Data Preprocessing Is Important?Why Data Preprocessing Is Important?
• Welcome to the Real World!• No quality data, no quality mining results!• Preprocessing is one of the most critical steps in a data mining
process
6
![Page 7: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/7.jpg)
Major Tasks in Data P iPreprocessing
7Figure 2.1 Forms of data preprocessing
![Page 8: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/8.jpg)
Why Data Preprocessing is Beneficial to D Mi i ?Data Mining?
• Less data– data mining methods can learn fasterHi h• Higher accuracy– data mining methods can generalize better
• Simple results• Simple results– they are easier to understand
• Fewer attributes– For the next round of data collection, saving can be made by removing redundant and irrelevant features
8
![Page 9: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/9.jpg)
Data CleaningData Cleaning
9
![Page 10: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/10.jpg)
Remarks on Data CleaningRemarks on Data Cleaning
• “Data cleaning is one of the biggest problems in data warehousing” ‐‐ Ralph Kimball
• “Data cleaning is the number one problem in data warehousing” ‐‐ DCI survey
10
![Page 11: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/11.jpg)
Why Data Is “Dirty”?
I l i d i i d• Incomplete, noisy, and inconsistent data are commonplace properties of large real‐world databases (p 48)databases …. (p. 48)
• There are many possible reasons for noisy data …. (p. 48)48)
11
![Page 12: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/12.jpg)
Types of Dirty Data Cleaning MethodsTypes of Dirty Data Cleaning Methods
• Missing values– Fill in missing values
• Noisy data (incorrect values)– Identify outliers and smooth out noisy data
12
![Page 13: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/13.jpg)
Methods for Missing Values (1)Methods for Missing Values (1)
• Ignore the tuple• Fill in the missing value manuallyFill in the missing value manually• Use a global constant to fill in the missing value
13
![Page 14: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/14.jpg)
Methods for Missing Values (2)Methods for Missing Values (2)
• Use the attribute mean to fill in the missing value• Use the attribute mean for all samples belonging to the same
class as the given tuple• Use the most probable value to fill in the missing value
14
![Page 15: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/15.jpg)
Methods for Noisy DataMethods for Noisy Data
• Binning• Regression• Clustering
15
![Page 16: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/16.jpg)
BinningBinning
16
![Page 17: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/17.jpg)
RegressionRegression
17
![Page 18: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/18.jpg)
ClusteringClustering
Figure 2.12 A 2‐D plot of customer data with respect to customer locations in a city, showing three data clusters. Each cluster centroid is marked with a “+”, representing the
i t th t l t O tli b d t t d l th t f ll t id faverage point on space that cluster. Outliers may be detected as values that fall outside of the sets of clusters.
18
![Page 19: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/19.jpg)
Data IntegrationData Integration
19
![Page 20: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/20.jpg)
Data IntegrationData Integration
• Schema integration and object matching– Entity identification problem
• Redundant data (between attributes) occur often when integration of multiple databases– Redundant attributes may be able to be detected by
l ti l i d hi th dcorrelation analysis, and chi‐square method
20
![Page 21: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/21.jpg)
Schema Integration and Object MatchingSchema Integration and Object Matching
• custom_id and cust_number– Schema conflict
• “H” and ”S”, and 1 and 2 for pay_type in one database– Value conflict
• Solutionst d t (d t b t d t )– meta data (data about data)
21
![Page 22: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/22.jpg)
Detecting Redundancy (1)Detecting Redundancy (1)
• If an attributed can be “derived” from another attribute or a set of attributes, it may be redundant
22
![Page 23: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/23.jpg)
Detecting Redundancy (2)Detecting Redundancy (2)
• Some redundancies can be detected by correlation analysis– Correlation coefficient for numeric data– Chi‐square test for categorical data
• These can be also used for data reduction
23
![Page 24: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/24.jpg)
Chi-square TestChi square Test
• For categorical (discrete) data, a correlation relationship between two attributes, A and B, can be discovered by a χ2 testtest
• Given the degree of freedom, the value of χ2 is used to decide correlation based on a significance level
24
![Page 25: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/25.jpg)
Chi-square Test for Categorical D tData
∑ −=
ExpectedExpectedObserved 2
2 )(χExpected
∑∑ −=
c rijij eo 2)(2χ ∑∑
= =
=i j ije1 1
2χ
NbBcountaAcounte ji
ij)()( =×=
= p. 68
25The larger the Χ2 value, the more likely the variables are related.
![Page 26: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/26.jpg)
Chi-square TestChi square Testmale female Totalmale female Total
fiction 250 200 450non fiction 50 1000 1050_f
Total 300 1200 1500
Table2.2 A 2 X 2 contingency table for the data of Example 2.1. Are gender and preferred_reading correlated?
The χ2 statistic tests the hypothesis that gender and preferred_reading are independent. The test is based on a significant level, with (r ‐ 1) x (c ‐ 1) degree of freedom.
26
![Page 27: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/27.jpg)
Table of Percentage Points of th 2 Di t ib ti the χ2 Distribution
27
![Page 28: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/28.jpg)
Correlation CoefficientCorrelation Coefficient
)())((11
−−− ∑∑N
i
ii
N
i
ii BANbaBbAa11, == ==
BA
i
BA
iBANN
rσσσσ
11 , +≤≤− BAr p. 68
28
![Page 29: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/29.jpg)
http://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png
29
![Page 30: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/30.jpg)
Data TransformationData Transformation
30
![Page 31: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/31.jpg)
Data Transformation/ConsolidationData Transformation/Consolidation
• Smoothing √• Aggregation• Generalization• Normaliza on √• Attribute construc on √
31
![Page 32: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/32.jpg)
SmoothingSmoothing
• Remove noise from the data• Binning, regression, and clustering
32
![Page 33: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/33.jpg)
Data NormalizationData Normalization
• Min-max normalizationMin max normalization
AAA
AA
A minnewminnewmaxnewminmax
minvv _)__(' +−−
−=
• z-score normalization
AA minmax −
A
Avvσμ−
='
• Normalization by decimal scalingvv'= where j is the smallest integer such that
33
j10 where j is the smallest integer such that Max(|ν′|) < 1
![Page 34: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/34.jpg)
Data NormalizationData Normalization
• Suppose that the minimum and maximum values for attribute income are $12,000 and $98,000, respectively. We would like to map income to the range [0 0 1 0] Do Min‐maxto map income to the range [0.0, 1.0]. Do Min max normalization, z‐score normalization, and decimal scaling for the attribute income
34
![Page 35: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/35.jpg)
Attribution ConstructionAttribution Construction
• New attributes are constructed from given attributes andNew attributes are constructed from given attributes and added in order to help improve accuracy and understanding of structure in high‐dimension data
• Example– Add the attribute area based on the attributes height and widthwidth
35
![Page 36: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/36.jpg)
Data ReductionData Reduction
36
![Page 37: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/37.jpg)
Data ReductionData Reduction
• Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volumerepresentation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data
37
![Page 38: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/38.jpg)
Data ReductionData Reduction
• (Data Cube)Aggregation• Attribute (Subset) Selection• Dimensionality Reduction• Numerosity Reduction• Data Discretization
C t Hi h G ti• Concept Hierarchy Generation
38
![Page 39: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/39.jpg)
“The Curse of Dimensionality”(1)The Curse of Dimensionality (1)
• Size– The size of a data set yielding the same density of data points in an n‐dimensional space increase exponentially with dimensionswith dimensions
• Radius– A larger radius is needed to enclose a faction of the data points in a high‐dimensional space
39
![Page 40: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/40.jpg)
“The Curse of Dimensionality”(2)The Curse of Dimensionality (2)
• DistanceDistance– Almost every point is closer to an edge than to another sample point in a high‐dimensional space
• Outlier– Almost every point is an outlier in a high‐dimensional spacespace
40
![Page 41: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/41.jpg)
Data Cube AggregationData Cube Aggregation
• Summarize (aggregate) data based on dimensions• The resulting data set is smaller in volume, without loss of
information necessary for analysis taskinformation necessary for analysis task• Concept hierarchies may exist for each attribute, allowing the
analysis of data at multiple levels of abstraction
41
![Page 42: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/42.jpg)
Data AggregationData Aggregation
Figure 2.13 Sales data for a given branch of AllElectronics for the years 2002 to 2004. On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the annual salesthe right, the data are aggregated to provide the annual sales
42
![Page 43: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/43.jpg)
Data CubeData Cube
• Provide fast access to pre‐computed, summarized data, thereby benefiting on‐line analytical processing as well asthereby benefiting on line analytical processing as well as data mining
43
![Page 44: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/44.jpg)
Data Cube - ExampleData Cube Example
Figure 2.14 A data cube for sales at AllElectronicsFigure 2.14 A data cube for sales at AllElectronics
44
![Page 45: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/45.jpg)
Attribute Subset Selection (1)Attribute Subset Selection (1)
• Attribute selection can help in the phases of data mining (knowledge discovery) process– By attribute selection,
• we can improve data mining performance (speed of l i di i i li i f l )learning, predictive accuracy, or simplicity of rules)
• we can visualize the data for model selected• we reduce dimensionality and remove noise• we reduce dimensionality and remove noise.
45
![Page 46: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/46.jpg)
Attribute Subset Selection (2)Attribute Subset Selection (2)
• Attribute (Feature) selection is a search problem– Search directions
• (Sequential) Forward selection• (Sequential) Backward selection (elimination)• Bidirectional selectionD i i t l ith (i d ti )• Decision tree algorithm (induction)
46
![Page 47: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/47.jpg)
Attribute Subset Selection (3)Attribute Subset Selection (3)
• Attribute (Feature) selection is a search problem– Search strategies
E h ti h• Exhaustive search• Heuristic search
– Selection criteriaSelection criteria• Statistic significance• Information gaing• etc.
47
![Page 48: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/48.jpg)
Attribute Subset Selection (4)Attribute Subset Selection (4)
Figure 2.15. Greedy (heuristic) methods for attribute subset selection
48
![Page 49: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/49.jpg)
Data DiscretizationData Discretization
R d th b f l f i• Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervalsof the attribute into intervals
• Interval labels can then be used to replace t l d t lactual data values
• Split (top‐down) vs. merge (bottom‐up)• Discretization can be performed recursively on an attribute
49/51
![Page 50: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/50.jpg)
Why Discretization is Used?Why Discretization is Used?
• Reduce data size.• Transforming quantitative data to qualitative data.
50
![Page 51: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/51.jpg)
Interval Merge by χ2 AnalysisInterval Merge by χ Analysis
• Merging‐based (bottom‐up)• Merge: Find the best neighboring intervals and merge them to
form larger intervals recursively• ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
51/51
![Page 52: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/52.jpg)
• Initially, each distinct value of a numerical attribute A is considered to be one intervalconsidered to be one interval
• χ 2 tests are performed for every pair of adjacent intervals• Adjacent intervals with the least χ 2 values are merged
together, since low χ 2 values for a pair indicate similar class distributions
• This merge process proceeds recursively until a predefined• This merge process proceeds recursively until a predefined stopping criterion is met
52
![Page 53: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/53.jpg)
Entropy-Based DiscretizationEntropy Based Discretization
• The goal of this algorithm is to find the split with the maximum information gain.
• The boundary that minimizes the entropy over all possible boundaries is selected
• The process is recursively applied to partitions obtained until some stopping criterion is met
h b d d d d• Such a boundary may reduce data size and improve classification accuracy
53/51
![Page 54: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/54.jpg)
What is Entropy?What is Entropy?
• The entropy is a • The entropy is a measure of the uncertainty associated
ith d i blwith a random variable• As uncertainty and or
randomness increases for a result set so does the entropyValues range from 0 1 • Values range from 0 – 1 to represent the entropy of information
54
![Page 55: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/55.jpg)
Entropy ExampleEntropy Example
55
![Page 56: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/56.jpg)
Entropy Examplepy p
56
![Page 57: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/57.jpg)
Entropy Example (cont’d)py p ( )
57
![Page 58: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/58.jpg)
Calculating Entropyg pyFor m classes:
∑=
−=m
iii ppSEntropy
12log)(
For 2 classes:
222121 loglog)( ppppSEntropy −−=
For 2 classes:
• Calculated based on the class distribution of the samples in set S.
• pi is the probability of class i in S• m is the number of classes (class values)
58
![Page 59: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/59.jpg)
Calculating Entropy From Splitg py p
• Entropy of subsets S and S2 are calculatedEntropy of subsets S1 and S2 are calculated.• The calculations are weighted by their probability of being in
set S and summed.• In formula below,
– S is the set– T is the value used to split S into S1 and S2T is the value used to split S into S1 and S2
SS)()(),( 2
21
1 SEntropySS
SEntropySS
TSE +=
59
![Page 60: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/60.jpg)
Calculating Information Gaing
I f ti G i Diff i t b t• Information Gain = Difference in entropy between original set (S) and weighted split (S1 + S2)
),()(),( TSESEntopyTSGain −=
0.7662890.991076)56,( −=SGain
0 224788)56( =SGain 0.224788)56,( =SGain
0.091091)46,( =SGaincompare to
60
![Page 61: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/61.jpg)
Numeric Concept HierarchyNumeric Concept Hierarchy
• A concept hierarchy for a given numerical attribute defines a discretization of the attribute
• Recursively reduce the data by collecting and replacing low level concepts by higher level concepts
61
![Page 62: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/62.jpg)
A Concept Hierarchy for the Attribute PriceAttribute Price
Figure 2.22. A concept hierarchy for the attribute price.
62/51
![Page 63: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/63.jpg)
Segmentation by Natural PartitioningSegmentation by Natural Partitioning
• A simply 3‐4‐5 rule can be used to segment numeric data into l l f “ l” lrelatively uniform, “natural” intervals
– If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi‐width intervals
– If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals
– If it covers 1, 5, or 10 distinct values at the most significant digit, , , g g ,partition the range into 5 intervals
63/51
![Page 64: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/64.jpg)
FigHie gure 2.23.
erarchy fo Autom
ator profit b tic gener
based on ation of a
3-4-5 rule a concepe.
pt
64
![Page 65: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/65.jpg)
Concept Hierarchy Generation for C t i l D tCategorical Data
• Specification of a partial ordering of attributes explicitly at theSpecification of a partial ordering of attributes explicitly at the schema level by users or experts
• Specification of a portion of a hierarchy by explicit data grouping
• Specification of a set of attributes, but not of their partial orderingordering
65
![Page 66: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/66.jpg)
Automatic Concept Hierarchy GenerationGeneration
country 15 distinct valules
province or state 365 distinct values
city 3,567 distinct values
street 674,339 distinct values
Based on the number of distinct values per attributes p 95Based on the number of distinct values per attributes, p.95
66
![Page 67: Data Preprocessing - California State University, Northridgetwang/595DM/Slides/Week2.pdf · Team Homework Assignment #2Team Homework Assignment #2 ... – data mining methods can](https://reader031.fdocuments.net/reader031/viewer/2022020302/5a72caa57f8b9aa7538df5a3/html5/thumbnails/67.jpg)
Data preprocessingData cleaning
Missing valuesUse the most probable value to fill in the missing value (and five other methods)
Noisy dataBinning; Regression; Clusttering
Data integrationEntity ID problem
MetadataRedundancy
Correlation analysis (Correlation coefficient chi square test)Correlation analysis (Correlation coefficient, chi‐square test)Data trasnformation
Smoothing Data cleaning
Aggregation Data reductionData reduction
Generailization Data reduction
Normalization Min‐max; z‐score; decimal scaling
Attribute ConstructionData reduction
Data cube aggregationData cube store multidimensional aggregated information
Attribute subset selectionStepwise forward selection; stepwise backward selection; combination; decision tree induction
Dimensionality reductionDiscrete wavelet trasnforms (DWT); Principle components analysis (PCA);
Numerosity ReductionRegression and log‐linear models; histograms; clustering; sampling
Data discretization Bi i hi t l i t b d di ti ti
67
Binning; historgram analysis; entropy‐based discretization;Interval merging by chi‐square analysis; cluster analysis; intuitive partitioning
Concept hierarchyConcept hierarchy generation