Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
-
Upload
lester-phillips -
Category
Documents
-
view
213 -
download
0
Transcript of Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
![Page 1: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/1.jpg)
Ch2 Data Preprocessing part2
Dr. Bernard Chen Ph.D.University of Central Arkansas
Fall 2009
![Page 2: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/2.jpg)
Knowledge Discovery (KDD) Process
Data mining—core of knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
![Page 3: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/3.jpg)
Forms of Data Preprocessing
![Page 4: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/4.jpg)
Outline
Data Cleaning Missing value Noise data
Data Integration Redundancy
![Page 5: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/5.jpg)
Data Cleaning
Importance “Data cleaning is one of the three
biggest problems in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data warehousing”—DCI survey
![Page 6: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/6.jpg)
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy
data
![Page 7: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/7.jpg)
Missing Data
Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time
of entry not register history or changes of the data
It is important to note that, a missing value may not always imply an error. (for example, Null-allow attri. )
![Page 8: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/8.jpg)
How to Handle Missing Data?
Ignore the tuple: usually done when class
label is missing (assuming the tasks in
classification—not effective when the
percentage of missing values per attribute
varies considerably.
Fill in the missing value manually: tedious
+ infeasible
![Page 9: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/9.jpg)
How to Handle Missing Data?
Fill in it automatically with a global constant : e.g., “unknown”, a new
class?!
the attribute mean
the attribute mean for all samples belonging to
the same class: smarter
the most probable value: inference-based such
as Bayesian formula or decision tree
![Page 10: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/10.jpg)
Outline
Data Cleaning Missing value Noise data
Data Integration Redundancy
![Page 11: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/11.jpg)
Noisy Data Noise: random error or variance in a
measured variable
Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention
![Page 12: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/12.jpg)
How to Handle Noisy Data? Binning
Regression
Clustering
Combined computer and human inspection
![Page 13: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/13.jpg)
Simple Discretization Methods: Binning Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34
* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
![Page 14: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/14.jpg)
Simple Discretization Methods: Binning
Equal-width (distance) partitioning Divides the range into N intervals of equal size:
uniform grid
if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well
![Page 15: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/15.jpg)
Simple Discretization Methods: Binning
Smooth by bin means
Smooth by bin medians
Smooth by bin boundaries – Each bin value is replaced by the closest boundary value
![Page 16: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/16.jpg)
Regression
x
y
y = x + 1
X1
Y1
Y1’
![Page 17: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/17.jpg)
Cluster Analysis
![Page 18: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/18.jpg)
Outline
Data Cleaning Missing value Noise data
Data Integration Redundancy
![Page 19: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/19.jpg)
Data integration
Data integration: Combines data from multiple sources
into a coherent store
![Page 20: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/20.jpg)
Data integration problems Schema integration:
e.g., A.cust-id B.cust-# Integrate metadata from different sources
Detecting and resolving data value conflicts For the same real world entity, attribute values
from different sources are different Possible reasons: different representations,
different scales, e.g., metric vs. British units
![Page 21: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/21.jpg)
Redundant data
Redundant data occur often when integration of multiple databases Object identification: The same attribute or
object may have different names in different databases
Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
![Page 22: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/22.jpg)
Redundant data
Redundant attributes may be able to be detected by correlation analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
![Page 23: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/23.jpg)
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment coefficient)
where n is the number of tuples, and are the respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product.
BABA n
BAnAB
n
BBAAr BA )1(
)(
)1(
))((,
![Page 24: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/24.jpg)
Correlation Analysis (Categorical Data) Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables are related
![Page 25: Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.](https://reader036.fdocuments.net/reader036/viewer/2022070407/56649e175503460f94b02f50/html5/thumbnails/25.jpg)
Chi-Square Calculation: An Example
e11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90
93.507840
)8401000(
360
)360200(
210
)21050(
90
)90250( 22222
Play chess
Not play chess
Sum (row)
Like science fiction
250(90)
200(360) 450
Not like science fiction
50(210)
1000(840) 1050
Sum(col.) 300 1200 1500