Outline
-
Upload
skyler-mcdowell -
Category
Documents
-
view
38 -
download
0
description
Transcript of Outline
![Page 1: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/1.jpg)
Outline Introduction Descriptive Data Summarization Data Cleaning
Missing value Noise data
Data Integration Redundancy
Data Transformation
![Page 2: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/2.jpg)
Data Cleaning
Importance “Data cleaning is one of the three
biggest problems in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data warehousing”—DCI survey
![Page 3: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/3.jpg)
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy
data
![Page 4: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/4.jpg)
Missing Data
Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time
of entry not register history or changes of the data
It is important to note that, a missing value may not always imply an error. (for example, Null-allow attri. )
![Page 5: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/5.jpg)
How to Handle Missing Data?
Ignore the tuple: usually done when class
label is missing (assuming the tasks in
classification—not effective when the
percentage of missing values per attribute
varies considerably.
Fill in the missing value manually: tedious
+ infeasible
![Page 6: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/6.jpg)
How to Handle Missing Data?
Fill in it automatically with a global constant : e.g., “unknown”, a new
class?!
the attribute mean
the attribute mean for all samples belonging to
the same class: smarter
the most probable value: inference-based such
as Bayesian formula or decision tree
![Page 7: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/7.jpg)
Outline Introduction Descriptive Data Summarization Data Cleaning
Missing value Noise data
Data Integration Redundancy
Data Transformation
![Page 8: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/8.jpg)
Noisy Data
Noise: random error or variance in a measured variable
How to Handle Noisy Data? Binning Regression Clustering
![Page 9: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/9.jpg)
Binning Binnig methods smooth a sorted data
value by consulting its “neighborhood” First of all, we sort all the values Then, the sorted values are distributed into
a number of “buckets”, or “bins” Then we smooth the values by
Means (bin value is replace by mean value), or Medium (bin value is replace by medium value), or Boundaries (bin value is replace by the closest
boundary value)
![Page 10: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/10.jpg)
Simple Discretization Methods: Binning Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34
* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
![Page 11: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/11.jpg)
Regression
x
y
y = x + 1
X1
Y1
Y1’
![Page 12: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/12.jpg)
Cluster Analysis
![Page 13: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/13.jpg)
Outline Introduction Descriptive Data Summarization Data Cleaning
Missing value Noise data
Data Integration Redundancy
Data Transformation
![Page 14: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/14.jpg)
Data integration
Data integration: Combines data from multiple sources
into a coherent store
![Page 15: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/15.jpg)
Data integration problems Schema integration:
e.g., A.cust-id B.cust-# Integrate metadata from different sources
Detecting and resolving data value conflicts For the same real world entity, attribute values
from different sources are different Possible reasons: different representations,
different scales, e.g., metric vs. British units
![Page 16: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/16.jpg)
Redundant data
Redundant data occur often when integration of multiple databases Object identification: The same attribute or
object may have different names in different databases
Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
![Page 17: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/17.jpg)
Redundant data
Redundant attributes may be able to be detected by correlation analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
![Page 18: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/18.jpg)
Pearson’s product moment coefficient
Correlation coefficient (also called Pearson’s product moment coefficient)
where n is the number of tuples, and are the respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product.
BABA n
BAnAB
n
BBAAr BA )1(
)(
)1(
))((,
![Page 19: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/19.jpg)
Pearson’s product moment coefficient The correlation coefficient is always
between -1 and +1. The closer the correlation is to +/-1, the closer to a perfect linear relationship. Here is how I tend to interpret correlations.
-1.0 to -0.7 strong negative association. -0.7 to -0.3 weak negative association. -0.3 to +0.3 little or no association. +0.3 to +0.7 weak positive association. +0.7 to +1.0 strong positive association.
![Page 20: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/20.jpg)
Chi-Square Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables are related
![Page 21: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/21.jpg)
Chi-Square Calculation: An Example Suppose a group of 1500 people was
surveyed.
The gender of each person was noted Male: 300 Female: 1200
We have two attributes: Gender Prefer-reading
![Page 22: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/22.jpg)
Chi-Square Calculation: An Example
E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 E12 = count (male)*count(not_fiction)/N = 300 * 1050/ 1500
=90
93.507840
)8401000(
360
)360200(
210
)21050(
90
)90250( 22222
i
j
Male Female Sum (row)
Like science fiction
250(90)
200(360) 450
Not like science fiction
50(210)
1000(840) 1050
Sum(col.) 300 1200 1500
![Page 23: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/23.jpg)
Chi-Square Calculation: An Example For this 2 by 2 table, the degree of freedom
are (2-1)(2-1)=1
For 1 degree of freedom, the Chi-Square value needed to reject the hypothesis at the 0.001 significance is 10.828
Since our value is above this, we can conclude that the gender and prefer_reading are (strongly) correlated for the given group of people
![Page 24: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/24.jpg)
Outline Introduction Descriptive Data Summarization Data Cleaning
Missing value Noise data
Data Integration Redundancy
Data Transformation
![Page 25: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/25.jpg)
Data Transformation
Data Transformation can involve the following: Smoothing: remove noise from the data,
including binning, regression and clustering
Aggregation Generalization Normalization Attribute construction
![Page 26: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/26.jpg)
Normalization
![Page 27: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/27.jpg)
Normalization
![Page 28: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/28.jpg)
Normalization
Min-max normalization Z-score normalization Decimal normalization
![Page 29: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/29.jpg)
Min-max normalization
Min-max normalization: to
[new_minA, new_maxA]
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
716.00)00.1(000,12000,98
000,12600,73
![Page 30: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/30.jpg)
Z-score normalization
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ = 54,000, σ = 16,000. Then
A
Avv
'
225.1000,16
000,54600,73
![Page 31: Outline](https://reader035.fdocuments.net/reader035/viewer/2022062719/5681306c550346895d964b54/html5/thumbnails/31.jpg)
Decimal normalization
Normalization by decimal scaling
Suppose the recorded value of A range from -986 to 917, the max absolute value is 986, so j = 3
j
vv
10' Where j is the smallest integer such that Max(|ν’|) < 1