Chauhan_Cleansing.ppt
-
Upload
ernestohp7 -
Category
Documents
-
view
213 -
download
0
Transcript of Chauhan_Cleansing.ppt
-
7/27/2019 Chauhan_Cleansing.ppt
1/14
Data Cleansing:Filling Missing Values in Data
Class Presentation
CIS 764
Instructor Presented by
Dr. William Hankley Gaurav Chauhan
-
7/27/2019 Chauhan_Cleansing.ppt
2/14
CIS 764-Gaurav Chauhan
Overview
Problems Caused
Methods for retrieving missing values
Predicting values
The average way
The probabilistic way
By leveraging the relational network
structure
Conclusions
-
7/27/2019 Chauhan_Cleansing.ppt
3/14
CIS 764-Gaurav Chauhan
Problems Caused
Following problems occur in data analysisbecause of missing values in the same
Summarizing variables
Computing new variables
Comparing variables
Combining variables
In Time Series Analysis
-
7/27/2019 Chauhan_Cleansing.ppt
4/14
CIS 764-Gaurav Chauhan
Methods for retrieving missing values
Considering average of the availablevalues for prediction
Using probabilistic approach for valueprediction
Leveraging relation network structure of
the data to predict values
-
7/27/2019 Chauhan_Cleansing.ppt
5/14
CIS 764-Gaurav Chauhan
Predicting Values- the average way
Year Rainfall (avg) in (cm) Temperature (avg)
1936 30 60F
1937 32 66F
1938 N.A, Predicted = 28.5 cm 62F
1939 25 64F
1940 23 69F
1941 30 59F
1942 N.A, Predicted = 29.0 cm 60F
1943 28 59F
1944 22 65F
-
7/27/2019 Chauhan_Cleansing.ppt
6/14
-
7/27/2019 Chauhan_Cleansing.ppt
7/14
CIS 764-Gaurav Chauhan
Predicting Values- the probabilistic way
Assume that we have n values and we are requiredto predict n+1th value
For every i such that i=1 to n the probability that adata instance has a value vi is p(vi)
Each of these probabilities is calculated on the basesof the frequency with which vi occurs in the data.
That said, vn+1 is picked at random such that
p(vn+1
= vi
) > p(vn+1
= vj
)
If p(vi)>p(vj)
-
7/27/2019 Chauhan_Cleansing.ppt
8/14
CIS 764-Gaurav Chauhan
Predicting Valuesby leveraging the relational network
This technique applies only to relationaldata only
The values of missing instances arepredicted as the mode of the peers whofit the relational network and have no
missing values
-
7/27/2019 Chauhan_Cleansing.ppt
9/14
CIS 764-Gaurav Chauhan
Predicting Values
by leveraging the relational network
-
7/27/2019 Chauhan_Cleansing.ppt
10/14
CIS 764-Gaurav Chauhan
Predicting Values
by leveraging the relational network
Example 1
Book A Book C Book B
Category A Category C Category B
Book A Book C Book B
? (Predicted= A) Category C Category B
-
7/27/2019 Chauhan_Cleansing.ppt
11/14
CIS 764-Gaurav Chauhan
Predicting Values
by leveraging the relational network
Example 2
Teacher
Student 1 Student 2 Student 3 Student 4
Age(19) ? Age(18) Age(19)
(Predicted 19)
-
7/27/2019 Chauhan_Cleansing.ppt
12/14
CIS 764-Gaurav Chauhan
Conclusion
Missing values in the data are bad whenit is used for analysis, learning ormining purposes
Various techniques aim at predictingdata but none has reached a 100%accuracy
An average of 90% accuracy with whichthese values are predicted is stillacceptable
-
7/27/2019 Chauhan_Cleansing.ppt
13/14
CIS 764-Gaurav Chauhan
References
www.hrs.co.nz
http://dblife.cs.wisc.edu/search.cgi?entity=entity-8982
-
7/27/2019 Chauhan_Cleansing.ppt
14/14
CIS 764-Gaurav Chauhan
Questions Anyone
I am shivering not because ofnervousness but because of cold room
temperature
-one nervous student