Chauhan_Cleansing.ppt

download Chauhan_Cleansing.ppt

of 14

Transcript of Chauhan_Cleansing.ppt

  • 7/27/2019 Chauhan_Cleansing.ppt

    1/14

    Data Cleansing:Filling Missing Values in Data

    Class Presentation

    CIS 764

    Instructor Presented by

    Dr. William Hankley Gaurav Chauhan

  • 7/27/2019 Chauhan_Cleansing.ppt

    2/14

    CIS 764-Gaurav Chauhan

    Overview

    Problems Caused

    Methods for retrieving missing values

    Predicting values

    The average way

    The probabilistic way

    By leveraging the relational network

    structure

    Conclusions

  • 7/27/2019 Chauhan_Cleansing.ppt

    3/14

    CIS 764-Gaurav Chauhan

    Problems Caused

    Following problems occur in data analysisbecause of missing values in the same

    Summarizing variables

    Computing new variables

    Comparing variables

    Combining variables

    In Time Series Analysis

  • 7/27/2019 Chauhan_Cleansing.ppt

    4/14

    CIS 764-Gaurav Chauhan

    Methods for retrieving missing values

    Considering average of the availablevalues for prediction

    Using probabilistic approach for valueprediction

    Leveraging relation network structure of

    the data to predict values

  • 7/27/2019 Chauhan_Cleansing.ppt

    5/14

    CIS 764-Gaurav Chauhan

    Predicting Values- the average way

    Year Rainfall (avg) in (cm) Temperature (avg)

    1936 30 60F

    1937 32 66F

    1938 N.A, Predicted = 28.5 cm 62F

    1939 25 64F

    1940 23 69F

    1941 30 59F

    1942 N.A, Predicted = 29.0 cm 60F

    1943 28 59F

    1944 22 65F

  • 7/27/2019 Chauhan_Cleansing.ppt

    6/14

  • 7/27/2019 Chauhan_Cleansing.ppt

    7/14

    CIS 764-Gaurav Chauhan

    Predicting Values- the probabilistic way

    Assume that we have n values and we are requiredto predict n+1th value

    For every i such that i=1 to n the probability that adata instance has a value vi is p(vi)

    Each of these probabilities is calculated on the basesof the frequency with which vi occurs in the data.

    That said, vn+1 is picked at random such that

    p(vn+1

    = vi

    ) > p(vn+1

    = vj

    )

    If p(vi)>p(vj)

  • 7/27/2019 Chauhan_Cleansing.ppt

    8/14

    CIS 764-Gaurav Chauhan

    Predicting Valuesby leveraging the relational network

    This technique applies only to relationaldata only

    The values of missing instances arepredicted as the mode of the peers whofit the relational network and have no

    missing values

  • 7/27/2019 Chauhan_Cleansing.ppt

    9/14

    CIS 764-Gaurav Chauhan

    Predicting Values

    by leveraging the relational network

  • 7/27/2019 Chauhan_Cleansing.ppt

    10/14

    CIS 764-Gaurav Chauhan

    Predicting Values

    by leveraging the relational network

    Example 1

    Book A Book C Book B

    Category A Category C Category B

    Book A Book C Book B

    ? (Predicted= A) Category C Category B

  • 7/27/2019 Chauhan_Cleansing.ppt

    11/14

    CIS 764-Gaurav Chauhan

    Predicting Values

    by leveraging the relational network

    Example 2

    Teacher

    Student 1 Student 2 Student 3 Student 4

    Age(19) ? Age(18) Age(19)

    (Predicted 19)

  • 7/27/2019 Chauhan_Cleansing.ppt

    12/14

    CIS 764-Gaurav Chauhan

    Conclusion

    Missing values in the data are bad whenit is used for analysis, learning ormining purposes

    Various techniques aim at predictingdata but none has reached a 100%accuracy

    An average of 90% accuracy with whichthese values are predicted is stillacceptable

  • 7/27/2019 Chauhan_Cleansing.ppt

    13/14

    CIS 764-Gaurav Chauhan

    References

    www.hrs.co.nz

    http://dblife.cs.wisc.edu/search.cgi?entity=entity-8982

  • 7/27/2019 Chauhan_Cleansing.ppt

    14/14

    CIS 764-Gaurav Chauhan

    Questions Anyone

    I am shivering not because ofnervousness but because of cold room

    temperature

    -one nervous student