Data Preprocessing - baskent.edu.tr

29
Data Preprocessing BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Transcript of Data Preprocessing - baskent.edu.tr

Page 1: Data Preprocessing - baskent.edu.tr

Data Preprocessing

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 2: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Data Preprocessing: An OverviewData Quality: Why Preprocess the Data?

Major Tasks in Data Preprocessing

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 3: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Data Quality: Why Preprocess the Data?Data has quality if it satisfies the requirements of its intended use.

There are many factors comprising data quality.

These include: accuracy, completeness, consistency, timeliness, believability, and interpretability.

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 4: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Major Tasks in Data Preprocessing

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 5: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Data Cleaning

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

• Missing Values• Noisy Data

Page 6: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

How can you go about filling in the missing values?Regression Analysis

Mod, Median, Mean

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 7: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Noisy Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

• Binning Method• Equal Frequency Binning: bins

have an equal frequency.• Equal Width Binning : bins have

equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).

• Regression• Outlier Analysis• Statistical Methods

Page 8: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Noisy Data - Binning

Partition into (equal-frequency) bins:

Bin 1: 4, 8, 15

Bin 2: 21, 21, 24

Bin 3: 25, 28, 34

Smoothing by bin means:

Bin 1: 9, 9, 9

Bin 2: 22, 22, 22

Bin 3: 29, 29, 29

Smoothing by bin boundaries:

Bin 1: 4, 4, 15

Bin 2: 21, 21, 24

Bin 3: 25, 25, 34

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

Page 9: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Data Integration

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

The Entity Identification Problem

Redundancy and Correlation Analysis

Tuple Duplication

Page 10: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

The Entity Identification ProblemThese sources may include multiple databases, data cubes

How can the data analyst or the computer be sure that customer id in one database and cust number inanother refer to the same attribute?

GUIDs are created and stored as 128-bit (16-byte) data using the MAC address, day, month, year, time of the systems they are produced, and the hardware information of the system in the active directory; It is usually displayed as 32 digits, according to the hexadecimal number system and as certain digits.

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 11: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Redundancy and Correlation AnalysisRedundancy is another important issue in data integration. An attribute (such as annual revenue, for instance) may be redundant if it can be different DB from another attribute or set of attributes.

Some redundancies can be detected by correlation analysis.

Given two attributes, such analysis can measure how strongly one attribute shows similarity the other, based on the available data.

For nominal data, we use the 2 (chisquare) test. For numeric attributes, we can use the correlation coefficient and covariance, both of which access how one attribute’s values vary with those of another.

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 12: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

𝑋2 Correlation Test for Nominal Data

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1. For 1 degree of freedom, the 𝑋2 value needed to reject the hypothesis at the 0.001significance level is 10.828. These features areindependent.

Page 13: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

𝑋2 Degrees of Freedom Table

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 14: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Remove duplicate tuples from list

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 15: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Data ReductionComplex data analysis and mining on huge amounts of data can take a long time, making such analysis impractical or infeasible.

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.

That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 16: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Wavelet TransformsThe most important feature of this method is that signals can be locally analyzed, whereby a large signal can be analyzed in a small area.

This analysis method enables signals to be analyzed on time domain and thus both low frequency information at a long time interval and high frequency information in short time interval can be defined. ,

Because of these advantages; wavelet analysis method is used in the analysis of time series and in a large variety of fields from the cylinder pressure data of internal combustion engines to data from Parkinson disease.

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 17: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Wavelet Transforms

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 18: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Wavelet Transforms

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 19: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Feature selectionUsing feature selection techniques has many advantages:

1. Reduced training time

2. Less complex, thus easier to interpret.

3. Improved accuracy if right subset is chosen.

4. Reduces overfitting.

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 20: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Feature selection using ReliefThere are three Algorithms in the Relief Family:

1. Basic Relief algorithm: It is limited to classification problems with two classes.

2. ReliefF : Extension of Relief . Which can deal with multiclass problems.

3. RReliefF : Then ReliefF was adapted for continuous class (regression)problems resulting in RReliefF algorithm.

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 21: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Basic Relief AlgorithmPseudo code:

1.set all weights W[A] := 0.0;2. for i := 1 to m do begin3. randomly select an instance Rᵢ;4. find nearest hit H and nearest miss M;5. for A := 1 to a do6. W[A] := W[A]-diff(A,Rᵢ,H)/m + diff(A,Rᵢ,M)/m;7. end;

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 22: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Basic Relief Algorithm

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Here, row 1,2,..,5 are the instances. D is the target class (having two class 0 or 1).

A,B,C are the features.

We will find the weights of the attributes and then select 2 best features, i.e. features having the highest weights.

Let m = 2 (i.e we will perform 2 iterations).

Let all attributes weight be 0 , A=B=C=0,

Page 23: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Correlation based Feature Selection(CFS)

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 24: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Information Gain

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 25: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

HistogramsThe following data are a list of prices of commonly sold items at AllElectronics (rounded to the nearest dollar). The numbers have been sorted:

1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 26: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Data TransformationIn this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand.

1. Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering.

2. Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process.

3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts.

4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.

5. Discretization, where the raw values of a numeric attribute (such as age) are replaced by interval labels (e.g., 0-10, 11-20, and so on) or conceptual labels (e.g., youth, adult, and senior ).

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 27: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Data Transformation by Normalization

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 28: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Min – Max Normalization

Annual Salary Normalized Values

89.986

40.849

42.061

17.175

4.229

85.926

56.223

92.268

21.742

1.765 0

3.268 0,02

98.048 1,00

97.382 0,99

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

Page 29: Data Preprocessing - baskent.edu.tr

GENEL- PUBLIC

Simple Moving Average basedNormalization

Year Sales ($M) MA

2003 4

2004 6

2005 5 6,4

2006 8

2007 9

2008 5

2009 4

2010 3

2011 7

2012 8

BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ

The mean (average) sales for the first five years (2003-2007) is calculated by finding the mean from the first five years (i.e. adding the five sales totals and dividing by 5). This gives you the moving average for 2005 (the center year) = 6.4M: