Data Preparation for Data Mining Prepared by: Yuenho Leung.

20
Data Preparation for Data Mining Prepared by: Yuenho Leung

Transcript of Data Preparation for Data Mining Prepared by: Yuenho Leung.

Data Preparation for Data Mining

Prepared by: Yuenho Leung

What to Do before Data PreparationWhat to Do before Data Preparation

Before the stage of data preparation, you have already:

Known the domain of your problem

Planned solutions and approaches you are going to apply

Got as much data as possible including incomplete data

Data Representation FormatData Representation Format

First step of data preparation is to convert the raw data into rows and columns format. Such as:

XML

Access

SQL

Validate DataValidate Data

To validate data, you need to:

Check the value by data type.

Check the range of the variable

Compare the values with other instances (rows).

Check columns by their relationships.

Validate Data (cont)Validate Data (cont)

To validate this table, you can check the relationship among the city name, zip code, and area code.

If you get the data from a normalized database, you can skip this step.

CusID Name Address City Zip Phone

1 Alan 1800 Bon Ave. Elk Grove 95758 916-333-4444

2 Tom 600 Bender Rd Sacramento 95412 916-112-2345

3 Sam 300 Tent St San Jose 95112 408-345-2134

Validate Data (cont)Validate Data (cont)

From this table, you can tell the third instance is wrong. Why? Because no small earthquake on 1975/10/20.

Date Time Latitude Longitude Magnitude

1975/7/10 00:41:23 37.1811 -122.0521 1.32

1975/9/5 00:41:23 34.1653 -122.2348 1.54

1975/10/20 00:41:23 31.1873 -122.0512 5.10

1975/11/18 00:41:23 57.1845 -122.2148 2.02

1975/12/30 00:41:23 57.2373 -122.0328 0.50

Validate Data (cont)Validate Data (cont)

Fixing individual errors from each instance is not the main purpose of data validation.

The main purpose is to find the cause of errors.

If you know the cause of the errors, you might be figure out the pattern of the errors and then fix all errors globally.

For example,

We want to mine the pattern of wind speed from data generated by 5 sensors. We find 20% of the speed measurements are obviously wrong. Therefore, we check the sensors whether they work normally or not. If we find a broken sensor always display readings 10% higher than the correct readings, we should fix those incorrect measurements by 10% decrement.

Dealing with Missing and Empty ValueDealing with Missing and Empty Value

There is no automated technique for differentiating between missing and empty values:

Example:

CusID Name Sandwich Sauce

1 Alan Turkey Sweet Union

2 Tom Ham

3 Sam Beef Thousand Island

You cannot tell whether:•Tom didn’t want any sauce.•Or•The salesperson forgot to input the sauce’s name.

Dealing with Missing and Incorrect ValueDealing with Missing and Incorrect Value

If you know the value is incorrect or missing, you can:

Ignore the instance that contains the value (not recommended)

Or

Assign a value by a reasonable estimate

Or

Use the default value

Dealing with Missing and Incorrect Value (cont)Dealing with Missing and Incorrect Value (cont)

Example of reasonable estimateCusID Name Address City Zip Phone

1 Alan 1800 Bon Ave. Elk Grove 95758 916-333-4444

2 Tom 600 Bender Rd Sacramento 95412 916-112-2345

3 Sam ??? ??? 408-345-2134

From the area code 408, you may guess the city is San Jose because San Jose owns over 50% of the phone number with this area code.

Dealing with Missing and Incorrect Value (cont)Dealing with Missing and Incorrect Value (cont)

Example (cont)

You would guess the missing zip code is 95110. Because 95110 is the center of San Jose

CusID Name Address City Zip Phone

1 Alan 1800 Bon Ave. Elk Grove 95758 916-333-4444

2 Tom 600 Bender Rd Sacramento 95412 916-112-2345

3 Sam San Jose ??? 408-345-2134

Reduce No. of VariableReduce No. of Variable

More variables generate more relationships and more data points are required.

We are not only interested in the pattern of each variables. We are interested in the pattern of relationships among variables.

With 10 variables, the 1st variable has to be compared with 9 neighbors, the 2nd compares with 8, and so on. The result is 9 x 8 x 7 x 6… which is 362,880 relationships.

With 13 variables, it is nearly 40 million relationships.

With 15 variables, it is nearly 9 billion relationships.

Therefore, when preparing data sets, try to minimize the number of variables.

Reduce No. of Variable (cont)Reduce No. of Variable (cont)

No general strategies to reduce no. of variable.

Before select variable sets, you must fully understand the role of each variable in the model.

Define Variable RangeDefine Variable Range

Correct range – a variable range contains only the correct variable.

Example: Correct range of month is 1 – 12

Any data not in this range must be either repaired or removed from the dataset.

Project required range – a variable range we want to analyze according to the project statement.

Example: For summer sales, the project required range for month is 7 – 9.

Our goal is to find the pattern of data in this range. However, data not in this range may be required by the model.

Define Variable Range (cont)Define Variable Range (cont)

In the following table, ‘B’ stands for business. Sam is a company’s name. ‘G’ is out of the correct range. However, the data miner guesses it stands for “girl,” so he replaces ‘G’ by ‘F’ If he wants to mine people’s shopping behavior, the input will be ‘M’ and ‘F’.

CusID Name Address City Zip Phone Gender

1 April 1800 Bon Ave. Elk Grove 95758 916-333-4444 G

2 Tom 600 Bender Rd Sacramento 95412 916-112-2345 M

3 Sam 200 Tend St San Jose 95112 408-345-2134 B

4 May 237 Hello Blvd San Jose 9510 408-999-1111 F

Define Variable RangeDefine Variable Range

Example of variable range:

You want to mine customers shopping behavior that is younger than 40 yr old. On the age column, you find the customers are between 20 and 150 yr old. Therefore, you select all records with ages between 20 and 40 as your input.

This example is wrong. Nobody is over 130 yr old in the world, so you can conclude the records with ages above 130 are wrong. However, your input should also contains the records with ages between 40 and 130. Why? Because the density and distribution of these ages directly relate the records with ages below 40.

Conclusion: Your input should be between 20 – 130 yr old

Choose a SampleChoose a Sample

Data miners do not always use the entire data collection. Instead they usually choose a sample set randomly to speed up the mining process.

A sample size we pick should depend on:No of records availableDistributes and density of the dataNo. of variablesProject required range of variablesAnd more…

Sounds difficult, but there is strategies to make a sample dataset…

Choose a Sample (cont)Choose a Sample (cont)

Strategies to make a sample dataset:

1. Select 10 instances randomly and put them into your sample set.

2. Create a distribution curve represented the sample set.

3. Add another 10 random instances to your sample set.

4. Create a distribution curve represented the new sample set.

5. Compare the new curve with the previous curve. Do they look almost the same? If no, go back to step 3. If yes, stop and that is your sample set.

***Sample of distribution curve is on the next slide.

The solid line represents the current sample set.

The dot line represents the previous sample set.

Do they look alike?

Choose a Sample (cont)Choose a Sample (cont)

Data Preparation for Data Mining 1999 by Dorian Pyle

Thank you for your attention!

ReferenceReference