Big Data Workshop

17
Dealing with large datasets Avoiding the dangers Adrien Ickowicz, Ross Sparks MATHEMATICS, INFORMATICS AND STATISTICS www.csiro.au

Transcript of Big Data Workshop

Page 1: Big Data Workshop

Dealing with large datasetsAvoiding the dangersAdrien Ickowicz, Ross Sparks

MATHEMATICS, INFORMATICS AND STATISTICSwww.csiro.au

Page 2: Big Data Workshop

Dealing with large datasets: Slide 2 of 17

Managing the data

Can the input be massaged to make it more amenable for learningmethods? (and how can you do it safely)

3 Attribute Selection– Scheme independent selection

– Searching the attribute space

– Scheme specific selection

3 Attribute Discretization– Unsupervized discretization

– Entropy-based discretization

– Other methods

3 Data Transformation– Linear and Non-linear PCA

– Random projections

– Time Series

7 Data Cleansing– Improving Decision Tree

– Robust Regression

– Detecting anomalies

Page 3: Big Data Workshop

Dealing with large datasets: Slide 3 of 17

Attribute Selection JustificationAn irrelevant attribute will often distract the performanceof state-of-the-art decision tree and rule learners...

å Example: Random binary attribute– Deteriorates the classification performance 5% to 10% of the time

But a relevant attribute can be harmful as well...

å Example: 65% same-class-value binary attribute– Deteriorates the classification performance 1% to 5% of the time

Page 4: Big Data Workshop

Dealing with large datasets: Slide 4 of 17

Attribute Selection

1 - Scheme-independant selection

• No universal relevance measure

• Beware of overfitting and model redundancy

• Make sure that the attributes scales are the same

2 - Searching the attribute space

• Exhaustive search impractical

• Forward, backward, ... : Need an expert to set alg. param.

3 - Scheme-specific selection

• Time consuming

• ”Burns” one classification method

Page 5: Big Data Workshop

Dealing with large datasets: Slide 5 of 17

Attribute Discretization JustificationDeal with both continuous and discretized data

Handle the extreme values

Some algorithms assume a unrealistic hypothesis onthe attribute values...

å Example: normal distribution assumption

... or slow down the process.

å Example: need to sort the attribute values

Page 6: Big Data Workshop

Dealing with large datasets: Slide 6 of 17

Attribute Discretization

1 - Unsupervized discretization

• Avoid big differences in bin-frequencies

• Avoid small sized bins

2 - Entropy-based discretization

• Recursive, so need a stopping criterion

3 - Other methods

• In practice, do not perform better than E-B-D.

• Some are time consuming

Page 7: Big Data Workshop

Dealing with large datasets: Slide 7 of 17

Data Transformation JustificationData often calls for general mathematical transforma-tions of a set of attributes...

å Example: Two date attributes may lead to a third attributerepresenting age

Test the robustness of a learning algorithm...

å Example: add noise or change a given percentage of a nom-inal attribute values

Page 8: Big Data Workshop

Dealing with large datasets: Slide 8 of 17

Data Transformation

1 - Linear and Non-linear PCA

• Dimension reduction technique: there is a loose in information

• Very costly in high dimension

2 - Random projections

• Perform worse than PCA

• Preserve distance relationship well on average

3 - Time Series

• Pay attention to the sampling

Page 9: Big Data Workshop

Dealing with large datasets: Slide 9 of 17

Application Example

- What is the difference between theory and practice?- There is no difference ... in theory. But in practice, there is.

å Example 1: Attribute Selection (Backward vs Filter)

å Example 2: Attribute Discretization (Chi-2 based vs Top-down)

å Example 3: Data Transformation

Page 10: Big Data Workshop

Dealing with large datasets: Slide 10 of 17

Example 1

Data Set : Wine quality Data

Description of the data: 1599 obs. of 12 variables

Question : What makes a good (red) wine?

Page 11: Big Data Workshop

Dealing with large datasets: Slide 11 of 17

Example 1

How many features do we keep?

Backward & RMSE

Number of features: 5

Page 12: Big Data Workshop

Dealing with large datasets: Slide 12 of 17

Example 1

How many features do we keep?

Filter & RMSE

Page 13: Big Data Workshop

Dealing with large datasets: Slide 13 of 17

Example 2

How do we discretize the features?

Chi-2 discretization MDL discretization

Page 14: Big Data Workshop

Dealing with large datasets: Slide 14 of 17

Example 2

How do we discretize the features?

Chi-2 Merge discretization Top-down discretization

Page 15: Big Data Workshop

Dealing with large datasets: Slide 15 of 17

Example 3

How do we transform the data?

Principal Component Analysis

Page 16: Big Data Workshop

Dealing with large datasets: Slide 16 of 17

Example 3

How do we transform the data?

Projection PursuitRegression

Page 17: Big Data Workshop

MATHEMATICS, INFORMATICS AND STATISTICSwww.csiro.au

CSIRO Mathematics, Informatics and Statistics

Adrien Ickowiczt +61 2 9325 3260e [email protected] Mathematics, Informatics and Statistics web

CSIRO Mathematics, Informatics and Statistics

Ross Sparkst +61 2 9325 3262e [email protected] Mathematics, Informatics and Statistics web