Big Data Workshop

Dealing with large datasetsAvoiding the dangersAdrien Ickowicz, Ross Sparks

MATHEMATICS, INFORMATICS AND STATISTICSwww.csiro.au

Dealing with large datasets: Slide 2 of 17

Managing the data

Can the input be massaged to make it more amenable for learningmethods? (and how can you do it safely)

3 Attribute Selection– Scheme independent selection

– Searching the attribute space

– Scheme specific selection

3 Attribute Discretization– Unsupervized discretization

– Entropy-based discretization

– Other methods

3 Data Transformation– Linear and Non-linear PCA

– Random projections

– Time Series

7 Data Cleansing– Improving Decision Tree

– Robust Regression

– Detecting anomalies


Attribute Selection JustificationAn irrelevant attribute will often distract the performanceof state-of-the-art decision tree and rule learners...

å Example: Random binary attribute– Deteriorates the classification performance 5% to 10% of the time

But a relevant attribute can be harmful as well...

å Example: 65% same-class-value binary attribute– Deteriorates the classification performance 1% to 5% of the time


Attribute Selection

1 - Scheme-independant selection

• No universal relevance measure

• Beware of overfitting and model redundancy

• Make sure that the attributes scales are the same

2 - Searching the attribute space

• Exhaustive search impractical

• Forward, backward, ... : Need an expert to set alg. param.

3 - Scheme-specific selection

• Time consuming

• ”Burns” one classification method


Attribute Discretization JustificationDeal with both continuous and discretized data

Handle the extreme values

Some algorithms assume a unrealistic hypothesis onthe attribute values...

å Example: normal distribution assumption

... or slow down the process.

å Example: need to sort the attribute values


Attribute Discretization

1 - Unsupervized discretization

• Avoid big differences in bin-frequencies

• Avoid small sized bins

2 - Entropy-based discretization

• Recursive, so need a stopping criterion

3 - Other methods

• In practice, do not perform better than E-B-D.

• Some are time consuming


Data Transformation JustificationData often calls for general mathematical transforma-tions of a set of attributes...

å Example: Two date attributes may lead to a third attributerepresenting age

Test the robustness of a learning algorithm...

å Example: add noise or change a given percentage of a nom-inal attribute values


Data Transformation

1 - Linear and Non-linear PCA

• Dimension reduction technique: there is a loose in information

• Very costly in high dimension

2 - Random projections

• Perform worse than PCA

• Preserve distance relationship well on average

3 - Time Series

• Pay attention to the sampling


Application Example

- What is the difference between theory and practice?- There is no difference ... in theory. But in practice, there is.

å Example 1: Attribute Selection (Backward vs Filter)

å Example 2: Attribute Discretization (Chi-2 based vs Top-down)

å Example 3: Data Transformation


Example 1

Data Set : Wine quality Data

Description of the data: 1599 obs. of 12 variables

Question : What makes a good (red) wine?


Example 1

How many features do we keep?

Backward & RMSE

Number of features: 5


Example 1

How many features do we keep?

Filter & RMSE


Example 2

How do we discretize the features?

Chi-2 discretization MDL discretization


Example 2

How do we discretize the features?

Chi-2 Merge discretization Top-down discretization


Example 3

How do we transform the data?

Principal Component Analysis


Example 3

How do we transform the data?

Projection PursuitRegression

MATHEMATICS, INFORMATICS AND STATISTICSwww.csiro.au

CSIRO Mathematics, Informatics and Statistics

Adrien Ickowiczt +61 2 9325 3260e [email protected] Mathematics, Informatics and Statistics web

CSIRO Mathematics, Informatics and Statistics

Ross Sparkst +61 2 9325 3262e [email protected] Mathematics, Informatics and Statistics web

mailto: [email protected]

http://www.csiro.au/en/Organisation-Structure/Divisions/Mathematics-Informatics-and-Statistics.aspx

mailto: [email protected]

http://www.csiro.au/en/Organisation-Structure/Divisions/Mathematics-Informatics-and-Statistics.aspx

Big Data Workshop

Documents

Transcript of Big Data Workshop