Big Data Workshop
-
Upload
adrien-ickowicz -
Category
Documents
-
view
628 -
download
2
Transcript of Big Data Workshop
Dealing with large datasetsAvoiding the dangersAdrien Ickowicz, Ross Sparks
MATHEMATICS, INFORMATICS AND STATISTICSwww.csiro.au
Dealing with large datasets: Slide 2 of 17
Managing the data
Can the input be massaged to make it more amenable for learningmethods? (and how can you do it safely)
3 Attribute Selection– Scheme independent selection
– Searching the attribute space
– Scheme specific selection
3 Attribute Discretization– Unsupervized discretization
– Entropy-based discretization
– Other methods
3 Data Transformation– Linear and Non-linear PCA
– Random projections
– Time Series
7 Data Cleansing– Improving Decision Tree
– Robust Regression
– Detecting anomalies
Dealing with large datasets: Slide 3 of 17
Attribute Selection JustificationAn irrelevant attribute will often distract the performanceof state-of-the-art decision tree and rule learners...
å Example: Random binary attribute– Deteriorates the classification performance 5% to 10% of the time
But a relevant attribute can be harmful as well...
å Example: 65% same-class-value binary attribute– Deteriorates the classification performance 1% to 5% of the time
Dealing with large datasets: Slide 4 of 17
Attribute Selection
1 - Scheme-independant selection
• No universal relevance measure
• Beware of overfitting and model redundancy
• Make sure that the attributes scales are the same
2 - Searching the attribute space
• Exhaustive search impractical
• Forward, backward, ... : Need an expert to set alg. param.
3 - Scheme-specific selection
• Time consuming
• ”Burns” one classification method
Dealing with large datasets: Slide 5 of 17
Attribute Discretization JustificationDeal with both continuous and discretized data
Handle the extreme values
Some algorithms assume a unrealistic hypothesis onthe attribute values...
å Example: normal distribution assumption
... or slow down the process.
å Example: need to sort the attribute values
Dealing with large datasets: Slide 6 of 17
Attribute Discretization
1 - Unsupervized discretization
• Avoid big differences in bin-frequencies
• Avoid small sized bins
2 - Entropy-based discretization
• Recursive, so need a stopping criterion
3 - Other methods
• In practice, do not perform better than E-B-D.
• Some are time consuming
Dealing with large datasets: Slide 7 of 17
Data Transformation JustificationData often calls for general mathematical transforma-tions of a set of attributes...
å Example: Two date attributes may lead to a third attributerepresenting age
Test the robustness of a learning algorithm...
å Example: add noise or change a given percentage of a nom-inal attribute values
Dealing with large datasets: Slide 8 of 17
Data Transformation
1 - Linear and Non-linear PCA
• Dimension reduction technique: there is a loose in information
• Very costly in high dimension
2 - Random projections
• Perform worse than PCA
• Preserve distance relationship well on average
3 - Time Series
• Pay attention to the sampling
Dealing with large datasets: Slide 9 of 17
Application Example
- What is the difference between theory and practice?- There is no difference ... in theory. But in practice, there is.
å Example 1: Attribute Selection (Backward vs Filter)
å Example 2: Attribute Discretization (Chi-2 based vs Top-down)
å Example 3: Data Transformation
Dealing with large datasets: Slide 10 of 17
Example 1
Data Set : Wine quality Data
Description of the data: 1599 obs. of 12 variables
Question : What makes a good (red) wine?
Dealing with large datasets: Slide 11 of 17
Example 1
How many features do we keep?
Backward & RMSE
Number of features: 5
Dealing with large datasets: Slide 12 of 17
Example 1
How many features do we keep?
Filter & RMSE
Dealing with large datasets: Slide 13 of 17
Example 2
How do we discretize the features?
Chi-2 discretization MDL discretization
Dealing with large datasets: Slide 14 of 17
Example 2
How do we discretize the features?
Chi-2 Merge discretization Top-down discretization
Dealing with large datasets: Slide 15 of 17
Example 3
How do we transform the data?
Principal Component Analysis
Dealing with large datasets: Slide 16 of 17
Example 3
How do we transform the data?
Projection PursuitRegression
MATHEMATICS, INFORMATICS AND STATISTICSwww.csiro.au
CSIRO Mathematics, Informatics and Statistics
Adrien Ickowiczt +61 2 9325 3260e [email protected] Mathematics, Informatics and Statistics web
CSIRO Mathematics, Informatics and Statistics
Ross Sparkst +61 2 9325 3262e [email protected] Mathematics, Informatics and Statistics web