Weka Tutorial
-
Upload
yamen-ajjour -
Category
Documents
-
view
67 -
download
3
Transcript of Weka Tutorial
Weka Tutorial
D. De Cao R. Basili
Corso di Web Mining e Retrievala.a. 2008-9
April 15, 2009
What is WEKA?
Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron
Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron
Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification
Schemes for clustering:EM and Cobweb
What is WEKA?
Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron
Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron
Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification
Schemes for clustering:EM and Cobweb
What is WEKA?
Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron
Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron
Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification
Schemes for clustering:EM and Cobweb
What is WEKA?
Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron
Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron
Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification
Schemes for clustering:EM and Cobweb
What is WEKA?
Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron
Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron
Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification
Schemes for clustering:EM and Cobweb
What is WEKA?
Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron
Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron
Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification
Schemes for clustering:EM and Cobweb
ARFF File Format
Require declarations of @RELATION, @ATTRIBUTE and @DATA
@RELATION declaration associates a name with the dataset@RELATION <relation-name>
@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date
@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}
@DATA declaration is a single line denoting the start of the datasegment
Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor
ARFF File Format
Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset
@RELATION <relation-name>
@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date
@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}
@DATA declaration is a single line denoting the start of the datasegment
Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor
ARFF File Format
Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset
@RELATION <relation-name>
@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date
@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}
@DATA declaration is a single line denoting the start of the datasegment
Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor
ARFF File Format
Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset
@RELATION <relation-name>
@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date
@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}
@DATA declaration is a single line denoting the start of the datasegment
Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor
ARFF File Format
Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset
@RELATION <relation-name>
@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date
@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}
@DATA declaration is a single line denoting the start of the datasegment
Missing values are represented by ?
@DATA1.4, 0.2, Setosa1.4, ?, Versicolor
ARFF File Format
Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset
@RELATION <relation-name>
@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date
@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}
@DATA declaration is a single line denoting the start of the datasegment
Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor
ARFF Sparse File Format
Similar to AARF files except that data value 0 are not represented
Non-zero attributes are specified by attribute number and valueFull:
@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"
Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}
Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)
ARFF Sparse File Format
Similar to AARF files except that data value 0 are not representedNon-zero attributes are specified by attribute number and value
Full:@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"
Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}
Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)
ARFF Sparse File Format
Similar to AARF files except that data value 0 are not representedNon-zero attributes are specified by attribute number and valueFull:
@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"
Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}
Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)
ARFF Sparse File Format
Similar to AARF files except that data value 0 are not representedNon-zero attributes are specified by attribute number and valueFull:
@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"
Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}
Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)
ARFF Sparse File Format
Similar to AARF files except that data value 0 are not representedNon-zero attributes are specified by attribute number and valueFull:
@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"
Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}
Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)
ARFF Sparse File Format: Problem
There is a known problem saving SparseInstance objects from datasetsthat have string attributes.
In Weka, string and nominal data values are stored as numbers;String at position 0 is mapped to value “0”If read back in, first String missing from Instances
SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly
ARFF Sparse File Format: Problem
There is a known problem saving SparseInstance objects from datasetsthat have string attributes.In Weka, string and nominal data values are stored as numbers;
String at position 0 is mapped to value “0”If read back in, first String missing from Instances
SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly
ARFF Sparse File Format: Problem
There is a known problem saving SparseInstance objects from datasetsthat have string attributes.In Weka, string and nominal data values are stored as numbers;String at position 0 is mapped to value “0”
If read back in, first String missing from Instances
SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly
ARFF Sparse File Format: Problem
There is a known problem saving SparseInstance objects from datasetsthat have string attributes.In Weka, string and nominal data values are stored as numbers;String at position 0 is mapped to value “0”If read back in, first String missing from Instances
SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly
ARFF Sparse File Format: Problem
There is a known problem saving SparseInstance objects from datasetsthat have string attributes.In Weka, string and nominal data values are stored as numbers;String at position 0 is mapped to value “0”If read back in, first String missing from Instances
SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly
Running Learning Schemes
java -Xmx512m -cp weka.jar <learner class> [options]
Example learner classes:Decision Tree: weka.classifiers.trees.J48 (Quinlan 1993 [1])Naive Bayes: weka.classifiers.bayes.NaiveBayesk-NN: weka.classifiers.lazy.IBk
Important generic options:-t <training file> Specify training file-T <test files> Specify Test file. If none testing is performed ontraining data-x <number of folds> Number of folds for cross-validation-l <input file> Use saved model-d <output file> Output model to file-split-percentage <train size> Size of training set-c <class index> Index of attribute to use as class (NB: the indexstart from 1)-p <attribute index> Only output the predictions and oneattribute (0 for none) for all test instances.
Running Learning Schemes
java -Xmx512m -cp weka.jar <learner class> [options]
Example learner classes:Decision Tree: weka.classifiers.trees.J48 (Quinlan 1993 [1])Naive Bayes: weka.classifiers.bayes.NaiveBayesk-NN: weka.classifiers.lazy.IBk
Important generic options:-t <training file> Specify training file-T <test files> Specify Test file. If none testing is performed ontraining data-x <number of folds> Number of folds for cross-validation-l <input file> Use saved model-d <output file> Output model to file-split-percentage <train size> Size of training set-c <class index> Index of attribute to use as class (NB: the indexstart from 1)-p <attribute index> Only output the predictions and oneattribute (0 for none) for all test instances.
Running Learning Schemes
java -Xmx512m -cp weka.jar <learner class> [options]
Example learner classes:Decision Tree: weka.classifiers.trees.J48 (Quinlan 1993 [1])Naive Bayes: weka.classifiers.bayes.NaiveBayesk-NN: weka.classifiers.lazy.IBk
Important generic options:-t <training file> Specify training file-T <test files> Specify Test file. If none testing is performed ontraining data-x <number of folds> Number of folds for cross-validation-l <input file> Use saved model-d <output file> Output model to file-split-percentage <train size> Size of training set-c <class index> Index of attribute to use as class (NB: the indexstart from 1)-p <attribute index> Only output the predictions and oneattribute (0 for none) for all test instances.
Ross Quinlan.C4.5: Programs for Machine Learning.Morgan Kaufmann Publishers, San Mateo, CA, 1993.