1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004.
-
date post
20-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004.
1
SIMS 290-2: Applied Natural Language Processing
Preslav NakovOctober 6, 2004
2
Today
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
3
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
4
Source: originally collected by Ken LangContent and structure:
approximately 20,000 newsgroup documents– 19,997 originally– 18,828 without duplicates
partitioned evenly across 20 different newsgroups
Some categories are strongly related (and thus hard to discriminate):
20 Newsgroups Data Sethttp://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/
comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.x
rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey
sci.cryptsci.electronicssci.medsci.space
misc.forsale talk.politics.misctalk.politics.gunstalk.politics.mideast
talk.religion.miscalt.atheismsoc.religion.christian
computers
5
Sample Posting: “talk.politics.guns”From: [email protected] (C. D. Tavares)Subject: Re: Congress to review ATF's status
In article <[email protected]>, [email protected] (Larry Cipriani) writes:
> WASHINGTON (UPI) -- As part of its investigation of the deadly> confrontation with a Texas cult, Congress will consider whether the> Bureau of Alcohol, Tobacco and Firearms should be moved from the> Treasury Department to the Justice Department, senators said Wednesday.> The idea will be considered because of the violent and fatal events> at the beginning and end of the agency's confrontation with the Branch> Davidian cult.
Of course. When the catbox begines to smell, simply transfer itscontents into the potted plant in the foyer.
"Why Hillary! Your government smells so... FRESH!"--
[email protected] --If you believe that I speak for my company,OR [email protected] write today for my special Investors' Packet...
reply
from
subject
signature
Need special handling during
feature extraction…
… writes:
6
The 20 Newsgroups Text Collection
WEKA: Exporer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
7Slide adapted from Eibe Frank's
WEKA: The Bird
Copyright: Martin Kramer ([email protected]), University of Waikato, New Zealand
8
WEKA: Terminology
Some synonyms/explanations for the terms used by WEKA, which may differ from what we adopted:
Attribute: feature Relation: collection of examples Instance: collection in use Class: category
9Slide adapted from Eibe Frank's
WEKA: The Software Toolkit
Machine learning/data mining software in JavaGNU LicenseUsed for research, education and applicationsComplements “Data Mining” by Witten & FrankMain features:
data pre-processing tools learning algorithms evaluation methods graphical interface (incl. data visualization) environment for comparing learning algorithms
http://www.cs.waikato.ac.nz/ml/weka
10Slide adapted from Eibe Frank's
WEKA GUI Chooser java -Xmx1000M -jar weka.jar
11Slide adapted from Eibe Frank's
Our Toy Example
We demonstrate WEKA on a toy example:
3 categories from “20 Newsgroups”:– misc.forsale, – rec.sport.hockey, – comp.graphics
20 documents per category features:– words converted to lowercase– frequency 2 or more required– stopwords removed
12Slide adapted from Eibe Frank's
Explorer: Pre-Processing The Data
WEKA can import data is from:files: ARFF, CSV, C4.5, binaryURL SQL database (using JDBC)
Pre-processing tools (filters) are used for:Discretization, normalization, resampling, attribute selection, transforming and combining attributes, etc.
13
List of attributes (last: class variable)
Frequency and categories for the selected
attribute
Statistics about the values of the selected attribute
Classification
Filter selection
Manual attribute selection
Statistical attribute selection
Preprocessing
The Preprocessing Tab
14Slide adapted from Eibe Frank's
Explorer: Building “Classifiers”
Classifiers in WEKA are models for:classification (predict a nominal class)regression (predict a numerical quantity)
Learning algorithms:Naïve Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.
Meta-classifiers:cannot be used alonealways combined with a learning algorithmexamples: boosting, bagging etc.
15
Choice of classifier
The attribute whose value is to be predicted from the values of the remaining ones.
Default is the last attribute.
Here (in our toy example) it is
named “class”.
Cross-validation: split the data into e.g. 10 folds and
10 times train on 9 folds and test on the remaining one
The Classification Tab
16
Choosing a classifier
17
18
False: Gaussian
True: kernels (better)
displays synopsis and options
numerical to nominal
conversion by discretization
outputs additional information
19
20
21
all other numbers can be obtained from it
different/easy class
accuracy
22
Contains information about the actual and the predicted classification
All measures can be derived from it: accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)
These extend for more than 2 classes: see previous lecture slides for details
Confusion matrix
predicted
– +
true
– a b
+ c d
23
Outputs the probability
distribution for each example
Predictions Output
24
Probability distribution for
a wrong example:
predicted 1 instead of 3
Naïve Bayes makes incorrect
conditional independence assumptions
and typically is over-confident in its prediction regardless of whether it is
correct or not.
Predictions Output
25
Error Visualization
26
Error Visualization
Little squares designate errors
Axes show example number
27Slide adapted from Eibe Frank's
Find which attributes are the most predictive ones
Two parts: search method: – best-first, forward selection, random, exhaustive, genetic
algorithm, ranking
evaluation method: – information gain, chi-squared, etc.
Very flexible: WEKA allows (almost) arbitrary combinations of these two
Explorer: Attribute Selection
28
Individual Features Ranking
29
misc.forsale
comp.graphics
rec.sport.hockey
Individual Features Ranking
30
misc.forsale
comp.graphics
rec.sport.hockey
???
random number
seed
Individual Features Ranking
31Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's
feature correlation
2-Way Interactions
Feature Interactions
C
BA
category
feature feature
importance of feature B
importance of feature A
32Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's
3-Way Interaction: What is common to A, B and C together;
and cannot be inferred from pairs of features.
Feature Interactions
C
BA
category
feature feature
importance of feature B
importance of feature A
33Slide adapted from Guozhu Dong's
Feature Subsets Selection
Problem illustration
Full setEmpty setEnumeration
SearchExhaustive/Complete (enumeration/branch&bounding)Heuristic (sequential forward/backward)Stochastic (generate/evaluate)Individual features or subsets generation/evaluation
34
Features Subsets Selection
35
misc.forsale
comp.graphics
rec.sport.hockey
17,309 subsets considered21 attributes selected
Features Subsets Selection
36
Saving the Selected Features
All we can do from this tab is to save the buffer in a text file. Not very useful...
But we can also perform feature selection during the pre-processing step...(the following slides)
37
Features Selection on Preprocessing
38
Features Selection on Preprocessing
39
Features Selection on Preprocessing
679 attributes: 678 + 1 (for the class)
40
Features Selection on Preprocessing
Just 22 attributes remain:
21 + 1 (for the class)
41
Run Naïve Bayes With the 21 Features
higher accuracy
21 Attributes
42
different/easy class
accuracy
(AGAIN) Naïve Bayes With All Features
ALL 679 Attributes(repeated slide)
43
Sometimes WEKA has a weird naming for some algorithms
Here is how to find the algorithms Barbara introduced: Naïve Bayes: weka.classifiers.bayes.NaiveBayes Perceptron: weka.classifiers.functions.VotedPerceptron Winnow: weka.classifiers.functions.winnow Decision tree: weka.classifiers.trees.J48 Support vector machines: weka.classifiers.functions.SMO k nearest neighbor: weka.classifiers.lazy.IBk
Some of these are more sophisticated versions of the classic algorithms
e.g. I cannot find the classic Naïve Bayes in WEKA (although there are 5 available implementations).
Some Important Algorithms
44
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
45Slide adapted from Eibe Frank's
Experimenter makes it easy to compare the performance of different learning schemes
Problems: classification regression
Results: written into file or databaseEvaluation options:
cross-validation learning curve hold-out
Can also iterate over different parameter settingsSignificance-testing built in!
Performing Experiments
46
Experiments Setup
47
Experiments Setup
48
Experiments Setup
CSV file: can be open in Exceldatasets
algorithms
49
Experiments Setup
50
Experiments Setup
51
Experiments Setup
52
Experiments Setup
53
Experiments Setup
accuracy
SVM is the best
Decision tree is the
worst
SVM is statistically better than Naïve Bayes
Decision tree is statistically worse than Naïve Bayes
54
Experiments: Excel
Results are output into an CSV file, which can
be read in Excel!
55
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
56Slide adapted from Eibe Frank's
@relation heart-disease-simplified
@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}
@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...
WEKA File Format: ARFF
Other attribute types:
• String
• Date
Numerical attribute
Nominal attribute
Missing value
57
Value 0 is not represented explicitlySame header (i.e @relation and @attribute tags)the @data section is different
Instead of @data
0, X, 0, Y, "class A"0, 0, W, 0, "class B"
We have
@data
{1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}
This is especially useful for textual data (why?)But! Problems with feature selection: cannot save results
WEKA File Format: Sparse ARFF
58
Python Interface to WEKA
Works on the 20 newsgroups collectionExtracts the features
currently words easy to modify, just change one or more of:– extract_features_and_freqs()– is_feature_good() – build_stoplist()
Allows to filter out: the stopwords the infrequent features
Features are weighted by document frequencyProduces an ARFF file to be used by WEKA
59
Python Interface to WEKA
Allows to specify: which subset of classes to consider the number of documents for each class the minimum feature frequency regular expression pattern a feature should match whether to remove the stopwords whether to convert words to lowercase kind of output to produce:
sparse (i.e., feature = value) full vector (list of values)
60
Python Interface to WEKA: How To
Needs installed "20_newsgroups“ and "stopwords“To get the things working under Windows:
open “__init__.py”in the code below, substitute “/” with “\\”
##################################################### 20 Newsgroupsgroups = [(ng, ng+'/.*') for ng in ''' alt.atheism rec.autos sci.space comp.graphics rec.motorcycles soc.religion.christian comp.os.ms-windows.misc rec.sport.baseball talk.politics.guns comp.sys.ibm.pc.hardware rec.sport.hockey talk.politics.mideast comp.sys.mac.hardware sci.crypt talk.politics.misc comp.windows.x sci.electronics talk.religion.misc misc.forsale sci.med'''.split()] twenty_newsgroups = SimpleCorpusReader( '20_newsgroups', '20_newsgroups/', '.*/.*', groups, description_file='../20_newsgroups.readme')del groups # delete temporary variable
61
Python Interface to WEKA
The Main Function
62
Python Interface to WEKA
Example Usage
Python dictionary
Estimated over the whole set! cross-validation: OK; test/train: not OK
Use 1
63
Python Interface to WEKAFunctions You Will Probably Want To Modify
convert to lowercase
Also: stemming!Also: word+POS!
Also: compounds!
64
Python Interface to WEKAYou might want to add… Stemming
Porter stemmer>>> cats = Token(TEXT='cats', POS='NN')
>>> from nltk.stemmer.porter import *
>>> porter = PorterStemmer()
>>> porter.stem(cats)
>>> print cats
<POS='NN', STEM='cat', TEXT='cats'>
WordNet stemmer morphy – morphological analyzer you need the following packages installed:– nltk.wordnet– nltk-contrib.pywordnet
>>> from nltk_contrib.pywordnet.stemmer import *
>>> morphy('dogs')
'dog'
65
Python Interface to WEKAYou might want to add… TF.IDF
TF.IDF: tij log(N/ni) TF– tij: frequency of term i in document j
– this is how features are currently weighted
IDF: log(N/ni)
– ni: number of documents containing term i
– N: total number of documents
Modify the function extract_features_and_freqs_forall()
66
The 20 Newsgroups Text Collection
WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA
WEKA: Real-time Demo
67
Summary
The 20 Newsgroups Text Collection
WEKA: The ToolkitExplorer
– Classification– Feature selection
ExperimenterARFF file format
Python Interface to WEKAfeature extraction
stemmingWeighting: TF.IDF
WEKA: Real-time Demo