1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004.

1

SIMS 290-2: Applied Natural Language Processing

Preslav NakovOctober 6, 2004

2

Today

The 20 Newsgroups Text Collection

WEKA: Exporer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

3


WEKA: Exporer

WEKA: Experimenter



4

Source: originally collected by Ken LangContent and structure:

approximately 20,000 newsgroup documents– 19,997 originally– 18,828 without duplicates

partitioned evenly across 20 different newsgroups

Some categories are strongly related (and thus hard to discriminate):

20 Newsgroups Data Sethttp://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/

comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.x

rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey

sci.cryptsci.electronicssci.medsci.space

misc.forsale talk.politics.misctalk.politics.gunstalk.politics.mideast

talk.religion.miscalt.atheismsoc.religion.christian

computers

5

Sample Posting: “talk.politics.guns”From: [email protected] (C. D. Tavares)Subject: Re: Congress to review ATF's status

In article <[email protected]>, [email protected] (Larry Cipriani) writes:

> WASHINGTON (UPI) -- As part of its investigation of the deadly> confrontation with a Texas cult, Congress will consider whether the> Bureau of Alcohol, Tobacco and Firearms should be moved from the> Treasury Department to the Justice Department, senators said Wednesday.> The idea will be considered because of the violent and fatal events> at the beginning and end of the agency's confrontation with the Branch> Davidian cult.

Of course. When the catbox begines to smell, simply transfer itscontents into the potted plant in the foyer.

"Why Hillary! Your government smells so... FRESH!"--

[email protected] --If you believe that I speak for my company,OR [email protected] write today for my special Investors' Packet...

reply

from

subject

signature

Need special handling during

feature extraction…

… writes:

6


WEKA: Exporer

WEKA: Experimenter



7Slide adapted from Eibe Frank's

WEKA: The Bird

Copyright: Martin Kramer ([email protected]), University of Waikato, New Zealand

8

WEKA: Terminology

Some synonyms/explanations for the terms used by WEKA, which may differ from what we adopted:

Attribute: feature Relation: collection of examples Instance: collection in use Class: category


WEKA: The Software Toolkit

Machine learning/data mining software in JavaGNU LicenseUsed for research, education and applicationsComplements “Data Mining” by Witten & FrankMain features:

data pre-processing tools learning algorithms evaluation methods graphical interface (incl. data visualization) environment for comparing learning algorithms

http://www.cs.waikato.ac.nz/ml/weka


WEKA GUI Chooser java -Xmx1000M -jar weka.jar


Our Toy Example

We demonstrate WEKA on a toy example:

3 categories from “20 Newsgroups”:– misc.forsale, – rec.sport.hockey, – comp.graphics

20 documents per category features:– words converted to lowercase– frequency 2 or more required– stopwords removed


Explorer: Pre-Processing The Data

WEKA can import data is from:files: ARFF, CSV, C4.5, binaryURL SQL database (using JDBC)

Pre-processing tools (filters) are used for:Discretization, normalization, resampling, attribute selection, transforming and combining attributes, etc.

13

List of attributes (last: class variable)

Frequency and categories for the selected

attribute

Statistics about the values of the selected attribute

Classification

Filter selection

Manual attribute selection

Statistical attribute selection

Preprocessing

The Preprocessing Tab


Explorer: Building “Classifiers”

Classifiers in WEKA are models for:classification (predict a nominal class)regression (predict a numerical quantity)

Learning algorithms:Naïve Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.

Meta-classifiers:cannot be used alonealways combined with a learning algorithmexamples: boosting, bagging etc.

15

Choice of classifier

The attribute whose value is to be predicted from the values of the remaining ones.

Default is the last attribute.

Here (in our toy example) it is

named “class”.

Cross-validation: split the data into e.g. 10 folds and

10 times train on 9 folds and test on the remaining one

The Classification Tab

16

Choosing a classifier

18

False: Gaussian

True: kernels (better)

displays synopsis and options

numerical to nominal

conversion by discretization

outputs additional information

21

all other numbers can be obtained from it

different/easy class

accuracy

22

Contains information about the actual and the predicted classification

All measures can be derived from it: accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)

These extend for more than 2 classes: see previous lecture slides for details

Confusion matrix

predicted

– +

true

– a b

+ c d

23

Outputs the probability

distribution for each example

Predictions Output

24

Probability distribution for

a wrong example:

predicted 1 instead of 3

Naïve Bayes makes incorrect

conditional independence assumptions

and typically is over-confident in its prediction regardless of whether it is

correct or not.

Predictions Output

25

Error Visualization

26

Error Visualization

Little squares designate errors

Axes show example number


Find which attributes are the most predictive ones

Two parts: search method: – best-first, forward selection, random, exhaustive, genetic

algorithm, ranking

evaluation method: – information gain, chi-squared, etc.

Very flexible: WEKA allows (almost) arbitrary combinations of these two

Explorer: Attribute Selection

28

Individual Features Ranking

29

misc.forsale

comp.graphics

rec.sport.hockey


30

misc.forsale

comp.graphics

rec.sport.hockey

???

random number

seed


31Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's

feature correlation

2-Way Interactions

Feature Interactions

C

BA

category

feature feature

importance of feature B

importance of feature A

32Slide adapted from Jakulin, Bratko, Smrke, Demšar and Zupan's

3-Way Interaction: What is common to A, B and C together;

and cannot be inferred from pairs of features.

Feature Interactions

C

BA

category

feature feature

importance of feature B

importance of feature A

33Slide adapted from Guozhu Dong's

Feature Subsets Selection

Problem illustration

Full setEmpty setEnumeration

SearchExhaustive/Complete (enumeration/branch&bounding)Heuristic (sequential forward/backward)Stochastic (generate/evaluate)Individual features or subsets generation/evaluation

34

Features Subsets Selection

35

misc.forsale

comp.graphics

rec.sport.hockey

17,309 subsets considered21 attributes selected

Features Subsets Selection

36

Saving the Selected Features

All we can do from this tab is to save the buffer in a text file. Not very useful...

But we can also perform feature selection during the pre-processing step...(the following slides)

37

Features Selection on Preprocessing

38


39


679 attributes: 678 + 1 (for the class)

40


Just 22 attributes remain:

21 + 1 (for the class)

41

Run Naïve Bayes With the 21 Features

higher accuracy

21 Attributes

42

different/easy class

accuracy

(AGAIN) Naïve Bayes With All Features

ALL 679 Attributes(repeated slide)

43

Sometimes WEKA has a weird naming for some algorithms

Here is how to find the algorithms Barbara introduced: Naïve Bayes: weka.classifiers.bayes.NaiveBayes Perceptron: weka.classifiers.functions.VotedPerceptron Winnow: weka.classifiers.functions.winnow Decision tree: weka.classifiers.trees.J48 Support vector machines: weka.classifiers.functions.SMO k nearest neighbor: weka.classifiers.lazy.IBk

Some of these are more sophisticated versions of the classic algorithms

e.g. I cannot find the classic Naïve Bayes in WEKA (although there are 5 available implementations).

Some Important Algorithms

44


WEKA: Explorer

WEKA: Experimenter




Experimenter makes it easy to compare the performance of different learning schemes

Problems: classification regression

Results: written into file or databaseEvaluation options:

cross-validation learning curve hold-out

Can also iterate over different parameter settingsSignificance-testing built in!

Performing Experiments

46

Experiments Setup

47

Experiments Setup

48

Experiments Setup

CSV file: can be open in Exceldatasets

algorithms

49

Experiments Setup

50

Experiments Setup

51

Experiments Setup

52

Experiments Setup

53

Experiments Setup

accuracy

SVM is the best

Decision tree is the

worst

SVM is statistically better than Naïve Bayes

Decision tree is statistically worse than Naïve Bayes

54

Experiments: Excel

Results are output into an CSV file, which can

be read in Excel!

55


WEKA: Explorer

WEKA: Experimenter




@relation heart-disease-simplified

@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}

@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...

WEKA File Format: ARFF

Other attribute types:

• String

• Date

Numerical attribute

Nominal attribute

Missing value

57

Value 0 is not represented explicitlySame header (i.e @relation and @attribute tags)the @data section is different

Instead of @data

0, X, 0, Y, "class A"0, 0, W, 0, "class B"

We have

@data

{1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}

This is especially useful for textual data (why?)But! Problems with feature selection: cannot save results

WEKA File Format: Sparse ARFF

58


Works on the 20 newsgroups collectionExtracts the features

currently words easy to modify, just change one or more of:– extract_features_and_freqs()– is_feature_good() – build_stoplist()

Allows to filter out: the stopwords the infrequent features

Features are weighted by document frequencyProduces an ARFF file to be used by WEKA

59


Allows to specify: which subset of classes to consider the number of documents for each class the minimum feature frequency regular expression pattern a feature should match whether to remove the stopwords whether to convert words to lowercase kind of output to produce:

sparse (i.e., feature = value) full vector (list of values)

60

Python Interface to WEKA: How To

Needs installed "20_newsgroups“ and "stopwords“To get the things working under Windows:

open “__init__.py”in the code below, substitute “/” with “\\”

##################################################### 20 Newsgroupsgroups = [(ng, ng+'/.*') for ng in ''' alt.atheism rec.autos sci.space comp.graphics rec.motorcycles soc.religion.christian comp.os.ms-windows.misc rec.sport.baseball talk.politics.guns comp.sys.ibm.pc.hardware rec.sport.hockey talk.politics.mideast comp.sys.mac.hardware sci.crypt talk.politics.misc comp.windows.x sci.electronics talk.religion.misc misc.forsale sci.med'''.split()] twenty_newsgroups = SimpleCorpusReader( '20_newsgroups', '20_newsgroups/', '.*/.*', groups, description_file='../20_newsgroups.readme')del groups # delete temporary variable

61


The Main Function

62


Example Usage

Python dictionary

Estimated over the whole set! cross-validation: OK; test/train: not OK

Use 1

63

Python Interface to WEKAFunctions You Will Probably Want To Modify

convert to lowercase

Also: stemming!Also: word+POS!

Also: compounds!

64

Python Interface to WEKAYou might want to add… Stemming

Porter stemmer>>> cats = Token(TEXT='cats', POS='NN')

>>> from nltk.stemmer.porter import *

>>> porter = PorterStemmer()

>>> porter.stem(cats)

>>> print cats

<POS='NN', STEM='cat', TEXT='cats'>

WordNet stemmer morphy – morphological analyzer you need the following packages installed:– nltk.wordnet– nltk-contrib.pywordnet

>>> from nltk_contrib.pywordnet.stemmer import *

>>> morphy('dogs')

'dog'

65

Python Interface to WEKAYou might want to add… TF.IDF

TF.IDF: tij log(N/ni) TF– tij: frequency of term i in document j

– this is how features are currently weighted

IDF: log(N/ni)

– ni: number of documents containing term i

– N: total number of documents

Modify the function extract_features_and_freqs_forall()

66


WEKA: Explorer

WEKA: Experimenter



67

Summary


WEKA: The ToolkitExplorer

– Classification– Feature selection

ExperimenterARFF file format

Python Interface to WEKAfeature extraction

stemmingWeighting: TF.IDF


1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004.

Documents

Transcript of 1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov October 6, 2004.