1 I256: Applied Natural Language Processing Preslav Nakov and Marti Hearst October 16, 2006 (Many...
-
date post
19-Dec-2015 -
Category
Documents
-
view
226 -
download
1
Transcript of 1 I256: Applied Natural Language Processing Preslav Nakov and Marti Hearst October 16, 2006 (Many...
1
I256: Applied Natural Language Processing
Preslav Nakov and Marti HearstOctober 16, 2006
(Many slides originally by Barbara Rosario, modified here)
2
Today
Classification Text categorization (and other applications)
Various issues regarding classificationClustering vs. classification, binary vs. multi-way, flat vs. hierarchical classification…
Introduce the steps necessary for a classification task
Define classesLabel textFeaturesTraining and evaluation of a classifier
3From: Foundations of Statistical Natural Language Processing. Manning and Schutze
Classification
Goal: Assign ‘objects’ from a universe to two or more classes or categories
Examples:
Problem Object CategoriesTagging Word POSSense Disambiguation Word The word’s sensesInformation retrieval Document Relevant/not relevantSentiment classification Document Positive/negativeAuthor identification Document Authors
4Slide adapted from Paul Bennet
Text Categorization Applications
Web pages organized into category hierarchiesJournal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.)Responses to Census Bureau occupationsPatents archived using International Patent ClassificationPatient records coded using international insurance categoriesE-mail message filteringNews events tracked and filtered by topicsSpam vs. anti-palm
6Slide adapted froml Paul Bennett
Why not a semi-automatic text categorization tool?
Humans can encode knowledge of what constitutes membership in a category.
This encoding can then be automatically applied by a machine to categorize new examples.
For example...
8Slide adapted froml Paul Bennett
Rule-based Approach to Text Categorization
Text in a Web Page“Saeco revolutionized espresso brewing a decade ago by introducing Saeco SuperAutomatic machines, which go from bean to coffee at the touch of a button. The all-new Saeco Vienna Super-Automatic home coffee and cappucino machine combines top quality with low price!”
RulesRule 1. (espresso or coffee or cappucino ) and machine* Coffee MakerRule 2.automat* and answering and machine* Phone
Rule ...
9Slide adapted froml Paul Bennett
Defining Rules By Hand
This is fine for low-stakes applicationsGoogle and Yahoo alerts allow users to automatically receive news articles containing certain keywords
Called “filtering” or “routing”
Works fine when it’s ok to miss some things
But when high accuracy is required, experience has shown
too time consuming
too difficult
inconsistency issues (as the rule set gets large)
11Slide adapted froml Paul Bennett
Cost of Manual Text Categorization
Yahoo! 200 (?) people for manual labeling of Web pages using a hierarchy of 500,000 categories
MEDLINE (National Library of Medicine) $2 million/year for manual indexing of journal articles using MEdical Subject Headings (18,000 categories)
Mayo Clinic $1.4 million annually for coding patient-record events using the International Classification of Diseases (ICD) for
billing insurance companies
US Census Bureau decennial census (1990: 22 million responses)
232 industry categories and 504 occupation categories $15 million if fully done by hand
12Slide adapted froml Paul Bennett
Knowledge Statistical Engineering Learning
For US Census Bureau Decennial Census 1990232 industry categories and 504 occupation categories$15 million if fully done by hand
Define classification rules manually:Expert System AIOCSDevelopment time: 192 person-months (2 people, 8 years)Accuracy = 47%
Learn classification functionNearest Neighbor classification (Creecy ’92: 1-NN)Development time: 4 person-months (Thinking Machine)Accuracy = 60%
vs.
13
Text Topic categorization
Topic categorization: classify the document into semantics topics
The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie.
One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.
14
The Reuters collection
A gold standardCollection of (21,578) newswire documents. For research purposes: a standard text collection to compare systems and algorithms135 valid topics categories
16
Reuters Document Example<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter
</BODY></TEXT></REUTERS>
17
Classification vs. Clustering
Classification assumes labeled data: we know how many classes there are and we have examples for each class (labeled data). Classification is supervisedIn Clustering we don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there areClustering is unsupervised
27
Categories (Labels, Classes)
Labeling data2 problems: Decide the possible classes (which ones, how many)
Domain and application dependent
Label textDifficult, time consuming, inconsistency between annotators
28
Reuters Example, revisited
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter
</BODY></TEXT></REUTERS>
Why not topic = policy ?
29
Binary vs. multi-way classification
Binary classification: two classes
Multi-way classification: more than two classes
Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes
30
Flat vs. Hierarchical classification
Flat classification: relations between the classes undetermined
Hierarchical classification: hierarchy where each node is the sub-class of its parent’s node
31
Single- vs. multi-category classification
In single-category text classification each text belongs to exactly one category
In multi-category text classification, each text can have zero or more categories
32
Features
>>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix."
>>> label = “sport”
>>> labeled_text = LabeledText(text, label)
Here the classification takes as input the whole stringWhat’s the problem with that?What are the features that could be useful for this example?
33
Feature terminology
Feature: An aspect of the text that is relevant to the task Some typical features
Words present in text Frequency of words CapitalizationAre there NE?WordNet Others?
34
Feature terminology
Feature: An aspect of the text that is relevant to the taskFeature value: the realization of the feature in the text
Words present in text : Kerry, Schumacher, China…
Frequency of word: Kerry(10), Schumacher(1)…Are there dates? Yes/noAre there PERSONS? Yes/noAre there ORGANIZATIONS? Yes/noWordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China)
35
Feature Types
Boolean (or Binary) FeaturesFeatures that generate boolean (binary) values. Boolean features are the simplest and the most common type of feature.
f1(text) = 1 if text contain “Kerry”
0 otherwise
f2(text) = 1 if text contain PERSON
0 otherwise
36
Feature Types
Integer FeaturesFeatures that generate integer values. Integer features can be used to give classifiers access to more precise information about the text.
f1(text) = Number of times text contains “Kerry”
f2(text) = Number of times text contains PERSON
38
Classification
Define classesLabel textExtract FeaturesChoose a classifier
>>> my_classifier.classify(token)
The Naive Bayes Classifier NN (perceptron)SVM….
Train it (and test it)Use it to classify new examples
39
Training
• Usually the classifier is defined by a set of parameters
• Training is the procedure for finding a “good” set of parameters
• Goodness is determined by an optimization criterion such as misclassification rate
• Some classifiers are guaranteed to find the optimal set of parameters
40
Testing, evaluation of the classifier
After choosing the parameters of the classifiers (i.e. after training it) we need to test how well it’s doing on a test set (not included in the training set)
Calculate misclassification on the test set
41
Evaluating classifiers
Contingency table for the evaluation of a binary classifier
GREEN is correct
RED is correct
GREEN was assigned
a b
RED was assigned c d
Accuracy = (a+d)/(a+b+c+d)Precision: P_GREEN = a/(a+b), P_ RED = d/(c+d)Recall: R_GREEN = a/(a+c), R_ RED = d/(b+d)
42*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang
Training sizeThe more the better! (usually)Results for text classification*
43*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang
Training size
44*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang
Training size
45Authorship Attribution a Comparison Of Three Methods, Matthew Care
Training Size
Author identification