Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert...
-
Upload
jacob-baker -
Category
Documents
-
view
215 -
download
0
Transcript of Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert...
Whitney St.CharlesResearch Alliance in Math and Science 2007
Mentors:Yu (Cathy) Jiao, Ph.D.Robert Patton, Ph.D.
Computational Sciences and Engineering Division
Using TF-IDF Anomalies to Cluster Documents on Subject Matter
Natural Language Processing And Computational Linguistics
An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Purposes of document clustering
Data overabundance YouTube generates 200 terabytes of data per day
How do we sift through those kinds of quantities? Searching
Reduces the set tremendously Document Clustering
Is a knowledge discovery technique Categorizes results into meaningful groups Allows the user to browse quickly to the target
2
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Document clustering users
Financial analysts Identify certain trends to develop forecasts about a
particular company
Business Intelligence Identify products that are associated with or dependent
upon one another
Military Identify terrorist cells from blog activity and movement of
materials
You! Narrow down hundreds of thousands of internet search
results to find the kinds of sites you want
3
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
• A word-by-word comparison of each document is made to determine similarity
• Unfortunately, this method…• Does not handle context very well
• Compares several hundred/ several thousand words for each document• Is very computationally expensive• Requires expensive SIMD machines
Current document clustering technique
4
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Contributions to the field
• Identify only those words which are more indicative of the subject matter– If airline occurs 20% more than is “normal,” it has
something to do with the subject
• Examine both simple and complex noun phrases to address the context of the document
• Generate much smaller vectors, containing an average of 82% fewer terms!
• Cluster more accurately because only “important” words are chosen
5
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Our method
6
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Establishing the baseline
• Train the program to recognize what is “normal” for a given term– Need an entire English language corpus
• Corpus: a large, structured set of texts compiled to be representative of a language
• uses hundreds of thousands of words in every allowable way
• Using a corpus, the program can• Establish usage statistics• Learn linguistic rules
Example: The Brown Corpus http://www.edict.com.hk/concordance/WWWConcappE.htm
7
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Extracting words and phrases
8
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Part-of-speech tagging
Tags every word in the sentence with the correct part-of-speech
Achieves an accuracy of 97.24% Is necessary because token extraction methods are each
dependent upon correct tagging
Passes the tagged sentence to the token extractor
9
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Token extractor
Extracts Words Simple noun phrases Complex noun phrases
10
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Word extraction
Uses POS tagged data to identify only adjectives, verbs, and nouns
Uses the Porter stemmer to identify unique words cut common suffixes such as –ing, -tion, -e, -es, -s
Example: “recreation” and “recreational” are both identified as “recreat”
11
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Why nouns?
Are named entities
Answer the question “What”
Are less ambiguous than verbs Example: “cook up a good meal” or “cook up a new
solution”
12
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Simple noun phrase extraction
Accepts only consecutive nouns Example: summer intern, union representative
Provides a set of short, highly descriptive phrases
13
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Complex noun phrase extraction techniques Static Rule-based/ Finite State Automata
Rely on the aptitude of linguist formulating rule set
Machine Learning Rely on the “completeness” of the training set
14
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Static rule-based extraction
Establishes a list of linguistic rules A determiner preceding a noun marks the beginning of a
noun phrase A determiner may not precede a noun phrase
15
determiner/adjective
noun/ pronoun
adjectiveRelative clause/Prepositional phrase/noun
noun/ pronoun/ determiner
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Static extraction shortcomings
Unanticipated rules The subjective nature of language
Difficulty finding non-recursive, base NP’s [The man [whose red hat [I borrowed
yesterday]RC ]RC [in the street]PP [that is next to my house]RC ]NP lives [next door]NP.
[The man]NP whose [red hat]NP I borrowed [yesterday]NP in [the street]NP that is next to [my house]NP lives [next door]NP.
Structural ambiguity
16
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Structural ambiguity example““I saw the man with the telescope.”I saw the man with the telescope.”
17
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Machine learning extraction
18
Is all about Uses a corpus
Is based on statistics The more it sees a particular occurrence, the more likely
it is to prefer it Makes better educated guesses about structural ambiguity Discovers thousands of unanticipated rules
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Transformation-based complex noun phrase extraction
An ‘error-driven’ approach for learning an ordered set of rules
1. Generate all rules that correct at least one error.2. For each rule:
(a) Apply to a copy of the most recent state of the training set.
(b) Score result3. Select rule with best score.4. Update training set by applying selected rule.5. Stop if score is smaller than some pre-set threshold T; otherwise repeat from step 1.
19
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Determining anomaly sets
TF-IDF: Term Frequency – Inverse Document Frequency Number of local occurrences of term multiplied by
uniqueness measure of term in document set
TF-ICF: Term Frequency – Inverse Corpus Frequency Average number of corpus occurrences of term multiplied
by uniqueness measure of term in the corpus
20
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Each document has its own anomaly vector
21
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Clustering the data Unweighted Pair Group Method with Average meansUnweighted Pair Group Method with Average means
22
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
23
Performance Metrics Used
Precision = number of correct responsesnumber of responses
Recall = number of correct responsesnumber correct in key
F-measure = 2RPR + P
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
24
Cluster Results using Vector Space Model
Cluster Results using modified Vector Space Model with anomaly sets
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Future Work
Determine clustering results for both simple and complex noun phrases
Could be applied to other clustering techniques, such as swarming
25
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Acknowledgements
The Research Alliance in Math and Science program
Computational Sciences and Engineering Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy.
Dr. Cathy Jiao
Dr. Robert Patton
Dr. Thomas Potok
26
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
27