Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert...

Whitney St.CharlesResearch Alliance in Math and Science 2007

Mentors:Yu (Cathy) Jiao, Ph.D.Robert Patton, Ph.D.

Computational Sciences and Engineering Division

Using TF-IDF Anomalies to Cluster Documents on Subject Matter

Natural Language Processing And Computational Linguistics

An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Purposes of document clustering

Data overabundance YouTube generates 200 terabytes of data per day

How do we sift through those kinds of quantities? Searching

Reduces the set tremendously Document Clustering

Is a knowledge discovery technique Categorizes results into meaningful groups Allows the user to browse quickly to the target

2


Document clustering users

Financial analysts Identify certain trends to develop forecasts about a

particular company

Business Intelligence Identify products that are associated with or dependent

upon one another

Military Identify terrorist cells from blog activity and movement of

materials

You! Narrow down hundreds of thousands of internet search

results to find the kinds of sites you want

3


• A word-by-word comparison of each document is made to determine similarity

• Unfortunately, this method…• Does not handle context very well

• Compares several hundred/ several thousand words for each document• Is very computationally expensive• Requires expensive SIMD machines

Current document clustering technique

4


Contributions to the field

• Identify only those words which are more indicative of the subject matter– If airline occurs 20% more than is “normal,” it has

something to do with the subject

• Examine both simple and complex noun phrases to address the context of the document

• Generate much smaller vectors, containing an average of 82% fewer terms!

• Cluster more accurately because only “important” words are chosen

5


Our method

6


Establishing the baseline

• Train the program to recognize what is “normal” for a given term– Need an entire English language corpus

• Corpus: a large, structured set of texts compiled to be representative of a language

• uses hundreds of thousands of words in every allowable way

• Using a corpus, the program can• Establish usage statistics• Learn linguistic rules

Example: The Brown Corpus http://www.edict.com.hk/concordance/WWWConcappE.htm

7


Extracting words and phrases

8


Part-of-speech tagging

Tags every word in the sentence with the correct part-of-speech

Achieves an accuracy of 97.24% Is necessary because token extraction methods are each

dependent upon correct tagging

Passes the tagged sentence to the token extractor

9


Token extractor

Extracts Words Simple noun phrases Complex noun phrases

10


Word extraction

Uses POS tagged data to identify only adjectives, verbs, and nouns

Uses the Porter stemmer to identify unique words cut common suffixes such as –ing, -tion, -e, -es, -s

Example: “recreation” and “recreational” are both identified as “recreat”

11


Why nouns?

Are named entities

Answer the question “What”

Are less ambiguous than verbs Example: “cook up a good meal” or “cook up a new

solution”

12


Simple noun phrase extraction

Accepts only consecutive nouns Example: summer intern, union representative

Provides a set of short, highly descriptive phrases

13


Complex noun phrase extraction techniques Static Rule-based/ Finite State Automata

Rely on the aptitude of linguist formulating rule set

Machine Learning Rely on the “completeness” of the training set

14


Static rule-based extraction

Establishes a list of linguistic rules A determiner preceding a noun marks the beginning of a

noun phrase A determiner may not precede a noun phrase

15

determiner/adjective

noun/ pronoun

adjectiveRelative clause/Prepositional phrase/noun

noun/ pronoun/ determiner


Static extraction shortcomings

Unanticipated rules The subjective nature of language

Difficulty finding non-recursive, base NP’s [The man [whose red hat [I borrowed

yesterday]RC ]RC [in the street]PP [that is next to my house]RC ]NP lives [next door]NP.

[The man]NP whose [red hat]NP I borrowed [yesterday]NP in [the street]NP that is next to [my house]NP lives [next door]NP.

Structural ambiguity

16


Structural ambiguity example““I saw the man with the telescope.”I saw the man with the telescope.”

17


Machine learning extraction

18

Is all about Uses a corpus

Is based on statistics The more it sees a particular occurrence, the more likely

it is to prefer it Makes better educated guesses about structural ambiguity Discovers thousands of unanticipated rules


Transformation-based complex noun phrase extraction

An ‘error-driven’ approach for learning an ordered set of rules

1. Generate all rules that correct at least one error.2. For each rule:

(a) Apply to a copy of the most recent state of the training set.

(b) Score result3. Select rule with best score.4. Update training set by applying selected rule.5. Stop if score is smaller than some pre-set threshold T; otherwise repeat from step 1.

19


Determining anomaly sets

TF-IDF: Term Frequency – Inverse Document Frequency Number of local occurrences of term multiplied by

uniqueness measure of term in document set

TF-ICF: Term Frequency – Inverse Corpus Frequency Average number of corpus occurrences of term multiplied

by uniqueness measure of term in the corpus

20


Each document has its own anomaly vector

21


Clustering the data Unweighted Pair Group Method with Average meansUnweighted Pair Group Method with Average means

22


23

Performance Metrics Used

Precision = number of correct responsesnumber of responses

Recall = number of correct responsesnumber correct in key

F-measure = 2RPR + P


24

Cluster Results using Vector Space Model

Cluster Results using modified Vector Space Model with anomaly sets


Future Work

Determine clustering results for both simple and complex noun phrases

Could be applied to other clustering techniques, such as swarming

25


Acknowledgements

The Research Alliance in Math and Science program

Computational Sciences and Engineering Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy.

Dr. Cathy Jiao

Dr. Robert Patton

Dr. Thomas Potok

26


27

Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert...

Documents

Transcript of Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert...