TamingText_slides.pptx

8/10/2019 TamingText_slides.pptx

1/36

Why Text? How much data? 1.8 zettabytes (1.8 trillion GB) Most of the Worlds Data is Unstructured

2009 HP survey: 70% Gartner: 80% Jerry Hill (Teradata), Anant Jhingran (IBM): 85%

Structured (stored) data often misses elementscritical to predictive modeling

Un -transcribed fields, notes, comments Ex: examiner/adjuster notes, surveys with free -text fields, medical charts


2/36

Text Mining - Perspective


3/36

Taming Text

Grant Ingersoll

CTO, LucidWorks@tamingtext, @gsingers


4/36

About the Book

Goal: An engineers guide to search and NaturalLanguage Processing (NLP) and Machine Learning

Target Audience: You All examples in Java, but concepts easily ported Covers:

Search, Fuzzy string matching, human language basics,clustering, classification, Question Answering, Intro toadvanced topics


5/36

Content Question Answering In Detail

Building Blocks Indexing Search/Passage Retrieval

Classification Scoring

Other Interesting Topics Clustering Fuzzy-Wuzzy Strings

Whats next? Resources


6/36

A Grain of Salt

Text is a strange and magical world filledwith

Evil villains Jesters Wizards Unicorns Heroes!

In other words, no system will be perfect

http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg


7/36

The Ugly Truth You will spend most of your time in NLP, search,

etc. doing grunt work nicely labeled as: Preprocessing

Feature Selection Sampling Validation/testing/etc. Content extraction ETL

Corollary: Start with simple, tried and truealgorithms, then iterate


8/36

Term / document matrix Most common form of representation in text

mining is the term - document matrix Term: typically a single word, but could be a word

phrase like data mining Document: a generic term meaning a collection of

text to be retrieved Can be large - terms are often 50k or larger,

documents can be in the billions (www). Can be binary, or use counts

8


9/36

Term document matrix

Each document now is just a vector of terms,

sometimes boolean 9

Database SQL Index Regression Likelihood linear

D1 24 21 9 0 0 3

D2 32 10 5 0 3 0

D3 12 16 5 0 0 0

D4 6 7 2 0 0 0D5 43 31 20 0 3 0

D6 2 0 0 18 7 6

D7 0 0 1 32 12 0

D8 3 0 0 22 4 4

D9 1 0 0 34 27 25

D10 6 0 0 17 4 23

Example: 10 documents: 6 terms


10/36

Term document matrix We have lost all semantic content

Be careful constructing your term list!

Not all words are created equal! Words that are the same should be treated the same!

Stop Words Stemming

10


11/36

Stop words Many of the most frequently used words in English are

worthless in retrieval and text mining these words are calledstop words .

the, of, and, to, . Typically about 400 to 500 such words

For an application, an additional domain specific stop words list may beconstructed Why do we need to remove stop words ?

Reduce indexing (or data) file size stopwords accounts 20-30% of total word counts.

Improve efficiency stop words are not useful for searching or text mining stop words always have a large number of hits

11


12/36

Stemming Techniques used to find out the root/stem of a word:

E.g., user engineering users engineered used engineer using

stem: use engineerUsefulness improving effectiveness of retrieval and text mining

matching similar words reducing indexing size

combing words with same roots may reduce indexing sizeas much as 40-50%.

12


13/36

Basic stemming methods

remove ending if a word ends with a consonant other than s,followed by an s, then delete s.

if a word ends in es, drop the s. if a word ends in ing, delete the ing unless the remaining word consists

only of one letter or of th. If a word ends with ed, preceded by a consonant, delete the ed unless

this leaves only a single letter. ...

transform words if a word ends with

ies

but not

eies

or

aies

then

ies --> y.

13


14/36

Feature Selection Performance of text classification algorithms can be optimized by

selecting only a subset of the discriminative terms Even after stemming and stopword removal.

Greedy search Start from full set and delete one at a time Find the least important variable

Can use Gini index for this if a classification problem

Often performance does not degrade even with orders ofmagnitude reductions

Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation,electricity and electronics. Only 140 out of 20,000 terms needed for classification!

14


15/36

Distances in TD matrices Given a term doc matrix represetnation, now we

can define distances between documents (orterms!)

Elements of matrix can be 0,1 or term frequencies(sometimes normalized)

Can use Euclidean or cosine distance

Cosine distance is the angle between the two vectors Not intuitive, but has been proven to work well

If docs are the same, d c =1, if nothing in commondc=0

15


16/36

We can calculate cosine and Euclideandistance for this matrix

What would you want the distances to looklike?

16


D1 24 21 9 0 0 3D2 32 10 5 0 3 0

D3 12 16 5 0 0 0

D4 6 7 2 0 0 0

D5 43 31 20 0 3 0

D6 2 0 0 18 7 6

D7 0 0 1 32 12 0

D8 3 0 0 22 4 4

D9 1 0 0 34 27 25

D10 6 0 0 17 4 23


17/36

Document distance Pairwise distances between documents Image plots of cosine distance, Euclidean,

and scaled Euclidean

17R function:

image


18/36

Weighting in TD space Not all phrases are of equal importance

E.g. David less important than Beckham If a term occurs frequently in many documents it has less discriminatory

power

One way to correct for this is inverse-document frequency (IDF).

Term importance = Term Frequency (TF) x IDF

Nj= # of docs containing the term N = total # of docs A term is

important

if it has a high TF and/or a high IDF. TF x IDF is a common measure of term importance

18

D t b SQL I d R g i Lik lih d li


19/36


D1 2.53 14.6 4.6 0 0 2.1

D2 3.3 6.7 2.6 0 1.0 0

D3 1.3 11.1 2.6 0 0 0

D4 0.7 4.9 1.0 0 0 0

D5 4.5 21.5 10.2 0 1.0 0

D6 0.2 0 0 12.5 2.5 11.1

D7 0 0 0.5 22.2 4.3 0

D8 0.3 0 0 15.2 1.4 1.4

D9 0.1 0 0 23.56 9.6 17.3D10 0.6 0 0 11.8 1.4 16.0


D1 24 21 9 0 0 3

D2 32 10 5 0 3 0

D3 12 16 5 0 0 0

D4 6 7 2 0 0 0

D5 43 31 20 0 3 0

D6 2 0 0 18 7 6

D7 0 0 1 32 12 0

D8 3 0 0 22 4 4

D9 1 0 0 34 27 25

D10 6 0 0 17 4 23

TF IDF


20/36

Simple Question AnsweringWorkflow


21/36

Building Blocks

Sentence Detection

Part of Speech Tagging

Parsing

Ch. 2


22/36

QA in Taming Text

Apache Solr for Passage Retrieval andintegration

Apache OpenNLP for sentence detection,parsing, POS tagging and answer typeclassification

Custom code for Query Parsing, Scoring See com.tamingtext.qa package

Wikipedia for truth


23/36

Indexing Ingest raw data into the system and make it

available for search Garbage In, Garbage Out

Need to spend some time understanding andmodeling your data just like you would with a DB

Lather, rinse, repeat

See the $TT_HOME/apache-solr/solr-qa/conf/schema.xml for setup WikipediaWexIndexer.java for indexing code


24/36

Aside: Named Entity Recognition

NER is the process of extracting proper names, etc.from text

Plays a vital role in a QA and many other NLP systems Often solved using classification approaches


25/36

Answer Type Classification Answer Type examples:

Person (P), Location (L), Organization (O), TimePoint (T), Duration (R), Money (M)

See page 248 for more Train an OpenNLP classifier off of a set of

previously annotated questions, e.g.: P Which French monarch reinstated the divine

right of the monarchy to France and was known as`The Sun King' because of the splendour of hisreign?


26/36

Search engines


27/36

Other Areas of NLP/Machine

Learning


28/36

Clustering Group together content based

on some notion of similarity Book covers (ch. 6):

Search result clustering usingCarrot 2 Whole collection clustering using

Mahout

Topic Modeling Mahout comes with many

different algorithms


29/36

Clustering Use Cases

Google News

Outlier detection in smart grids

Recommendations

Products People, etc.


30/36

In Focus: K-Means

http://en.wikipedia.org/wiki/K-means_clustering


31/36

Fuzzy-Wuzzy Strings

Fuzzy string matching is a common, and difficult,problem

Useful for solving problems like: Did you mean spell checking Auto-suggest Record linkage


32/36

Common Approaches

See com.tamingtext.fuzzy package Jaccard

Measure character overlap Levenshtein (Edit Distance)

Count the number of edits required to transformone word into the other

Jaro-Winkler Account for position


33/36

Text Mining: Helpful Data WordNet

Data Mining -Volinsky - 2011 - Columbia University 33Courtesy: Luca Lanz i


34/36

Text Mining - Other Topics Sentiment Analysis

Automatically determine tone in text: positive, negative or neutral Typically uses collections of good and bad words

While the traditional media is slowly starting to take John McCain

s straight

talking image with increasingly large grains of salt, hisbase isn

t quite ready to giveup on their favorite son. Jonathan Alter

s bizarre defense of McCain after he wascaught telling anoutright lie , perfectly captures that reluctance[.]

Often fit using Nave Bayes

There are sentiment word lists out there: See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis

34
http://www.nybooks.com/articles/21470http://www.dailykos.com/storyonly/2008/6/10/14401/4418/822/533105http://www.newsweek.com/id/140470/output/printhttp://www.newsweek.com/id/140470/output/printhttp://www.dailykos.com/storyonly/2008/6/10/14401/4418/822/533105http://www.nybooks.com/articles/21470


35/36

Text Mining - Other Topics

Summarizing text: Word Clouds Takes text as input, finds the most

interesting ones, and displays themgraphically

Blogs do this Wordle.net

35


36/36

Much Harder Problems

Semantics, Pragmatics and beyond Relationship Extraction Cross-language Search Importance

TamingText_slides.pptx

Documents

Transcript of TamingText_slides.pptx