TamingText_slides.pptx
-
Upload
seadragonwin -
Category
Documents
-
view
213 -
download
0
Transcript of TamingText_slides.pptx
-
8/10/2019 TamingText_slides.pptx
1/36
Why Text? How much data? 1.8 zettabytes (1.8 trillion GB) Most of the Worlds Data is Unstructured
2009 HP survey: 70% Gartner: 80% Jerry Hill (Teradata), Anant Jhingran (IBM): 85%
Structured (stored) data often misses elementscritical to predictive modeling
Un -transcribed fields, notes, comments Ex: examiner/adjuster notes, surveys with free -text fields, medical charts
-
8/10/2019 TamingText_slides.pptx
2/36
Text Mining - Perspective
-
8/10/2019 TamingText_slides.pptx
3/36
Taming Text
Grant Ingersoll
CTO, LucidWorks@tamingtext, @gsingers
-
8/10/2019 TamingText_slides.pptx
4/36
About the Book
Goal: An engineers guide to search and NaturalLanguage Processing (NLP) and Machine Learning
Target Audience: You All examples in Java, but concepts easily ported Covers:
Search, Fuzzy string matching, human language basics,clustering, classification, Question Answering, Intro toadvanced topics
-
8/10/2019 TamingText_slides.pptx
5/36
Content Question Answering In Detail
Building Blocks Indexing Search/Passage Retrieval
Classification Scoring
Other Interesting Topics Clustering Fuzzy-Wuzzy Strings
Whats next? Resources
-
8/10/2019 TamingText_slides.pptx
6/36
A Grain of Salt
Text is a strange and magical world filledwith
Evil villains Jesters Wizards Unicorns Heroes!
In other words, no system will be perfect
http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg
-
8/10/2019 TamingText_slides.pptx
7/36
The Ugly Truth You will spend most of your time in NLP, search,
etc. doing grunt work nicely labeled as: Preprocessing
Feature Selection Sampling Validation/testing/etc. Content extraction ETL
Corollary: Start with simple, tried and truealgorithms, then iterate
-
8/10/2019 TamingText_slides.pptx
8/36
Term / document matrix Most common form of representation in text
mining is the term - document matrix Term: typically a single word, but could be a word
phrase like data mining Document: a generic term meaning a collection of
text to be retrieved Can be large - terms are often 50k or larger,
documents can be in the billions (www). Can be binary, or use counts
8
-
8/10/2019 TamingText_slides.pptx
9/36
Term document matrix
Each document now is just a vector of terms,
sometimes boolean 9
Database SQL Index Regression Likelihood linear
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
Example: 10 documents: 6 terms
-
8/10/2019 TamingText_slides.pptx
10/36
Term document matrix We have lost all semantic content
Be careful constructing your term list!
Not all words are created equal! Words that are the same should be treated the same!
Stop Words Stemming
10
-
8/10/2019 TamingText_slides.pptx
11/36
Stop words Many of the most frequently used words in English are
worthless in retrieval and text mining these words are calledstop words .
the, of, and, to, . Typically about 400 to 500 such words
For an application, an additional domain specific stop words list may beconstructed Why do we need to remove stop words ?
Reduce indexing (or data) file size stopwords accounts 20-30% of total word counts.
Improve efficiency stop words are not useful for searching or text mining stop words always have a large number of hits
11
-
8/10/2019 TamingText_slides.pptx
12/36
Stemming Techniques used to find out the root/stem of a word:
E.g., user engineering users engineered used engineer using
stem: use engineerUsefulness improving effectiveness of retrieval and text mining
matching similar words reducing indexing size
combing words with same roots may reduce indexing sizeas much as 40-50%.
12
-
8/10/2019 TamingText_slides.pptx
13/36
Basic stemming methods
remove ending if a word ends with a consonant other than s,followed by an s, then delete s.
if a word ends in es, drop the s. if a word ends in ing, delete the ing unless the remaining word consists
only of one letter or of th. If a word ends with ed, preceded by a consonant, delete the ed unless
this leaves only a single letter. ...
transform words if a word ends with
ies
but not
eies
or
aies
then
ies --> y.
13
-
8/10/2019 TamingText_slides.pptx
14/36
Feature Selection Performance of text classification algorithms can be optimized by
selecting only a subset of the discriminative terms Even after stemming and stopword removal.
Greedy search Start from full set and delete one at a time Find the least important variable
Can use Gini index for this if a classification problem
Often performance does not degrade even with orders ofmagnitude reductions
Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation,electricity and electronics. Only 140 out of 20,000 terms needed for classification!
14
-
8/10/2019 TamingText_slides.pptx
15/36
Distances in TD matrices Given a term doc matrix represetnation, now we
can define distances between documents (orterms!)
Elements of matrix can be 0,1 or term frequencies(sometimes normalized)
Can use Euclidean or cosine distance
Cosine distance is the angle between the two vectors Not intuitive, but has been proven to work well
If docs are the same, d c =1, if nothing in commondc=0
15
-
8/10/2019 TamingText_slides.pptx
16/36
We can calculate cosine and Euclideandistance for this matrix
What would you want the distances to looklike?
16
Database SQL Index Regression Likelihood linear
D1 24 21 9 0 0 3D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
-
8/10/2019 TamingText_slides.pptx
17/36
Document distance Pairwise distances between documents Image plots of cosine distance, Euclidean,
and scaled Euclidean
17R function:
image
-
8/10/2019 TamingText_slides.pptx
18/36
Weighting in TD space Not all phrases are of equal importance
E.g. David less important than Beckham If a term occurs frequently in many documents it has less discriminatory
power
One way to correct for this is inverse-document frequency (IDF).
Term importance = Term Frequency (TF) x IDF
Nj= # of docs containing the term N = total # of docs A term is
important
if it has a high TF and/or a high IDF. TF x IDF is a common measure of term importance
18
D t b SQL I d R g i Lik lih d li
-
8/10/2019 TamingText_slides.pptx
19/36
Database SQL Index Regression Likelihood linear
D1 2.53 14.6 4.6 0 0 2.1
D2 3.3 6.7 2.6 0 1.0 0
D3 1.3 11.1 2.6 0 0 0
D4 0.7 4.9 1.0 0 0 0
D5 4.5 21.5 10.2 0 1.0 0
D6 0.2 0 0 12.5 2.5 11.1
D7 0 0 0.5 22.2 4.3 0
D8 0.3 0 0 15.2 1.4 1.4
D9 0.1 0 0 23.56 9.6 17.3D10 0.6 0 0 11.8 1.4 16.0
Database SQL Index Regression Likelihood linear
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
TF IDF
-
8/10/2019 TamingText_slides.pptx
20/36
Simple Question AnsweringWorkflow
-
8/10/2019 TamingText_slides.pptx
21/36
Building Blocks
Sentence Detection
Part of Speech Tagging
Parsing
Ch. 2
-
8/10/2019 TamingText_slides.pptx
22/36
QA in Taming Text
Apache Solr for Passage Retrieval andintegration
Apache OpenNLP for sentence detection,parsing, POS tagging and answer typeclassification
Custom code for Query Parsing, Scoring See com.tamingtext.qa package
Wikipedia for truth
-
8/10/2019 TamingText_slides.pptx
23/36
Indexing Ingest raw data into the system and make it
available for search Garbage In, Garbage Out
Need to spend some time understanding andmodeling your data just like you would with a DB
Lather, rinse, repeat
See the $TT_HOME/apache-solr/solr-qa/conf/schema.xml for setup WikipediaWexIndexer.java for indexing code
-
8/10/2019 TamingText_slides.pptx
24/36
Aside: Named Entity Recognition
NER is the process of extracting proper names, etc.from text
Plays a vital role in a QA and many other NLP systems Often solved using classification approaches
-
8/10/2019 TamingText_slides.pptx
25/36
Answer Type Classification Answer Type examples:
Person (P), Location (L), Organization (O), TimePoint (T), Duration (R), Money (M)
See page 248 for more Train an OpenNLP classifier off of a set of
previously annotated questions, e.g.: P Which French monarch reinstated the divine
right of the monarchy to France and was known as`The Sun King' because of the splendour of hisreign?
-
8/10/2019 TamingText_slides.pptx
26/36
Search engines
-
8/10/2019 TamingText_slides.pptx
27/36
Other Areas of NLP/Machine
Learning
-
8/10/2019 TamingText_slides.pptx
28/36
Clustering Group together content based
on some notion of similarity Book covers (ch. 6):
Search result clustering usingCarrot 2 Whole collection clustering using
Mahout
Topic Modeling Mahout comes with many
different algorithms
-
8/10/2019 TamingText_slides.pptx
29/36
Clustering Use Cases
Google News
Outlier detection in smart grids
Recommendations
Products People, etc.
-
8/10/2019 TamingText_slides.pptx
30/36
In Focus: K-Means
http://en.wikipedia.org/wiki/K-means_clustering
-
8/10/2019 TamingText_slides.pptx
31/36
Fuzzy-Wuzzy Strings
Fuzzy string matching is a common, and difficult,problem
Useful for solving problems like: Did you mean spell checking Auto-suggest Record linkage
-
8/10/2019 TamingText_slides.pptx
32/36
Common Approaches
See com.tamingtext.fuzzy package Jaccard
Measure character overlap Levenshtein (Edit Distance)
Count the number of edits required to transformone word into the other
Jaro-Winkler Account for position
-
8/10/2019 TamingText_slides.pptx
33/36
Text Mining: Helpful Data WordNet
Data Mining -Volinsky - 2011 - Columbia University 33Courtesy: Luca Lanz i
-
8/10/2019 TamingText_slides.pptx
34/36
Text Mining - Other Topics Sentiment Analysis
Automatically determine tone in text: positive, negative or neutral Typically uses collections of good and bad words
While the traditional media is slowly starting to take John McCain
s straight
talking image with increasingly large grains of salt, hisbase isn
t quite ready to giveup on their favorite son. Jonathan Alter
s bizarre defense of McCain after he wascaught telling anoutright lie , perfectly captures that reluctance[.]
Often fit using Nave Bayes
There are sentiment word lists out there: See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis
34
http://www.nybooks.com/articles/21470http://www.dailykos.com/storyonly/2008/6/10/14401/4418/822/533105http://www.newsweek.com/id/140470/output/printhttp://www.newsweek.com/id/140470/output/printhttp://www.dailykos.com/storyonly/2008/6/10/14401/4418/822/533105http://www.nybooks.com/articles/21470 -
8/10/2019 TamingText_slides.pptx
35/36
Text Mining - Other Topics
Summarizing text: Word Clouds Takes text as input, finds the most
interesting ones, and displays themgraphically
Blogs do this Wordle.net
35
-
8/10/2019 TamingText_slides.pptx
36/36
Much Harder Problems
Semantics, Pragmatics and beyond Relationship Extraction Cross-language Search Importance