TamingText_slides.pptx

download TamingText_slides.pptx

of 36

Transcript of TamingText_slides.pptx

  • 8/10/2019 TamingText_slides.pptx

    1/36

    Why Text? How much data? 1.8 zettabytes (1.8 trillion GB) Most of the Worlds Data is Unstructured

    2009 HP survey: 70% Gartner: 80% Jerry Hill (Teradata), Anant Jhingran (IBM): 85%

    Structured (stored) data often misses elementscritical to predictive modeling

    Un -transcribed fields, notes, comments Ex: examiner/adjuster notes, surveys with free -text fields, medical charts

  • 8/10/2019 TamingText_slides.pptx

    2/36

    Text Mining - Perspective

  • 8/10/2019 TamingText_slides.pptx

    3/36

    Taming Text

    Grant Ingersoll

    CTO, LucidWorks@tamingtext, @gsingers

  • 8/10/2019 TamingText_slides.pptx

    4/36

    About the Book

    Goal: An engineers guide to search and NaturalLanguage Processing (NLP) and Machine Learning

    Target Audience: You All examples in Java, but concepts easily ported Covers:

    Search, Fuzzy string matching, human language basics,clustering, classification, Question Answering, Intro toadvanced topics

  • 8/10/2019 TamingText_slides.pptx

    5/36

    Content Question Answering In Detail

    Building Blocks Indexing Search/Passage Retrieval

    Classification Scoring

    Other Interesting Topics Clustering Fuzzy-Wuzzy Strings

    Whats next? Resources

  • 8/10/2019 TamingText_slides.pptx

    6/36

    A Grain of Salt

    Text is a strange and magical world filledwith

    Evil villains Jesters Wizards Unicorns Heroes!

    In other words, no system will be perfect

    http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg

  • 8/10/2019 TamingText_slides.pptx

    7/36

    The Ugly Truth You will spend most of your time in NLP, search,

    etc. doing grunt work nicely labeled as: Preprocessing

    Feature Selection Sampling Validation/testing/etc. Content extraction ETL

    Corollary: Start with simple, tried and truealgorithms, then iterate

  • 8/10/2019 TamingText_slides.pptx

    8/36

    Term / document matrix Most common form of representation in text

    mining is the term - document matrix Term: typically a single word, but could be a word

    phrase like data mining Document: a generic term meaning a collection of

    text to be retrieved Can be large - terms are often 50k or larger,

    documents can be in the billions (www). Can be binary, or use counts

    8

  • 8/10/2019 TamingText_slides.pptx

    9/36

    Term document matrix

    Each document now is just a vector of terms,

    sometimes boolean 9

    Database SQL Index Regression Likelihood linear

    D1 24 21 9 0 0 3

    D2 32 10 5 0 3 0

    D3 12 16 5 0 0 0

    D4 6 7 2 0 0 0D5 43 31 20 0 3 0

    D6 2 0 0 18 7 6

    D7 0 0 1 32 12 0

    D8 3 0 0 22 4 4

    D9 1 0 0 34 27 25

    D10 6 0 0 17 4 23

    Example: 10 documents: 6 terms

  • 8/10/2019 TamingText_slides.pptx

    10/36

    Term document matrix We have lost all semantic content

    Be careful constructing your term list!

    Not all words are created equal! Words that are the same should be treated the same!

    Stop Words Stemming

    10

  • 8/10/2019 TamingText_slides.pptx

    11/36

    Stop words Many of the most frequently used words in English are

    worthless in retrieval and text mining these words are calledstop words .

    the, of, and, to, . Typically about 400 to 500 such words

    For an application, an additional domain specific stop words list may beconstructed Why do we need to remove stop words ?

    Reduce indexing (or data) file size stopwords accounts 20-30% of total word counts.

    Improve efficiency stop words are not useful for searching or text mining stop words always have a large number of hits

    11

  • 8/10/2019 TamingText_slides.pptx

    12/36

    Stemming Techniques used to find out the root/stem of a word:

    E.g., user engineering users engineered used engineer using

    stem: use engineerUsefulness improving effectiveness of retrieval and text mining

    matching similar words reducing indexing size

    combing words with same roots may reduce indexing sizeas much as 40-50%.

    12

  • 8/10/2019 TamingText_slides.pptx

    13/36

    Basic stemming methods

    remove ending if a word ends with a consonant other than s,followed by an s, then delete s.

    if a word ends in es, drop the s. if a word ends in ing, delete the ing unless the remaining word consists

    only of one letter or of th. If a word ends with ed, preceded by a consonant, delete the ed unless

    this leaves only a single letter. ...

    transform words if a word ends with

    ies

    but not

    eies

    or

    aies

    then

    ies --> y.

    13

  • 8/10/2019 TamingText_slides.pptx

    14/36

    Feature Selection Performance of text classification algorithms can be optimized by

    selecting only a subset of the discriminative terms Even after stemming and stopword removal.

    Greedy search Start from full set and delete one at a time Find the least important variable

    Can use Gini index for this if a classification problem

    Often performance does not degrade even with orders ofmagnitude reductions

    Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation,electricity and electronics. Only 140 out of 20,000 terms needed for classification!

    14

  • 8/10/2019 TamingText_slides.pptx

    15/36

    Distances in TD matrices Given a term doc matrix represetnation, now we

    can define distances between documents (orterms!)

    Elements of matrix can be 0,1 or term frequencies(sometimes normalized)

    Can use Euclidean or cosine distance

    Cosine distance is the angle between the two vectors Not intuitive, but has been proven to work well

    If docs are the same, d c =1, if nothing in commondc=0

    15

  • 8/10/2019 TamingText_slides.pptx

    16/36

    We can calculate cosine and Euclideandistance for this matrix

    What would you want the distances to looklike?

    16

    Database SQL Index Regression Likelihood linear

    D1 24 21 9 0 0 3D2 32 10 5 0 3 0

    D3 12 16 5 0 0 0

    D4 6 7 2 0 0 0

    D5 43 31 20 0 3 0

    D6 2 0 0 18 7 6

    D7 0 0 1 32 12 0

    D8 3 0 0 22 4 4

    D9 1 0 0 34 27 25

    D10 6 0 0 17 4 23

  • 8/10/2019 TamingText_slides.pptx

    17/36

    Document distance Pairwise distances between documents Image plots of cosine distance, Euclidean,

    and scaled Euclidean

    17R function:

    image

  • 8/10/2019 TamingText_slides.pptx

    18/36

    Weighting in TD space Not all phrases are of equal importance

    E.g. David less important than Beckham If a term occurs frequently in many documents it has less discriminatory

    power

    One way to correct for this is inverse-document frequency (IDF).

    Term importance = Term Frequency (TF) x IDF

    Nj= # of docs containing the term N = total # of docs A term is

    important

    if it has a high TF and/or a high IDF. TF x IDF is a common measure of term importance

    18

    D t b SQL I d R g i Lik lih d li

  • 8/10/2019 TamingText_slides.pptx

    19/36

    Database SQL Index Regression Likelihood linear

    D1 2.53 14.6 4.6 0 0 2.1

    D2 3.3 6.7 2.6 0 1.0 0

    D3 1.3 11.1 2.6 0 0 0

    D4 0.7 4.9 1.0 0 0 0

    D5 4.5 21.5 10.2 0 1.0 0

    D6 0.2 0 0 12.5 2.5 11.1

    D7 0 0 0.5 22.2 4.3 0

    D8 0.3 0 0 15.2 1.4 1.4

    D9 0.1 0 0 23.56 9.6 17.3D10 0.6 0 0 11.8 1.4 16.0

    Database SQL Index Regression Likelihood linear

    D1 24 21 9 0 0 3

    D2 32 10 5 0 3 0

    D3 12 16 5 0 0 0

    D4 6 7 2 0 0 0

    D5 43 31 20 0 3 0

    D6 2 0 0 18 7 6

    D7 0 0 1 32 12 0

    D8 3 0 0 22 4 4

    D9 1 0 0 34 27 25

    D10 6 0 0 17 4 23

    TF IDF

  • 8/10/2019 TamingText_slides.pptx

    20/36

    Simple Question AnsweringWorkflow

  • 8/10/2019 TamingText_slides.pptx

    21/36

    Building Blocks

    Sentence Detection

    Part of Speech Tagging

    Parsing

    Ch. 2

  • 8/10/2019 TamingText_slides.pptx

    22/36

    QA in Taming Text

    Apache Solr for Passage Retrieval andintegration

    Apache OpenNLP for sentence detection,parsing, POS tagging and answer typeclassification

    Custom code for Query Parsing, Scoring See com.tamingtext.qa package

    Wikipedia for truth

  • 8/10/2019 TamingText_slides.pptx

    23/36

    Indexing Ingest raw data into the system and make it

    available for search Garbage In, Garbage Out

    Need to spend some time understanding andmodeling your data just like you would with a DB

    Lather, rinse, repeat

    See the $TT_HOME/apache-solr/solr-qa/conf/schema.xml for setup WikipediaWexIndexer.java for indexing code

  • 8/10/2019 TamingText_slides.pptx

    24/36

    Aside: Named Entity Recognition

    NER is the process of extracting proper names, etc.from text

    Plays a vital role in a QA and many other NLP systems Often solved using classification approaches

  • 8/10/2019 TamingText_slides.pptx

    25/36

    Answer Type Classification Answer Type examples:

    Person (P), Location (L), Organization (O), TimePoint (T), Duration (R), Money (M)

    See page 248 for more Train an OpenNLP classifier off of a set of

    previously annotated questions, e.g.: P Which French monarch reinstated the divine

    right of the monarchy to France and was known as`The Sun King' because of the splendour of hisreign?

  • 8/10/2019 TamingText_slides.pptx

    26/36

    Search engines

  • 8/10/2019 TamingText_slides.pptx

    27/36

    Other Areas of NLP/Machine

    Learning

  • 8/10/2019 TamingText_slides.pptx

    28/36

    Clustering Group together content based

    on some notion of similarity Book covers (ch. 6):

    Search result clustering usingCarrot 2 Whole collection clustering using

    Mahout

    Topic Modeling Mahout comes with many

    different algorithms

  • 8/10/2019 TamingText_slides.pptx

    29/36

    Clustering Use Cases

    Google News

    Outlier detection in smart grids

    Recommendations

    Products People, etc.

  • 8/10/2019 TamingText_slides.pptx

    30/36

    In Focus: K-Means

    http://en.wikipedia.org/wiki/K-means_clustering

  • 8/10/2019 TamingText_slides.pptx

    31/36

    Fuzzy-Wuzzy Strings

    Fuzzy string matching is a common, and difficult,problem

    Useful for solving problems like: Did you mean spell checking Auto-suggest Record linkage

  • 8/10/2019 TamingText_slides.pptx

    32/36

    Common Approaches

    See com.tamingtext.fuzzy package Jaccard

    Measure character overlap Levenshtein (Edit Distance)

    Count the number of edits required to transformone word into the other

    Jaro-Winkler Account for position

  • 8/10/2019 TamingText_slides.pptx

    33/36

    Text Mining: Helpful Data WordNet

    Data Mining -Volinsky - 2011 - Columbia University 33Courtesy: Luca Lanz i

  • 8/10/2019 TamingText_slides.pptx

    34/36

    Text Mining - Other Topics Sentiment Analysis

    Automatically determine tone in text: positive, negative or neutral Typically uses collections of good and bad words

    While the traditional media is slowly starting to take John McCain

    s straight

    talking image with increasingly large grains of salt, hisbase isn

    t quite ready to giveup on their favorite son. Jonathan Alter

    s bizarre defense of McCain after he wascaught telling anoutright lie , perfectly captures that reluctance[.]

    Often fit using Nave Bayes

    There are sentiment word lists out there: See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis

    34

    http://www.nybooks.com/articles/21470http://www.dailykos.com/storyonly/2008/6/10/14401/4418/822/533105http://www.newsweek.com/id/140470/output/printhttp://www.newsweek.com/id/140470/output/printhttp://www.dailykos.com/storyonly/2008/6/10/14401/4418/822/533105http://www.nybooks.com/articles/21470
  • 8/10/2019 TamingText_slides.pptx

    35/36

    Text Mining - Other Topics

    Summarizing text: Word Clouds Takes text as input, finds the most

    interesting ones, and displays themgraphically

    Blogs do this Wordle.net

    35

  • 8/10/2019 TamingText_slides.pptx

    36/36

    Much Harder Problems

    Semantics, Pragmatics and beyond Relationship Extraction Cross-language Search Importance