Statistical Language Processing

download Statistical Language Processing

of 32

Transcript of Statistical Language Processing

  • 8/2/2019 Statistical Language Processing

    1/32

    Statistical language processing

    Concepts and Algorithms

    A. Georgakis, PhD

  • 8/2/2019 Statistical Language Processing

    2/32

    2/32

    ToC

    Basic definitions

    Text mining

    Performance evaluation References

  • 8/2/2019 Statistical Language Processing

    3/32

    3/32

    Definitions

    SLP is NLP on steroids

    Away from rule based methods

    Cover a wide area: Automatic summarization, Machine translation,

    Named entity recognition, Part-of-speechtagging, Sentence boundary disambiguation,

    Sentiment analysis, Word sensedisambiguation, etc

  • 8/2/2019 Statistical Language Processing

    4/32

    4/32

    Automatic summarization

    ...transformation of source text to summarytext through content reduction by selection,generalization and transformation S. Jones,

    1999 but there are many more definitions

    ambiguity for the term

    For additional info go here

    http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009
  • 8/2/2019 Statistical Language Processing

    5/32

    5/32

    Machine translation

    Substitution of source text into a targetlanguage

    Usage of parallel corpora Internet is a vast source for such data

    Pivot languages

    http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009
  • 8/2/2019 Statistical Language Processing

    6/32

    6/32

    Named entity recognition

    Identify proper names and their types Peter person Paris city or person

    Capitalization is not always a good tool Some languages do not not use capitals German Begining of centences

    http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009
  • 8/2/2019 Statistical Language Processing

    7/32

    7/32

    Part-of-speech tagging

    Determine the part of speech for words Well, she and

    young John walk

    to school slowly English has 9 parts of speech:

    noun, verb, article, adjective, preposition,

    pronoun, adverb, conjunction, andinterjection .. but as a linguist you will need to use

    somewhere between 50 and 150

    http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009
  • 8/2/2019 Statistical Language Processing

    8/32

    8/32

    Sentence boundary disambiguation

    Where does a centence start and stop? Punctuation marks are problematic Rule based mathod

    Precompiled list of abbreviations

    90% of periods are sentence boundaries(Riley, 1999)

    ~47% in Wall Street Journal are abbreviations(Stammatos, 2009)

    http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009
  • 8/2/2019 Statistical Language Processing

    9/32

    9/32

    Sentiment analysis

    Identify the polarity and emotional state for agiven text:

    positive or negative angry, sad, unhappy

    Rather tough problem to solve due tolanguage ambiguity

    http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009
  • 8/2/2019 Statistical Language Processing

    10/32

    10/32

    Word sense disambiguation

    Identify the sense of different words

    ML on top of human knowledge

    Thesauri Ontologies Corpora ...

    For more info go here

    http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.slideshare.net/dinel/orasan-ranlp2009http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf
  • 8/2/2019 Statistical Language Processing

    11/32

    11/32

    Basic tools I

    Corpora Balanced and representative collection of

    documents

    Stopping removal of common words

    I will be at the parktomorrow evening

    park tomorrow evening Stemming

    removal of word inflectionwalking walk

    http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf
  • 8/2/2019 Statistical Language Processing

    12/32

    12/32

    Basic tools

    N-grams Sequences of unigrams

    Dimensionality reduction PCA, SVD, NMF, ...

    Language modelling

    LSA, pLSA, LDA, ...

    http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdfhttp://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf
  • 8/2/2019 Statistical Language Processing

    13/32

    13/32

    Language analysis

    Source text

    Pre-processing

    Tokenization

    Disambiguation

    Dim. reduction

    Clustering

    Syntactic

    Semantic

    Results

    Results

  • 8/2/2019 Statistical Language Processing

    14/32

    14/32

    Text mining I

    Keyword indexing Big, REALLY big table; Term-to-Document

    matrix Bag-of-words

    Use IR, search engines, etc

    Unigram N-gram transition

  • 8/2/2019 Statistical Language Processing

    15/32

    15/32

    Text mining II

    1968, Salton: Vector Space Model (VSM) Scalling or normalization:

    Term freq. Inverse Document freq. (TFIDF) Log-entropy scalling

    Document similarity: cos or Euclidean distance

    VSM shortcomings Inter- and intra-document context N-grams offer a partial solution

  • 8/2/2019 Statistical Language Processing

    16/32

    16/32

    Text mining III

    1990, Deerwester: Latent Semantic Analysis(LSA)

    SVD on term-by-document matrix K-dim subspace (concepts)

    Linear combination of terms Frequencies in Fourier analysis

    LSA shortcomings Computationally expensive Updating is equally expensive

    Concepts are not intuitive

  • 8/2/2019 Statistical Language Processing

    17/32

    17/32

    Text mining IV

    1999, Hofmann: Probabilistic LSA (pLSA) oraspect model

    Probabilistic topic models Statistical foundation Latent variable

    Hidden states in HMM

    pLSA shortcomings Overfit

    pLSA. Source: Berry, 2010

  • 8/2/2019 Statistical Language Processing

    18/32

    18/32

    Text mining V

    Source: Blei, 2011

  • 8/2/2019 Statistical Language Processing

    19/32

    19/32

    Text mining VI

    Source: Blei, 2011

  • 8/2/2019 Statistical Language Processing

    20/32

    20/32

    Text mining VII

    Probabilistic topic models Uncover the relationship between observed

    and hidden variables PLSA LDA

    Ando's presentation

    LDA extensions

    Relax statistical assumptions Use meta data

    For an indroduction go hereLDA. Source: Berry, 2010

    http://www.cs.princeton.edu/~blei/papers/Blei2011.pdfhttp://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
  • 8/2/2019 Statistical Language Processing

    21/32

    21/32

    Text mining VIII

    Assumptions Word order irrelevant; bag-of-words

    Unrealistic but used extensively Words are generated in condition to previous

    words; Markov property

    Order of documents irrelevant; corpus Word distribution static over time

    Number of topics: known and fixed

  • 8/2/2019 Statistical Language Processing

    22/32

    22/32

    Text mining IX

    Meta-data Author-topic model; Rosen-Zvi et al. 2004 Author, title, location, etc

    Hyperlink analysis

  • 8/2/2019 Statistical Language Processing

    23/32

    23/32

    Matrix factorization techniques I

    SVD

    Where Weigenvectors and eigenvalues

    PCA

    ICA Independence for principal components

    (neither orthogonal nor in rank order)

    NMF

    X=WVT

    Y=WLT

    X

    XW H

  • 8/2/2019 Statistical Language Processing

    24/32

    24/32

    Matrix factorization techniques II

    SVD, PCA and ICA

    Eigenvalue based

    Fast Converge under certain conditions Sub-space is not intuitive

    NMF

    Numerically unstable Converges to local minimum Iterative process Sub-space is more natural

  • 8/2/2019 Statistical Language Processing

    25/32

    25/32Source: Lee, 1999

  • 8/2/2019 Statistical Language Processing

    26/32

    26/32

    Matrix factorization techniques III

    Problems with NMF Initialization

    Convergence speed

    Iterative Local minimum

  • 8/2/2019 Statistical Language Processing

    27/32

    27/32

    Text streams

    Detecting changes in sentiment Surprise Emerging

    Text-to-number conversion

    Time signatures

    Temporal histogram Teele's work

    Source: Berry, 2009

  • 8/2/2019 Statistical Language Processing

    28/32

    28/32

    Performance evaluation I

    Contigency matrix

    Accuracy

    Recall

    Precision

    System output

    Positive Negative

    True

    output

    Positive TP FN

    Negative FP TN

    A=TP+TN

    m

    R=TP

    TP+FN

    R=TP

    TP+FN

  • 8/2/2019 Statistical Language Processing

    29/32

    29/32

    Performance evaluation II

    Precision-Recall curve

  • 8/2/2019 Statistical Language Processing

    30/32

    30/32

    Performance evaluation III

    F-measure

    F=1

    a1

    P+a

    1a

    R

  • 8/2/2019 Statistical Language Processing

    31/32

    31/32

    References

    A. Clark, C. Fox and S. Lappin, eds., The Handbook of Computational Linguisticsand Natural Language Processing, Wiley-Blackwell, 2010.

    M. W. Berry and J. Kogan, Text Mining: Applications and Theory, Wiley, 2010.

    J. Han, M. Kamber and J. Pei, Data mining: Concepts and Techniques, Morgan-Kaufmann, 2012.

    N. Indurkhya, F. J. Damerau, eds., Handbook of Natural Language Processing,CRC, 2010.

    C. D. Manning and H. Schtze, Foundations of Statistical Natural LanguageProcessing, The MIT Press, 2000.

    R. Nisbet, J. Elder and G. Miner, Handbook of statistical analysis and data miningapplications, Elsevier, 2009.

    M. T. zsu, ed., Methods for Mining and Summarizing Text Conversations, Morgan& Claypool, 2011.

    M. Song and Y.-F. B. Wu, Handbook of Research on Text and Web MiningTechnologies, IGI, 2009.

  • 8/2/2019 Statistical Language Processing

    32/32

    32/32

    References

    D. M. Blei, A. Y. Ng, M. I. Jordan and J. Lafferty, Latent Dirichlet Allocation, J.Machine Learning Research, vol. 3, 2003.

    S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman,Indexing by Latent Semantic Analysis, J. American Society for Information Science,vol. 41, no. 6, pp. 391407, 1990.

    M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The Author-Topic Model forAuthors and Documents, Proc. of 20th Conf. on Uncertainty in Artificial Intelligence(UAI '04), 2004.

    C. Orsan,Automatic Summarisation in the Information Age, Int. Conf. on RecentAdvances in Natural Language Processing (RANLP'09), 2009.

    R. Navigli, Word Sense Disambiguation: A Survey, ACM Comput. Surv., vol. 41,no. 2, 2009.

    D. M. Blei, Introduction to Probabilistic Topic Models, ACM Press, pp. 1-16, 2010.S