Probabilistic Language Processing
-
Upload
cassidy-rosales -
Category
Documents
-
view
23 -
download
2
description
Transcript of Probabilistic Language Processing
![Page 1: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/1.jpg)
Probabilistic Language Processing
Chapter 23
![Page 2: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/2.jpg)
Probabilistic Language Models
• Goal -- define probability distribution over set of strings
• Unigram, bigram, n-gram• Count using corpus but need smoothing:
– add-one– Linear interpolation
• Evaluate with Perplexity measure• E.g. segmentwordswithoutspaces w/ Viterbi
![Page 3: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/3.jpg)
PCFGs
• Rewrite rules have probabilities.
• Prob of a string is sum of probs of its parse trees.
• Context-freedom means no lexical constraints.
• Prefers short sentences.
![Page 4: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/4.jpg)
Learning PCFGs
• Parsed corpus -- count trees.
• Unparsed corpus– Rule structure known -- use EM (inside-outside
algorithm)– Rules unknown -- Chomsky normal form…
problems.
![Page 5: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/5.jpg)
Information Retrieval
• Goal: Google. Find docs relevant to user’s needs.
• IR system has doc. Collection, query in some language, set of results, and a presentation of results.
• Ideally, parse docs into knowledge base… too hard.
![Page 6: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/6.jpg)
IR 2
• Boolean Keyword Model -- in or out?
• Problem -- single bit of “relevance”
• Boolean combinations a bit mysterious
• How compute P(R=true | D,Q)?
• Estimate language model for each doc, computes prob of query given the model.
• Can rank documents by P(r|D,Q)/P(~r|D,Q)
![Page 7: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/7.jpg)
IR3
• For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes.
• Good example pp 842-843.
![Page 8: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/8.jpg)
Evaluating IR
• Precision is proportion of results that are relevant.• Recall is proportion of relevant docs that are in
results• ROC curve (there are several varieties): standard
is to plot false negatives vs. false positives.• More “practical” for web: reciprocal rank of first
relevant result, or just “time to answer”
![Page 9: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/9.jpg)
IR Refinements
• Case
• Stems
• Synonyms
• Spelling correction
• Metadata --keywords
![Page 10: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/10.jpg)
IR Presentation
• Give list in order of relevance, deal with duplicates
• Cluster results into classes– Agglomerative– K-means
• How describe automatically-generated clusters? Word list? Title of centroid doc?
![Page 11: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/11.jpg)
IR Implementation
• CSC172!
• Lexicon with “stop list”,
• “inverted” index: where words occur
• Match with vectors: vectorof freq of words dotted with query terms.
![Page 12: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/12.jpg)
Information Extraction
• Goal: create database entries from docs.
• Emphasis on massive data, speed, stylized expressions
• Regular expression grammars OK if stylized enough
• Cascaded Finite State Transducers,,,stages of grouping and structure-finding
![Page 13: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/13.jpg)
Machine Translation Goals
• Rough Translation (E.g. p. 851)
• Restricted Doman (mergers, weather)
• Pre-edited (Caterpillar or Xerox English)
• Literary Translation -- not yet!
• Interlingua-- or canonical semantic representation like Conceptual Dependency
• Basic Problem != languages, != categories
![Page 14: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/14.jpg)
MT in Practice
• Transfer -- uses data base of rules for translating small units of language
• Memory -based. Memorize sentence pairs
• Good diagram p. 853
![Page 15: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/15.jpg)
Statistical MT
• Bilingual corpus• Find most likely translation given corpus.• Argmax_F P(F|E) = argmax_F P(E|F)P(F)• P(F) is language model• P(E|F) is translation model• Lots of interesting problems: fertility (home vs. a
la maison).• Horrible drastic simplfications and hacks work
pretty well!
![Page 16: Probabilistic Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022072013/56812b19550346895d8f0b1d/html5/thumbnails/16.jpg)
Learning and MT
• Stat. MT needs: language model, fertility model, word choice model, offset model.
• Millions of parameters
• Counting , estimate, EM.