Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for...
Transcript of Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for...
![Page 1: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/1.jpg)
1/45
Introduction to Natural Language Processing
Vered Shwartz
Bar-Ilan University, Israel
June 13, 2018
![Page 2: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/2.jpg)
2/45
TMI: Too Much InformationLet’s start with two facts:
I 90% of the data in the world today hasbeen created in the last two years.[1]
I Our attention span is now less than thatof a goldfish,[2] and we almost never readthrough an article.[3]
[1] IBM, March 2012 (!)[2] The Telegraph, March 2016[3] Slate, March 2016
![Page 3: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/3.jpg)
2/45
TMI: Too Much InformationLet’s start with two facts:
I 90% of the data in the world today hasbeen created in the last two years.[1]
I Our attention span is now less than thatof a goldfish,[2] and we almost never readthrough an article.[3]
[1] IBM, March 2012 (!)[2] The Telegraph, March 2016[3] Slate, March 2016
![Page 4: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/4.jpg)
2/45
TMI: Too Much InformationLet’s start with two facts:
I 90% of the data in the world today hasbeen created in the last two years.[1]
I Our attention span is now less than thatof a goldfish,[2] and we almost never readthrough an article.[3]
[1] IBM, March 2012 (!)[2] The Telegraph, March 2016[3] Slate, March 2016
![Page 5: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/5.jpg)
2/45
TMI: Too Much InformationLet’s start with two facts:
I 90% of the data in the world today hasbeen created in the last two years.[1]
I Our attention span is now less than thatof a goldfish,[2] and we almost never readthrough an article.[3]
[1] IBM, March 2012 (!)[2] The Telegraph, March 2016[3] Slate, March 2016
![Page 6: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/6.jpg)
3/45
TMI: Too Much InformationWhy is this a problem?
Some people may spend their entire vacation...trying to find the optimal hotel!
![Page 7: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/7.jpg)
3/45
TMI: Too Much InformationWhy is this a problem?
Some people may spend their entire vacation...trying to find the optimal hotel!
![Page 8: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/8.jpg)
3/45
TMI: Too Much InformationWhy is this a problem?
Some people may spend their entire vacation...trying to find the optimal hotel!
![Page 9: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/9.jpg)
3/45
TMI: Too Much InformationWhy is this a problem?
Some people may spend their entire vacation...trying to find the optimal hotel!
![Page 10: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/10.jpg)
3/45
TMI: Too Much InformationWhy is this a problem?
Some people may spend their entire vacation...trying to find the optimal hotel!
![Page 11: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/11.jpg)
3/45
TMI: Too Much InformationWhy is this a problem?
Some people may spend their entire vacation...trying to find the optimal hotel!
![Page 12: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/12.jpg)
4/45
Natural Language Processing to the rescueWe are working on automatic methods to...
I Summarize multiple long texts
I Answer questions based on texts
I Identify the sentiment of texts (e.g.reviews)
I More...
![Page 13: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/13.jpg)
4/45
Natural Language Processing to the rescueWe are working on automatic methods to...
I Summarize multiple long texts
I Answer questions based on texts
I Identify the sentiment of texts (e.g.reviews)
I More...
![Page 14: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/14.jpg)
4/45
Natural Language Processing to the rescueWe are working on automatic methods to...
I Summarize multiple long texts
I Answer questions based on texts
I Identify the sentiment of texts (e.g.reviews)
I More...
![Page 15: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/15.jpg)
4/45
Natural Language Processing to the rescueWe are working on automatic methods to...
I Summarize multiple long texts
I Answer questions based on texts
I Identify the sentiment of texts (e.g.reviews)
I More...
![Page 16: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/16.jpg)
5/45
What is Natural Language Processing (NLP)?
I Goal: for computers to “understand” and be able tocommunicate with people in natural languages (e.g. English)
![Page 17: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/17.jpg)
6/45
NLP Applications are EverywhereSpell Check
![Page 18: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/18.jpg)
7/45
NLP Applications are EverywhereGrammar Correction
![Page 19: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/19.jpg)
8/45
NLP Applications are EverywhereAutocomplete
![Page 20: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/20.jpg)
9/45
NLP Applications are EverywhereAutocomplete
![Page 21: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/21.jpg)
10/45
NLP Applications are EverywhereSpam Detection
![Page 22: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/22.jpg)
11/45
NLP Applications are EverywhereMachine Translation
![Page 23: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/23.jpg)
12/45
NLP Applications are EverywhereSearch Queries
![Page 24: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/24.jpg)
13/45
NLP Applications are EverywhereQuestion Answering
![Page 25: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/25.jpg)
14/45
NLP Applications are EverywhereTargeted Ads
![Page 26: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/26.jpg)
15/45
NLP Applications are EverywherePersonal Assistants
![Page 27: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/27.jpg)
16/45
NLP Applications are EverywhereChatbots
![Page 28: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/28.jpg)
17/45
Text Analysis Tasks
tokenization(split to words)
speech to text
speech text
morphologicalanalysis
syntacticanalysis
semanticanalysis
![Page 29: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/29.jpg)
18/45
Text Analysis Tasks
tokenization(split to words)
speech to text
speech text
morphologicalanalysis
syntacticanalysis
semanticanalysis
![Page 30: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/30.jpg)
19/45
Text Analysis TasksTokenization
I Split text into a sequence of tokens (≈ words)
I Naive approach: split sentences by period, words by spacesI How to tokenize this text?
‘Whose frisbee is this?’ John asked, rather self-consciously.‘Oh, it’s one of the boys’ said the Sen.
I (Optional) answer:
![Page 31: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/31.jpg)
19/45
Text Analysis TasksTokenization
I Split text into a sequence of tokens (≈ words)I Naive approach: split sentences by period, words by spaces
I How to tokenize this text?‘Whose frisbee is this?’ John asked, rather self-consciously.‘Oh, it’s one of the boys’ said the Sen.
I (Optional) answer:
![Page 32: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/32.jpg)
19/45
Text Analysis TasksTokenization
I Split text into a sequence of tokens (≈ words)I Naive approach: split sentences by period, words by spacesI How to tokenize this text?
‘Whose frisbee is this?’ John asked, rather self-consciously.‘Oh, it’s one of the boys’ said the Sen.
I (Optional) answer:
![Page 33: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/33.jpg)
19/45
Text Analysis TasksTokenization
I Split text into a sequence of tokens (≈ words)I Naive approach: split sentences by period, words by spacesI How to tokenize this text?
‘Whose frisbee is this?’ John asked, rather self-consciously.‘Oh, it’s one of the boys’ said the Sen.
I (Optional) answer:
![Page 34: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/34.jpg)
20/45
Text Analysis Tasks
tokenization(split to words)
speech to text
speech text
morphologicalanalysis
syntacticanalysis
semanticanalysis
![Page 35: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/35.jpg)
21/45
Text Analysis TasksMorphological Analysis
I Words are made from morphemes, smaller meaningful units
I Normally: base form + affixesI Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works
I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”
I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT
I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run
![Page 36: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/36.jpg)
21/45
Text Analysis TasksMorphological Analysis
I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes
I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works
I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”
I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT
I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run
![Page 37: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/37.jpg)
21/45
Text Analysis TasksMorphological Analysis
I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes
I Nouns - plural form: dogs, suffixes, baby → babies
I Verbs - tense: worked, working, person: works
I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”
I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT
I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run
![Page 38: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/38.jpg)
21/45
Text Analysis TasksMorphological Analysis
I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes
I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works
I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”
I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT
I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run
![Page 39: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/39.jpg)
21/45
Text Analysis TasksMorphological Analysis
I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes
I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works
I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”
I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT
I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run
![Page 40: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/40.jpg)
21/45
Text Analysis TasksMorphological Analysis
I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes
I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works
I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”
I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT
I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run
![Page 41: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/41.jpg)
21/45
Text Analysis TasksMorphological Analysis
I Words are made from morphemes, smaller meaningful unitsI Normally: base form + affixes
I Nouns - plural form: dogs, suffixes, baby → babiesI Verbs - tense: worked, working, person: works
I Many irregularities... “women and children begun runningaway as the wolves showed their teeth”
I Morphological analysis:I input: “am”, output: “be” + 1 PERSON + PRESENT
I Lemmatizer: reduce inflectional forms of a word to a commonbase forme.g. children → child, running → run
![Page 42: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/42.jpg)
22/45
Text Analysis Tasks
tokenization(split to words)
speech to text
speech text
morphologicalanalysis
syntacticanalysis
semanticanalysis
![Page 43: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/43.jpg)
23/45
Text Analysis TasksPart of Speech Tagging
I Tags each word with its part of speech (POS): noun, verb,adjective, adverb, preposition, etc.
I Surrounding words help deciding on the correct POS tag forambiguous words:I’m reading an interesting book ⇒ book = NOUNI would like to book a flight ⇒ book = VERB
![Page 44: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/44.jpg)
23/45
Text Analysis TasksPart of Speech Tagging
I Tags each word with its part of speech (POS): noun, verb,adjective, adverb, preposition, etc.
I Surrounding words help deciding on the correct POS tag forambiguous words:I’m reading an interesting book ⇒ book = NOUNI would like to book a flight ⇒ book = VERB
![Page 45: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/45.jpg)
24/45
Text Analysis TasksSyntactic Parsing
I Analyzes the syntactic structure of a sentence
I Let’s look at some syntactic ambiguities!
![Page 46: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/46.jpg)
24/45
Text Analysis TasksSyntactic Parsing
I Analyzes the syntactic structure of a sentence
I Let’s look at some syntactic ambiguities!
![Page 47: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/47.jpg)
25/45
Text Analysis TasksSyntactic Parsing
I “They ate pizza with anchovies”
I (1) They ate pizza, the pizza had anchovies on it
I (2) They ate pizza using anchovies instead of utensils
I (3) The anchovies also ate pizza
I Each of the interpretations yields a different syntactic analysis
![Page 48: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/48.jpg)
25/45
Text Analysis TasksSyntactic Parsing
I “They ate pizza with anchovies”
I (1) They ate pizza, the pizza had anchovies on it
I (2) They ate pizza using anchovies instead of utensils
I (3) The anchovies also ate pizza
I Each of the interpretations yields a different syntactic analysis
![Page 49: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/49.jpg)
25/45
Text Analysis TasksSyntactic Parsing
I “They ate pizza with anchovies”
I (1) They ate pizza, the pizza had anchovies on it
I (2) They ate pizza using anchovies instead of utensils
I (3) The anchovies also ate pizza
I Each of the interpretations yields a different syntactic analysis
![Page 50: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/50.jpg)
25/45
Text Analysis TasksSyntactic Parsing
I “They ate pizza with anchovies”
I (1) They ate pizza, the pizza had anchovies on it
I (2) They ate pizza using anchovies instead of utensils
I (3) The anchovies also ate pizza
I Each of the interpretations yields a different syntactic analysis
![Page 51: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/51.jpg)
26/45
Text Analysis TasksSyntactic Parsing
![Page 52: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/52.jpg)
27/45
Text Analysis Tasks
tokenization(split to words)
speech to text
speech text
morphologicalanalysis
syntacticanalysis
semanticanalysis
![Page 53: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/53.jpg)
28/45
Text Analysis TasksCoreference Resolution
I Identify mentions referring to the same entity
I Considered a difficult task!
![Page 54: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/54.jpg)
28/45
Text Analysis TasksCoreference Resolution
I Identify mentions referring to the same entity
I Considered a difficult task!
![Page 55: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/55.jpg)
29/45
Text Analysis TasksCoreference Resolution
I “I gave the monkeys thebananas because they werehungry” ⇒they = the monkeys
I “I gave the monkeys thebananas because they wereripe” ⇒they = the bananas
![Page 56: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/56.jpg)
29/45
Text Analysis TasksCoreference Resolution
I “I gave the monkeys thebananas because they werehungry” ⇒they = the monkeys
I “I gave the monkeys thebananas because they wereripe” ⇒they = the bananas
![Page 57: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/57.jpg)
30/45
Text Analysis TasksWord Sense Disambiguation
I What’s the correct sense of a word in a given context?
from http://naviglinlp.blogspot.co.il/
![Page 58: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/58.jpg)
30/45
Text Analysis TasksWord Sense Disambiguation
I What’s the correct sense of a word in a given context?
from http://naviglinlp.blogspot.co.il/
![Page 59: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/59.jpg)
31/45
Text Analysis TasksNamed Entities
I Named Entity Recognition: recognize entities and their type
I Entity Linking: linking entities to their Wikipedia pages
from http://www.ibm.com/blogs/research
![Page 60: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/60.jpg)
31/45
Text Analysis TasksNamed Entities
I Named Entity Recognition: recognize entities and their type
I Entity Linking: linking entities to their Wikipedia pages
from http://www.ibm.com/blogs/research
![Page 61: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/61.jpg)
32/45
NLP is hard!I Tokenization and POS tagging are almost 100% accurate
today, but semantic tasks are far from that
I Two major difficulties:I Ambiguity: one text can have multiple meaningsI Lexical variability: the same meaning can be expressed with
different words
![Page 62: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/62.jpg)
32/45
NLP is hard!I Tokenization and POS tagging are almost 100% accurate
today, but semantic tasks are far from thatI Two major difficulties:
I Ambiguity: one text can have multiple meaningsI Lexical variability: the same meaning can be expressed with
different words
![Page 63: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/63.jpg)
32/45
NLP is hard!I Tokenization and POS tagging are almost 100% accurate
today, but semantic tasks are far from thatI Two major difficulties:
I Ambiguity: one text can have multiple meanings
I Lexical variability: the same meaning can be expressed withdifferent words
![Page 64: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/64.jpg)
32/45
NLP is hard!I Tokenization and POS tagging are almost 100% accurate
today, but semantic tasks are far from thatI Two major difficulties:
I Ambiguity: one text can have multiple meaningsI Lexical variability: the same meaning can be expressed with
different words
![Page 65: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/65.jpg)
33/45
Example Application: Spam Detection
(used to be much worse... > 90%!)
I Automatically determine whether an email is spam or notI (and move spam messages to “spam” folder)
I Special case of Text Classification: given a text, automaticallydetermine its topic
I How does it work?
![Page 66: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/66.jpg)
33/45
Example Application: Spam Detection
(used to be much worse... > 90%!)
I Automatically determine whether an email is spam or not
I (and move spam messages to “spam” folder)
I Special case of Text Classification: given a text, automaticallydetermine its topic
I How does it work?
![Page 67: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/67.jpg)
33/45
Example Application: Spam Detection
(used to be much worse... > 90%!)
I Automatically determine whether an email is spam or notI (and move spam messages to “spam” folder)
I Special case of Text Classification: given a text, automaticallydetermine its topic
I How does it work?
![Page 68: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/68.jpg)
33/45
Example Application: Spam Detection
(used to be much worse... > 90%!)
I Automatically determine whether an email is spam or notI (and move spam messages to “spam” folder)
I Special case of Text Classification: given a text, automaticallydetermine its topic
I How does it work?
![Page 69: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/69.jpg)
33/45
Example Application: Spam Detection
(used to be much worse... > 90%!)
I Automatically determine whether an email is spam or notI (and move spam messages to “spam” folder)
I Special case of Text Classification: given a text, automaticallydetermine its topic
I How does it work?
![Page 70: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/70.jpg)
34/45
Spam DetectionLet’s think of characteristics of spam emails
I Unknown sender
I Spam triggering words:I Earn extra cashI Earn $I FreeI Lose weightI InstantI BonusI ...
I Naive idea: mark any email that contains these words as spam
I Problem: inaccurate (will mark non-spam as spam and viceversa)
![Page 71: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/71.jpg)
34/45
Spam DetectionLet’s think of characteristics of spam emails
I Unknown senderI Spam triggering words:
I Earn extra cashI Earn $I FreeI Lose weightI InstantI BonusI ...
I Naive idea: mark any email that contains these words as spam
I Problem: inaccurate (will mark non-spam as spam and viceversa)
![Page 72: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/72.jpg)
34/45
Spam DetectionLet’s think of characteristics of spam emails
I Unknown senderI Spam triggering words:
I Earn extra cashI Earn $I FreeI Lose weightI InstantI BonusI ...
I Naive idea: mark any email that contains these words as spam
I Problem: inaccurate (will mark non-spam as spam and viceversa)
![Page 73: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/73.jpg)
34/45
Spam DetectionLet’s think of characteristics of spam emails
I Unknown senderI Spam triggering words:
I Earn extra cashI Earn $I FreeI Lose weightI InstantI BonusI ...
I Naive idea: mark any email that contains these words as spam
I Problem: inaccurate (will mark non-spam as spam and viceversa)
![Page 74: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/74.jpg)
35/45
Spam DetectionRule-based Approach
I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”
I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)
I Problems:I Finding the optimal rules is difficultI Not all triggering words were created equal
I Solution: Let the computer “learn” these rules alone!
![Page 75: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/75.jpg)
35/45
Spam DetectionRule-based Approach
I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”
I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)
I Problems:I Finding the optimal rules is difficultI Not all triggering words were created equal
I Solution: Let the computer “learn” these rules alone!
![Page 76: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/76.jpg)
35/45
Spam DetectionRule-based Approach
I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”
I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)
I Problems:I Finding the optimal rules is difficult
I Not all triggering words were created equal
I Solution: Let the computer “learn” these rules alone!
![Page 77: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/77.jpg)
35/45
Spam DetectionRule-based Approach
I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”
I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)
I Problems:I Finding the optimal rules is difficultI Not all triggering words were created equal
I Solution: Let the computer “learn” these rules alone!
![Page 78: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/78.jpg)
35/45
Spam DetectionRule-based Approach
I Better idea: define rules, e.g. “mark as spam if unknownsender and contains at least 2 spam triggering words”
I More accurate: e.g. will not mark an email from your mother,with the word “instant” as spam :)
I Problems:I Finding the optimal rules is difficultI Not all triggering words were created equal
I Solution: Let the computer “learn” these rules alone!
![Page 79: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/79.jpg)
36/45
Spam DetectionSupervised Learning
I Let the computer learn a scoring function:score = ...+ αhave · c(have) + αsent · c(sent) + ...+ αbernard · c(bernard)
I Different weight αi for each word, e.g. αcash > αdocument
I Classify as spam if score > threshold (learn threshold too!)
![Page 80: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/80.jpg)
36/45
Spam DetectionSupervised Learning
I Let the computer learn a scoring function:score = ...+ αhave · c(have) + αsent · c(sent) + ...+ αbernard · c(bernard)
I Different weight αi for each word, e.g. αcash > αdocument
I Classify as spam if score > threshold (learn threshold too!)
![Page 81: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/81.jpg)
36/45
Spam DetectionSupervised Learning
I Let the computer learn a scoring function:score = ...+ αhave · c(have) + αsent · c(sent) + ...+ αbernard · c(bernard)
I Different weight αi for each word, e.g. αcash > αdocument
I Classify as spam if score > threshold (learn threshold too!)
![Page 82: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/82.jpg)
37/45
Spam DetectionSupervised Learning
I How does the computer learn the α weights?
I Supervised learning: estimate a function (learn weights)using labeled examples
I Take a lot of emails, manually mark them as spam/not spam
I The computer learns a function (weights) that best predictsspam/not spam for the known emails
I If we have enough examples, it would also work well on newemails
![Page 83: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/83.jpg)
37/45
Spam DetectionSupervised Learning
I How does the computer learn the α weights?
I Supervised learning: estimate a function (learn weights)using labeled examples
I Take a lot of emails, manually mark them as spam/not spam
I The computer learns a function (weights) that best predictsspam/not spam for the known emails
I If we have enough examples, it would also work well on newemails
![Page 84: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/84.jpg)
37/45
Spam DetectionSupervised Learning
I How does the computer learn the α weights?
I Supervised learning: estimate a function (learn weights)using labeled examples
I Take a lot of emails, manually mark them as spam/not spam
I The computer learns a function (weights) that best predictsspam/not spam for the known emails
I If we have enough examples, it would also work well on newemails
![Page 85: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/85.jpg)
37/45
Spam DetectionSupervised Learning
I How does the computer learn the α weights?
I Supervised learning: estimate a function (learn weights)using labeled examples
I Take a lot of emails, manually mark them as spam/not spam
I The computer learns a function (weights) that best predictsspam/not spam for the known emails
I If we have enough examples, it would also work well on newemails
![Page 86: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/86.jpg)
37/45
Spam DetectionSupervised Learning
I How does the computer learn the α weights?
I Supervised learning: estimate a function (learn weights)using labeled examples
I Take a lot of emails, manually mark them as spam/not spam
I The computer learns a function (weights) that best predictsspam/not spam for the known emails
I If we have enough examples, it would also work well on newemails
![Page 87: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/87.jpg)
38/45
Spam DetectionFeatures
I We used bag-of-words as features for classification :{ I, have, sent, you, ... }
I If we have enough spam examples that contain the word“urgent”, αurgent will be high
I What about similar words like “immediate” or “instant”?
I We need to find a way to let the computer know aboutsemantically-similar words
![Page 88: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/88.jpg)
38/45
Spam DetectionFeatures
I We used bag-of-words as features for classification :{ I, have, sent, you, ... }
I If we have enough spam examples that contain the word“urgent”, αurgent will be high
I What about similar words like “immediate” or “instant”?
I We need to find a way to let the computer know aboutsemantically-similar words
![Page 89: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/89.jpg)
38/45
Spam DetectionFeatures
I We used bag-of-words as features for classification :{ I, have, sent, you, ... }
I If we have enough spam examples that contain the word“urgent”, αurgent will be high
I What about similar words like “immediate” or “instant”?
I We need to find a way to let the computer know aboutsemantically-similar words
![Page 90: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/90.jpg)
38/45
Spam DetectionFeatures
I We used bag-of-words as features for classification :{ I, have, sent, you, ... }
I If we have enough spam examples that contain the word“urgent”, αurgent will be high
I What about similar words like “immediate” or “instant”?
I We need to find a way to let the computer know aboutsemantically-similar words
![Page 91: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/91.jpg)
39/45
Word RepresentationOne-hot Vectors
I How do we represent all the words in the computer?
I Simplest: we have a dictionary, and each word has an index,e.g. index(urgent) = 316, index(instant) = 12418
I You can think of the word with index i as a vector (array ofnumbers) with zeros and one entry with 1 in the ith index -“one-hot vector”:
urgent 0 0 ... 1 0 ... 0 0 ... 0
↑316
instant 0 0 ... 0 0 ... 1 0 ... 0
↑12418
![Page 92: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/92.jpg)
39/45
Word RepresentationOne-hot Vectors
I How do we represent all the words in the computer?
I Simplest: we have a dictionary, and each word has an index,e.g. index(urgent) = 316, index(instant) = 12418
I You can think of the word with index i as a vector (array ofnumbers) with zeros and one entry with 1 in the ith index -“one-hot vector”:
urgent 0 0 ... 1 0 ... 0 0 ... 0
↑316
instant 0 0 ... 0 0 ... 1 0 ... 0
↑12418
![Page 93: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/93.jpg)
39/45
Word RepresentationOne-hot Vectors
I How do we represent all the words in the computer?
I Simplest: we have a dictionary, and each word has an index,e.g. index(urgent) = 316, index(instant) = 12418
I You can think of the word with index i as a vector (array ofnumbers) with zeros and one entry with 1 in the ith index -“one-hot vector”:
urgent 0 0 ... 1 0 ... 0 0 ... 0
↑316
instant 0 0 ... 0 0 ... 1 0 ... 0
↑12418
![Page 94: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/94.jpg)
40/45
Spam DetectionBag-of-words with One-hot Vectors
I A vector representing the entire email: sum of one-hot vectorsof the words in the email:
I 0 0 ... 1 0 ... 0 0 ... 0
have 0 1 ... 0 0 ... 0 0 ... 0
sent 0 0 ... 0 0 ... 1 0 ... 0
+ ... ...
bernard 0 0 ... 0 1 ... 0 0 ... 0
=
feature vector 0 4 ... 2 1 ... 1 0 ... 0
I Problem: Emails with similar words (e.g. deliver instead ofsend, urgent instead of instant) have very different featurevectors!
![Page 95: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/95.jpg)
40/45
Spam DetectionBag-of-words with One-hot Vectors
I A vector representing the entire email: sum of one-hot vectorsof the words in the email:
I 0 0 ... 1 0 ... 0 0 ... 0
have 0 1 ... 0 0 ... 0 0 ... 0
sent 0 0 ... 0 0 ... 1 0 ... 0
+ ... ...
bernard 0 0 ... 0 1 ... 0 0 ... 0
=
feature vector 0 4 ... 2 1 ... 1 0 ... 0
I Problem: Emails with similar words (e.g. deliver instead ofsend, urgent instead of instant) have very different featurevectors!
![Page 96: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/96.jpg)
41/45
Word RepresentationDistributional Word Vectors
I Can we have similar vectors for semantically-similar words?
I “You shall know a word by the company it keeps”(John Rupert Firth, 1957)
elevator 0 0 ... 0.16 0 ... 0.49 0 ... 0
lift 0 0 ... 0.15 0 ... 0.51 0 ... 0
↑ ↑up stairs
I Now semantically-similar words have similar word vectors!
![Page 97: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/97.jpg)
41/45
Word RepresentationDistributional Word Vectors
I Can we have similar vectors for semantically-similar words?
I “You shall know a word by the company it keeps”(John Rupert Firth, 1957)
elevator 0 0 ... 0.16 0 ... 0.49 0 ... 0
lift 0 0 ... 0.15 0 ... 0.51 0 ... 0
↑ ↑up stairs
I Now semantically-similar words have similar word vectors!
![Page 98: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/98.jpg)
41/45
Word RepresentationDistributional Word Vectors
I Can we have similar vectors for semantically-similar words?
I “You shall know a word by the company it keeps”(John Rupert Firth, 1957)
elevator 0 0 ... 0.16 0 ... 0.49 0 ... 0
lift 0 0 ... 0.15 0 ... 0.51 0 ... 0
↑ ↑up stairs
I Now semantically-similar words have similar word vectors!
![Page 99: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/99.jpg)
41/45
Word RepresentationDistributional Word Vectors
I Can we have similar vectors for semantically-similar words?
I “You shall know a word by the company it keeps”(John Rupert Firth, 1957)
elevator 0 0 ... 0.16 0 ... 0.49 0 ... 0
lift 0 0 ... 0.15 0 ... 0.51 0 ... 0
↑ ↑up stairs
I Now semantically-similar words have similar word vectors!
![Page 100: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/100.jpg)
42/45
Spam DetectionBag-of-words with Distributional Word Vectors
I Again, we sum up all the vectors:
I 0 0 ... 0.12 0.03 ... ... ... ... 0.04 0 ... 0have 0 0.22 ... 0 0 ... ... ... ... 0 0 ... 0sent 0 0.43 ... 0 0.1 ... ... ... ... 0.25 0 ... 0+ ... ...
bernard 0 0 ... 0 0.67 ... ... ... ... 0 0 ... 0=FV 0 0.65 ... 0.12 0.71 ... ... ... ... 0.29 0 ... 0
I We can now replace a word (e.g. sent) with a similar word(e.g. delivered) and get a similar feature vector ⇒ sameclassification for similar emails!
![Page 101: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/101.jpg)
42/45
Spam DetectionBag-of-words with Distributional Word Vectors
I Again, we sum up all the vectors:
I 0 0 ... 0.12 0.03 ... ... ... ... 0.04 0 ... 0have 0 0.22 ... 0 0 ... ... ... ... 0 0 ... 0sent 0 0.43 ... 0 0.1 ... ... ... ... 0.25 0 ... 0+ ... ...
bernard 0 0 ... 0 0.67 ... ... ... ... 0 0 ... 0=FV 0 0.65 ... 0.12 0.71 ... ... ... ... 0.29 0 ... 0
I We can now replace a word (e.g. sent) with a similar word(e.g. delivered) and get a similar feature vector ⇒ sameclassification for similar emails!
![Page 102: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/102.jpg)
43/45
Word Embeddings
I [ A more recent type of distributional vectors ]
I Find most similar words:
See more here: http://bionlp-www.utu.fi/wv_demo/
![Page 103: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/103.jpg)
44/45
![Page 104: Introduction to Natural Language ProcessingWhat is Natural Language Processing (NLP)? I Goal: for computers to \understand" and be able to communicate with people in natural languages](https://reader034.fdocuments.net/reader034/viewer/2022042711/5f733d57d137a6735f421620/html5/thumbnails/104.jpg)
45/45
Additional Resources
I Books:I Chris Manning and Hinrich Schutze, Foundations of Statistical
Natural Language Processing, MIT Press. Cambridge, MA:May 1999.
I Dan Jurafsky and James H. Martin, Speech and LanguageProcessing. Second Edition. Pearson Education, 2014.
I Resources from NACLO - North American ComputationalLinguistics Olympiadhttp://nacloweb.org/resources.php
I My blog: http://veredshwartz.blogspot.co.il