TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf ·...
Transcript of TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf ·...
![Page 1: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/1.jpg)
TTIC 31190: Natural Language Processing
Kevin Gimpel Spring 2018
Lecture 1: IntroducBon;
Words
1
![Page 2: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/2.jpg)
Course Overview • Second Bme being offered (first was Winter 2016)
• Designed for first-‐year TTIC PhD students
• My office hours: 3-‐4pm Mondays (TTIC 531), or by appointment
• TA: Lifu Tu, TTIC PhD student • TA office hours: 3-‐4pm Wednesdays (TTIC 501)
2
![Page 3: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/3.jpg)
• course had much more interest this year than expected
• if you are not yet registered, it is unlikely you will be able to get a spot
• I have been in touch with you if you’re within the first few spots on the waitlist
3
![Page 4: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/4.jpg)
Prerequisites • No course prerequisites, but I will assume: – some programming experience (no specific language required)
– familiarity with basics of calculus, linear algebra, and probability
– will be helpful to have taken a machine learning course, but not strictly required
4
![Page 5: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/5.jpg)
Grading • 3 assignments (15% each) • midterm exam (15%) (Wed., May 16) • course project (30%): – project proposal (5%) – final report (25%)
• class parBcipaBon, including quizzes (10%) • no final
5
![Page 6: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/6.jpg)
Assignments • mixture of formal exercises, implementaBon, experimentaBon, analysis
• first assignment has been posted so that you can have a look at it, due 2 weeks from Wednesday
6
![Page 7: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/7.jpg)
Project • Replicate [part of] a published NLP paper, or define your own project
• The project must be done in a group of two • Each group member will receive same grade • More details to come
7
![Page 8: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/8.jpg)
CollaboraBon Policy • You are welcome to discuss assignments with others in the course, but soluBons and code must be wrifen individually
8
![Page 9: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/9.jpg)
Lateness Policy • If you turn in an assignment late, a penalty will be assessed (2% per hour late)
• You will have 4 late days to use as you wish during the quarter
• Late days must be used in whole increments – e.g., if you turn in an assignment 6 hours late and want to use a late day to avoid penalty, it will cost an enBre late day to do so
9
![Page 10: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/10.jpg)
OpBonal Textbooks (1/2) • Jurafsky & MarBn. Speech and Language Processing, 2nd Ed. & 3rd Ed. • Many chapters of 3rd ediBon are online • Copies of 2nd ediBon available in TTIC library
10
![Page 11: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/11.jpg)
OpBonal Textbooks (2/2) • Goldberg. Neural Network Methods for Natural Language Processing. • Earlier draj (from 2015) available online • Two copies on reserve in TTIC library
11
![Page 12: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/12.jpg)
12
What is natural language processing?
![Page 13: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/13.jpg)
13
an experimental computer science research area that includes problems and soluBons pertaining to
the understanding of human language
What is natural language processing?
![Page 14: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/14.jpg)
14
Text ClassificaBon
![Page 15: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/15.jpg)
15
Text ClassificaBon
• spam / not spam • priority level • category (primary / social / promoBons / updates)
![Page 16: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/16.jpg)
16
SenBment Analysis
![Page 17: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/17.jpg)
17
SenBment Analysis
![Page 18: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/18.jpg)
18
Machine TranslaBon
![Page 19: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/19.jpg)
19
QuesBon Answering
![Page 20: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/20.jpg)
20
QuesBon Answering
![Page 21: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/21.jpg)
21
Dialog Systems
figure credit: Phani Marupaka
![Page 22: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/22.jpg)
22
SummarizaBon
![Page 23: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/23.jpg)
23
SummarizaBon
The Apple Watch has drawbacks. There are other smartwatches that offer more capabiliBes.
![Page 24: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/24.jpg)
24
Part-‐of-‐Speech Tagging
determiner verb (past) prep. proper proper poss. adj. noun Some quesBoned if Tim Cook ’s first product modal verb det. adjecBve noun prep. proper punc. would be a breakaway hit for Apple .
![Page 25: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/25.jpg)
determiner verb (past) prep. proper proper poss. adj. noun modal verb det. adjecBve noun prep. proper punc.
25
Part-‐of-‐Speech Tagging
determiner verb (past) prep. noun noun poss. adj. noun Some quesBoned if Tim Cook ’s first product modal verb det. adjecBve noun prep. noun punc. would be a breakaway hit for Apple .
![Page 26: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/26.jpg)
26
SyntacBc Parsing
NP NP
Cook ’s first product may not be a breakaway hit
![Page 27: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/27.jpg)
27
SyntacBc Parsing
NP NP
VP
Cook ’s first product may not be a breakaway hit
![Page 28: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/28.jpg)
28
SyntacBc Parsing
NP NP
VP
Cook ’s first product may not be a breakaway hit
S
![Page 29: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/29.jpg)
29
Some quesBoned if Tim Cook’s first product would be a breakaway hit for Apple.
Named EnBty RecogniBon
PERSON ORGANIZATION
![Page 30: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/30.jpg)
30
Some quesBoned if Tim Cook’s first product would be a breakaway hit for Apple.
EnBty Linking
![Page 31: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/31.jpg)
31
Some quesBoned if Tim Cook’s first product would be a breakaway hit for Apple. It’s the company’s first new device since he became CEO.
Coreference ResoluBon
![Page 32: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/32.jpg)
32
Some quesBoned if Tim Cook’s first product would be a breakaway hit for Apple. It’s the company’s first new device since he became CEO.
Coreference ResoluBon
![Page 33: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/33.jpg)
33
Some quesBoned if Tim Cook’s first product would be a breakaway hit for Apple. It’s the company’s first new device since he became CEO.
Coreference ResoluBon
![Page 34: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/34.jpg)
34
Some quesBoned if Tim Cook’s first product would be a breakaway hit for Apple. It’s the company’s first new device since he became CEO.
Coreference ResoluBon
??
![Page 35: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/35.jpg)
35
“Winograd Schema” Coreference ResoluBon
The man couldn't lij his son because he was so weak. The man couldn't lij his son because he was so heavy.
![Page 36: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/36.jpg)
36
“Winograd Schema” Coreference ResoluBon
The man couldn't lij his son because he was so weak. The man couldn't lij his son because he was so heavy.
man
son
![Page 37: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/37.jpg)
Once there was a boy named Fritz who loved to draw. He drew everything. In the morning, he drew a picture of his cereal with milk. His papa said, “Don’t draw your cereal. Eat it!” Ajer school, Fritz drew a picture of his bicycle. His uncle said, “Don't draw your bicycle. Ride it!” … What did Fritz draw first? A) the toothpaste B) his mama C) cereal and milk D) his bicycle
37
Reading Comprehension
![Page 38: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/38.jpg)
38
Reading Comprehension
![Page 39: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/39.jpg)
39
Other ways are needed.
We must find other ways.
I absolutely do believe there was an iceberg in those waters.
I don't believe there was any iceberg at all anywhere near the Titanic.
4.4
1.2
Input Output
Pakistan bomb vicBms’ families end protest
Pakistan bomb vicBms to be buried ajer protest ends 2.6
Sentence Similarity
![Page 40: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/40.jpg)
40
he bent down and searched the large container, trying to find anything else hidden in it other than the _____
Word PredicBon
![Page 41: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/41.jpg)
41
Word PredicBon
he turned to one of the cops beside him. “search the enEre coffin.” the man nodded and bustled forward towards the coffin. he bent down and searched the large container, trying to find anything else hidden in it other than the _____
![Page 42: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/42.jpg)
Other language technologies (not typically considered core NLP):
• speech processing (see TTIC 31110) • informaBon retrieval / web search • knowledge representaBon / reasoning
42
![Page 43: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/43.jpg)
Roadmap • words, morphology, lexical semanBcs • text classificaBon • simple neural methods for NLP • language modeling and word embeddings • recurrent/recursive/convoluBonal networks in NLP • sequence labeling, HMMs, dynamic programming • syntax and syntacBc parsing • semanBcs, composiBonality, semanBc parsing • machine translaBon and other NLP tasks
43
![Page 44: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/44.jpg)
ComputaBonal LinguisBcs vs. Natural Language Processing
• how do they differ?
44
![Page 45: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/45.jpg)
ComputaBonal LinguisBcs vs. Natural Language Processing
• English is a “head-‐final” language: the head of a noun phrase comes at the end
• computaBonal linguisBcs is about linguisEcs – computaEonal is a modifier
• natural language processing is about processing – natural language is a modifier
45
![Page 46: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/46.jpg)
ComputaBonal LinguisBcs vs. Natural Language Processing
• many people think of the two terms as synonyms
• computaBonal linguisBcs is more inclusive; more likely to include sociolinguisBcs, cogniBve linguisBcs, and computaBonal social science
• NLP is more likely to use machine learning and involve engineering / system-‐building
46
![Page 47: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/47.jpg)
Is NLP Science or Engineering? • goal of NLP is to develop technology, which takes the form of engineering
• though we try to solve today’s problems, we seek principles that will be useful for the future
• if science, it’s not linguisBcs or cogniBve science; it’s the science of computaBonal processing of language
• I like to think of NLP as the science of engineering soluBons to problems involving natural language
47
![Page 48: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/48.jpg)
Why is NLP hard? • ambiguity and variability of linguisBc expression: – variability: many forms can mean the same thing – ambiguity: one form can mean many things
• many different kinds of variability and ambiguity • each NLP task must address disBnct kinds
48
![Page 49: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/49.jpg)
Example: Hyperlinks in Wikipedia
49
bar (law)
bar (establishment)
bar association
bar (unit)
medal bar
bar (music)
bar
…
…
…
Wikipedia ArBcles
![Page 50: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/50.jpg)
Example: Hyperlinks in Wikipedia
50
bar (law)
bar (establishment)
bar association
bar (unit)
medal bar
bar (music)
bar
…
barbarssaloonsaloonsloungepubsports bar
…
…
Wikipedia ArBcles
…
…
![Page 51: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/51.jpg)
51
bar (law)
bar (establishment)
bar association
bar (unit)
medal bar
bar (music)
bar
…
barbarssaloonsaloonsloungepubsports bar
…
…
Wikipedia ArBcles
…
…
Ambiguity Variability
![Page 52: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/52.jpg)
Word Sense Ambiguity
52
credit: A. Zwicky
![Page 53: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/53.jpg)
Word Sense Ambiguity
53
credit: A. Zwicky
![Page 54: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/54.jpg)
Afachment Ambiguity
54
![Page 55: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/55.jpg)
Meaning Ambiguity
55
![Page 56: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/56.jpg)
Roadmap • words, morphology, lexical semanBcs • text classificaBon • simple neural methods for NLP • language modeling and word embeddings • recurrent/recursive/convoluBonal networks in NLP • sequence labeling, HMMs, dynamic programming • syntax and syntacBc parsing • semanBcs, composiBonality, semanBc parsing • machine translaBon and other NLP tasks
56
![Page 57: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/57.jpg)
Words • what is a word? • tokenizaBon • morphology • lexical semanBcs
57
![Page 58: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/58.jpg)
What is a word?
58
![Page 59: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/59.jpg)
TokenizaBon • tokenizaBon: convert a character stream into words by adding spaces
• for certain languages, highly nontrivial • e.g., Chinese word segmentaBon is a widely-‐studied NLP task
59
![Page 60: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/60.jpg)
TokenizaBon • for other languages (English), tokenizaBon is easier but is sBll not always obvious
• the data for your homework has been tokenized: – punctuaBon has been split off from words – contracBons have been split
60
![Page 61: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/61.jpg)
Intricacies of TokenizaBon
• separaBng punctuaBon characters from words? – , ” ? ! à always separate – . à when shouldn’t we separate it?
• Dr., Mr., Prof., U.S., etc.
• English contracBons: – isn’t, aren’t, wasn’t,… à is n’t, are n’t, was n’t,… – but how about these: can’t, won’t à ca n’t, wo n’t – ca and wo are then different forms from can and will
61
![Page 62: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/62.jpg)
Intricacies of TokenizaBon
• separaBng punctuaBon characters from words? – , ” ? ! à always separate – . à when shouldn’t we separate it?
• Dr., Mr., Prof., U.S., etc.
• English contracBons: – isn’t, aren’t, wasn’t,… à is n’t, are n’t, was n’t,… – but how about these: can’t, won’t à ca n’t, wo n’t – ca and wo are then different forms from can and will
62
![Page 63: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/63.jpg)
Intricacies of TokenizaBon
• separaBng punctuaBon characters from words? – , ” ? ! à always separate – . à when shouldn’t we separate it?
• Dr., Mr., Prof., U.S., etc.
• English contracBons: – isn’t, aren’t, wasn’t,… à is n’t, are n’t, was n’t,… – but how about these: can’t, won’t à ca n’t, wo n’t – ca and wo are then different forms from can and will
63
![Page 64: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/64.jpg)
• Chinese and Japanese: no spaces between words: – 莎拉波娃现在居住在美国东南部的佛罗里达。 �
– 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达 �
– Sharapova now lives in US southeastern Florida • Further complicated in Japanese, with mulBple alphabets intermingled – Dates/amounts in mulBple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
J&M/SLP3
![Page 65: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/65.jpg)
Removing Spaces? • tokenizaBon is usually about adding spaces • but might we also want to remove spaces? • what are some English examples? – names?
• New York à NewYork
– non-‐composiBonal compounds? • hot dog à hotdog
– other arBfacts of our spacing convenBons? • New York-‐Long Island Railway
65
![Page 66: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/66.jpg)
Removing Spaces? • tokenizaBon is usually about adding spaces • but might we also want to remove spaces? • what are some English examples? – names?
• New York à NewYork
– non-‐composiBonal compounds? • hot dog à hotdog
– other arBfacts of our spacing convenBons? • New York-‐Long Island Railway à ?
66
![Page 67: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/67.jpg)
Types and Tokens • once text has been tokenized, let’s count the words • types: entries in the vocabulary • tokens: instances of types in a corpus • example sentence: If they want to go , they should go . – how many types? – how many tokens?
• type/token raBo: useful staBsBc of a corpus (here, 0.8) • as we add data, what happens to the type-‐token raBo? • indicates what? – high type/token raBo à rich morphology – low type/token raBo à poor morphology
67
![Page 68: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/68.jpg)
Types and Tokens • once text has been tokenized, let’s count the words • types: entries in the vocabulary • tokens: instances of types in a corpus • example sentence: If they want to go , they should go . – how many types? 8 – how many tokens? 10
• type/token raBo: useful staBsBc of a corpus (here, 0.8) • as we add data, what happens to the type-‐token raBo? • indicates what? – high type/token raBo à rich morphology – low type/token raBo à poor morphology
68
![Page 69: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/69.jpg)
Types and Tokens • once text has been tokenized, let’s count the words • types: entries in the vocabulary • tokens: instances of types in a corpus • example sentence: If they want to go , they should go . – how many types? 8 – how many tokens? 10
• type/token raBo: useful staBsBc of a corpus (here, 0.8) • as we add data, what happens to the type-‐token raBo? a
69
![Page 70: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/70.jpg)
• How will the type/token raBo change when adding more data?
70
![Page 71: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/71.jpg)
More data à Lower type/token raBo
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
10K 100K 1M 10M 100M
English Wikipedia
71 # tokens
type
/token
raBo
![Page 72: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/72.jpg)
• What has a higher type/token raBo, Simple English Wikipedia or English Wikipedia?
72
![Page 73: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/73.jpg)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
10K 100K 1M 10M 100M
English Wikipedia
Simple English Wikipedia
73 # tokens
type
/token
raBo
![Page 74: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/74.jpg)
• What has a higher type/token raBo, Simple English Wikipedia or English Wikipedia? – English Wikipedia – type/token raBo is one measure of complexity
• How about Wikipedia vs Newswire?
74
![Page 75: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/75.jpg)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
10K 100K 1M 10M 100M
English Wikipedia
Simple English Wikipedia
Newswire
75 # tokens
type
/token
raBo
![Page 76: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/76.jpg)
• Wikipedia vs Simple English Wikipedia? – Wikipedia
• Wikipedia vs Newswire? – Wikipedia
• Wikipedia vs Tweets?
76
![Page 77: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/77.jpg)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
10K 100K 1M 10M 100M
English Wikipedia
Simple English Wikipedia
Tweets
77 # tokens
type
/token
raBo
![Page 78: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/78.jpg)
• Wikipedia vs Simple English Wikipedia? – Wikipedia
• Wikipedia vs Newswire? – Wikipedia
• Wikipedia vs Tweets? – Tweets (once you have 1 million or more tokens)
78
![Page 79: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/79.jpg)
“really” on Twifer
79
224571 really 1189 rly 1119 realy 731 rlly 590 reallly 234 realllly 216 reallyy 156 relly 146 reallllly 132 rily 104 reallyyy 89 reeeally 89 realllllly 84 reaaally 82 reaally 72 reeeeally 65 reaaaally 57 reallyyyy 53 rilly
50 reallllllly 48 reeeeeally 41 reeally 38 really2 37 reaaaaally 35 reallyyyyy 31 reely 30 realllyyy 27 realllyy 27 reaaly 26 realllyyyy 25 realllllllly 22 reaaallly 21 really- 19 reeaally 18 reallllyyy 16 reaaaallly 15 realyy 15 reallyreally
15 reallllyy 15 reallllllllly 15 reaallly 14 reeeeeeally 14 reallllyyyy 13 reeeaaally 12 rreally 12 reaaaaaally 11 reeeeallly 11 reeeallly 11 realllllyyy 11 reaallyy 10 reallyreallyreally 10 reaaaly 9 reeeeeeeally 9 reallys 9 really-really 9 r)eally 8 reeeaally
![Page 80: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/80.jpg)
“really” on Twifer
80
8 reallyyyyyyy 8 reallyyyyyy 8 realky 7 relaly 7 reeeeeeeeeally 7 reeeealy 7 reeeeaaally 7 reallllllyyy 7 realllllllllllly 7 reaaaaaaally 7 raelly 7 r3ally 6 r-really 6 reeeaaalllyyy 6 reeeaaallly 6 reeeaaaally 6 realyl 6 r-e-a-l-l-y 6 realllyyyyy
6 realllllllllly 6 reaaaaaallly 5 rrrreally 5 rrly 5 rellly 5 reeeeeeeeally 5 reeeeaally 5 reeeeaaallly 5 reeallyyy 5 reallllllllllly 5 reallllllllllllly 5 reaalllyy 5 reaaaalllly 5 reaaaaallly 4 rllly 4 reeeeeeeeeeally 4 reeealy 4 reeaaaally 4 realllllyyyy
4 realllllllyyyy 4 reaalllyyy 4 reaalllly 4 reaaalllyy 4 reaaalllly 4 reaaaaly 3 reeeeealllly 3 reeeealllly 3 reeeeaaaaally 3 reeeaallly 3 reeeaaallllyyy 3 reealy 3 reeallly 3 reeaaly 3 reeaalllyyy 3 reeaalllly 3 reeaaallly 3 reallyyyyyyyyy 3 reallyl
![Page 81: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/81.jpg)
“really” on Twifer
81
3 really) 3 r]eally 3 realluy 3 reallllyyyyy 3 reallllllyyyyyyy 3 reallllllyyyy 3 reallllllyy 3 realllllllllllllllly 3 realiy 3 reaallyyyy 3 reaallllly 3 reaaallyy 3 reaaaallyy 3 reaaaallllly 3 reaaaaaly 3 reaaaaaaaally 3 r34lly 2 rrreally 2 rreeaallyy
2 rlyyyy 2 rlyyy 2 reqally 2 rellyy 2 rellys 2 reeely 2 reeeeeealy 2 reeeeeallly 2 reeeeeaally 2 reeeeeaaally 2 reeeeeaaallllly 2 reeeeallyyy 2 reeeeallllyyy 2 reeeeaaallllyyyy 2 reeeeaaalllly 2 reeeeaaaally 2 reeeeaaaalllyyy 2 reeeallyy 2 reeallyy
2 reeaallyy 2 reeaalllyy 2 reeaallly 2 reeaaally 2 reaqlly 2 realyyy 2 reallyyyyyyyyyyyy 2 reallyyyyyyyy 2 really* 2 really/ 2 realllyyyyyy 2 reallllyyyyyy 2 realllllyyyyyy 2 realllllyy 2 reallllllyyyyy 2 realllllllyyyyy 2 realllllllyy 2 reallllllllllllllly 2 reallllllllllllllllly
![Page 82: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/82.jpg)
82
1 rrrrrrrrrrrrrrrreeeeeeeeeeeaaaaaaalllllllyyyyyy 1 rrrrrrrrrreally 1 rrrrrrreeeeeeaaaalllllyyyyyyy 1 rrrrrrealy 1 rrrrrreally … 1 re-he-he-heeeeally 1 re-he-he-he-ealy 1 reheheally 1 reelllyy 1 reellly 1 ree-hee-heally … 1 reeeeeeeeeaally 1 reeeeeeeeeaaally 1 reeeeeeeeeaaaaaalllyyy 1 reeeeeeeeeaaaaaaallllllllyyyyyyyy 1 reeeeeeeeeaaaaaaallllllllyyyyyyyy 1 reeeeeeeeeaaaaaaaaalllllllllyyyyyyyy 1 reeeeeeeeaaaaaaaalllllyyyyyy
![Page 83: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/83.jpg)
83
1 reallyreallyreallyreallyreallyreallyreallyreallyreallyreally reallyreallyreallyreallyreallyreallyreally 1 reallyreallyreallyreallyreallyr33lly 1 really/really/really 1 really(really … 1 reallllllllyyyy 1 realllllllllyyyyyy 1 realllllllllyyyyy 1 realllllllllyyyy 1 realllllllllyyy 1 reallllllllllyyyyy 1 reallllllllllllyyyyyy 1 reallllllllllllllllllly 1 reallllllllllllllllllllly 1 reallllllllllllllllllllllyyyyy 1 reallllllllllllllllllllllllllly 1 realllllllllllllllllllllllllllly 1 reallllllllllllllllllllllllllllllllly 1 reallllllllllllllllllllllllllllllllllllllllllllly 1 reallllllllllllllllllllllllllllllllllllllllllllllllllllllly 1 reallllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllly
![Page 84: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/84.jpg)
How many words are there? • how many English words exist? • when we increase the size of our corpus, what happens to the number of types? – a bit surprising: vocabulary conBnues to grow in any actual dataset
– you’ll just never see all the words • Zipf’s law: frequency of a word is inversely proporBonal to its rank
84
![Page 85: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/85.jpg)
How many words are there? • how many English words exist? • when we increase the size of our corpus, what happens to the number of types? – a bit surprising: vocabulary conBnues to grow in any actual dataset
– you’ll just never see all the words – in 1 million tweets, 15M tokens, 600k types – in 56 million tweets, 847M tokens, ? types
85
![Page 86: TTIC$31190:$ Natural$Language$Processing$ttic.uchicago.edu/.../lectures/lect1-intro-words.pdf · 2018. 3. 26. · TTIC$31190:$ Natural$Language$Processing$ Kevin$Gimpel$ Spring2018$](https://reader034.fdocuments.net/reader034/viewer/2022051903/5ff45526aae07d7a266e3ba8/html5/thumbnails/86.jpg)
How many words are there? • how many English words exist? • when we increase the size of our corpus, what happens to the number of types? – a bit surprising: vocabulary conBnues to grow in any actual dataset
– you’ll just never see all the words – in 1 million tweets, 15M tokens, 600k types – in 56 million tweets, 847M tokens, 11M types
86