Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered...
Transcript of Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered...
![Page 1: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/1.jpg)
1/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
A picture is worth13.6 words
(on average)
AlexBerg
AmitGoyal
TamaraBerg
JesseDodge
YejinChoi
YiannisAloimonos
KotaYamaguchi
AlyssaMensch
KarlStratos
MegMitchell
XufengHan
Ching LikTeo
YezhouYang
![Page 2: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/2.jpg)
2/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
An on-paper experiment
Write a captionfor this image,one sentencein length.
(In English.)
![Page 3: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/3.jpg)
3/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
People write weird captions
Another dream car toadd to the list, this onespotted in Hanbury St.
![Page 4: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/4.jpg)
4/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
People write weird captions
Another dream car toadd to the list, this onespotted in Hanbury St.
Shot out my car windowwhile stuck in trafficbecause people in
Cincinatti can'tdrive in the rain
![Page 5: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/5.jpg)
5/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
People write weird captions
Another dream car toadd to the list, this onespotted in Hanbury St.
Shot out my car windowwhile stuck in trafficbecause people in
Cincinatti can'tdrive in the rain
1. A distorted photo of a mancutting up a large cut ofmeat in a garage.
2. A man smiling at thecamera while carvingup meat.
3. A man smiling while hecuts up a piece of meat.
4. A smiling man is standing next to a table dressinga piece of venison.
5. The man is smiling into the camera as he cuts meat.
![Page 6: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/6.jpg)
6/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Two complementary questions...Image ⇒ Text?
“two women sitting brunette blonde on bench reading magazine”
![Page 7: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/7.jpg)
7/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Two complementary questions...Image ⇒ Text? Text ⇒ Image?
“two women sitting brunette blonde on bench reading magazine”
“looking for castles in the clouds out my car window”
![Page 8: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/8.jpg)
8/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Two complementary questions...Image ⇒ Text? Text ⇒ Image?
“two women sitting brunette blonde on bench reading magazine”
“looking for castles in the clouds out my car window”
Understanding andUnderstanding andPredicting ImportancePredicting Importancein Imagesin ImagesBBDDGHMMSSY, CVPR 2012BBDDGHMMSSY, CVPR 2012
Detecting Visual TextDetecting Visual TextDGHMMSYCDBB, NAACL 2012DGHMMSYCDBB, NAACL 2012
Corpus-Guided SentenceCorpus-Guided SentenceGeneration of Natural ImagesGeneration of Natural ImagesYTDA, EMNLP 2011YTDA, EMNLP 2011
Midge: Generating ImageMidge: Generating ImageDescriptions fromDescriptions fromComputer Vision DetectionsComputer Vision DetectionsMDGYSHMBBD, EACL 2012MDGYSHMBBD, EACL 2012
![Page 10: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/10.jpg)
10/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
the sheep meandered along a desolate road in the highlands of Scotland through frozen grass
![Page 12: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/12.jpg)
12/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
![Page 13: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/13.jpg)
13/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
the small white cat is -17 inches above the hat. the tiny white illuminator is in front of the cat. it is night. the ground is red.
the 200 foot tall dragon is facing the 100 foot tall car. The ground is a checkerboard. the sky is pink
Coyne & Sproat, SIGGRAPH 2001WordsEye: An Automatic
Text-to-Scene Conversion System
![Page 14: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/14.jpg)
14/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
![Page 15: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/15.jpg)
15/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
Training Object Detectors from Text
![Page 16: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/16.jpg)
16/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
Training Object Detectors from Text
“elephant in the beach”
“a personriding a horse”
≠Person + Horse
Farhadi + Sadeghi, CVPR 2011Recognition Using Visual
Phrases
![Page 17: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/17.jpg)
17/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why do this?Caption Generation
Visual Scene Construction
Training Object Detectors from Text
![Page 18: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/18.jpg)
18/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is “visual text”● Photographer/viewer distinctions
Kevin’s mom, so punxrawkin Kev’s black flag hat
![Page 19: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/19.jpg)
19/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is “visual text”● Photographer/viewer distinctions
● Amount of inference
Kevin’s mom, so punxrawkin Kev’s black flag hat
Another dream car toadd to the list, this onespotted in Hanbury St.
![Page 20: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/20.jpg)
20/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is “visual text”● Photographer/viewer distinctions
● Amount of inference
● Temporal events
Kevin’s mom, so punxrawkin Kev’s black flag hat
Another dream car toadd to the list, this onespotted in Hanbury St.
Tuckered out from playingin Nannie’s yard.
![Page 21: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/21.jpg)
21/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is “visual text”● Photographer/viewer distinctions
● Amount of inference
● Temporal events
Kevin’s mom, so punxrawkin Kev’s black flag hat
Another dream car toadd to the list, this onespotted in Hanbury St.
Tuckered out from playingin Nannie’s yard.
A phrase is visual if there is apiece of the image you can cut
out, place in another image,and still use the same description.
![Page 22: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/22.jpg)
22/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Okay, so can we detect it?● SBU Flickr data● 3 NPs per caption● 800 images: ≥3 annotations● 48k images: 1 annotation● People largely agree
(74% whatever that means...)● 3 NPs per caption, 70% visual
![Page 23: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/23.jpg)
23/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Okay, so can a computer detect it?Word+stems
BigramsSpelling
Hypernyms(Inside, Before and After)
Another dream car to add to the list...another anothdream dreamcar caranother_dream dream_carAa+ a+ a+Vehicle … artifact … entity
to toadd addto_adda+ a+
![Page 24: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/24.jpg)
24/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Okay, so can a computer detect it?Word+stems
BigramsSpelling
Hypernyms(Inside, Before and After)
Another dream car to add to the list...another anothdream dreamcar caranother_dream dream_carAa+ a+ a+Vehicle … artifact … entity
to toadd addto_adda+ a+
≈67% AUC
![Page 25: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/25.jpg)
25/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
![Page 26: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/26.jpg)
26/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
![Page 27: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/27.jpg)
27/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
#A81C07
![Page 28: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/28.jpg)
28/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
#A81C07#A81C07
![Page 29: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/29.jpg)
29/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
oblong, hemispherical, quadrangular, convex
#A81C07#A81C07
![Page 30: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/30.jpg)
30/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
oblong, hemispherical, quadrangular, convex
#A81C07#A81C07
≈67% AUC
![Page 31: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/31.jpg)
31/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation
car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty
brown green wooden striped orangerectangular furry shiny rusty feathered
public original whole righteous adjectivespolitical personal intrinsic seeds individual
Adj
Nou
n V
NV
V
NV
Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic
grayish, chestnut, emerald, rufous
oblong, hemispherical, quadrangular, convex
#A81C07#A81C07
≈67% AUC≈71% AUC
![Page 32: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/32.jpg)
32/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
But this doesn't use the images!!!
50
55
60
65
70
75
80
85
90
95
RandomModelModel+ListsHuman
![Page 33: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/33.jpg)
33/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
But this doesn't use the images!!!
50
55
60
65
70
75
80
85
90
95
RandomModelModel+ListsHuman
![Page 34: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/34.jpg)
34/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What I used to think vision did...
![Page 35: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/35.jpg)
35/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What I used to think vision did...
![Page 36: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/36.jpg)
36/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What I used to think vision did...
![Page 37: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/37.jpg)
37/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What I used to think vision did...
![Page 39: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/39.jpg)
39/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Adding in image features
Ecuador, amazon basin, near coca, rain forest, passion fruit flower
● Does a detector corresponding to this head noun exist?
● Did it fire?● How many times did it fire?● How confident was the “best”
firing?● What %age of pixels in the image
are in that bounding box?
![Page 40: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/40.jpg)
40/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Results with vision features
50
55
60
65
70
75
80
85
90
95
RandomModelModel+Lists+VisionHuman
![Page 41: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/41.jpg)
41/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Results with vision features
50
55
60
65
70
75
80
85
90
95
RandomModelModel+Lists+VisionHuman
Features only availableon about 11% of examples
![Page 42: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/42.jpg)
42/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Results with vision features
50
55
60
65
70
75
80
85
90
95
RandomModelModel+Lists+VisionHuman
Features only availableon about 11% of examples
8% improvement onphrases with recognizers
![Page 43: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/43.jpg)
A picture is worth 13.6 words43 Hal Daumé III ([email protected])
bird
boat
bottle
bowl
Detecting on a large scale...
![Page 44: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/44.jpg)
A picture is worth 13.6 words44 Hal Daumé III ([email protected])
Given an image
1)
What do people describe?
![Page 45: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/45.jpg)
A picture is worth 13.6 words45 Hal Daumé III ([email protected])
Predict what people will describe
Given an image
1)
“two women sitting brunette blonde on bench reading magazine”
What do people describe?
![Page 46: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/46.jpg)
A picture is worth 13.6 words46 Hal Daumé III ([email protected])
Predict what people will describe
Given an image
1)
“two women sitting brunette blonde on bench reading magazine”
women ● bench ●
magazine● grass
skirt
…
What do people describe?
![Page 47: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/47.jpg)
A picture is worth 13.6 words47 Hal Daumé III ([email protected])
What’s in this image?
Predicting what will be described
![Page 48: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/48.jpg)
A picture is worth 13.6 words48 Hal Daumé III ([email protected])
manbabysling
ladderfridgetable
watermelonchair
boxescups
water bottlewall
pacifierbeard
glassesshirt
…
What’s in this image?
Predicting what will be described
![Page 49: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/49.jpg)
A picture is worth 13.6 words49 Hal Daumé III ([email protected])
What do people describe?“A bearded man is holding a child in a sling.”
manbabysling
ladderfridgetable
watermelonchair
boxescups
water bottlewall
pacifierbeard
glassesshirt
…
What’s in this image?
Predicting what will be described
![Page 50: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/50.jpg)
A picture is worth 13.6 words50 Hal Daumé III ([email protected])
What do people describe?“A bearded man is holding a child in a sling.”
manbabysling
ladderfridgetable
watermelonchair
boxescups
water bottlewall
pacifierbeard
glassesshirt
…
What’s in this image?
Predicting what will be described
“A bearded man stands while holdinga small child in a green sheet.” “A bearded man with a baby in a sling poses.”“Man standing in kitchen with little girlin green sack.” “Man with beard and baby”
![Page 51: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/51.jpg)
A picture is worth 13.6 words51 Hal Daumé III ([email protected])
What do people describe?“A bearded man is holding a child in a sling.”
manbabysling
ladderfridgetable
watermelonchair
boxescups
water bottlewall
pacifierbeard
glassesshirt
…
What’s in this image?
Predicting what will be described
“A bearded man stands while holdinga small child in a green sheet.” “A bearded man with a baby in a sling poses.”“Man standing in kitchen with little girlin green sack.” “Man with beard and baby”
![Page 52: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/52.jpg)
A picture is worth 13.6 words52 Hal Daumé III ([email protected])
Two kinds of factors– Compositional– Semantic
What factors influence what someone will describe about an image?
Description factors
![Page 53: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/53.jpg)
A picture is worth 13.6 words53 Hal Daumé III ([email protected])
“A sail boat on the ocean.”
Size/Saliency
Location
Compositional factors
![Page 54: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/54.jpg)
A picture is worth 13.6 words54 Hal Daumé III ([email protected])
Compositional factors
“Two men standing on beach.”
Size/Saliency
Location
![Page 55: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/55.jpg)
A picture is worth 13.6 words55 Hal Daumé III ([email protected])
“girl in the street”
Object Type
Nameable Scene
Unusualness
Semantic factors
![Page 56: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/56.jpg)
A picture is worth 13.6 words56 Hal Daumé III ([email protected])
Semantic factors
“kitchen in house”
Object Type
Nameable Scene
Unusualness
![Page 57: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/57.jpg)
A picture is worth 13.6 words57 Hal Daumé III ([email protected])
Semantic factors
“elephant in the beach”
Object Type
Nameable Scene
Unusualness
![Page 58: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/58.jpg)
A picture is worth 13.6 words58 Hal Daumé III ([email protected])
Semantic factors
“A tree in water and a boy with a beard”
Object Type
Nameable Scene
Unusualness
![Page 59: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/59.jpg)
A picture is worth 13.6 words59 Hal Daumé III ([email protected])
Generating captions
a) Detect objects and scenes from input image;b) Estimate optimal sentence structure quadruplet T;c) Generating a sentence from T;
![Page 63: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/63.jpg)
A picture is worth 13.6 words63 Hal Daumé III ([email protected])
Using large corpora to compose natural captions
(why write your own material when you can just “steal” it?)
![Page 64: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/64.jpg)
A picture is worth 13.6 words64 Hal Daumé III ([email protected])
a) monkey playing in the tree canopy, Monte Verde in the rain forest
e) the monkey sitting in a tree, posing for his picture
c) monkey spotted in Apenheul Netherlands under the tree
d) a white-faced or capuchin in the tree in the garden
b) capuchin monkey in frontof my window
Composing captions
![Page 65: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/65.jpg)
A picture is worth 13.6 words65 Hal Daumé III ([email protected])
a) monkey playing in the tree canopy, Monte Verde in the rain forest
e) the monkey sitting in a tree, posing for his picture
c) monkey spotted in Apenheul Netherlands under the tree
d) a white-faced or capuchin in the tree in the garden
b) capuchin monkey in frontof my window
Composing captions
![Page 66: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/66.jpg)
A picture is worth 13.6 words66 Hal Daumé III ([email protected])
Caption images where:
We assume some evidence for 1 object
&
Object detector is confident
Captioning with (some) evidence
![Page 67: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/67.jpg)
A picture is worth 13.6 words67 Hal Daumé III ([email protected])
Caption images where:
We assume some evidence for 1 object
&
Object detector is confident
Tag: “mare” Evidence for horse
Captioning with (some) evidence
![Page 68: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/68.jpg)
A picture is worth 13.6 words68 Hal Daumé III ([email protected])
Caption images where:
We assume some evidence for 1 object
&
Object detector is confident
Tag: “mare”
High detection score
Evidence for horse
Captioning with (some) evidence
![Page 69: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/69.jpg)
A picture is worth 13.6 words69 Hal Daumé III ([email protected])
Grab phrases based on image similarity between query and captioned data baseObject detection similarity - NPs, VPs Stuff detection similarity – PPs Scene similarity - PPs
Mash phrases Compose descriptions using simple rule based concatenation
Generation: Grab 'N Mash
![Page 70: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/70.jpg)
A picture is worth 13.6 words70 Hal Daumé III ([email protected])
Detect: fruit
Getting NPs – Objects
![Page 71: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/71.jpg)
A picture is worth 13.6 words71 Hal Daumé III ([email protected])
Detect: fruit
Find matching fruit detections by color similarity
Getting NPs – Objects
![Page 72: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/72.jpg)
A picture is worth 13.6 words72 Hal Daumé III ([email protected])
Detect: fruit
Find matching fruit detections by color similarity
Tray of glace fruit in the market at Nice, France
Fresh fruit in the market
A box of oranges was just catching the sun, bringing out detail in the skin.
The street market in Santanyi, Mallorca is a must for the oranges and local crafts.
An orange tree in the backyard of the house.
mandarin oranges in glass bowl
Getting NPs – Objects
![Page 73: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/73.jpg)
A picture is worth 13.6 words73 Hal Daumé III ([email protected])
Getting NPs – Objects
The muddy elephantAn elephantsmall elephantA very large and seemingly old elephantmusk male elephantAfrican elephantthe temple elephant
Fushia flowera flowera pink zinna flowerThis beautiful flowera roman pink flowera tiny pink flowerpink bursting flowersa perfectly pink gerbera daisy
a lonesome ducka native new zealand duckThe duckmale Mallard duckseveral other ducksa so-called navigation duckthis ducka duckduckmandarin duck
![Page 74: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/74.jpg)
A picture is worth 13.6 words74 Hal Daumé III ([email protected])
theses cows live in the field behind my house A cow eating flowers in
the south of the Netherlands.
The cow was more interested in eating than looking at me with a camera!
While cycling north on Tremaine Road near Milton, this cow gazed across the road intently.
Detect: cow
Find matching cow detections by shape/pose similarity
Getting VPs – objects
![Page 75: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/75.jpg)
A picture is worth 13.6 words75 Hal Daumé III ([email protected])
Detect: grassgreen manure in the veg field - Plaw Hatch
Find matching grass detections by color similarity
Found on hawthorn in boggy grass field
Sheep in a field spotted during a coastal drive from Tramore to Dungervan
I am happy in a field of green Maryland grass
Getting PPs – stuff
![Page 76: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/76.jpg)
A picture is worth 13.6 words76 Hal Daumé III ([email protected])
View from our B&B in this photo
Extract scene descriptor
Find matching images by scene similarity
Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere
I'm about to blow the building across the street over with my massive lung power.
Only in Paris will you find a bottle of wine on a table outside a bookstore
Getting PPs – scenes
![Page 78: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/78.jpg)
A picture is worth 13.6 words78 Hal Daumé III ([email protected])
object color
object pose
scene
stuff
Composing captions
![Page 79: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/79.jpg)
A picture is worth 13.6 words79 Hal Daumé III ([email protected])
NP: the sheep
VP: meandered along a desolate road
PP: in the highlands of Scotland
PP: through frozen grass
object color
object pose
scene
stuff
Composing captions
![Page 80: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/80.jpg)
A picture is worth 13.6 words80 Hal Daumé III ([email protected])
NP: the sheep
VP: meandered along a desolate road
PP: in the highlands of Scotland
PP: through frozen grass
object color
object pose
scene
stuff
Various composition patterns:NP VPNP PP_stuffNP PP_scene…NP VP PP_scene PP_stuff
Composing captions
![Page 81: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/81.jpg)
A picture is worth 13.6 words81 Hal Daumé III ([email protected])
the sheep meandered along a desolate road in the highlands of Scotland through frozen grass
NP: the sheep
VP: meandered along a desolate road
PP: in the highlands of Scotland
PP: through frozen grass
object color
object pose
scene
stuff
Various composition patterns:NP VPNP PP_stuffNP PP_scene…NP VP PP_scene PP_stuff
Composing captions
![Page 82: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/82.jpg)
A picture is worth 13.6 words82 Hal Daumé III ([email protected])
cat enjoys hiding under the tree
A female Monarch butterfly was visiting the plant in my front yard in Devon 17/10/10 Stained glass window
depicting Christ and numerous saints in Washington National Cathedral in the Eglise
A double-decker bus under some spreading shade trees
her flower girl dress designed by Mainbocher in the house
A duck was having a bath in the harbor at whitehaven, cumbria, england in the water near Camley St
Good results
![Page 84: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/84.jpg)
A picture is worth 13.6 words84 Hal Daumé III ([email protected])
Language issues
A Moo cow tied up around the city eating grass in various places under the tree at the young tree
male tiger sighting in twelve months of a street
Not so good results
![Page 85: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/85.jpg)
A picture is worth 13.6 words85 Hal Daumé III ([email protected])
Language issues Vision issues
A Moo cow tied up around the city eating grass in various places under the tree at the young tree
The silhouetted building and cross stands under water around Loon Mountain
male tiger sighting in twelve months of a street
a girl walking by in a green field in the sun
Not so good results
![Page 86: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/86.jpg)
A picture is worth 13.6 words86 Hal Daumé III ([email protected])
Language issues Vision issues Just plain silly
A Moo cow tied up around the city eating grass in various places under the tree at the young tree
dogs running pic, this time, racing through the sea at Fraisthorpe near Bridlington of Christmas tree in bed
The silhouetted building and cross stands under water around Loon Mountain
male tiger sighting in twelve months of a street
a girl walking by in a green field in the sun
bike was left here by an ancient civilization not as sophisticated as our own in the grass of granite
Not so good results
![Page 87: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/87.jpg)
A picture is worth 13.6 words87 Hal Daumé III ([email protected])
Open question...➢ Can we do this without using pre-defined object/scene/etc.
detectors?
➢ Build a representation of each image in the database➢ Build a representation of the test image➢ Find 10 most similar database images➢ Merge their NL descriptions using text-to-text generation
techniques
➢ Q: Where do these representations come from???
![Page 88: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/88.jpg)
88/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
And why are we trying to do this...???● Captioning the world for
people with visual impairments● But the captions we have are not
really descriptive of the world
![Page 89: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/89.jpg)
89/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
And why are we trying to do this...???● Captioning the world for
people with visual impairments● But the captions we have are not
really descriptive of the world
● Use vision to “ground out”language● Is it turtles
all the way down?
![Page 90: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/90.jpg)
90/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
And why are we trying to do this...???● Captioning the world for
people with visual impairments● But the captions we have are not
really descriptive of the world
● Use vision to “ground out”language● Is it turtles
all the way down?
● That's how babies work!● Sadly we don't have
baby-esque robots yet
![Page 91: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/91.jpg)
91/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Why work on a task at all?● A solution is of benefit to society● The process focuses attention on
phenomena that are worthy of study
● What is worthy of study? (IMO)● Low-level linguistic phenomena that hide in the tail● Human-like abilities to generalize from small data● Very basic learning of correlations between different
modalities (operant conditioning)
René Descartes(1596-1650)
![Page 92: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/92.jpg)
92/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What about 2nd language learning?● Obvious problems
● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...
![Page 93: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/93.jpg)
93/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What about 2nd language learning?● Obvious problems
● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...
● But we do havesoftware withexercises for SLA
![Page 94: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/94.jpg)
94/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What about 2nd language learning?● Obvious problems
● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...
● But we do havesoftware withexercises for SLA
It's hard for people, too!
![Page 95: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/95.jpg)
95/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What about 2nd language learning?● Obvious problems
● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...
● But we do havesoftware withexercises for SLA
It's hard for people, too!
![Page 96: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/96.jpg)
96/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Aspects of computational 2ndLL● Very specific linguistic variants
● Number, case, agreement, etc.● Not enough to get the majority case
![Page 97: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/97.jpg)
97/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Aspects of computational 2ndLL● Very specific linguistic variants
● Number, case, agreement, etc.● Not enough to get the majority case
● Focus on subtle visual differences
![Page 98: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/98.jpg)
98/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Aspects of computational 2ndLL● AI-style
reasoning &one-shotlearning
![Page 99: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/99.jpg)
99/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Aspects of computational 2ndLL● AI-style
reasoning &one-shotlearning
● “It's learnable” proof of concept:
![Page 100: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/100.jpg)
100/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
What is needed to solve this?● Linguistic model over character
sequences (words not okay!)w/o any L-specific background
● Pre-trained (?) visual detectorsfor objects, poses andphysical relationships (eg., gaze)
● Ability to reason and generalizefrom a few examples
![Page 101: Yiannis A picture is worth 13.6 words - UMIACSusers.umiacs.umd.edu/~hal/tmp/talk.pdf · Tuckered out from playing in Nannie’s yard. A phrase is visual if there is a piece of the](https://reader033.fdocuments.net/reader033/viewer/2022050300/5f699824e8004c0d79144ef7/html5/thumbnails/101.jpg)
101/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]
Thanks!Questions?
AlexBerg
AmitGoyal
TamaraBerg
JesseDodge
YejinChoi
YiannisAloimonos
KotaYamaguchi
AlyssaMensch
KarlStratos
MegMitchell
XufengHan
Ching LikTeo
YezhouYang