CSE 595 Words and Pictures

65
SBU Digital Media CSE 595 Words and Pictures Tamara L. Berg SUNY Stony Brook

description

Tamara L. Berg SUNY Stony Brook. CSE 595 Words and Pictures. Class Info. CSE 595: Words & Pictures Instructor: Tamara Berg   ( [email protected] ) Office: 1411 Computer Science Lectures: Tues/Thurs 1: 20 -2: 2 0pm Rm 2129 CS Office Hours: Tues/Thurs 2: 2 0-3: 2 0pm and by appt. - PowerPoint PPT Presentation

Transcript of CSE 595 Words and Pictures

Page 1: CSE 595 Words and Pictures

SBU

Digital

Media

CSE 595 Words and Pictures

Tamara L. Berg

SUNY Stony Brook

Page 2: CSE 595 Words and Pictures

SBU

Digital

Media

Class Info CSE 595: Words & Pictures Instructor: Tamara Berg  ([email protected])

Office: 1411 Computer Science Lectures: Tues/Thurs 1:20-2:20pm Rm 2129 CS Office Hours: Tues/Thurs 2:20-3:20pm and by appt.

Course Webpage: http://tamaraberg.com/teaching/Fall_12/wordspics

Page 3: CSE 595 Words and Pictures

SBU

Digital

Media

About Me

• Joined Stony Brook in 2008– PhD from UC Berkeley 2007.– 2007-2008 Yahoo! Research

• Research in computer vision and natural language processing - combining information from multiple forms of digital media for applications like image search and recognition.

Page 4: CSE 595 Words and Pictures

SBU

Digital

Media

You? MS/PhD? Experience in Comp Vision, Natural

Language Processing, AI, Machine Learning?

Familiar with Matlab?

Page 5: CSE 595 Words and Pictures

SBU

Digital

Media

What’s in this picture?

Page 6: CSE 595 Words and Pictures

SBU

Digital

Media

What does the picture tell us?

Green, textured region – maybe tree?

Fuzzy black thing with a face-like part -- maybe an animal?

Page 7: CSE 595 Words and Pictures

SBU

Digital

Media

What do the words tell us?

Tags: leaves, endangered, green, i love nature, chennai, nilgiri langur, monkey, forest, wildlife, perch, black, wallpaper, ARK OF WILDLIFE, topv111, WeeklySurvivor, top20HallFame, topv333, 100v10f, captive, simian

Page 8: CSE 595 Words and Pictures

SBU

Digital

Media

What do words+picture tell us?

Tags: leaves, endangered, green, i love nature, chennai, nilgiri langur, monkey, forest, wildlife, perch, black, wallpaper, ARK OF WILDLIFE, topv111, WeeklySurvivor, top20HallFame, topv333, 100v10f, captive, simian

Page 9: CSE 595 Words and Pictures

SBU

Digital

Media

Consumer Photo Collections

Over the hills and far away

Road, Hills, Germany, Hoffenheim, Outstanding Shots, specland, Baden-Wuerttemberg

Heavenly

Peacock, AlbinoPeacock, WhiteBeauty, Birds, Wildlife, FeathredaleWildlifePark, PictureAustralia, ImpressedBeauty

End of the world - Verdens Ende - The lighthouse 1

Verdens ende, end of the world, norway, lighthouse, ABigFave, vippefyr, wood, coal

Flickr – 3+ billion photographs, 3-5 million uploaded per day

Page 10: CSE 595 Words and Pictures

SBU

Digital

Media

Museum and Library Collections

Fine Arts Museum of San Francisco (82,000 images)

Woman of Head Howard H G Mrs Gift America North bust States United Sculpture marble

bowl stemmed small Irridescent glass

New York Public Library

Digital Collection

The new board walk, Rockaway, Long Island

Part of New England, New York, east New Iarsey and Long Iland.

Page 11: CSE 595 Words and Pictures

SBU

Digital

Media

Web CollectionsBillions of Web Pages

Page 12: CSE 595 Words and Pictures

SBU

Digital

Media

Video

OUTSIDE IN THE RAIN THE SENATOR WEARING HIS UH BASEBALL CAP A BOSTON RED SOX CAP AS HE TALKED TO HIS SUPPORTERS HERE IN THE RAIN THE UH SENATOR THEY'RE DOING HIS BEST TO TRY TO MAKE HIS CASE THAT HE WILL BE THE MAN FOR THE MIDDLE CLASS AND UH TRY TO CONVINCE HIS SUPPORTERS TO EXPRESS THEIR SUPPORT THROUGH A VOTE ON TUESDAY IN THERE WE ARE TWENTY FOUR HOURS FROM THE GREAT MOMENT THAT THE WORLD IN AMERICA IS WAITING FOR IT I NEED TO YOU IN THESE HOURS TO GO OUT AND DO THE HARD WORK NOT ON THOSE DOORS MAKE THOSE PHONE CALLS TO TALK TO FRIENDS TAKE PEOPLE TO THE POLLS HELP US CHANGE THE DIRECTION OF THIS GREAT NATION FOR THE BETTER CAN YOU IMAGINE A UH SENATOR BEGINNING HIS DAY IN FLORIDA TODAY

TrecVid 2006 – video frames with speech processing output

Page 13: CSE 595 Words and Pictures

SBU

Digital

Media

Consumer Products

Soft and glossy patent calfskin trimmed with natural vachetta cowhide, open top satchel for daytime and weekends, interior double slide pockets and zip pocket, seersucker stripe cotton twill lining, kate spade leather license plate logo, imported. 2.8" drop length 14"h x 14.2"w x 6.9"d Katespade.com

It's the perfect party dress. With distinctly feminine details such as a wide sash bow around an empire waist and a deep scoopneck, this linen dress will keep you comfortable and feeling elegant all evening long. * Measures 38" from center back, hits at the knee. * Scoopneck, full skirt. * Hidden side zip, fully lined. * 100% Linen. Dry clean. bananarepublic.com

Internet retail transactions in 2006, 2007 of $145 billion, $175 billion (Forrester Research).

Page 14: CSE 595 Words and Pictures

SBU

Digital

Media

Lots of Data!

Page 15: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Page 16: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Organize

Search

Browse

Page 17: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Organize

Search

Browse

Page 18: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Organize

Search

BrowseComputing Iconic Summaries for General Visual Concepts.R. Raguram and S. Lazebnik, 2008.

Page 19: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Image Search circa 2007

Organize

Search

Browse

Page 20: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Image Search now

Organize

Search

Browse

Page 21: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Image re-ranking for “monkey”

Tamara L Berg, David A Forsyth,Animals on the Web CVPR 2006

Organize

Search

Browse

Page 22: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Visual shopping at like.com

Organize

Search

Browse

Page 23: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Visual attribute discoveryTamara L Berg, Alexander C Berg, Jonathan ShihAutomatic Attribute Discovery and Characterization from Noisy Web DataECCV 2010

Organize

Search

Browse

Page 24: CSE 595 Words and Pictures

SBU

Digital

Media

What do we want to do?

Visual attribute discovery

J. Wang, K. Markert, and M. Everingham. "Learning models for object recognition from natural language descriptions” BMVC 2009.

Organize

Search

Browse

Page 25: CSE 595 Words and Pictures

SBU

Digital

Media

Types of Words & Pictures

Page 26: CSE 595 Words and Pictures

SBU

Digital

Media

General web pages

Page 27: CSE 595 Words and Pictures

SBU

Digital

Media

General web pages

Image re-ranking for “monkey”

Tamara L Berg, David A Forsyth,Animals on the Web CVPR 2006

Improving Search

Page 28: CSE 595 Words and Pictures

SBU

Digital

Media

General web pages

Harvesting Image Databases from the WebSchroff, F. , Criminisi, A. and Zisserman, A.ICCV 2007.

Mining to build big computer vision data sets.

Page 29: CSE 595 Words and Pictures

SBU

Digital

Media

General web pages

Pros?

Cons?

Page 30: CSE 595 Words and Pictures

SBU

Digital

Media

Tags or keywords + images

Tags: canon, eos, macro, japan, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo,art, light, photo, flickr, blurry, favorite, nice.

Page 31: CSE 595 Words and Pictures

SBU

Digital

Media

Tags or keywords + images

Gang Wang, Derek Hoiem, and David Forsyth, Building text features for object image classification.  CVPR, 2009.

Using tags and similar images for novel image classification

Page 32: CSE 595 Words and Pictures

SBU

Digital

Media

Tags or keywords + images

Tag Order as implicit cue to expected size

“Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags”Sung Ju Hwang and Kristen Grauman

Page 33: CSE 595 Words and Pictures

SBU

Digital

Media

Tags or keywords + images

Tags: canon, eos, macro, japan, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo,art, light, photo, flickr, blurry, favorite, nice.

Pros?

Cons?

Page 34: CSE 595 Words and Pictures

SBU

Digital

Media

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Captioned images

Page 35: CSE 595 Words and Pictures

SBU

Digital

Media

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Captioned images for face labeling

Captions provide direct information about depiction!

Page 36: CSE 595 Words and Pictures

SBU

Digital

Media

Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose AnnotationJie Luo, Barbara Caputo, Vittorio FerrariNIPS 2009

Captioned images for face and pose labeling

Page 37: CSE 595 Words and Pictures

SBU

Digital

Media

Videos with transcripts

Page 38: CSE 595 Words and Pictures

SBU

Digital

Media

M. Everingham, J. Sivic, and A. Zisserman. Hello! My name is... Buffy' - Automatic naming of characters in TV videoBMVC 2006.

Videos with transcripts for face labeling

Page 39: CSE 595 Words and Pictures

SBU

Digital

Media

Learning by Watching

Page 40: CSE 595 Words and Pictures

SBU

Digital

Media

P. Buehler, M. Everingham, and A. Zisserman. "Learning sign language by watching TV (using weakly aligned subtitles)". CVPR 2009.

Learning Sign Language

Page 41: CSE 595 Words and Pictures

SBU

Digital

Media

Learning to Sportscast: A Test of Grounded Language Acquisition (2008)David L. Chen and Raymond J. Mooney

Learning to Sportscast

Page 42: CSE 595 Words and Pictures

SBU

Digital

Media

Learning About Semantics

Page 43: CSE 595 Words and Pictures

SBU

Digital

Media

Traditional Recognition

car

shoe

person

Page 44: CSE 595 Words and Pictures

SBU

Digital

Media

Beyond traditional recognition

Page 45: CSE 595 Words and Pictures

SBU

Digital

Media

Beyond traditional recognition

“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – Scarlett O’Hara, Gone with the Wind.

Page 46: CSE 595 Words and Pictures

SBU

Digital

Media

Attributes

Visual attribute learning from textTamara L Berg, Alexander C Berg, Jonathan ShihAutomatic Attribute Discovery and Characterization from Noisy Web DataECCV 2010

Page 47: CSE 595 Words and Pictures

SBU

Digital

Media

Object relationships

Page 48: CSE 595 Words and Pictures

SBU

Digital

Media

Object relationships

Object relationships – prepositions & adjectives

Beyond Nouns: Exploiting prepositions and comparative adjectives for learning visual classifiersAbhinav Gupta and Larry S. DavisIn ECCV 2008

Car is on the street

Page 49: CSE 595 Words and Pictures

SBU

Digital

Media

Cross-Language Learning

Learning Bilingual Lexicons using the Visual Similarity of Labeled Web ImagesShane Bergsma and Benjamin Van Durme 2011

Page 50: CSE 595 Words and Pictures

SBU

Digital

Media

Descriptive Text

Visually descriptive language offers: 1) information about the world, especially the visual world. 2) training data for how people construct natural language to describe imagery.

“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – Scarlett O’Hara, Gone with the Wind.

Page 51: CSE 595 Words and Pictures

SBU

Digital

Media

Generating descriptions for images

Page 52: CSE 595 Words and Pictures

SBU

Digital

Media

Generating Captions for News Images with Articles

How Many Words is a Picture Worth? Automatic Caption Generation for News Images”

Feng & Lapata 2010

Page 53: CSE 595 Words and Pictures

SBU

Digital

Media

Generating Simple Descriptions for images

“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”

Baby Talk: Understanding and Generating Simple Image Descriptions (2011)Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, Tamara L. Berg

Page 54: CSE 595 Words and Pictures

SBU

Digital

Media

Im2Text: Describing Images Using 1 Million Captioned Photographs

Vicente Ordonez, Girish Kulkarni, Tamara L. BergStony Brook University

NIPS 2011

One of the many stone bridges in town that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Generate Natural Sounding Descriptions

Page 55: CSE 595 Words and Pictures

SBU

Digital

Media

Summary Enormous amounts of data. Lots of commercial and academic

applications. We should combine information

from words & pictures intelligently.

Page 56: CSE 595 Words and Pictures

SBU

Digital

Media

Overall Class Goal Gain exposure to interesting and

current research on Words&Pictures

No prior experience in Computer Vision or Natural Language Processing is required.

We will be reading a variety of research papers over the course of the semester

Please read the papers!

Page 57: CSE 595 Words and Pictures

SBU

Digital

Media

General knowledge lecturesComputer VisionNatural Language ProcessingFeatures & RepresentationsClustering Discriminative Models & ClassificationGenerative & Topic Models

Page 58: CSE 595 Words and Pictures

SBU

Digital

Media

Your responsibilities

Homework – 3 relatively simple assignments. Project – final project including proposal,

update, and final presentation & write-up. Participation – read papers and participate in

topic discussions. Topic presentations – one in class topic

presentation in groups of 4-5.

30%

30%

30%

10%

Late assignments/projects will be accepted with a 10% reduction in value per day late.

Page 59: CSE 595 Words and Pictures

SBU

Digital

Media

Homework & Projects

Assignments should be completed individually in matlab.

Projects will be in groups of 3 and can be completed in the language of your choice on the topic of your choice (must involve text and images/video).

Page 60: CSE 595 Words and Pictures

SBU

Digital

Media

Participation Experiment Goal: interesting, lively discussions

about research topics.

To encourage this goal at the end of each class please submit a paper noting how many (if any) questions you posed, answers you provided, or significant comments you made.

If this does not work, we will revert to having short sporadic pop quizzes on papers.

Page 61: CSE 595 Words and Pictures

SBU

Digital

Media

Note about papers You won’t understand everything,

especially at first. Don’t sweat the small stuff. Try to grasp the overall idea, what’s

novel, what’s interesting, pros/cons of the method, how it relates to other things we’ve read.

Page 62: CSE 595 Words and Pictures

SBU

Digital

Media

Topic Presentations You will give one topic presentation

during the semester in groups of 4-5.

Suggested papers for each topic presentations are listed on the course website.

You are welcome to swap papers (if relevant to your topic), but please ask me at least 1 week prior to the presentation.

Page 63: CSE 595 Words and Pictures

SBU

Digital

Media

Reference Books 1) Forsyth, David A., and Ponce, J.

Computer Vision: A Modern Approach, Prentice Hall, 2003.

2) Hartley, R. and Zisserman, A. Multiple View Geometry in Computer Vision, Academic Press, 2002.

3) Jurafsky and Martin, SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, McGraw Hill, 2008.

4) Christopher D. Manning, and Hinrich Schuetze. Foundations of Statistical Natural Language Processing

Page 64: CSE 595 Words and Pictures

SBU

Digital

Media

For next class Get access to matlab

Student Matlab licenses can be purchased from mathworks for $99

Do a matlab tutorial One link on the course website, many others

are available online.

Page 65: CSE 595 Words and Pictures

SBU

Digital

Media

Class Info CSE 595: Words & Pictures Instructor: Tamara Berg  ([email protected])

Office: 1411 Computer Science Lectures: Tues/Thurs 1:20-2:20pm Rm 2129 CS Office Hours: Tues/Thurs 2:20-3:20pm and by appt.

Course Webpage: http://tamaraberg.com/teaching/Fall_12/wordspics