Intelligent Photo Retrieval Using Commonsense Semantics Hugo Liu Software Agents Group MIT Media...
-
Upload
lillian-powell -
Category
Documents
-
view
221 -
download
1
Transcript of Intelligent Photo Retrieval Using Commonsense Semantics Hugo Liu Software Agents Group MIT Media...
Intelligent Photo Retrieval Using Commonsense Semantics
Hugo LiuSoftware Agents GroupMIT Media [email protected]
PARC talk08 Apr 2002Palo Alto, CA
Overview
Introduction to ARIAStorytelling with photos (i.e. email, web page)
Automated Learning of Photo AnnotationsBy observation of user’s behaviorParsing, IE heuristics, identifying semantic roles
Proactive Photo RetrievalFuzzy match of user’s current textual context to annotated
photosUse of commonsense reasoning to make retrieval more
robust
Organizing Digital Photos:A Common Approach
TasksRename the JPG fileOrganize it into a directory
DrawbacksManual annotations are time-consumingMay be hard to find relevant photos later
Organizing Digital Photos:Using Software Tools
Kodak’s Picture Easy SoftwareUsers create albums (hierarchical)
DrawbacksUpfront cost imposed on the userPhotos must be put under only one category
category types: places, social grouping, eventsHard to choose categoriesHard to find photos, recall categories, etc.
Why not use annotations?
An annotation is:A keyword associated with an image
Each image can have multiple annotations
Compared with directory structure:A directory represents one classification of a
photoi.e. event, geographical location, etc
Annotations allow for multiple “classifications”
ARIA: Annotation and Retrieval Integration Agent
Semi-automated annotation and retrieval
Allows user to do image-related taski.e.: emailing photos and posting web pagesi.e.: storytelling
AnnotationUses description from text to annotate photos
RetrievalProactively suggests possibly relevant photos
The User Interface
Email Headers
Text Pane
Photo Pane (ordered by relevance)
Double-click to edit annotations
A Scenario of Annotation
John has just returned from a friend’s wedding.
He wants to share his experience with his friends through pictures that he took.
He opens ARIA and the photos from his trip are automatically entered in his ARIA shoebox
He begins to write an email to his friend Jane:
To: [email protected]: My wonderful weekend
Hi Jane!
Last weekend, I went to Ken and Mary’s wedding in San Francisco. Here’s a picture of the two of them at the gazebo.
Regards, John
A Baseline Approach
Extract all keywords from text surrounding photo
Jane Last Weekend Went Ken Mary’s Wedding San Francisco Picture Gazebo Doesn’t make sense!
Rather brittle!
Our approach
Use concepts, not keywordsIdentify semantically important
conceptsWho: Ken, Mary
What: wedding
When: June 12, 2001, 1pm
Where: gazebo, San Francisco, CA
How did ARIA learn this annotation?
“Hi Jane!
Last weekend, I went to Ken and Mary’s wedding in
San Francisco.
Here’s a picture of the two of them at the gazebo.
The date that this event occurred
The two people pictured
The event
The place
The place
Extracting Semantic Roles from Text
What Understanding Agents are Needed?
“Hi Jane!
Last weekend, I went to Ken and Mary’s wedding in
San Francisco.
Here’s a picture of the two of them at the gazebo.
Temporal UA needs to recognize dateand time expressions
People UA needs KB of names and roles (e.g. brother, tour guide)
Events UA needs events KB
Place UA needs places KB
Anaphora resolutionContext attacher (links previous paragraph)
ARIA’s internal representation of text
Last weekend, I went to Ken and Mary’s wedding in San Francisco. Here’s a picture of the two of them at the gazebo.
ARIA_DATE_INTERVAL{2001-6-10,2001-6-12} , [PERSON $user] [ACTION go] [TRAJECTORY to [PERSON Ken] and [PERSON Mary] ’s [EVENT wedding] [LOCATION in [PLACE San Francisco]] [SBR .] [PLACE Here] [STATE is] [THING a picture [PROPERTY of [PERSON the two of them] [LOCATION at [PLACE the gazebo]]]] SBR
Mechanism for information extraction
Tokenize and NormalizeAnaphora Resolution / Context AttacherTemporal UA runsPOS Tagging
TBL Tagger (Brill, 1994) on Penn Treebank tagset
Syntactic ParsingLink Grammar (Sleator & Temperley, 1993)outputs a constituent parse
Semantic ProcessingCombines POS info, runs semantic UAssyntactic parse tree semantic parse treeontology of tags for event structure (Jackendoff, 1983, 1990)Not as deeper as deep semantic understanding i.e. UNITRAN (Dorr, 1992), but not as brittle
Frame Extraction
ARIA_DATE_INTERVAL{2001-6-10,2001-6-12} , [PERSON $user]
[ACTION go] [TRAJECTORY to
[PERSON Ken] and [PERSON Mary] ’s [EVENT wedding]
[LOCATION in [PLACE San Francisco]] [SBR .] [PLACE Here] [STATE is] [THING a picture
[PROPERTY of [PERSON the two of them] [LOCATION at
[PLACE the gazebo]]]] SBR
Hold On a Second…
The pragmatist says:Why go through all this effort to parse text? Why not just match
against your dictionary of PLACES, EVENTS, NAMES?
Response That would make a great baseline Getting the constituent structure is helpful in:
Learning bindings useful, e.g. “Here is a picture of George, who was our priest and also a great family friend.”
Exploit structure to induce semantic importance “Here is my wife posing with foobar in front of the Eiffel
Tower
Retrieval
As user types, the photos in the shoebox dynamically reorder (based on relevance)
A baseline approach:“exact matching”Problem is.. not robust
morphological variations (e.g. sisters =/= sister)semantic connections missed(e.g. wedding =/= bride)
Our Approach
Morphological variations easy to solveLemmatize words… e.g. married marry, brides bride
Semantic ConnectionsA good candidate for commonsense reasoning
1. In consumer photography, events and situations tend to be somewhat stereotyped and predictable:
brides are usually found at weddingsa birthday cake may be found at a birthday partypicnics may be located in parks
(note the granularity)2. So, given the annotation “bride”, also add “wedding”
Expand Annotations using Commonsense
Each annotation undergoes common sense expansion, which produces other semantically related annotations. For example:
>>> expand(“bride”)('wedding', '0.3662') ('woman', '0.2023')('ball', '0.1517') ('tree', '0.1517')('snow covered mountain', '0.1517')('flower', '0.1517') ('lake', '0.1517')('cake decoration', '0.1517') ('grass', '0.1517')('groom', '0.1517') ('tender moment', '0.1517')('veil', '0.1517') ('tuxedo', '0.1517')('wedding dress', '0.1517') ('sky', '0.1517')('hair', '0.1517') ('wedding bouquet', '0.1517')
Something that’s not perfect or complete can still be useful!
Retrieval Steps
As user types, use most recent n words to rerank photos by relevance
Fuzzy match n words to annotated photos (expanded with commonsense)
Quality of match determined by simple scoring function
Run temporal UA on n words for temporal retrieval against photo timestamp
More on Commonsense
Source of Common sense Open Mind Common Sense
(OMCS), (Singh, 2002)
OMCS is:Publicly acquired common
knowledge about the worldHas about 400,000
commonsense factsRepresented as semi-
structured English sentences
Examples of OMCS entries
Well-constrained (NP/AP slots)Something you may find in a restaurant is a
waiter.Something that might come after a wedding is a
wedding reception.
Poorly ConstrainedSomething that might happen in a play is you
forget your lines.
More on Open Mind
Comparision:CYC (Lenat, 1998)
1 million hand crafted assertionsrepresented more formally, i.e. as logic
OMCS advantagesPublicly and freely availableLess granular (i.e. more knowledge about social-
level interactions, versus, “to open a door, grasp the door knob and turn it”)
Easy to add to (e.g. personal commonsense)
Open Mind Caveats
More ambiguity than CYCWord senses not disambiguated
Coverage is uneven, and spotty at times
Culturally specific (CYC is too)i.e.: common knowledge for middle-class
americansWedding scenario is a perfect example
Open Mind Redeeming Qualities
Good granularity for our problem domaini.e.: events, social composition, etc.
Many simple, binary relationsOur approach is tolerant to noise
Each commonsense annotation doesn’t count muchIt combines a lot of “partial evidence”
Adaptive photo suggestion is “soft” (AI)Unlike “hard” uses like question answering, and general
problem solving, photo suggestion isn’t mission-critical
Expanding annotations: Under the hood
Use OMCS to build a semantic resourceChoose spreading activation network
Nodes are concepts, edges are semantic relation
e.g.: bride and wedding are connected through foundAt edge
SAN applications to IR: Salton & Buckley (1988)
Enables shallow commonsense reasoning (only applying transitivity inference pattern)
On-the-fly photo suggestion demands efficiency
Building the graph
Sample mapping rule from OMCS to pred-arg structure:
somewhere THING1 can be is PLACE1somewherecanbeTHING1, PLACE10.5, 0.1
Differentiate Forward/Backward weightSmallbig (e.g. bride wedding) preferred
20 mapping rules applied to get 50,000 pred/arg structures (mostly binary)
Mapping Rules
Mapping Rules target this subset of OMCS:
Classification: A dog is a pet
Spatial: San Francisco is part of California
Scene: Things often found together are: restaurant, food, waiters, tables, seats
Purpose: A vacation is for relaxation; Pets are for companionship
Causality: After the wedding ceremony comes the wedding reception.
Emotion: A pet makes you feel happy; Rollercoasters make you feel excited and scared.
Cleaning Args in Pred/Args
Apply a sieve grammar to normalize concepts NOUN PHRASE: (PREP) (DET|POSS-PRON) NOUN (PREP) (DET|POSS-PRON) NOUN NOUN (PREP) NOUN POSS-MARKER (ADJ) NOUN (PREP) (DET|POSS-PRON) NOUN NOUN NOUN (PREP) (DET|POSS-PRON) (ADJ) NOUN PREP NOUN ACTIVITY PHRASE: (PREP) (ADV) VERB (ADV) (PREP) (ADV) VERB (ADV) (DET|POSS-PRON) (ADJ) NOUN (PREP) (ADV) VERB (ADV) (DET|POSS-PRON) (ADJ) NOUN NOUN
Examples: “to cook some dinner” “cook dinner”, “in the playground” “playground”
Building concept node graph
30,000 concept nodes80,000 pairs of directed edgesAverage branching factor is 5
0.2
bride
weddinggroom
0.5
0.1
0.2
0.5
0.1FoundTogether PartOf
PartOf
Spreading activation algorithm
Origin node is annotation, activates with energy=1
Next, nodes 1 hop away are activated, then 2 hops, etc
Nodes will activate if energy meets threshold (e.g. 0.2)Because edges are a measure of supposed semantic
connectedness, lower energy less semantic relevance
Given edge (A,B), activation score of B:)),((*)()( BAedgeweightAASBAS
Reweighting edges
Two opportunities to reweight edges to improve semantic relevanceReinforcementInverse Popularity
First, ReinforcementSimple IdeaNodes connected thru multiple paths are more semantically relatedApplies to k-clusters
(mutually reinforced)A CP PB
Q
Inverse Popularity
Punish a node for having too many children Often symptomatic of word sense collision
(since OMCS is not sense disambiguated)
Sense collisions lead to noise Discount by equation (above-right)
bridesmaid
bride groom
to wedding
to wedding
to beauty
concepts
conceptsconcepts
(bf = 2)
(bf = 129)
)*log(
1
*
actorbranchingFdiscount
discountoldWeightnewWeight
More Example Expansions
>>> expand('london')('england', '0.9618') ('ontario', '0.6108')('europe', '0.4799') ('california', '0.3622')('united kingdom', '0.2644') ('forest', '0.2644')('earth', '0.1244') >>> expand(“symphony”)('concert', '0.5') ('piece of music', '0.4')('theatre', '0.2469')('screw', '0.2244') ('concert hall', '0.2244')('xylophone', '0.1') ('harp', '0.1')('viola', '0.1') ('cello', '0.1')('wind instrument', '0.1') ('bassoon', '0.1')('bass fiddle', '0.1')
Finally, the Demo
Evaluation
Full Evaluation completed on ARIA 1.0 at KodakUsers liked it better than Kodak PictureEasy
Only Informal Evaluation of ARIA 2.0 (with NL and CS)Interesting findingsWhen CS expansion helped, user had positive commentsMutual adaptationWhen CS expansion suggests irrelevant photo, users still thought
computer was being helpful, attributed intentionality to ARIAValidates this as “fail-soft” problem
Related Work
State-of-the-art: FotoFile (Kuchinsky et al, 1999)DOES some image analysis to propose annotationsDOESN’T do observational learning of annotations from text
Watson (Budzik and Hammond, 2000)DOES observe user actions, and automates retrievalDOESN’T learn annotationsNEITHER does adaptive retrieval of semantically connected
photos
OMCS vs. WordNet (Fellbaum, 1998), thesauriOMCS operates on a concept level (e.g. activity) vs. WordNet on
a lexical levelWordNet gives small set of nymic relations, OMCS offers more
variety of relations, describing the real world.
Future Work
Learning personal commonsense to build a user model e.g.: “…Mary, my best friend, …” “Mary is user’s best
friend”
Bulk annotationsAs photo shoebox grows, need to cluster annotations by
time, event (can be done heuristically)Propagate annotations to other photos in cluster
e.g. propagate “european vacation” to all photos from vacation
Pointers, etc.
Access to papers: google for “hugo liu”
Related Projects:GOOSE: A Goal-Oriented Search Engine with Commonsense (AH-2002)MAKEBELIEVE: Using commonsense to generate stories (AAAI-2002)Natural Language Agent for Automatic Meeting Scheduling through Emails (unpublished)