visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention...
Transcript of visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention...
Visualattention[withandfor]
deepneuralnets
June16,2016AdobeCTLVisionandLearningReadingGroup
ZoyaBylinskii
Visualattentionfor deepneuralnets[captioning]Paperdiscussion:“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhutdinov,R.Zemel,Y.Bengio[ICML2015]
? Abirdflyingoverabodyofwater
VisualAttentionforImageCaptioning
VisualAttentionforImageCaptioning
? Abirdflyingoverabodyofwater
CNNforimagefeatureextraction
RNNwithattentionoverimagefeaturesforcaptiongeneration
1 2
VisualAttentionforImageCaptioning
?
CNNforimagefeatureextraction
Lastconvolutional layerbeforemaxpoolinge.g.VGGnet:14x14x512featuremap->196imagefeaturevectorsofdimension 512
1
VisualAttentionforImageCaptioning
?
CNNforimagefeatureextraction
a1 a2 a196…
1
VisualAttentionforImageCaptioning
?
CNNforimagefeatureextraction
a1 a2 a196…
1Wanttocompute“weights”α1α2α196
VisualAttentionforImageCaptioning
a1 a2 a196…
Equationsonthisslidecourtesyof:http://people.ee.duke.edu/~lcarin/Yunchen9.25.2015.pdf
Wanttocompute“weights”α1α2α196
VisualAttentionforImageCaptioning
a1 a2 a196…
Equationsonthisslidecourtesyof:http://people.ee.duke.edu/~lcarin/Yunchen9.25.2015.pdf
Wanttocompute“weights”α1α2α196
VisualAttentionforImageCaptioning
Equationsonthisslidecourtesyof:http://people.ee.duke.edu/~lcarin/Yunchen9.25.2015.pdf
Probability thatlocationiisrightplacetofocusforproducing nextword
Relativeimportanceoflocationi forproducingnextword
VisualAttentionforImageCaptioning
• LSTMnetworkgeneratesonewordytateverytimestepconditionedonacontextvector,theprevioushiddenstateandthepreviouslygeneratedwords• Thecontextvectorzt isadynamicrepresentationoftherelevantpartoftheimageinputattimet
VisualAttentionforImageCaptioning
? Abirdflyingoverabodyofwater
CNNforimagefeatureextraction
RNNwithattentionoverimagefeaturesforcaptiongeneration
1 2
VisualAttentionforImageCaptioning
“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhutdinov,R.Zemel,Y.Bengio [ICML2015]
VisualAttentionforImageCaptioning
“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhutdinov,R.Zemel,Y.Bengio [ICML2015]
VisualAttentionforImageCaptioning
“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhutdinov,R.Zemel,Y.Bengio [ICML2015]
Visualattentionfor deepneuralnets[questionanswering]Paperdiscussion:“StackedAttentionNetworksforImageQuestionAnswering”,Z.Yang,X.He,J.Gao,L.Deng,A.Smola [arXiv,Jan2016]
VisualAttentionforQuestionAnswering
? Answer:Dogs
Question:Whataresittinginthebasketonabicycle?
VisualAttentionforQuestionAnswering
? Answer:Dogs
Question:Whataresittinginthebasketonabicycle?
1 2
CNNforimagefeatureextraction
CNNorLSTMforquestion representation
3stackedattentionmodel
?
CNNforimagefeatureextraction
f1 f2 a196…
1
VisualAttentionforQuestionAnswering
VisualAttentionforQuestionAnswering
?
LSTMforquestionrepresentation
2
Question:Whataresittinginthebasketonabicycle?
VisualAttentionforQuestionAnswering
?
LSTMforquestionrepresentation
2
Question:Whataresittinginthebasketonabicycle?
VisualAttentionforQuestionAnswering
?
CNNforquestionrepresentation
2
Question:Whataresittinginthebasketonabicycle?
LSTMforquestionrepresentation
CNNforquestionrepresentation
2a
2b
VisualAttentionforQuestionAnswering
Questionsfordiscussion
• Whenmightwepreferoneoftheselanguagerepresentationsoveranother?• (2a)anLSTMthatbuildsuparepresentationovermultipletimesteps
• Dowekeepenoughinformationfromearlyoninthesentence?Shouldwefavorlatterpartsofasentence?
• (2b)aCNNthatstaticallycombinesword/sentencefeaturesatafewscales• TowhatextentdoestheNusedforN-gramsaffecttheresultingrepresentation?
VisualAttentionforQuestionAnswering
CNN f1 f2 f196…
1
Question:Whataresittinginthebasketonabicycle?2
CNNor
LSTMvQ
VisualAttentionforQuestionAnswering
v1 v2 v196
Question:Whataresittinginthebasketonabicycle?2
CNNor
LSTMvQ
CNN1 f1 f2 f196…
VisualAttentionforQuestionAnswering
?
Question:Whataresittinginthebasketonabicycle?
vQvI
VisualAttentionforQuestionAnswering
?
Question:Whataresittinginthebasketonabicycle?
vQvI
VisualAttentionforQuestionAnswering
?
Question:Whataresittinginthebasketonabicycle?
vQvI
Questionsfordiscussion
• Whenmightwewanttomodulatethequestion/languagerepresentationovertime,andwhenmightweprefertomodulatethevisualfeaturerepresentation?• Correspondstorecursing onvQ orvI
Visualattentionwith deepneuralnetsPaperdiscussions:“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.Shen,X.Boix,Q.Zhao[CVPR2015]“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXiv Oct2015]“DeepGazeI:BoostingSaliencyPredictionwithFeatureMapsTrainedonImageNet”,M.Kümmerer,L.Theis,M.Bethge [ICLR2015workshop]“PredictingEyeFixationsusingConvolutionalNeuralNetworks”,N.Liu,J.Han,D.Zhang,S.Wen,T.Liu[CVPR2015]
SaliencyPredictionwithNeuralNets
?
SaliencyPredictionwithNeuralNets
• Bottom-uppop-out• Semanticobjectsofinterest• Salientnon-objectregions(”abstractconcepts”)• Multi-scale,context-sensitive
• Challenge:verysmalldatasets
SaliencyPredictionwithNeuralNets
“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]
SaliencyPredictionwithNeuralNets
Capturemulti-scalefeatures- Imagedown-sampled- SameDNNapplied
“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]
SaliencyPredictionwithNeuralNets
AlexNet,VGG,orGoogleNet- RemoveallFClayers- Adddepth-1convolutional layerfor
saliencyprediction (aftercombiningresponses frombothscales)
“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]
SaliencyPredictionwithNeuralNets
- fine-tuning pretrained networks- optimizesaliencyevaluation
metricsdirectly“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]
SaliencyPredictionwithNeuralNets
“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]
SaliencyPredictionwithNeuralNets
• First5layersinitializedwithVGG-16weights• Note:aschanneldepthdoubles,spatialdimensionsarehalvedwithstride-2poolinglayers
“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]
SaliencyPredictionwithNeuralNets
• First5layersinitializedwithVGG-16weights• Note:aschanneldepthdoubles,spatialdimensionsarehalvedwithstride-2poolinglayers
• Holesofsize2introducedinkernelsof5th layertoincreasereceptivefieldwithoutincreasingmemoryfootprint
“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]
SaliencyPredictionwithNeuralNets
• First5layersinitializedwithVGG-16weights• Note:aschanneldepthdoubles,spatialdimensionsarehalvedwithstride-2poolinglayers
• Holesofsize2introducedinkernelsof5th layertoincreasereceptivefieldwithoutincreasingmemoryfootprint• Twoinception-styleconvolutionalmodulestocapturemulti-scalesemanticstructure• Convolutionallayersin7th layerwithholesofsize6operateonlargereceptivefieldsformoreglobalcontext
“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]
SaliencyPredictionwithNeuralNets
• First5layersinitializedwithVGG-16weights• Note:aschanneldepthdoubles,spatialdimensionsarehalvedwithstride-2poolinglayers
• Holesofsize2introducedinkernelsof5th layertoincreasereceptivefieldwithoutincreasingmemoryfootprint• Twoinception-styleconvolutionalmodulestocapturemulti-scalesemanticstructure• Convolutionallayersin7th layerwithholesofsize6operateonlargereceptivefieldsformoreglobalcontext• Finallayerup-sampledtodepth-1saliencymap
“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]
SaliencyPredictionwithNeuralNets
“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]
Introducing location-biasedbehaviorwithoutdrasticallyincreasingnumberofnetworkparameters
SaliencyPredictionwithNeuralNets
• RemovefinalFClayers• Rescaleresponsesofallotherlayerstolargestsize(->3712filterresponsesperimagelocation)• Eachfilterindividuallynormalizedacrossdataset,thenGaussianblurredwithsomesigma• Saliencymapisweightedcombinationbetweenthesepost-processedfiltersandacenterbias• L1regularizationonweightstoencouragesparsity• Softmax producesfinaloutputmap
“DeepGazeI:BoostingSaliencyPredictionwithFeatureMapsTrainedonImageNet”,M.Kümmerer,L.Theis,M.Bethge [ICLR2015workshop]
FixationPredictionwithNeuralNets
“PredictingEyeFixationsusingConvolutional NeuralNetworks”,N.Liu,J.Han,D.Zhang,S.Wen,T.Liu[CVPR2015]
Questionsfordiscussion
• Candirectlyoptimizingforvisualattention/saliencyleadtobenefitsforothercomputervisionapplicationsor shouldvisualattentionnaturallycomeoutofthespecificapplication?• Canmodelsofvisualattention/saliencyhelpbootstrapindividualtasksorleadtogeneralizationacrosstasks?