DenseCap: Fully Convolutional Localization Networks for ... · densecap: fully convolutional...

1
DenseCap: Fully Convolutional Localization Networks for Dense Captioning ODEHO GHQVLW\ :KROH ,PDJH ,PDJH 5HJLRQV ODEHO FRPSOH[LW\ 6LQJOH /DEHO 6HTXHQFH &11 ,PDJH [:[+ &RQY IHDWXUHV & [ :¶ [ +¶ 5HJLRQ IHDWXUHV %[&[;[< 5HJLRQ &RGHV %[' /670 6WULSHG JUD\ FDW &DWV ZDWFKLQJ 79 /RFDOL]DWLRQ /D\HU &RQY 5HJLRQ 3URSRVDOV N [ :¶ [ +¶ 5HJLRQ VFRUHV N [ :¶ [ +¶ &RQY IHDWXUHV & [ :¶ [ +¶ %LOLQHDU 6DPSOHU 5HJLRQ IHDWXUHV % [ [ [ 6DPSOLQJ *ULG %[;[<[ 6DPSOLQJ *ULG *HQHUDWRU %HVW 3URSRVDOV %[ 5HFRJQLWLRQ 1HWZRUN $ PDQ DQG D ZRPDQ VLWWLQJ DW D WDEOH ZLWK D FDNH $ WUDLQ LV WUDYHOLQJ GRZQ WKH WUDFNV QHDU D IRUHVW $ ODUJH MHWOLQHU IO\LQJ WKURXJK D EOXH VN\ $ WHGG\ EHDU ZLWK D UHG ERZ RQ LW Justin Johnson*, Andrej Karpathy*, Li Fei-Fei Overview Classic computer vision tasks include classification, where a system assigns a single label to a whole image and detection where a system labels all objects in an image. Recent captioning systems describe entire images in natural language. We propose the dense captioning task, where a system jointly detects regions of interest and describes them in natural language, combining the label density of detection and the label complexity of captioning. Model Our model is a single differentiable function that inputs an image and outputs a set of regions and captions, learned in an end-to-end, fully-supervised manner using back- propagation and gradient descent. The image is processed by a deep convolutional neural network [1]; these features are processed by a Localization Layer which proposes regions of interest [2] using a region proposal network [3] and differentiably attends to regions using bilinear sampling [4]; these region features are processed by a recurrent neural network with long short-term memory [5] which generates captions for each region [5, 6, 7]. References [1] Krishna et al, “Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations”, arXiv 2016 [2] Simonyan and Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 [3] Girschick, “Fast R-CNN”, ICCV 2015 [4] Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 [5] Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015 [6] Hochreiter and Schmidhuber, “Long short-term memory”, Neural Computation 1997. [7] Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015 [8] Vinyals et al, “Show and Tell: A Neural Image Caption Generator”, CVPR 2015 [9] Donahue et al, “Long-term Recurrent Convolutional Networks for Visual Recognition and Description”, CVPR 2015 Data We train and test our model on the Visual Genome region captions dataset [1], consisting of around 100k images. Each image is annotated with around 50 regions, and each region has a natural-language caption. All regions and captions were drawn and written by human annotators on Amazon’s Mechanical Turk. We use 77k images for training, and 5k each for validation and testing. Dense Captioning Results Example results on test-set images. We also run a traditional image captioning model on each image; our dense captions are able to capture much more detailed information. Phrase Retrieval In addition to generating novel captions from images, our model can also be used to search for natural-language phrases in images. Given an input phrase, we compute region proposals for each image and use the RNN language model to compute the likelihood of the phrase conditioned on the region, then sort by this likelihood. We show results of retrieving several phrases from test-set images.

Transcript of DenseCap: Fully Convolutional Localization Networks for ... · densecap: fully convolutional...

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

&ODVVLILFDWLRQ

&DW

&DSWLRQLQJ

$�FDW�ULGLQJ�D�VNDWHERDUG

'HWHFWLRQ

&DW

6NDWHERDUG

'HQVH�&DSWLRQLQJ2UDQJH�VSRWWHG�FDW

6NDWHERDUG�ZLWK�UHG�ZKHHOV

&DW�ULGLQJ�D�VNDWHERDUG

%URZQ�KDUGZRRG�IORRULQJ

ODEHO�GHQVLW\:KROH�,PDJH ,PDJH�5HJLRQV

ODEHO�FRPSOH[LW\

6LQJOH/DEHO

6HTXHQFH

&11

,PDJH����[�:�[�+ &RQY�IHDWXUHV���

&�[�:¶�[�+¶

5HJLRQ�IHDWXUHV�%�[�&�[�;�[�< 5HJLRQ�&RGHV�

%�[�'

/6706WULSHG�JUD\�FDW

&DWV�ZDWFKLQJ�79

/RFDOL]DWLRQ�/D\HU

&RQY

5HJLRQ�3URSRVDOV��N�[�:¶�[�+¶

5HJLRQ�VFRUHV�N�[�:¶�[�+¶&RQY�IHDWXUHV��

&�[�:¶�[�+¶%LOLQHDU�6DPSOHU 5HJLRQ�IHDWXUHV�

%�[�����[���[��

6DPSOLQJ�*ULG�%�[�;�[�<�[��

6DPSOLQJ *ULG�*HQHUDWRU

%HVW�3URSRVDOV�%�[��

5HFRJQLWLRQ�1HWZRUN

$�PDQ�DQG�D�ZRPDQ�VLWWLQJ�DW�D�WDEOH�ZLWK�D�FDNH� $�WUDLQ�LV�WUDYHOLQJ�GRZQ�WKH�WUDFNV�QHDU�D�IRUHVW�$�ODUJH�MHWOLQHU�IO\LQJ�WKURXJK�D�EOXH�VN\� $�WHGG\�EHDU�ZLWK�

D�UHG�ERZ�RQ�LW�

2XU�0RGHO�

)XOO�,PDJH�511�

Justin Johnson*, Andrej Karpathy*, Li Fei-Fei Overview Classic computer vision tasks include classification, where a system assigns a single label to a whole image and detection where a system labels all objects in an image. Recent captioning systems describe entire images in natural language. We propose the dense captioning task, where a system jointly detects regions of interest and describes them in natural language, combining the label density of detection and the label complexity of captioning.

 

Model

Our model is a single differentiable function that inputs an image and outputs a set of regions and captions, learned in an end-to-end, fully-supervised manner using back-propagation and gradient descent. The image is processed by a deep convolutional neural network [1]; these features are processed by a Localization Layer which proposes regions of interest [2] using a region proposal network [3] and differentiably attends to regions using bilinear sampling [4]; these region features are processed by a recurrent neural network with long short-term memory [5] which generates captions for each region [5, 6, 7].

References [1] Krishna et al, “Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations”, arXiv 2016 [2] Simonyan and Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, ICLR 2015 [3] Girschick, “Fast R-CNN”, ICCV 2015 [4] Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 [5] Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015 [6] Hochreiter and Schmidhuber, “Long short-term memory”, Neural Computation 1997. [7] Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015 [8] Vinyals et al, “Show and Tell: A Neural Image Caption Generator”, CVPR 2015 [9] Donahue et al, “Long-term Recurrent Convolutional Networks for Visual Recognition and Description”, CVPR 2015

Data We train and test our model on the Visual Genome region captions dataset [1], consisting of around 100k images. Each image is annotated with around 50 regions, and each region has a natural-language caption. All regions and captions were drawn and written by human annotators on Amazon’s Mechanical Turk. We use 77k images for training, and 5k each for validation and testing.

Dense Captioning Results

Example results on test-set images. We also run a traditional image captioning model on each image; our dense captions are able to capture much more detailed information.

Phrase Retrieval

In addition to generating novel captions from images, our model can also be used to search for natural-language phrases in images. Given an input phrase, we compute region proposals for each image and use the RNN language model to compute the likelihood of the phrase conditioned on the region, then sort by this likelihood. We show results of retrieving several phrases from test-set images.