Learning Spatiotemporal Features with 3D Convolutional...

LearningSpatiotemporalFeatureswith

3DConvolutionalNetworksDuTran,LubomirBourdev,RobFergus,LorenzoTorresani,ManoharPaluri

EffectiveVideoDescriptor

• Generic– Canrepresentdifferenttypes

• Compact– Processing,storage

• Efficient– computation

• Simple– implementation

3DConvolutionandPooling

• 3DConvolutionisbetterthan2DConvolutiontomodeltemporalinformation.– 2DCONV:performedonlyspatially,losetemporalinformation.

– 3DCONV:performedspatio-temporally,preservetemporalinformation.

• Samephenomenaisapplicableforpooling.

2DConvolutionOn1-chInput

• Result:2DImage.

2DConvolutionOnn-chInput

• Result:2DImage.

3DConvolutionOnn-chInput

• Result:Volume

IdentifyBestArchitectureFor3DConvNets(OnUCF101)

• Commonnetworksettings– Allvideoframesresizedinto128x171.– Videosaresplitintonon-overlapped16frameclip.– Input:3x16x128x171.– 5ConvolutionandPoolinglayer– 2FullyConnectedlayer– SoftmaxLosslayertopredictactionlabels

IdentifyBestArchitectureFor3DConvNets(OnUCF101)

• VaryingNetworkArchitecture– Homogeneoustemporaldepth.• Depth–dfor1,3,5,7

– Varyingtemporaldepth.• Increasing:3-3-5-5-7• Decreasing:7-7-5-5-3-3

3DConvolutionKernelTemporalDepthSearch

SpatiotemporalFeatureLearning

• BestNetworkArchitecture–With3x3x3kernel

SpatiotemporalFeatureLearning

• Datasetfortraining– Sports1MDataset• Largestvideoclassificationbenchmark• 1.1millionsportsvideos• 487categories

Sports1MClassificationResults

C3DVideoDescriptor

• C3DModelcanbeusedasafeatureextractorforvariousvideoanalysistasks.– Actionrecognition– Actionsimilarity– SceneandObjectrecognition

• Usingwithfc6activations– 4096dimension

ActionRecognition

• Dataset:UCF101– 13.320video– 101humanaction

ActionSimilarityLabeling

• Dataset:ASLAN– 3,631video– 432actionclass

SceneObjectRecognition

• Dataset:YUPENN– 420video– 14scene

• Dataset:Maryland– 130video– 13scene

WhyC3DFeatures?

• Generic• Compact• Efficient• Simple

Visualisation using t-SNE method:

L. van der Maaten and G. Hinton. Visualizing data using t-sne. JMLR

WhatDoesC3DLearn?

Using deconvolution method in M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014

UsefulLinks

• http://vlg.cs.dartmouth.edu/c3d/• https://github.com/facebook/C3D

Tools and software required:

- keras- tensorflow- ffmpeg(compiled form source)- opencv(compiled from source)

Thank you

Learning Spatiotemporal Features with 3D Convolutional...

Documents

Transcript of Learning Spatiotemporal Features with 3D Convolutional...