Action Detection arXiv:2004.07485v1 [cs.CV] 16 Apr 2020 · Asynchronous Interaction Aggregation for...

Asynchronous Interaction Aggregation forAction Detection

Jiajun Tang1∗, Jin Xia1∗, Xinzhi Mu1, Bo Pang1, and Cewu Lu1†

Shanghai Jiao Tong University, China

Abstract. Understanding interaction is an essential part of video ac-tion detection. We propose the Asynchronous Interaction Aggregationnetwork (AIA) that leverages different interactions to boost action de-tection. There are two key designs in it: one is the Interaction Aggre-gation structure (IA) adopting a uniform paradigm to model and in-tegrate multiple types of interaction; the other is the AsynchronousMemory Update algorithm (AMU) that enables us to achieve betterperformance by modeling very long-term interaction dynamically with-out huge computation cost. We provide empirical evidence to show thatour network can gain notable accuracy from the integrative interactionsand is easy to train end-to-end. Our method reports the new state-of-the-art performance on AVA dataset, with 3.7 mAP gain (12.6% rela-tive improvement) on validation split comparing to our strong baseline.The results on dataset UCF101-24 and EPIC-Kitchens further illustratethe effectiveness of our approach. Source code will be made public at:https://github.com/MVIG-SJTU/AlphAction.

Keywords: Action Detection · Video Understanding · Interaction · Mem-ory

1 Introduction

The task of action detection (spatio-temporal action localization) aims at de-tecting and recognizing actions in space and time. As an essential task of videounderstanding, it has a variety of applications such as abnormal behavior de-tection and autonomous driving. On top of spatial representation and temporalfeatures [19,25,3,9], the interaction relationships [12,37,45,27] are crucial for un-derstanding actions. Take Figure 1 for example. The appearance of the man, thetea cup as well as the previous movement of the woman help to predict the ac-tion of the woman. In this paper, We propose a new framework who emphasizeson the interactions for action detection.

Interactions can be briefly considered as the relationship between the targetperson and context. Many existing works try to explore interactions in videos,but there are two problems in the current methods: (1) Previous methods such

∗ Both authors contributed equally to this work.† The corresponding author is Cewu Lu. Email: [email protected]

arX

iv:2

004.

0748

5v1

[cs

.CV

] 1

6 A

pr 2

020

https://github.com/MVIG-SJTU/AlphAction

2 J. Tang and J. Xia and X. Mu and B. Pang and C. Lu

Target Frame

Person

Memory

Object

She is close to the man

She prepared the tea

She puts down the teacup

Serve sth. to sb.

Fig. 1: Interaction Aggregation. In this target frame, we can tell that thewomen is serving tea to the man with following clues: (1) She is close to theman. (2) She puts down the tea cup before the man. (3) She prepared the teaa few seconds ago. These three clues correspond respectively the person-person,person-object and temporal interactions

as [12,14] focus on a single type of interaction (eg. person-object). They can onlyboost one specific kind of actions. Methods such as [44] intend to merge differ-ent interactions, but they model them separately. Information of one interactioncan’t contribute to another interaction modeling. How to find interactions cor-rectly in video and use them for action detection remains challenging. (2) Thelong-term temporal interaction is important but hard to track. Methods whichuse temporal convolution [19,25,9] have very limited temporal reception due tothe resource challenge. Methods such as [39] require a duplicated feature ex-tracting pre-process which is not practical in reality.

In this work, we propose a new framework, the Asynchronous InteractionAggregation network (AIA), who explores three kinds of interactions (person-person, person-object, and temporal interaction) that cover nearly all kinds ofperson-context interactions in the video. As a first try, AIA makes them workcooperatively in a hierarchical structure to capture higher level spatial-temporalfeatures and preciser attentions. There are two main designs in our network: theInteraction Aggregation (IA) structure and the Asynchronous Memory Update(AMU) algorithm.

The former design, IA structure, explores and integrates all three types of in-teraction in a deep structure. More specifically, it consists of multiple elementalinteraction blocks, of each enhances the target features with one type of inter-action. These three types of interaction blocks are nested along the depth of IAstructure. One block may use the result of previous interactions blocks. Thus,IA structure is able to model interactions precisely using information acrossdifferent types.

Jointly training with long memory features is infeasible due to the large sizeof video data. The AMU algorithm is therefore proposed to estimate intractablefeatures during training. We adopt a memory-like structure to store the spatialfeatures and propose a series of write-read algorithm to update the content in

Asynchronous Interaction Aggregation 3

memory: features extracted from target clips at each iteration are written to amemory pool and they can be retrieved in subsequent iterations to model tem-poral interaction. This effective strategy enables us to train the whole networkin an end-to-end manner and the computational complexity doesn’t increase lin-early with the length of temporal memory features. In comparison to previoussolution [39] that extracted features in advance, the AMU is much simpler andachieves better performance.

In summary, our key contributions are: (1) A deep IA structure that inte-grates a diversity of person-context interactions for robust action detection and(2) an AMU algorithm to estimate the memory features dynamically. We per-form an extensive ablation study on the AVA [16] dataset for spatio-temporalaction localization task. Our method shows a huge boost on performance, whichyields the new state-of-the-art on both validation and test set. We also test ourmethod on dataset UCF101-24 and a segment level action recognition datasetEPIC-Kitchens. Results further validate its generality.

2 Related Works

Video Classification. Various 3D CNN [19,32,31,34] have been developed tohandle video input. To leverage the huge image dataset, I3D [3] has been pro-posed to benefit from ImageNet[6] pre-training. In [25,7,33,42,4], the 3D kernelsin above models are simulated by temporal filters and spatial filters which cansignificantly decrease the model size. SlowFast Network [9] introduces a two-stream method [28,10].Spatio-temporal Action Detection. Action detection is more difficult thanaction classification because the model needs to not only predict the actionlabels but also localize the action in time and space. Most of recent approaches[16,11,9,17,40] follow the object detection frameworks [13,26] by classifying thefeatures generated by the detected bounding boxes. In contrast to our method,their results depend only on the cropped features. While all the other informationis discarded and contributes nothing to the final prediction.Attention Mechanism for Videos. The transformer [35] consists of severalstacked self-attention layers and fully connected layers. Non-Local [36] concludesthat the previous self-attention model can be viewed as a form of classical com-puter vision method of non-local means [2]. Hence a generic non-local block[36]is introduced. This structure enables models to compute the response by relatingthe features at different time or space, which makes the attention mechanismapplicable for video-related tasks like action classification. The non-local blockalso plays an important role in [39] where the model references information fromthe long-term feature bank via a non-local feature bank operator.

3 Proposed Method

In this section, we will describe our method that localizes actions in space andtime. Our approach aims at modeling and aggregating various interactions to


Video Model

DetectorFeature Pool Ω

O MP

RoI WRITE

IA Structure

a. Feature Extractor b. Interaction Aggregation

c. Asynchronous Memory Update

Watch screenTalk to Sit

Watch screenListen to Sit

READ Classifier

𝑃#

𝑂#

M#

Forward only

Forward & Backward

𝑓#output

𝑃'# 𝑃'#() 𝑃'*𝑃'#+)𝑃',𝑃')

𝑃#

Fig. 2: Pipeline of the proposed AIA. a. We crop features of persons andobjects from the extracted video features. b. Person features, object featuresand memory features from the feature pool Ω in c are fed to IA in order tointegrate multiple interactions. The output of IA is passed to the final classifierfor predictions. c. Our AMU algorithm reads memory features from feature pooland writes fresh person features to it

achieve better action detection performance. In Section 3.1, we describe twoimportant types of instance level features in short clips and the memory featuresin long videos. In Section 3.2, the Interaction Aggregation structure (IA) isexplored to gather knowledge of interactions. In Section 3.3, we introduce theAsynchronous Memory Update algorithm (AMU) to alleviate the problem ofheavy computation and memory consumption in temporal interaction modeling.The overall pipeline of our method is demonstrated in Figure 2.

3.1 Instance Level and Temporal Memory Features

To model interactions in video, we need to find correctly what the queried personis interacted with. Previous works such as [36] calculate the interactions amongall the pixels in feature map. Being computational expensive, these brute-forcemethods struggle to learn interactions among pixels due to the limited size ofvideo dataset. Thus we go down to consider how to obtain concentrated in-teracted features. We observe that persons are always interacting with concreteobjects and other persons. Therefore, we extract object and person embedding asthe instance level feature. In addition, video frames are always highly correlated,thus we keep the long-term person features as the memory features.

Instance level features are cropped from the video features. Since computingthe whole long video is impossible, we split it to consecutive short video clips[v1, v2, . . . , vT ]. The d-dimensional features of the tth clip vt is extracted using avideo backbone model: ft = F(vt, φF ) where φF is the parameters.


A detector is applied on the middle frame of vt to get person boxes andobject boxes. Based on the detected bounding boxes, we apply RoIAlign to cropthe person and object features out from extracted features ft. The person andobject features in vt are denoted respectively as Pt and Ot.

One clip is only a short session and misses the temporal global seman-tics. In order to model the temporal interaction, we keep tracks of memoryfeatures. The memory features consist of person features in consecutive clips:Mt = [Pt−L, . . . , Pt, . . . , Pt+L], where (2L + 1) is the size of clip-wise receptionfield. In practice, a certain number of persons are sampled from each neighborclip.

The three features above have semantic meaning and contain concentratedinformation to recognize actions. With these three features, we are now able tomodel semantic interactions explicitly.

3.2 Interaction Modeling and Aggregation

How do we leverage these extracted features? For a target person, there aremultiple detected objects and persons. The main challenge is how to correctly paymore attention to the objects or the persons that the target person is interactedwith. In this section, we introduce first our Interaction Block that can adaptivelymodel each type of interactions in a uniform structure. Then we describe ourInteraction Aggregation (IA) structure that aggregates multiple interactions.Overview. Given different human Pt, object Ot and memory features Mt, theproposed IA structure outputs action features At = E(Pt, Ot,Mt, φE), whereφE denotes the parameters in the IA structure. At is then passed to the finalclassifier for final predictions.

The hierarchical IA structure consists of multiple interaction blocks. Each ofthem is tailored for a single type of interactions. The interaction blocks are deepnested with other blocks to efficiently integrate different interactions for higherlevel features and more precise attentions.Interaction Block. The structure of interaction block is adapted from Trans-former Block originally proposed in [35] whose specific design basically follows[36,39]. Briefly speaking, one of the two inputs is used as the query and theother is mapped to key and value. Through the dot-product attention, which isthe output of the softmax layer in Figure 3 a, the block is able to select valuefeatures that are highly activated to the query features and merge them to en-hance the query features. There are three types of interaction blocks, P-Block,O-Block and M-Block.

-P-Block: P-Block models person-person interaction in the same clip. It ishelpful for recognizing actions like listening and talking. Since the query inputis already the person features or the enhanced person features, we take thekey/value input the same as the query input.

-O-Block: In O-Block, we aim to distill person-object interactions such aspushing and carrying an object. Our key/value input is the detected objectfeatures Ot. In the case where detected objects are too many, we sample basedon detection scores. Figure 3a is an illustration of O-Block.


a. Interaction Block b. Serial IA

𝑃! 𝑂! 𝑀!

O

P

M

P

O

M

O

P

M

P

O

M

𝑃! 𝑂! 𝑀!

O

c. Dense Serial IA𝑂!

N × d

N’ × d

N × N’

N × d

N’ × d N’ × d

N × d𝑄!

𝐸!

O

Linear

LayerNorm

Scale

Softmax

Linear

Linear

Linear

Fig. 3: Interaction Block and IA structure. a. The O-Block: the query inputis the feature of the target person and the key/value input is the feature ofobjects. The P-Block and M-Block are similar. b. Serial IA. c. Dense Serial IA

-M-Block: Some actions have strong logical connections along the temporaldimension like opening and closing. We model this type of interaction as temporalinteractions. To operate this type, we take memory features Mt as key/valueinput of an M-Block.

Interaction Aggregation Structure. The Interaction Blocks extract threetypes of interaction. We now propose two IA structures to integrate these differ-ent interactions. The proposed IA structures are the naive parallel IA, the serialIA and the dense serial IA. For clarity, we use P, O, and M to represent theP-Block, O-Block, and M-Block respectively.

-Parallel IA: A naive approach is to model different interactions separatelyand merge them at last. As displayed in Figure 4a., each branch follows similarstructure to [12] that treats one type of interactions without the knowledge ofother interactions. We argue that the parallel structure struggles to find inter-action precisely. We illustrate the attention of the last P-Block in Figure 4c.by displaying the output of the softmax layer for different persons. As we cansee, the target person is apparently watching and listening to the man in red.However, the P block pays similar attention to two men.

-Serial IA: The knowledge across different interactions is helpful for rec-ognizing interactions. We propose the serial IA to aggregate different types ofinteractions. As shown in Figure 3b., different types of interaction blocks arestacked in sequence. The queried features are enhanced in one interaction blockand then passed to an interaction block of a different type. Figure 4f. and 4g.demonstrate the advantage of serial IA: The first P block can not differ theimportance of the man in left and the man in middle. After gaining knowledgefrom O-block and M-block, the second P-block is able to pay more attention toman in left who is talking to the target person. Comparing to the attention inparallel IA (Figure 4c.), our serial IA is better in finding interactions.


a. Parallel IA

OP M

OP M Watch (a person)Listen to (a person)

0.50

0.49 0.01

b. Target action c. Attention in 2nd P block in parallel IA

0.45

0.48 0.07

0.89

0.10 0.01

O

P

M

P

O

M

𝑃"𝑂"𝑀"

e. Serial IA f. Attention in 1st P block in serial IA g. Attention in 2nd P block in serial IA

𝑃" 𝑂" 𝑀"

Fig. 4: We visualize attention by displaying the output of the softmax layer inP-Block. The original output contains the attention to zero padding person. Weremove those meaningless attention and normalize the rest attention to 1

-Dense Serial IA: In above structures, the connections between interactionblocks are totally manually designed and the input of an interaction block issimply the output of another one. We expect the model to further learn whichinteraction features to take by itself. With this in mind, we propose the DenseSerial IA extension. In Dense Serial IA, each interaction block takes all the out-puts of previous blocks and aggregates them using a learnable weight. Formally,the query of the ith block can be represent as

Qt,i =∑j∈C

Wj Et,j , (1)

where denotes the element-wise multiplication, C is the set of indices of pre-vious blocks, Wj is a learnable d-dimenional vector normalized with a Softmaxfunction among C, Et,j is the enhanced output features from the jth block.Dense Serial IA is illustrated in Figure 3c.

3.3 Asynchronous Memory Update Algorithm

Long-term memory features can provide useful temporal semantics to aid rec-ognizing actions. Imagine a scene where a person opens the bottle cap, drinkswater, and finally closes the cap, it could be hard to detect opening and closingwith subtle movements. But knowing the context of drinking water, things getmuch easier.Resource Challenge. To capture more temporal information, we hope our Mt

can gather features from enough clips, however, using more clips will increase


Tim

e P

er It

erat

ion(

s)

0.1

1

10

100

GP

U M

emor

y(M

B)

1,000

10,000

100,000

0 12 24 36 48 60

GPU Memoryof Jointly Training

Time Per Iterationof Jointly Training

GPU Memoryof AMU

Time Per Iterationof AMU

Length of Memory Feature

Tesla V100-PCIE-16GB

Fig. 5: Joint training with memory features is restricted by limitedhardware resource. In this minor experiment, we take a 32-frame video clipwith 256 × 340 resolution as input. The backbone is ResNet-50. During jointtraining (yellow line), rapidly growing GPU memory and computation time re-stricted the length of memory features to be very small value (8 in this experi-ment). With larger input or deeper backbone, this problem will be more serious.Our method (cyan line) doesn’t have such problem.

the computation and memory consumption dramatically. Depicted with Figure 5,when jointly training, the memory usage and computation consumption increaserapidly as the temporal length of Mt grows. To train on one target person, wemust forward and backward (2L + 1) video clips at one time, which consumesmuch more time, and even worse, cannot make full use of enough long-terminformation due to limited GPU memory.

Insight. In the previous work [39], they pre-train another duplicated backboneto extract memory features to avoid this problem. However, this method makeuse of frozen memory features, whose representation power can not be enhancedas model training goes. We expect the memory feature can be updated dynam-ically and enjoy the improvement from parameter update in training process.Therefore, we propose the asynchronous memory update method which cangenerate effective dynamic long-term memory features and make the trainingprocess more lightweight. The details of training process with this algorithm arepresented in Algorithm 1.

Inspired by [38], our algorithm is composed of a memory component, thememory pool Ω and two basic operations, READ and WRITE. The memory

pool Ω records memory features. Each feature P(i)t in this pool is an estimated

value and tagged with a loss value δ(i)t . This loss value δ

(i)t logs the convergence

state of the whole network. Two basic operations are invoked at each iterationof training:

-READ: At the beginning of each iteration, given a video clip v(i)t from the

ith video, memory features around the target clip are read from the memory

pool Ω, which is [P(i)t−L, . . . , P

(i)t−1] and [P

(i)t+1, . . . , P

(i)t+L] specifically.


Algorithm 1 Training with asynchronous memory update

Input: Video dataset V = v(1), v(2), . . . , v(|V|) with v(i) = [v(i)1 , v

(i)2 , . . . , v

(i)Ti

];

The whole network N , with its parameter φN ;Output: Optimized network N with φN for inference.

// Initialization :

1: Ω = (P (i)t ← zero vectors, δ

(i)t ← 0) | ∀t, i.

2: err ←∞.// Training Process:

3: for iter = 1 to itermax do

4: Sample a video clip v(i)t from dataset V.

5: for t′ = t− L to t+ L do6: if t′ 6= t then

7: READ P(i)t′ and δ

(i)t′ from memory pool Ω.

8: w(i)t′ = minerr/δ(i)

t′ , δ(i)t′ /err.

9: Impose penalty: P(i)t′ ← w

(i)t′ P

(i)t′ .

10: end if11: end for12: Extract P

(i)t and O

(i)t with the backbone in N .

13: Estimated memory features: M(i)t ← [P

(i)t−L1, . . . , P

(i)t−1, P

(i)t , P

(i)t+1, . . . , P

(i)t+L2].

14: Forward (P(i)t , O

(i)t ,M

(i)t ) with the head in N and backward to optimize φN .

15: Update err as the output of current loss function.

16: WRITE P(i)t ← P

(i)t , δ

(i)t ← err back to Ω.

17: end for18: return N , φN

-WRITE: At the end of each iteration, personal features for the target clip

P(i)t are written back to the memory pool Ω as estimated memory features P

(i)t ,

tagged with current loss value.

-Reweighting: The features we READ are written at different training steps.Therefore, some early written features are extracted from the model whose pa-rameters are much different with current ones. Therefore, we impact a penalty

factor w(i)t′ to discard badly estimated features. We design a simple yet effective

way to compute such penalty factor by using loss tag. The difference between

the loss tag δ(i)t′ and current loss value is expressed as,

w(i)t′ = minerr/δ(i)t′ , δ

(i)t′ /err, (2)

which should be very close to 1 when the difference is small. As the networkconverges, the estimated features in the memory pool are expected to be closer

and closer to the precise features and w(i)t′ approaches to 1.

As shown in Figure 5, the consumption of our algorithm has no obviousincrease in both GPU memory and computation as the length of memory featuresgrows, and thus we can use long enough memory features on current commondevices. With dynamic updating, the asynchronous memory features can bebetter exploited than frozen ones.


4 Experiments on AVA

The Atomic Visual Actions(AVA) [16] dataset is built for spatio-temporal actionlocalization. In this dataset, each person is annotated with a bounding box andmultiple action labels at 1 FPS. There are 80 atomic action classes which coverpose actions, person-person interactions and person-object interactions. Thisdataset contains 235 training movie videos and 64 validation movie videos.

Since our method is originally designed for spatio-temporal action detection,we use AVA dataset as the main benchmark to conduct detailed ablation exper-iments. The performances are evaluated with official metric frame level meanaverage precision(mAP) at spatial IoU ≥ 0.5 and only the top 60 most commonaction classes are used for evaluation, according to [16].

4.1 Implementation Details

Instance Detector. We apply Faster R-CNN [26] framework to detect personsand objects on the key frames of each clip. A model with ResNeXt-101-FPN[41,21] backbone from maskrcnn-benchmark [23] is adopted for object detection.It is firstly pre-trained on ImageNet [6] and then fine-tuned on MSCOCO [22]dataset. For human detection, we further fine-tune the model on AVA for higherdetection precision.Backbone. Our method can be easily applied to any kind of 3D CNN back-bone. We select state-of-the-art backbone SlowFast [9] network with ResNet-50structure as our baseline model. Basically following the recipe in [9], our back-bone is pre-trained on Kinetics-700 [3] dataset for action classification task. Thispre-trained backbone produces 66.34% top-1 and 86.66% top-5 accuracy on theKinetics-700 validation set.Training and Inference. Initialized from Kinetics pre-trained weights, we thenfine-tune the whole model on AVA dataset. The inputs of our network are 32RGB frames, sampled from a 64-frame raw clip with one frame interval. Clipsare scaled such that the shortest side becomes 256, and then fed into the fullyconvolution backbone. We use only the ground-truth human boxes for trainingand the randomly jitter them for data augmentation. For the object boxes, we setthe detection threshold to 0.5 in order to higher recall. During inference, detectedhuman boxes with a confidence score larger than 0.8 are used. We set L = 30for memory features in our experiments. We train our network using the SGDalgorithm with batch size 64 on 16 GPU(4 clips per device). BatchNorm(BN)[18] statistics are set frozen. We train for 27.5k iterations with base learning rate0.004 and the learning rate is reduced by a factor 10 at 17.5k and 22.5k iteration.A linear warm-up [15] scheduler is applied for the first 2k iterations.

4.2 Ablation Experiments

Three Interactions. We first study the importance of three kinds of interac-tions. For each interaction type, we use at most one block in the experiment.


Table 1: Ablation Experiments. We use a ResNet-50 SlowFast backbone toperform our ablation study. Models are trained on the AVA(v2.2) training setand evaluated on the validation set. The evaluation metric mAP is shown in %

(a) 3 Interactions (b) Num of I-Blocks (c) Interaction Order

P O M mAP

26.54X X 28.04X X 28.86

X X 28.92X X X 29.26

blocks mAP

1× P,M,O 29.262× P,M,O 29.643× P,M,O 29.61

order mAP order mAP

M→O → P 29.48 M→ P → O 29.46O → P →M 29.51 O →M→ P 29.53P →M→ O 29.44 P → O →M 29.64

(d) IA Structure (e) Asynchronous Memory Update (f) Compare to NL

structure mAP

Parallel 28.85Serial 29.64Dense Serial 29.80

model params FLOPs mAP

Baseline 1.00× 1.00× 26.54LFB(w/o AMU) 2.18× 2.12× 27.02LFB(w/ AMU) 1.18× 1.12× 28.57IA(w/o AMU) 2.35× 2.15× 28.07IA(w/ AMU) 1.35× 1.15× 29.64

model mAP

Baseline 26.54+NL 26.85+IA(w/oM) 28.23

These blocks are then stacked in serial. To evaluate the importance of person-object interaction, we remove the O-Block in the structure. Other interactionsare evaluated in the same way. Table 1a compares the model performance, whereused interaction types are marked with ”X”. A backbone baseline without anyinteraction is also listed in this table. Overall we observe that removing any ofthese three type interactions results in a significant performance decrease, whichconfirms that all these three interactions are important for action detection.

Number of Interaction Blocks. We then experiment with different settingsfor the number of interaction blocks in our IA structure. The interaction blocksare nested in serial structure in this experiment. In Table 1b, N × P,M,Odenotes N blocks are used for each interaction type, with the total number as3N . We find that with setting N = 2 the our method can achieve the bestperformance, so we use this as our default configuration.

Interaction Order. In our serial IA, different type of interactions are alter-nately integrated in sequential. We investigate effect of different interaction orderdesign in Table 1c. As shown in this experiment, the performance with differentorder are quite similar, we thus choose the slightly better one P → O →M asour default setting.

Interaction Aggregation Structure. We analyze different IA structure inthis part. Parallel IA, serial IA and the dense serial IA extension are comparedin Table 1d. As we expect, the parallel IA performs much worse than serialstructure. With dense connections between blocks, our model is able to learnmore knowledge of interactions, which further boosts the performance.

Asynchronous Memory Update. In the previous work LFB [39], the memoryfeatures are extracted with another backbone, which is frozen during training. In


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

stan

dta

lk to si

tan

swer

pho

new

alk

wat

ch s

omeb

ody

liste

n to

som

ebod

yda

nce

play

mus

ical

inst

rum

ent

carr

y/ho

ldm

artia

l art

swim

run/

jog

lie/s

leep

fight

/hit

driv

erid

ebe

nd/b

owre

adsi

ng to

touc

hw

atch

TV

hand

cla

pea

tki

ssdr

ink

crou

ch/k

neel

hug

open

smok

eha

nd s

hake

get u

psa

il bo

atcl

imb

wor

k on

a c

ompu

ter

clos

egi

ve/s

erve

toju

mp/

leap

put o

n cl

othi

nggr

abfa

ll do

wn

writ

esh

oot

take

from

hand

wav

ete

xt o

n a

phon

een

ter

thro

w cut

push

a p

erso

npu

t dow

nta

ke a

pho

tolis

ten

som

ethi

ng lift

push

som

ethi

nglif

t/pic

k up hi

tpu

lltu

rn (s

omet

hing

)po

int t

o

Baseline Ours

Fig. 6: Per category results for the proposed network and the baseline model onthe validation set of AVA v2.2

this experiment we compare our asynchronous memory features with the frozenones. For fair comparison, we re-implement LFB with SlowFast backbone, andalso apply our AMU algorithm to LFB. In Table 1e, we find that our asyn-chronous memory features can gain much better performance than the frozenmethod with nearly half of the parameters and computation cost. We argue thatthis is because our dynamic features can provide better representation.Comparison to Non-local Attention. Finally we compare our interaction ag-gregation method with prior work non-local block [36]. Following [9], we augmentthe backbone with a non-local branch, where attention is computed between theperson instance features and global pooled features. Since there is no long-termfeatures in the non-local block, we use only P and O in this experiment. In Ta-ble 1f, we see that our serial IA works significantly better than non-local block.This confirms that our method can better learn to find potential interactionsthan non-local block.

4.3 Main Results

Finally, we compare our results on AVA with previous methods in Table 2. Weshow our results on both AVA v2.1 and v2.2. Our method surpass all previousworks on both versions.

The AVA v2.2 dataset, is the newer benchmark used in ActivityNet challenge2019 [8]. On the validation set, our method reports a new state-of-the-art 33.11mAP with one single model, which outperforms the strong baseline SlowFastnetworks by 3.7 mAP. On the test split, we train our model on both trainingand validation splits and use a relative longer scheduler due to more data. Withan ensemble of three models with different learning rates and aggregation struc-tures, our method achieves better performance than the winning entry of AVAaction detection challenge 2019 (an ensemble with 7 SlowFast networks). The percategory results for our method and SlowFast baseline is illustrated in Figure 6.


Table 2: Main results on AVA dataset. In this table, we display our bestresult with both backbone ResNet-50(R-50) and ResNet-101(R-101). “*” indi-cates that the result is tested in multi-scale. The input sizes are shown as theframe number and corresponding sample rate. Two R-101 backbone baselinere-implemented in this work are also displayed as “ours” for comparison.

(a) Comparison on AVA v2.1. (b) Comparison on AVA v2.2.

model input pretrain val

SlowFast [9] 32× 2 K400 26.3LFB [39] 32× 2 K400 27.7I3D [11] 64× 1 K600 21.9SlowFast [9] 32× 2 K600 28.2SlowFast (ours) 32× 2 K700 28.1AIA R50 32× 2 K700 28.9AIA R101 32× 2 K700 31.2

model input pretrain val test

SlowFast [9] 32× 2 K600 29.1 -SlowFast [9] 64× 2 K600 29.4 -SlowFast*, 7 ens. [9] - K600 - 34.25SlowFast (ours) 32× 2 K700 29.3 -AIA R50 32× 2 K700 29.80 -AIA R101 32× 2 K700 32.26 -AIA R101* 32× 2 K700 33.11 32.25AIA R101*, 3 ens. - K700 - 34.42

We can observe the performance gain for each category, especially for those whocontain interactions with video context.

We note that we use a new larger Kinetics-700 for pre-training. As shownin Table 2, the backbone we implement in this work has a similar performanceto the official model pre-trained with Kinetics-600. This comparison proves thatthe very most part of performance gain is from our method.

5 Experiments on UCF101-24

UCF101-24 [30] is an action detection set with 24 action categories. In thisdataset, the bounding boxes of each action instance are annotated frame byframe. We conduct experiments on the first split of this dataset following previousworks. We use the corrected annotations provided by Singh et al. [29].

5.1 Implemetnation Details

Since the bounding boxes are annotated in frame level, we choose to use aResNet50-C2D proposed in [36] as our video backbone model. To receive sin-gle image as input, we remove all temporal poolings in it. We pre-train it on theKinetics-400 dataset. Other settings are basically the same as AVA experiments.More implementation details are provided in Supplementary Material.

5.2 Quantitative Evaluation

Table 3 shows the result on UCF101-24 test split in terms of Frame-mAP with0.5 IOU threshold. As we can see in the table, AIA achieves 3.3% mAP gaincomparing to our C2D baseline. Moreover, with only a lightweight 2D back-bone (other works could use some 3D backbone), our method still achieves verycompetitive results.


Table 3: Results on UCF101-24 Split1

method mAP method mAP method mAP

T-CNN [17] 41.4 STEP [43] 75.0 C2D (ours) 75.5Peng et al. [24] 65.7 Gu et al. [16] 76.3 AIA-Serial 78.5ACT [20] 69.5 Zhang et al. [44] 77.9 AIA-Dense-Serial 78.8

Table 4: EPIC-Kitchens Validation ResultsVerbs Nouns Actions

top-1 top-5 top-1 top-5 top-1 top-5

Baradel [1] 40.9 - - - - -LFB NL [39] 52.4 80.8 29.3 54.9 20.8 39.8SlowFast (ours) 56.8 82.8 32.3 56.7 24.1 42.0

AIA-Parallel 57.6 83.9 36.3 63.0 26.4 47.4AIA-Serial 59.2 84.2 37.2 63.2 27.7 48.0AIA-Dense-Serial 60.0 84.6 37.2 62.1 27.1 47.8

6 Experiments on EPIC-Kitchens

To demonstrate the generalizability of AIA, we evaluate our method on thesegment level dataset EPIC-Kitchens [5]. In EPIC Kitchens, each segment is an-notated with one verb and one noun. The action is defined by their combination.Following [1], We split the original training set into a new training set and a newvalidation set. Verb models and noun models are trained separately. Actions areobtained by combining their predictions.

6.1 Implementation Details

For both verb model and noun model, we use the extracted segment features(global average pooling of ft) as query input for IA structure. Hand features andobject features are cropped and then fed into IA to model person-person andperson-object interactions. For verb model, the memory features are the segmentfeatures. For noun model, the memory features are the object features extractedfrom object detector feature map, thus the AMU algorithm is only applied to theverb model. Other experimental details different from AVA setting are providedin Supplementary Material.

6.2 Quantitative Evaluation

We observe from Table 4 a significant gain for all three tasks. All the variants ofAIA outperform the strong baseline SlowFast. Among them, the dense serial IAachieves the best performance for the verbs test, leading to 3.2% improvementon top-1 score. The serial IA results in 4.9% for the nouns test and 3.6% for theaction test.


7 Conclusion

In this paper, we present the Asynchronous Interaction Aggregation network andits performance in action detection. Our method reports the new start-of-the-art on AVA dataset. Nevertheless, the performance of action detection and theinteraction recognition is far from perfect. The poor performance is probably dueto the limited video dataset. Transferring the knowledge of action and interactionfrom image could be a further improvement for AIA network.

References

1. Baradel, F., Neverova, N., Wolf, C., Mille, J., Mori, G.: Object level visual rea-soning in videos. In: Proceedings of the European Conference on Computer Vision(ECCV). pp. 105–121 (2018)

2. Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In:2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition (CVPR’05). vol. 2, pp. 60–65. IEEE (2005)

3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and thekinetics dataset. In: proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 6299–6308 (2017)

4. Christoph, R., Pinz, F.A.: Spatiotemporal residual networks for video action recog-nition. Advances in Neural Information Processing Systems pp. 3468–3476 (2016)

5. Damen, D., Doughty, H., Maria Farinella, G., Fidler, S., Furnari, A., Kazakos,E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Scaling egocentricvision: The epic-kitchens dataset. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 720–736 (2018)

6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer visionand pattern recognition. pp. 248–255. Ieee (2009)

7. Diba, A., Fayyaz, M., Sharma, V., Mahdi Arzani, M., Yousefzadeh, R., Gall, J.,Van Gool, L.: Spatio-temporal channel correlation networks for action classifica-tion. In: Proceedings of the European Conference on Computer Vision (ECCV).pp. 284–299 (2018)

8. Fabian Caba Heilbron, Victor Escorcia, B.G., Niebles, J.C.: Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. pp. 961–970 (2015)

9. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recogni-tion. In: Proceedings of the IEEE International Conference on Computer Vision.pp. 6202–6211 (2019)

10. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fu-sion for video action recognition. In: Proceedings of the IEEE conference on com-puter vision and pattern recognition. pp. 1933–1941 (2016)

11. Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: A better baseline for AVA.CoRR abs/1807.10066 (2018), http://arxiv.org/abs/1807.10066

12. Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformernetwork. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 244–253 (2019)

13. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference oncomputer vision. pp. 1440–1448 (2015)

http://arxiv.org/abs/1807.10066


14. Gkioxari, G., Girshick, R., Dollar, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 8359–8367 (2018)

15. Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tul-loch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1hour. arXiv preprint arXiv:1706.02677 (2017)

16. Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan,S., Toderici, G., Ricco, S., Sukthankar, R., et al.: Ava: A video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 6047–6056 (2018)

17. Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (t-cnn) for ac-tion detection in videos. In: Proceedings of the IEEE International Conference onComputer Vision. pp. 5822–5831 (2017)

18. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

19. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human actionrecognition. IEEE transactions on pattern analysis and machine intelligence 35(1),221–231 (2012)

20. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detectorfor spatio-temporal action localization. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 4405–4413 (2017)

21. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Featurepyramid networks for object detection. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 2117–2125 (2017)

22. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conferenceon computer vision. pp. 740–755. Springer (2014)

23. Massa, F., Girshick, R.: maskrcnn-benchmark: Fast, modular reference implemen-tation of Instance Segmentation and Object Detection algorithms in PyTorch.https://github.com/facebookresearch/maskrcnn-benchmark (2018), accessed:2020-2-29

24. Peng, X., Schmid, C.: Multi-region two-stream r-cnn for action detection. In: Eu-ropean conference on computer vision. pp. 744–759. Springer (2016)

25. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: proceedings of the IEEE International Conference onComputer Vision. pp. 5533–5541 (2017)

26. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-tion with region proposal networks. In: Advances in neural information processingsystems. pp. 91–99 (2015)

27. Sigurdsson, G.A., Divvala, S., Farhadi, A., Gupta, A.: Asynchronous temporalfields for action recognition. In: The IEEE Conference on Computer Vision andPattern Recognition (CVPR) (July 2017)

28. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-nition in videos. In: Advances in neural information processing systems. pp. 568–576 (2014)

29. Singh, G., Saha, S., Sapienza, M., Torr, P.H., Cuzzolin, F.: Online real-time multi-ple spatiotemporal action localisation and prediction. In: Proceedings of the IEEEInternational Conference on Computer Vision. pp. 3637–3646 (2017)

30. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classesfrom videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

https://github.com/facebookresearch/maskrcnn-benchmark


31. Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: European conference on computer vision. pp. 140–153.Springer (2010)

32. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3d: generic featuresfor video analysis. CoRR, abs/1412.0767 2(7), 8 (2014)

33. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer lookat spatiotemporal convolutions for action recognition. In: Proceedings of the IEEEconference on Computer Vision and Pattern Recognition. pp. 6450–6459 (2018)

34. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for actionrecognition. IEEE transactions on pattern analysis and machine intelligence 40(6),1510–1517 (2017)

35. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio,S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances inNeural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc.(2017), http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

36. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro-ceedings of the IEEE conference on computer vision and pattern recognition. pp.7794–7803 (2018)

37. Wang, X., Gupta, A.: Videos as space-time region graphs. In: Proceedings of theEuropean conference on computer vision (ECCV). pp. 399–417 (2018)

38. Weston, J., Chopra, S., Bordes, A.: Memory networks. arXiv preprintarXiv:1410.3916 (2014)

39. Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) (June 2019)

40. Xia, J., Tang, J., Lu, C.: Three branches: Detecting actions with richer features.arXiv preprint arXiv:1908.04519 (2019)

41. Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformationsfor deep neural networks. In: Proceedings of the IEEE conference on computervision and pattern recognition. pp. 1492–1500 (2017)

42. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal featurelearning for video understanding. arXiv preprint arXiv:1712.04851 1(2), 5 (2017)

43. Yang, X., Yang, X., Liu, M.Y., Xiao, F., Davis, L.S., Kautz, J.: Step: Spatio-temporal progressive learning for video action detection. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. pp. 264–272 (2019)

44. Zhang, Y., Tokmakov, P., Hebert, M., Schmid, C.: A structured model for actiondetection. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 9975–9984 (2019)

45. Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning invideos. In: Proceedings of the European Conference on Computer Vision (ECCV).pp. 803–818 (2018)

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Action Detection arXiv:2004.07485v1 [cs.CV] 16 Apr 2020 · Asynchronous Interaction Aggregation for...

Documents

Transcript of Action Detection arXiv:2004.07485v1 [cs.CV] 16 Apr 2020 · Asynchronous Interaction Aggregation for...