# trajectory_convolution_for_action_recognition__f3356948.pdf

Trajectory Convolution for Action Recognition

Yue Zhao Department of Information Engineering The Chinese University of Hong Kong zy317@ie.cuhk.edu.hk

Yuanjun Xiong Amazon Rekognition yuanjx@amazon.com

Dahua Lin Department of Information Engineering The Chinese University of Hong Kong dhlin@ie.cuhk.edu.hk

How to leverage the temporal dimension is one major question in video analysis. Recent works [47, 36] suggest an efﬁcient approach to video feature learning, i.e., factorizing 3D convolutions into separate components respectively for spatial and temporal convolutions. The temporal convolution, however, comes with an implicit assumption the feature maps across time steps are well aligned so that the features at the same locations can be aggregated. This assumption can be overly strong in practical applications, especially in action recognition where the motion serves as a crucial cue. In this work, we propose a new CNN architecture Trajectory Net, which incorporates trajectory convolution, a new operation for integrating features along the temporal dimension, to replace the existing temporal convolution. This operation explicitly takes into account the changes in contents caused by deformation or motion, allowing the visual features to be aggregated along the the motion paths, trajectories. On two large-scale action recognition datasets, Something-Something V1 and Kinetics, the proposed network architecture achieves notable improvement over strong baselines.

1 Introduction

The past decade has witnessed signiﬁcant progress in action recognition [37, 38, 29, 42, 1], especially due to the advances in deep learning. Deep learning based methods for action recognition mostly fall into two categories, two-stream architectures [29] with 2D convolutional networks and 3D convolutional networks [34]. Particularly, the latter has demonstrated great potential on large-scale video datasets [19, 25], with the use of new training strategies like transferring weights from pretrained 2D CNNs [42, 1].

However, for 3D convolution, several key questions remain to be answered: (1) 3D convolution involves substantially increased computing cost. Is it really necessary? (2) 3D convolution treats the spatial and temporal dimensions uniformly. Is it the most effective way for video modeling? We are not the ﬁrst to raise such questions. In recent works, there have been attempts to move beyond 3D convolution and further improve the efﬁciency and effectiveness of joint spatio-temporal analysis. For instance, both Separable-3D (S3D) [47] and R(2+1)D [36] obtain superior performance by factorizing the 3D convolutional ﬁlter into separate spatial and temporal operations. However, both methods are based on an implicit assumption that the feature maps across frames are well aligned so that the features at the same locations (across consecutive frames) can be aggregated via temporal convolution. This assumption ignores the motion of people or objects, a key aspect in video analysis.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Motion field

! = ℱ$%,$%'( ! = ℱ$%,$%)(

Convolute over x p,- , . 1 1,1, 1 + 1

Interpolate feature x p,%)(

Determine p,%)( from motion field

direction for regular convolution

direction along trajectory

Input feature map

Trajectory Conv.

Output feature map

Figure 1: Illustration of our trajectory convolution. Given a sequence of video frames (left) and its corresponding input feature map of size C T H W (bottom-middle; the dimension of channels C is simpliﬁed as one for clarity), in order to calculate the response of a speciﬁc point at time step t, we leverage the motion ﬁelds ω and ω (top-middle; the arrows in blue denote the motion velocity) to determine the sampling location at neighboring time step t 1 and t + 1 in the sense of tracking along the motion path. The response is denoted on the output feature map (bottom-right). The operation of trajectory convolution (denoted in a red box) is illustrated on the top-right. This ﬁgure is best viewed in color.

A natural idea to address this issue is to track the objects of interest and extract the features along their motion paths, i.e., trajectories. This idea has been explored in previous works [33, 37, 38, 41]. The most recent work along this direction is the Trajectory-pooled Deep-convolutional Descriptor (TDD) [41], which aggregates off-the-shelf deep features along trajectories. However, in this method, the visual features are derived separately from an existing deep network, just as a replacement of hand-crafted features. Hence, a question emerges: can we learn better video features in conjunction with feature tracking?

In pursuit of this question, we develop a new CNN architecture for learning video features, called Trajectory Net. Inspired by the Separable-3D network [36, 47], our design involves a cascade of convolutional operations respectively along the spatial and temporal dimensions. A distinguishing feature of this architecture is that it introduces a new operation, namely the trajectory convolution, to take the place of the standard temporal convolution. As shown in Figure 1, the trajectory convolution operates along the trajectories that trace the pixels corresponding to the same physical points, rather than at ﬁxed pixel locations. The trajectories can be derived from either a precomputed optical ﬂow ﬁeld or a dense ﬂow prediction network trained jointly with the features. The standard temporal convolution can be seen as a special case of the trajectory convolution where all pixels are considered to be stationary over time.

Experimental results on Something-Something V1 and Kinetics datasets show that by explicitly taking into account the motion dynamics in the temporal operation, the proposed network obtains considerable improvements over the Separable-3D, a competitive baseline.

2 Related Work

Trajectory-based Methods for Action Recognition Action recognition in videos has been greatly advanced thanks to the up-springing of powerful features. It was ﬁrstly tackled by extracting spatial-temporal local descriptors [39] from space-time interest points [20, 46]. These successful local features include: Histogram of Oriented Gradients (HOG) [3], Histogram of Optical Flow (HOF) [21], and Motion Boundary Histogram (MBH) [4].

Over the years, it was recognized that the 2D space domain and 1D time domain have different characteristics and should be handled in a different manner intuitively. As of the motion modeling in the temporal domain, trajectories have been a powerful intermediary to convey such motion information. Messing et al [24] used a KLT tracker [22] to extract feature trajectories and applied log-polar uniform quantization. Sun et al [33] extracted trajectories by matching SIFT feature between frames. These trajectories are based on sparse interest points, which have been later proved to be inferior to dense sampling. In [37], Wang et al used dense trajectories to extract low-level features within aligned 3D volumes. An improved version [38] increased recognition accuracy by estimating and compensating the effect of camera motion. [37, 38] also revealed that trajectory itself can serve as a component of descriptors in the form of concatenated displacement vectors, which was consolidated by deep learning methods [29].

Wang et al ﬁrst proposed TDD in [41] to introduce deep features to trajectory analysis. It conducts trajectory-constrained pooling to aggregate deep features into video descriptors. However, the backbone two-stream CNN [29], from which the deep feature is extracted, is learned from very short frame snippets and is unaware of the information of temporal evolution. In addition, all of these trajectory-aligned methods rely on encoding methods such as Fisher vectors (FV) [45] and vectors of locally aggregated descriptors (VLAD) [14] and an extra SVM is needed for classiﬁcation, which prohibits end-to-end training. To sum up the discussion above, we provide a comparison of our approach with previous works on action recognition in Table 1.

Table 1: A comparison of our approach with existing methods.

Method Use deep feature? Feature tracking? End-to-end? STIP [20] DT [37], i DT [38] TSN [42], I3D [1] TDD [41] Trajectory Net (Ours)

Action Recognition in the Context of Deep Learning Deep convolutional neural networks based models have been widely applied to action recognition [18, 29, 34], which can be mostly categorized into two families, i.e. two-stream networks [29] and 3D convolution networks [15, 34]. Recently, 3D convolutonal network has drawn attention since Carreira et al introduced Inﬂated-3D models [1] by inﬂating an existing 2D convolutional network to its 3D variant and training on a very large action recognition dataset [19]. Tran et al argued in [36] that factorizing 3D convolutions into separable spatial and temporal convolutions obtains higher accuracy. Similar phenomenon is also observed in Separable-3D models by Xie et al [47]. Wang et al incorporated multiplicative interaction into 3D convolution for modeling relation in [40]. All of these modiﬁcations are focused on the single modality, i.e. the appearance branch.

Apart from network architectural designs, another direction is to exploit the interaction of appearance and motion information of action. Feichtenhofer et al explored the strategies of spatio-temporal fusion of two-stream networks at earlier stages in [7]. Such attempts are mostly simple manipulation of feature such as stacking, addition [7], and multiplicative gating [6]. Motion Representation using Convolutional Networks Optical ﬂow has been used as a generic representation of motion as well as trajectory in particular for decades. As a competitive counterpart to the classical variational approaches [10, 31], many parametric models based on CNN have been recently proposed and achieved promising results in estimating optical ﬂow. These include, but are not limited to, the Flow Net family [5, 11], Spy Net [26], and PWC-Net [32]. The aforementioned models are learned in a supervised manner on large-scale simulated ﬂow datasets [5, 23], possibly leaving a large gap between simulated animations and real-world videos. Also, these datasets are designed for accurate ﬂow prediction, which is possibly not appropriate for motion estimation of human action due to inhomogeneity of displacement across optical ﬂow dataset and human action dataset, as revealed in [11]. As of the network architecture, most models require parameters in the magnitude of 107 108, which both prohibits being plugged into action recognition networks as a submodule and causes too much computational cost. Zhu et al proposed Motion Net [51] to learn dense ﬂow ﬁelds in an unsupervised manner and plugged it into a two-stream network [29] to be ﬁnetuned for action recognition task. The Motion Net is relatively light-weighted and can accept a sequence of multiple images. However, this is only used to substitute the pre-calculated optical

ﬂow while maintaining the conventional two-stream architecture. Zhao et al proposed an alternative representation based on cost volume for efﬁciency at the cost of degraded quality of motion ﬁeld [49]. Transformation-Sensitive Convolutional Networks Conventional CNN operates on ﬁxed locations in a regular grid, which limits its ability to modeling unknown geometric transformations. Spatial Transform Networks (STN) [13] is the ﬁrst to introduce spatial transformation learning into deep models. It estimates a global parametric transformation on which the ordinary feature map is warped. Such warping is computationally expensive and the transformation is considered to be universal across the whole image, which is usually not the case for action recognition, since different body parts have their own movement. In Dynamic Filter Networks [16], Xu et al introduce dynamic ﬁlters which are conditioned on the input and can change over samples. This enables learning local spatial transformations. Deformable Convolutional Network (DCN) [2] achieves similar local transformation in a different way. While maintaining ﬁlter weights invariant to the input, the proposed deformable convolution ﬁrst learns a dense offset map from the input, and then applies it to the regular feature map for re-sampling. The proposed trajectory convolution is inspired by the deformable sampling in DCN and utilizes it for feature tracking in the spatiotemporal convolution operations.

The Trajectory Net model is built with the trajectory convolution operation. In this section, we ﬁrst introduce the concept of trajectory convolution. Then we illustrate the architecture of Trajectory Net. Finally we describe the approach to learning the trajectory together with the trajectory convolution.

3.1 Trajectory Convolution

In the context of separable spatial temporal 3D convolution, the 1D temporal convolution is conducted pixel-wise on the 2D spatial feature map along the temporal dimension. Given input feature maps xt(p) at the t-th time step, the output feature yt(p), at position p = (h, w) [0, H) [0, W), is calculated by the inner product of input feature sequences at same spatial position across neighboring frames and the 1D convolution kernels.

By revisiting the idea of trajectory modeling in the action recognition literature, we introduce the concept of trajectory convolution. In trajectory convolution, the convolutional operation is done across irregular grids such that the sampled positions at different times correspond to the same physical point of a moving object. Formally, parameterized by the ﬁlter weight {wτ : τ [ t, t]} with kernel size (2 t + 1), the output feature yt(p) is calculated as

τ= t wτxt+τ(ept+τ). (1)

Following the formulation of trajectory in [37], the point pt at frame t can be tracked to position ept+1 at next frame (t+1) in the presence of a forward dense optical ﬂow ﬁeld ω = (ut, vt) = F(It, It+1) using the following equation

ept+1 = (ht+1, wt+1) = pt + ω (pt) = (ht, wt) + ω |(ht,wt). (2)

For τ > 1, the sample position ept+τ can be calculated by applying Eq. (2) iteratively. To track to the previous frame (t 1), a backward dense optical ﬂow ﬁeld ω = (ut, vt) = F(It, It 1) is used likewise.

Since the optical ﬂow ﬁeld is typically real-valued, the sampling position ept+τ becomes fractional. Therefore, the corresponding feature x(ept+τ) is derived via interpolation with a speciﬁc sampling kernel G, written as

x(ept+τ) = X

p G(p , ept+τ) x(p ). (3)

In this paper, we will not go deeper into the usage of different choices of sampling kernels G and use the bilinear interpolation as default.

3.2 Relation with Deformable Convolution

The original deformable convolution is introduced for 2D convolution. But it is natural to extend it to the 3D scenarios. A spatio-temporal grid R R3 can be deﬁned by an ordinary 3D convolution speciﬁed by a certain receptive ﬁeld size and dilation. For each location q0 (t, h, w) on the output feature map y, the response is calculated by sampling on irregular locations offset by qn.

qn R w(qn) x(q0 + qn + qn) (4)

The trajectory convolution can then be viewed as a special case of 3D deformable convolution where the offset map is from the trajectories. Here, the grid R = {( 1, 0, 0), (0, 0, 0), (1, 0, 0)} is deﬁned by a 3 1 1 kernel with dilation 1. The temporal component of the offset is always 0, i.e. qn = (0, pn). The discussion above reveals the relationship with deformable convolution. Therefore, the trajectory convolution can be efﬁciently implemented similar to the way discussed in [2].

3.3 Combining Motion and Appearance Features

The trajectory convolution helps the network to aggregate appearance features along motion path, alleviating the motion artifact by trajectory alignment. However, the motion information itself is important for action recognition. Inspired by the trajectory descriptor proposed in [37], we describe local motion patterns at each position p using the sequence of trajectory information in the form of coordinates of sampling offsets { pτ : τ [ t, t]}. This is equivalent to stacking the offset map for trajectory convolution and the original appearance feature map. The offset map is normalized through Batch-Normalization [12] before concatenation. As a result, we achieve the combination of appearance feature and motion information in terms of trajectory with minimal increase of network parameters. Compared with the canonical two-stream approaches, which are based on late fusion of two networks, our approach leads to a uniﬁed network architecture and is much more parameter and computation efﬁcient.

3.4 The Trajectory Net Architecture

Based on the concept of trajectory convolution, we design a uniﬁed architecture that can align appearance and motion features along the motion trajectories. We call it Trajectory Net by integrating trajectory convolution into the Separable-3D Res Net18 architecture [9, 36]. The 1D temporal convolution component of a (2+1)D-convolutional block is replaced by a trajectory convolution with down-sampled motion ﬁeld, such as a pre-computed optical ﬂow, in the middle level of the network. The appearance feature map for trajectory convolution is optionally concatenated with the down-sampled motion ﬁeld to introduce extra motion information. Adding trajectory convolution at higher levels is likely to provide less motion information since spatial resolution is reduced and down-sampled optical ﬂow may be inaccurate. Adding trajectory convolution at lower levels increases the precision of motion estimation, but the receptive ﬁeld for sampling position is limited.

3.5 Learning Trajectory

As discussed in the previous subsection, the trajectory convolution can be viewed as deformable convolution with a special deformation map, that is the motion trajectory in the video. It is capable of accumulating gradients from higher layers via back-propagation. Therefore, if the trajectory can be estimated by a parametric model, we can learn the model parameters using back-propagation as well. The most straight forward approach for this cause is applying a small 3D CNN to estimate trajectories as an mimic of the 2D CNN used in the deformable convolution networks [2]. Preliminary experiments show that this is not very effective. It can be observed that the offsets obtained simply by applying a 3D convolutional layer over the same input feature map are highly correlated to the appearance. On the contrary, the motion representation, which we use trajectory as a medium, has long been considered to be invariant to appearance intuitively and empirically [17, 28]. Therefore, we cannot naïvely adopt the way of learning offsets in [2]. This also reveals the difference between the original deformable convolution for object detection and our trajectory convolution for action recognition: The original deformable convolution attempts to learn deformation of spatial conﬁguration within an

single image while our trajectory convolution tries to model the variation of appearance deformation across neighboring images, despite sharing the similar mathematical formulation.

To tackle such issue, we train another network to predict the trajectory individually as an alternative. In particular, we use Motion Net [51] as the basis due to its lightweightness. It accepts a stack of (M + 1) images as a 3(M + 1)-channel input and predicts a series of M motion ﬁeld maps as a 2Mchannel output. Following a downsample-upsample design like Flow Net-SD [11], motion ﬁelds with multiple spatial resolutions are predicted. The network is trained without external supervision such as ground-truth optical ﬂow. An unsupervised loss Lunsup [51] is designed to enforce pair-wise reconstruction and similarity, with motion smoothness as a regularization.

Once pre-trained, the Motion Net can be plugged into the Trajectory Net architecture to substitute the input of pre-computed optical ﬂow. We modify the original model in [51] to produce optical ﬂow map of the same resolution of feature maps where the trajectory convolution operates on. The Motion Net can also be ﬁne-tuned with the classiﬁcation network. In this case, the loss for network training is a weighted sum of the unsupervised loss Lunsup and the cross-entropy loss for classiﬁcation Lcls, written as L = γLunsup + Lcls.

4 Experiments

To evaluate the effectiveness of our Trajectory Net, we conduct experiments on two benchmark datasets for action recognition: Something-Something V1 [8] and Kinetics [19]. Visualization of intermediate features for both appearance and trajectory is also provided.

4.1 Dataset descriptions

Something-Something V1 [8] is a large-scale crowd-sourced video dataset on human-object interaction. It contains 108,499 video clips in 174 classes. The dataset is split into train, validation and test subset in the ratio of around 8:1:1. The top-1 and top-5 accuracy is reported.

Kinetics [19] is a large-scale video dataset on human-centric activities sourced from You Tube. We use the version released in 2017, covering 400 human action classes. Due to the inaccessibility of some videos on You Tube, our version contains 240436, 19796 and 38685 clips in the training, validation and test subset, respectively. The recognition performance is measured by the average of top-1 and top-5 accuracy.

4.2 Experimental Setups

Network conﬁguration We use the Separable-3D Res Net-18 [9] as the base model, if not speciﬁed. Starting from the base Res Net-18 model, A 1-D temporal convolution module with temporal kernel size of 3, followed by Batch Normalization [12] and Re LU non-linearity is inserted after every 2-D spatial convolution module. A dropout of 0.2 is used between the global pooling and the last C-dimensional (C equals the total number of classes) fully-connected layer.

Generating trajectories As stated above, we study two methods to generate trajectories: one is based on variational methods and the other is based on CNNs. For the former, we adopt the TV-L1 algorithm [48] which is implemented in Open CV with CUDA. To match the size of input feature, two types of pooling are used to down-sample the optical ﬂow ﬁeld: average pooling and max pooling. For the latter, the Motion Net is trained by randomly sampling images pairs from UCF-101 [30]. The training policy follows the practices in [51].

Training The network is trained with stochastic gradient descent with momentum set to 0.9. The weights for 2D spatial convolution are initialized with the 2D Res Net pre-trained on Image Net [27]. The length of each input clip is 16 and the sampling step varies from 1 to 2. For Something-Something V1, the batch size is set to 64 while for Kinetics, the batch size is 128. On Kinetics, the network is trained from an initial learning rate of 0.01 and is reduced by 1 10 every 40 epochs. The whole training procedure takes 100 epochs. For Something-Something V1, the epoch number is halved because the duration of its videos is shorter.

Testing At test time, we follow the common practice by sampling a ﬁxed number of N snippets (N = 7 for Something-Something V1 and N = 25 for Kinetics) with an equal temporal interval. By

cropping and ﬂipping four corners and the center of each frame within a snippet, 10 inputs are obtain for each snippet. The ﬁnal class scores are calculated by averaging the scores across all 10N inputs.

4.3 Ablation Studies

Trajectory convolution We ﬁrst evaluate the effect of using trajectory convolution in the Separable3D Res Net architecture in Table 2. Consistent improvement of accuracy can be observed if trajectory convolution is used. Then, we study the effect of incorporating trajectory in different locations. Adding trajectory convolution increases the top-5 accuracy but the top-1 accuracy saturates. In the remaining experiments, we use only 1 trajectory convolution at the res3b1.conv1 block, if not speciﬁed.

Since we did not see remarkable gain, we conjecture that this is because the used trajectory is derived from the optical ﬂow down-sampled via average pooling. The optical ﬂow is already smoothed with TV-L1 and the extra average pooling degrades the quality more. To verify this, we preform an additional experiment by replacing average pooling with max pooling. This alternative downsampling strategy preserves more details without degrading the trajectory. Furthermore, as will be shown in Table 4, using trajectory learned from Motion Net leads to higher accuracy. This indicates that the performance of Trajectory Net highly depends on the quality of trajectory.

Table 2: Results of using trajectory convolution in different convolutional layers in the Separable-3D Res Net-18 network. The accuracy is reported on the validation subset of Something-Something V1.

Usage of Traj. Conv. Down-sample Method Top-1 Acc. Top-5 Acc. None None 34.30 65.66 res2b1.conv1 Avg. Pool 34.49 66.23 res3a.conv1 Avg. Pool 34.79 66.21 res3b1.conv1 Avg. Pool 34.96 66.24 res3b1.conv1,2 Avg. Pool 34.72 66.89 res3b1.conv1 Max Pool 36.04 67.72

Combining motion and appearance features We compare the results of incorporating motion information into the trajectory convolution in Table 3. We can clearly see the improvement of more than 1% after encoding a 4-dimensional feature map of trajectory coordinates. We compare with several other methods, such as the early spatial fusion by concatenation with motion feature map [7] and the late fusion used in the two-stream network [29]. Though there is still an apparent gap between ours and the late-fusion strategy, our fusion strategy achieves notable increase with negligible increase of parameters. And it also completely removes the computation for running a motion-stream recognition network.

Table 3: Results of incorporating different sources of input into the trajectory convolution in the Separable-3D Res Net-18 network. The ft. denotes the feature map. The accuracy is reported on the validation subset of Something-Something V1.

Source Usage of Traj. Conv. # param. Top-1 Acc. Top-5 Acc. appearance res3b1.conv1 15.2M 34.96 66.24 appearance + motion (ft.) res3b1.conv1 15.9M 35.24 67.22 appearance + trajectory (# dim=4) res3b1.conv1 15.2M 36.08 67.72 two-stream S3D (late fusion) None 30.4M 40.67 72.79

Learning trajectory Here we compare the learned trajectory against pre-computed optical ﬂow from TV-L1 [48]. We choose two architectures of Motion Net: one accepts one image pair and outputs one motion ﬁeld (denoted by Motion Net-(2)), and the other accepts 17 consecutive images and produces 16 motion ﬁelds (denoted by Motion Net-(17)). We study three training policies: (1) ﬁxing the Motion Net once it is pre-trained; (2) ﬁne-tuning the Motion Net with the classiﬁcation crossentropy loss; and (3) ﬁne-tuning the Motion Net with both the unsupervised loss and classiﬁcation loss. The loss weight γ is set to 0.01. The results are listed in Table 4. It turns out that the trajectories learned by both Motion Net-(2) and Motion Net-(17) outperform those derived from TV-L1 [48]. It is interesting to observe that jointly training Motion Net and Trajectory Net will yield lower accuracies than freezing Motion Net unless the unsupervised loss is introduced. We conjecture that the existence

of Lunsup can help to maintain the quality of trajectories by enforcing the pair-wise consistency. The necessity of multi-task ﬁne-tuning may also explain the difﬁculty of using shallow convolutional modules with random initialization to estimate the trajectory, which we have discussed in Sec 3.5.

Table 4: Results of learning trajectory. The settings are elaborated in the body part.

source of trajectory ﬁne-tune weight unsup. loss Top-1 Acc. Top-5 Acc. TV-L1 - - 34.96 66.24 Motion Net-(2) 36.37 67.74 Motion Net-(2) 34.72 65.59 Motion Net-(2) 36.91 68.47 Motion Net-(17) 35.69 66.82 Motion Net-(17) 35.25 66.65 Motion Net-(17) 36.69 68.52

Trajectories with step greater than one Here we evaluate the model which accepts an input of 16 frames but at a sampling step of 2. To be more speciﬁc, we collect a consecutive of 32 frames and randomly sample one frame for every two neighboring frames. This enlarges the effective coverage of the architecture, i.e. from 16 to 32, while keeping the computation the same. With the strategy of learning trajectory mentioned above, the Trajectory Net can still improve over the baseline. This also reﬂects the ﬂexibility of learnable trajectory, since pre-computed optical ﬂow has to be re-run for the whole training set under such circumstances.

Table 5: Results of using trajectories with step greater than one.

# of frame step Effective coverage Usage of Trajectories Top-1 Acc. Top-5 Acc. 16 2 32 None 42.47 74.57 16 2 32 Motion Net-(17)-ft.-unsup. 43.32 74.85

Runtime Cost In Table 6, we report the runtime of the proposed Trajectory Net with two settings: (1) the one whose trajectories are from pre-computed TV-L1 (time not included) and (2) the one whose trajectories are inferred from Motion Net-(17) (time included). Compared with its plain counterpart, the Trajectort Net with pre-computed TV-L1 incurs less than 10% additional computation for the operation of trajectory convolution. It takes Trajectory Net with Motion Net-(17) an extra 0.137 second for network forward compared to Trajectory Net with TV-L1, which can be ascribed to the forward time of the Motion Net plugged in.

Table 6: Runtime comparison of Trajectory Net and the counterpart. The network is tested on a workstation with Intel(R) Xeon(R) CPU (E5-2640 v3 @2.60GHz) and Nvidia Titan X GPU.

Method Net. forward (sec) t (sec) S3D 0.390 - Trajectory Net (TV-L1) 0.426 +0.036 Trajectory Net (Motion Net-(17)) 0.563 +0.137

4.4 Comparison with State-of-the-Arts

We compare the performance of our Trajectory Net with other state-of-the-art methods. The results on Something-Something V1 [8] and Kinetics [19] are shown in Table 7 and Table 8 respectively. For Something-Something V1, we use 16 frames with a step of 2 as input and apply Motion Net-(17) to produce trajectory. Motion information encoded by trajectory is used optionally. On Table 7, we can see that our Trajectory Net achieves competitive results with state-of-the-art models including those with deeper models or those pre-trained on larger models. After pre-training on Kinetics, the accuracy is boosted to a new level. For Kinetics, a Motion Net-(2) is used. On Table 8, the Trajectory Net improves the Separable-3D baseline. With 16 input frames at a step of 2, it performs on par with models with similar model complexity.

4.5 Visualization

We present a qualitative study by visualizing the intermediate feature of our Trajectory Net in Figure 2. Given a pair of two consecutive images on top of the ﬁrst column, we ﬁrst compare the feature map at

Table 7: Comparison with state-of-the-art methods on the validation and test set of Something Something V1. The performance is measured by the Top-1 accuracy.

Method Backbone network Pre-train Val Top-1 3D-CNN [8] C3D Sports-1M 11.5 Multi Scale TRN [50] BN-Inception Image Net 34.4 ECO lite [52] BN-Inception + 3D-Res Net18 Kinetics 46.4 Non-local I3D + GCN [44] Res Net-50 Kinetics 46.1 Trajectory Net-Motion Net-(17) w/o. motion Res Net-18 Image Net 43.3 Trajectory Net-Motion Net-(17) w/. motion Res Net-18 Image Net 44.0 Trajectory Net-Motion Net-(17) w/o. motion Res Net-18 Kinetics 47.8 Table 8: Comparison with state-of-the-art methods on the validation subset of Kinetics. The performance is measured by the average of Top-1 and Top-5 accuracy.

Method Backbone network Pre-train Val. Avg. Acc. TSN (RGB) [42] BN-Inception-v2 Image Net 77.8 I3D (RGB) [1] BN-Inception-v1 Image Net 81.2 Nonlocal-I3D (RGB) [43] Res Net-101 Image Net 85.5 R(2+1)D (RGB) [36] Res Net-34 Sports-1M 82.6 C3D [35] Res Net-18 - 75.7 ARTNet w/. TSN [40] Res Net-18 - 80.0 Separable-3D (RGB, 16 1 frames) Res Net-18 Image Net 76.9 Trajectory Net-Motion Net-(2) (16 1 frames) Res Net-18 Image Net 77.8 Trajectory Net-Motion Net-(2) (16 2 frames) Res Net-18 Image Net 79.8

the layer of res3b1.conv1, i.e. on which the trajectory convolution is applied, at the bottom of the ﬁrst column. We can observe a visible spatial shift between the two images high response regions, which conforms to our assumption that feature map is not well aligned due to object movement. We also demonstrate different types of trajectories that we use in the experiments, namely the TV-L1 [48] optical ﬂow and prediction from Motion Net before and after ﬁnetune on the second, third and fourth column. We can see that the motion estimation by the original Motion Net is less smooth than TV-L1 especially in the background regions. For foreground objects, however, Motion Net does well and can sometimes produce motion with more rigid shape, e.g. the hand on the left example of Figure 2. Also, the joint training further improves the quality of trajectories.

TV-L1 Motion Net w/o. unsup. ft.

Motion Net w/. unsup. ft. Image $%

Feature Map ℱ($%)

@res3b.conv1

Feature Map ℱ($%)()

@res3b.conv1

TV-L1 Motion Net w/o. unsup. ft.

Motion Net w/. unsup. ft.

(HSV color map)

(x-direction map)

(y-direction map)

Figure 2: Visualization of the intermediate feature of the Trajectory Net. These two image pairs depict the action of moving something (a pen) down and trying but failing to attach something (a ball) to something (a cat) because it doesn t stick. For each block, the ﬁrst column show a pair of input images and their corresponding feature map at the layer of res3b1.conv1; the second, third and fourth column show the optical ﬂow ﬁeld generated by TV-L1 algorithm and learned by Motion Net before and after ﬁnetuning (The motion ﬁeld encoded in HSV color map as well as the components of x-axis and y-axis are shown from top to bottom). The ﬁgure is best viewed in color. 5 Conclusion In this paper, we propose a uniﬁed end-to-end architecture called Trajectory Net for action recognition. The approach is to incorporate the repeatedly proven idea of trajectory modeling into the Separable-3D network by introducing a new operation named trajectory convolution. The Trajectory Net further combines appearance and motion information in a uniﬁed model architecture. The proposed architecture achieves notable improvements over the Separable-3D baseline, providing a new perspective of explicitly considering motion dynamics in the deep networks.

Acknowledgment This work is partially supported by the Big Data Collaboration Research grant from Sense Time Group (CUHK Agreement No. TS1610626), and the Early Career Scheme (ECS) of Hong Kong (No. 24204215).

[1] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724 4733. IEEE, 2017.

[2] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), pages 764 773, 2017.

[3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886 893. IEEE, 2005.

[4] Navneet Dalal, Bill Triggs, and Cordelia Schmid. Human detection using oriented histograms of ﬂow and appearance. In European Conference on Computer Vision (ECCV), pages 428 441. Springer, 2006.

[5] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical ﬂow with convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), pages 2758 2766, 2015.

[6] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. Spatiotemporal multiplier networks for video action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7445 7454. IEEE, 2017.

[7] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016.

[8] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The something something video database for learning and evaluating visual common sense. In The IEEE International Conference on Computer Vision (ICCV), 2017.

[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016.

[10] Berthold KP Horn and Brian G Schunck. Determining optical ﬂow. Artiﬁcial intelligence, 17(1-3):185 203, 1981.

[11] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical ﬂow estimation with deep networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.

[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning(ICML), pages 448 456, 2015.

[13] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in Neural Information Processing Systems (NIPS), pages 2017 2025, 2015.

[14] Herve Jegou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Perez, and Cordelia Schmid. Aggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and machine intelligence, 34(9):1704 1716, 2012.

[15] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221 231, 2013.

[16] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic ﬁlter networks. In Advances in Neural Information Processing Systems (NIPS), pages 667 675, 2016.

[17] Gunnar Johansson. Visual perception of biological motion and a model for its analysis. Perception & psychophysics, 14(2):201 211, 1973.

[18] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classiﬁcation with convolutional neural networks. In The IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 1725 1732, 2014.

[19] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. ar Xiv preprint ar Xiv:1705.06950, 2017.

[20] Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107 123, 2005.

[21] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1 8. IEEE, 2008.

[22] Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In International Joint Conference on Artiﬁcial intelligence (IJCAI), pages 674 679. Morgan Kaufmann Publishers Inc., 1981.

[23] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical ﬂow, and scene ﬂow estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040 4048, 2016.

[24] Ross Messing, Chris Pal, and Henry Kautz. Activity recognition using the velocity histories of tracked keypoints. In The IEEE International Conference on Computer Vision (ICCV), pages 104 111. IEEE, 2009.

[25] Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Alex Andonian, Tom Yan, Kandan Ramakrishnan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, et al. Moments in time dataset: one million videos for event understanding. ar Xiv preprint ar Xiv:1801.03150, 2018.

[26] Anurag Ranjan and Michael J Black. Optical ﬂow estimation using a spatial pyramid network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.

[27] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

[28] Laura Sevilla-Lara, Yiyi Liao, Fatma Guney, Varun Jampani, Andreas Geiger, and Michael J Black. On the integration of optical ﬂow and action recognition. ar Xiv preprint ar Xiv:1712.08416, 2017.

[29] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), pages 568 576, 2014.

[30] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. ar Xiv preprint ar Xiv:1212.0402, 2012.

[31] Deqing Sun, Stefan Roth, and Michael J Black. Secrets of optical ﬂow estimation and their principles. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2432 2439. IEEE, 2010.

[32] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical ﬂow using pyramid, warping, and cost volume. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[33] Ju Sun, Xiao Wu, Shuicheng Yan, Loong-Fah Cheong, Tat-Seng Chua, and Jintao Li. Hierarchical spatiotemporal context modeling for action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2004 2011. IEEE, 2009.

[34] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), pages 4489 4497. IEEE, 2015.

[35] Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. Convnet architecture search for spatiotemporal feature learning. ar Xiv preprint ar Xiv:1708.05038, 2017.

[36] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann Le Cun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.

[37] Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3169 3176. IEEE, 2011.

[38] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In The IEEE International Conference on Computer Vision (ICCV), pages 3551 3558, 2013.

[39] Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. Evaluation of local spatio-temporal features for action recognition. In British Machine Vision Conference (BMVC), pages 124 1. BMVA Press, 2009.

[40] Limin Wang, Wei Li, Wen Li, and Luc Van Gool. Appearance-and-relation networks for video classiﬁcation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.

[41] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4305 4314, 2015.

[42] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV), pages 20 36. Springer, 2016.

[43] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[44] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In European Conference on Computer Vision (ECCV), 2018.

[45] Xingxing Wang, Limin Wang, and Yu Qiao. A comparative study of encoding, pooling and normalization methods for action recognition. In Asian Conference on Computer Vision (ACCV), pages 572 585. Springer, 2012.

[46] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efﬁcient dense and scale-invariant spatio-temporal interest point detector. In European Conference on Computer Vision (ECCV), pages 650 663. Springer, 2008.

[47] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: : Speed-accuracy trade-offs in video classiﬁcation. In European Conference on Computer Vision (ECCV), 2018.

[48] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l 1 optical ﬂow. In Joint Pattern Recognition Symposium, pages 214 223. Springer, 2007.

[49] Yue Zhao, Yuanjun Xiong, and Dahua Lin. Recognize actions by disentangling components of dynamics. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6566 6575, 2018.

[50] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In European Conference on Computer Vision (ECCV), 2018.

[51] Yi Zhu, Zhenzhong Lan, Shawn Newsam, and Alexander G Hauptmann. Hidden two-stream convolutional networks for action recognition. ar Xiv preprint ar Xiv:1704.00389, 2017.

[52] Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. Eco: Efﬁcient convolutional network for online video understanding. In European Conference on Computer Vision (ECCV), 2018.