# optimization_planning_for_3d_convnets__f0b137f4.pdf

Optimization Planning for 3D Conv Nets

Zhaofan Qiu 1 Ting Yao 1 Chong-Wah Ngo 2 Tao Mei 1

It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D Conv Nets) due to high complexity and various options of the training scheme. The most common hand-tuning process starts from learning 3D Conv Nets using short video clips and then is followed by learning longterm temporal dependency using lengthy clips, while gradually decaying the learning rate from high to low as training progresses. The fact that such process comes along with several heuristic settings motivates the study to seek an optimal path to automate the entire training. In this paper, we decompose the path into a series of training states and specify the hyper-parameters, e.g., learning rate and the length of input clips, in each state. The estimation of the knee point on the performance-epoch curve triggers the transition from one state to another. We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path. Furthermore, we devise a new 3D Conv Nets with a unique design of dual-head classiﬁer to improve spatial and temporal discrimination. Extensive experiments on seven public video recognition benchmarks demonstrate the advantages of our proposal. With the optimization planning, our 3D Conv Nets achieves superior results when comparing to the state-of-the-art recognition methods. More remarkably, we obtain the top-1 accuracy of 80.5% and 82.7% on Kinetics400 and Kinetics-600 datasets, respectively.

1. Introduction

The recent advances in 3D Convolutional Neural Networks (3D Conv Nets) have successfully pushed the limits and improved the state-of-the-art of video recognition. For in-

1JD AI Research, Beijing, China 2School of Computing and Information Systems, Singapore Management University, Singapore. Correspondence to: Ting Yao <tingyao.ustc@gmail.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

stance, an ensemble of LGD-3D networks (Qiu et al., 2019) achieves 17.88% in terms of average error in trimmed video classiﬁcation task of Activity Net Challenge 2019, which is dramatically lower than the error (29.3%) attained by the former I3D networks (Carreira & Zisserman, 2017). The result basically indicates the advantage and great potential of 3D Conv Nets for improving video recognition. Despite these impressive progresses, learning effective 3D Conv Nets for video recognition remains challenging, due to large variations and complexities of video content. Existing works of 3D Conv Nets (Tran et al., 2015; Carreira & Zisserman, 2017; Tran et al., 2018; Wang et al., 2018c; Qiu et al., 2017b; 2019; Feichtenhofer et al., 2019; Feichtenhofer, 2020; Li et al., 2020b; 2021; Qiu et al., 2021) predominately focus on the designs of network architectures but seldom explore how to train a 3D Conv Nets in a principled way.

The difﬁculty in training 3D Conv Nets originates from the high ﬂexibility of the training scheme. Compared to the training of 2D Conv Nets (Ge et al., 2019; Lang et al., 2019; Yaida, 2019), the involvement of temporal dimension in 3D Conv Nets brings two new problems of how many frames should be sampled from the video and how to sample these frames. First, the length of video clip is a tradeoff to control the balance between training efﬁciency and long-range temporal modeling for learning 3D Conv Nets. On one hand, training with short clips (16 frames) (Tran et al., 2015; Qiu et al., 2017b) generally leads to fast convergence with large mini-batch, and also alleviates the overﬁtting problem through data augmentation brought by sampling short clips. On the other hand, recent works (Varol et al., 2018; Wang et al., 2018c; Qiu et al., 2019) have proven better ability in capturing long-range dependency when training with long clips (over 100 frames) at the expense of training time. The second issue is the sampling strategy. Uniform sampling (Fan et al., 2019; Jiang et al., 2019; Mart ınez et al., 2019) offers the network a fast-forward overview of the entire video, while consecutive sampling (Tran et al., 2015; Qiu et al., 2017b; Varol et al., 2018; Wang et al., 2018c) encodes the continuous changes across frames and captures the spatiotemporal relation better. Given these complex choices of training scheme, learning a powerful 3D Conv Nets often requires signiﬁcant engineering efforts of human experts to determine the optimal strategy on each dataset. That motivates us to automate training strategy for 3D Conv Nets.

Optimization Planning for 3D Conv Nets

In the paper, we propose optimization planning mechanism which seeks the optimal training strategy of 3D Conv Nets adaptively. To this end, our optimization planning studies three problems: 1) choose between consecutive or uniform sampling; 2) when to increase the length of input clip; 3) when to decrease the learning rate. Speciﬁcally, we decompose the training process into several training states. Each state is assigned with the ﬁxed hyper-parameters, including sampling strategy, length of input clip and learning rate. The transition between states represents the change of hyper-parameters during training. Therefore, the training process can be decided by the permutation of different states and the number of epochs for each state. Here, we build a candidate transition graph to deﬁne the valid transitions between states. The search of the best optimization strategy is then equivalent to seeking an optimal path from the initial state to the ﬁnal state on the graph, which can be solved by dynamic programming. In order to determine the best epoch for each state in such process, we propose a knee point estimation method via ﬁtting the performance-epoch curve. In general, our optimization planning is viewed as a training scheme controller and is readily applicable to train other neural networks in stages with multiple hyper-parameters.

To the best of our knowledge, our work is the ﬁrst to address the issue of optimization planning for 3D Conv Nets training. The issue also leads to the elegant view of how the order and epochs for different hyper-parameters should be planned adaptively. We uniquely formulate the problem as seeking an optimal training path and devise a new 3D Conv Nets with dual-head classiﬁer. Extensive experiments on seven datasets demonstrate the effectiveness of our proposal, and with optimization planning, our 3D Conv Nets achieves superior results than several state-of-the-art techniques.

2. Related Work

The early works using Convolutional Neural Networks for video recognition are mostly extended from 2D Conv Nets for image classiﬁcation (Karpathy et al., 2014; Simonyan & Zisserman, 2014; Feichtenhofer et al., 2016; Wang et al., 2016; Qiu et al., 2017a). These approaches often treat a video as a sequence of frames or optical ﬂow images, and the pixel-level temporal evolution across consecutive frames are seldom explored. To alleviate this issue, 3D Conv Nets in Ji et al. (2013) is devised to directly learn spatio-temporal representation from a short video clip via 3D convolution. Tran et al. design a widely-adopted 3D Conv Nets in Tran et al. (2015), namely C3D, consisting of 3D convolutions and 3D poolings optimized on the large-scale Sports1M (Karpathy et al., 2014) dataset. Despite having encouraging performances, the training of 3D Conv Nets is computationally expensive and the model size suffers from a massive growth. Later in Qiu et al. (2017b); Tran et al. (2018); Xie

et al. (2018), the decomposed 3D convolution is proposed to simulate one 3D convolution with one 2D spatial convolution plus one 1D temporal convolution. Recently, more advanced techniques are presented for 3D Conv Nets, including inﬂating 2D convolutions (Carreira & Zisserman, 2017), non-local pooling (Wang et al., 2018c) and local-andglobal diffusion (Qiu et al., 2019). These newly designed 3D Conv Nets further show good transferability to several downstream tasks (Qiu et al., 2017c; Li et al., 2018; 2019; Long et al., 2019a;b; 2020).

Our work expands the research horizons of 3D Conv Nets and focuses on improving 3D Conv Nets training by adaptively planning the optimization process. The related works for 2D Conv Nets training (Chee & Toulis, 2018; Lang et al., 2019; Yaida, 2019) automate the training strategy via only changing the learning rate adaptively. Our problem is much more challenging especially when temporal dimension is additionally considered and involved in the training scheme of 3D Conv Nets. For enhancing 3D Conv Nets training, the recent works (Wang et al., 2018c; Qiu et al., 2019) ﬁrst train 3D Conv Nets with short input clips and then ﬁne-tune the network with lengthy clips, which balances training efﬁciency and long-range temporal modeling. The multigrid method (Wu et al., 2020) and ours both delve into the network training of 3D Conv Nets, but along two distinct dimensions. Multigrid proposes to cyclically change spatial resolution and temporal duration of the input clips for a more efﬁcient optimization of 3D Conv Nets and the training strategy is still hand-designed. In contrast, ours studies the training of 3D Conv Nets with multiple stages, but automatically schedules the change of hyper-parameters through optimization planning.

3. Optimization Planning

3.1. Problem Formulation

The goal of optimization planning is to automate the learning strategy of 3D Conv Nets. Formally, the optimization process of 3D Conv Nets can be represented as an optimization path P = S0, S1, ..., SN , which consists of one initial state S0 and N intermediate states. Each intermediate state is assigned with the ﬁxed hyperparameters, and the training is performed with these N different settings one by one. The training epoch on each setting is decided by T = {t1, t2, ..., t N}, in which ti denotes the number of epochs when moving from Si 1 to Si. The hyper-parameters include sampling strategy {cs, us}, length of input clip {l1, l2, ..., l Nl} and learning rate {r1, r2, ..., r Nr}, where cs and us denotes consecutive sampling and uniform sampling from input videos, respectively. As a result, there are 2 Nl Nr valid types of training states.

Optimization Planning for 3D Conv Nets

High Learning Rate Low Learning Rate High Learning Rate Low Learning Rate

Consecutive Sampling (cs) Uniform Sampling (us)

Final state S(cs) Final state S(us)

Less Frames

More Frames

Less Frames

More Frames Best State

Initial State

(a) Basic transition graph

High Learning Rate Low Learning Rate High Learning Rate Low Learning Rate

Less Frames

More Frames

Less Frames

More Frames

Initial State

Consecutive Sampling (cs) Uniform Sampling (us)

Final state S(cs) Final state S(us) Best State

(b) Extended transition graph

Figure 1. Examples of two kinds of transition graphs. The circles denote the candidate states and the arrows represent the candidate transitions. The ultimate model is the one with higher accuracy of the two ﬁnal states.

The objective function of optimization planning is to seek the optimal strategy {P, T } by maximizing the performance of the ﬁnal state SN as

maximize P,T V(SN), (1)

where V( ) is the performance metric, i.e., mean accuracy on validation set in our case.

3.2. Optimization Path

To plan the optimal permutation of training states, we ﬁrst choose a ﬁnal state SN, which is usually with low learning rate and lengthy input clip. Then, the problem of seeking an optimal optimization path to SN is naturally decomposed to the subproblem of ﬁnding the optimization path to an intermediate state Si and the state transition from Si to SN. As such, the problem can be solved by dynamic programming. Formally, the solution of optimization path P(SN) can be given in a recursive form:

P(SN) = P(Si ), SN , i = argmax i {V(Si SN)} .

(2) When executing the transfer from the state Si to the state SN, we ﬁne-tune the 3D Conv Nets at the state Si by using the hyper-parameters at the state SN. We then evaluate such ﬁne-tuned model on the validation set to measure the priority of this transition, i.e., V(Si SN). We choose the state Si , which achieves the highest priority of transition to the state SN, as the preceding state of SN. In other words, the optimal path for SN derives from the best-performing

preceding state Si . Here, we propose to pre-deﬁne all the valid transitions in a directed acyclic graph and determine the best optimization path of each state one by one in the topological order. Figure 1(a) shows one example of the pre-deﬁned transition graph. In the example, we set the number of candidate input clip lengths Nl = 3 and the number of candidate learning rates Nr = 3. Hence, there are 2 3 3 = 18 candidate states. Next, we capitalize on the following principles to determine the possible transitions, i.e., the connections between states:

(1) The transitions between the states with different sampling strategies are forbidden. S9 and S18 is the ﬁnal state for consecutive and uniform sampling, respectively.

(2) The training only starts from high learning rate and short input video clips.

(3) The intermediate state can be only transferred to a new state, where either the learning rate is decreased or the length of input clip is increased in the new state.

Please note that, some very speciﬁc learning rate strategies, e.g., schedules with restart or warmup, show that increasing the learning rate properly may beneﬁt training. Nevertheless, there is still no clear picture of when to increase the learning rate, and thus it is very difﬁcult to automate these schedules. In the works of adaptively changing the learning rate for 2D Conv Nets training (Ge et al., 2019; Lang et al., 2019; Yaida, 2019), such cyclic schedules are also not taken into account. Similarly in this work, we only consider the schedule of decreasing learning rate in the transition graph.

The aforementioned principles can simplify the transition graph and reduce the time cost when solving Equ.(2). We take this graph as basic transition graph. Furthermore, we also build an extended transition graph by enabling to simultaneously decrease the input clip length and the learning rate, as illustrated in Figure 1(b). In such graph, the training strategies are more ﬂexible.

3.3. State Transition

One state transition from Si to Sj is deﬁned as a training step that starts to optimize the model at Si by using the hyper-parameters at Sj. Then the question is when this training step completes. Here, we derive the spirit from SASA (Lang et al., 2019) that trains the network with constant hyper-parameters until it reaches a stationary condition. SASA presents to adaptively evaluate the convergence of stochastic gradient descent by Yaida s condition (Yaida, 2019) during training. However, in practice, the thoroughly optimized network does not always perform well on validation set due to overﬁtting problem. Therefore, we take both convergence and overﬁtting into account, and propose to estimate the knee point on the performance-epoch curve evaluated on the validation set. That performs more steadily

Optimization Planning for 3D Conv Nets

Table 1. The comparisons of four ﬁtting functions in terms of RMSE and R-Square.

Fitting Function fα(t) Constraints RMSE R-Square

power: α1 + α2(t + 1)α3 + α4t + α5t2 α2, α3, α5 < 0 1.010 10 3 0.356 multi-power: α1 + α2(t + 1)α3 + α4(t + 1)α5 + α6t + α7t2 α2, α3, α4, α5, α7 < 0 1.030 10 3 0.320 exponential: α1 + α2eα3t + α4t + α5t2 α2, α3, α5 < 0 1.007 10 3 0.360 multi-exponential: α1 + α2eα3t + α4eα5t + α6t + α7t2 α2, α3, α4, α5, α7 < 0 1.063 10 3 0.350

0 10 20 30 40 50 Number of Epochs

Accuracy on Validation Set

0 10 20 30 40 50 Number of Epochs

Accuracy on Validation Set

power fitting multi-power fitting exponential fitting multi-exponential fitting observation values

0 10 20 30 40 50 Number of Epochs

Accuracy on Validation Set

power fitting multi-power fitting exponential fitting multi-exponential fitting observation values

Figure 2. The examples of (a) the collected performance-epoch curves, (b) the ﬁtting results for training model from scratch, and (c) the ﬁtting results for ﬁne-tuning model.

across various datasets. Speciﬁcally, we measure the accuracy yt by evaluating the intermediate model after t-th training epoch on validation set. To estimate the knee point given a limited number of observations yt, we ﬁt the curve by a continuous function fα(t) as

yt = fα(t) + zt, zt N(0, σ2), (3)

where zt is the stochastic factor following a normal distribution, and α denotes the parameters of function f. Here, we choose fα(t) as a unimodal function to ensure that there is only one maximum value. The curve ﬁtting can be formulated as the optimization of parameters α by minimizing the distance between the observed performance and the estimated performance:

0 yt fα(t) 2 , s.t. fα(t) is unimodal. (4)

We exploit Trust Region Reﬂective algorithm (Branch et al., 1999) to solve this problem and the algorithm is robust for arbitrary form of function fα(t). To adaptively stop the iteration, we estimate the knee point epoch t by solving Equ.(4) after each training epoch. If the current epoch t is larger than t +T, we will stop the iteration and choose t as the best epoch number. T is a delay parameter which allows the model to have a T-epoch attempt even if t > t . We simply ﬁx the delay parameter T to 10 in all the experiments.

Next, the essential issue is the form of ﬁtting function fα(t). We separate the function into two parts fα(t) = gα(t) + hα(t), where gα(t) is an increasing bounded function to simulate the convergence of the model, and hα(t) is a concave function to model the inﬂuence of overﬁtting.

Table 1 shows four examples of ﬁtting function fα(t). In the four functions, we ﬁx hα(t) as a quadratic function and exploit power, multi-power, exponential and multiexponential function as gα(t), respectively. Please note that, for each function, some constraints are given to guarantee the properties of gα(t) and hα(t). We empirically validate the functions by pre-collecting 162 performance-epoch curves (Figure 2(a)) from the training processes of different networks on various datasets and employing the four functions to ﬁt the curves through solving Equ.(4). Table 1 compares the average Root Mean Square Error (RMSE) and R-square when using different functions. Figure 2(b) and Figure 2(c) further depict a ﬁtting example in the context of model training from scratch and model ﬁne-tuning, respectively. The general observation is that, all the four functions can nicely ﬁt the performance-epoch curve and do not make a major difference on the ﬁnal performance. Hence, we simply choose the best-performing exponential function in the rest of the paper.

4. 3D Conv Nets Architecture

In this section, we present the proposed Dual-head Globalcontextual Pseudo-3D (DG-P3D) network and Figure 3 shows an overview. In particular, the network is originated from the residual network (He et al., 2016) and further extended to 3D manner with three designs, i.e., pseudo-3D convolution, global context and dual-head classiﬁer.

Pseudo-3D convolution. To achieve a good tradeoff between accuracy and computational cost, pseudo-3D convolution proposed in Qiu et al. (2017b) decomposes 3D learning into 2D convolutions in spatial space plus 1D operations in

Optimization Planning for 3D Conv Nets

64x4x56x56 3x16x224x224

256x4x56x56 512x4x28x28 1024x4x14x14

1024x1x14x14 2048x1x14x14

2048x4x7x7 2048

Temporal Head

Spatial Head

Dual-Head Classifier

Global-Contextual Pseudo-3D Block

Global Pool

Pseudo-3D Residual Global Contextual Residual

Figure 3. An overview of our proposed Dual-head Global-contextual Pseudo-3D (DG-P3D) network. Here, we take the 16-frame input as an example and the size of output feature map is also given for each layer.

temporal dimension. The similar idea of decomposing 3D convolution is also presented in R(2+1)D (Tran et al., 2018) and S3D (Xie et al., 2018). In this paper, we choose the mixed P3D architecture with the highest performance in Qiu et al. (2017b), which interleaves three types of P3D blocks.

Global context. The recent works on non-local networks (Wang et al., 2018c; Cao et al., 2019; Qiu et al., 2019) highlight the drawback of performing convolutions, in which each operation processes only a local window of neighboring pixels and lacks a holistic view of ﬁeld. To alleviate this limitation, we delve into global context to learn the global residual from the global-pooled representation and then broadcast to each position in the feature map.

Dual-head classiﬁer. 3D Conv Nets are expected to have both spatial and temporal discrimination. For example, the Slow Fast network (Feichtenhofer et al., 2019) contains an individual pathway for visual appearance and temporal dynamics, respectively. Here, we uniquely propose a simpler way that builds a dual-head classiﬁer at the top of the network instead of the two-path structure in the Slow Fast network. In between, the temporal head with large temporal dimension focuses on modeling the temporal evolution, and the spatial head with large spatial resolution emphasizes the spatial discrimination. The predictions from two heads are linearly fused. As such, our design costs less computations and is easier to implement.

5. Experiments

5.1. Datasets

The experiments are conducted on HMDB51, UCF101, Activity Net, SS-V1/V2, Kinetics-400 and Kinetics-600 datasets. Table 2 details the information and settings on these datasets. The HMDB51 (Kuehne et al., 2011), UCF101 (Soomro et al., 2012), Kinetics-400 (Carreira & Zisserman, 2017) and Kinetics-600 (Carreira et al., 2018) are the most popular video benchmarks for action recogni-

Table 2. The number of videos, the number of video categories, and the detailed settings for optimization planning, respectively, on HMDB51, UCF101, Activity Net, SS-V1, SS-V2, Kinetics-400, and Kinetics-600 datasets.

Dataset #videos #classes l1 l2 l3 r1 r2 r3 dropout

HMDB51 6K 51 16 32 64 0.01 0.001 0.0001 0.9 UCF101 13K 101 16 32 64 0.01 0.001 0.0001 0.9 Activity Net 20K 200 16 32 128 0.01 0.001 0.0001 0.9 SS-V1 108K 174 16 32 0.04 0.004 0.0004 0.5 SS-V2 220K 174 16 32 0.04 0.004 0.0004 0.5 Kinetics-400 300K 400 16 32 128 0.04 0.004 0.0004 0.5 Kinetics-600 480K 600 16 32 128 0.04 0.004 0.0004 0.5

tion on trimmed video clips. The Something-Something V1 (SS-V1) dataset is ﬁrstly constructed in Goyal et al. (2017) to learn ﬁne-grained human-object interactions, and then extended to Something-Something V2 (SS-V2) recently. The Activity Net (Caba Heilbron et al., 2015) dataset is an untrimmed video benchmark for activity recognition. The latest released version of the dataset (v1.3) is exploited. In our experiments, we only use the video-level label of Activity Net and disable the temporal annotations. Note that the labels for test sets are not publicly available, and thus the performances of Activity Net, SS-V1, SS-V2, Kinetics-400 and Kinetics-600 are all reported on the validation set. For optimization planning, the original training set of each dataset is split into two parts for learning the network weights and validating the performance, respectively. We construct this internal validation set with the same size as the original validation/test set. Note that the original validation/test set is never exploited in the optimization planning.

5.2. Implementation Details

For optimization planning, we set the number of the choices for both input clip length Nl and learning rate Nr as 3, and utilize the extended transition graph introduced in Section 3.2. The candidate values of input clip length {l1, l2, l3} and learning rate {r1, r2, r3} for each dataset are summarized in Table 2. Speciﬁcally, on SS-V1, SS-V2,

Optimization Planning for 3D Conv Nets

Table 3. The comparisons between optimization planning (OP) and hand-tuned strategies (HS) with DG-P3D on Kinetics-400 dataset. The number in the bracket denotes the best number of epoches, which is achieved by grid-search for hand-tuned strategies and adaptively determined by our optimization planning.

Method Sampling Clip length Learning rate strategy

3-step cosine

consecutive

l1 l3 77.3 (256) 77.6 (192) l2 l3 77.8 (320) 78.0 (256) l1 l2 l3 77.5 (320) 77.9 (256)

l1 l3 76.5 (192) 76.9 (128) l2 l3 76.8 (256) 76.9 (256) l1 l2 l3 76.8 (192) 77.1 (192)

OP 78.9 (220)

Kinetics-400 and Kinetics-600 datasets, the base learning rate is set as 0.04 and the dropout ratio is ﬁxed as 0.5. For HMDB51, UCF101 and Activity Net, we set lower base learning rate and higher dropout ratio due to limited training samples. The maximum clip length is 64 for HMDB51 and UCF101, and is increased to 128 for Activity Net, Kinetics400 and Kinetics-600. Considering that the video clips in SS-V1 and SS-V2 are usually shorter than 64 frames, we only use two settings, i.e., 16-frame and 32-frame.

The network training is implemented on Py Torch framework and the mini-batch stochastic gradient descent is employed to tune the network. The resolution of the input clip is ﬁxed as 224 224, which is randomly cropped from the video clip resized with the short size in [256, 340]. The clip is randomly ﬂipped along horizontal direction for data augmentation except for SS-V1 and SS-V2 in view of the direction-related categories. Following the settings in Wang et al. (2018c); Qiu et al. (2019), for network training with long clips (64-frame and 128-frame), we freeze the parameters of all Batch Normalization layers except for the ﬁrst one since the batch size is too small for batch normalization.

There are two inference strategies for the evaluations. The ﬁrst one roughly predicts the video label on a 224 224 single center crop from the centric one clip resized with the short size 256. This strategy is only used when planning the optimization for the purpose of efﬁciency. Once the optimization path is ﬁxed, we train 3D Conv Nets with the path and evaluate the learnt 3D Conv Nets by using the second strategy, i.e., the three-crop strategy as in Feichtenhofer et al. (2019), which crops three 256 256 regions from each video clip. The video-level prediction score is achieved by averaging all scores from ten uniform sampled clips.

5.3. Evaluation of Optimization Planning

We ﬁrstly verify the effectiveness of our proposed optimization planning for 3D Conv Nets and compare the hand-tuned strategies. To ﬁnd the most powerful hand-tuned strategy, we capitalize on the popular practices in the literature, and

Table 4. The comparisons between optimization planning (OP) and hand-tuned strategies (HS) with different 3D Conv Nets on Kinetics-400 dataset. All the backbone networks are Res Net-50.

Network Training strategy

Slow Fast (Feichtenhofer et al., 2019) cosine decay multigrid 77.0 77.6

HS OP I3D (Carreira & Zisserman, 2017) 75.6 76.4 ( 0.8) P3D (Qiu et al., 2017b) 75.2 76.0 ( 0.8) R(2+1)D (Xie et al., 2018) 75.1 76.2 ( 1.1) LGD-3D (Qiu et al., 2019) 76.3 77.3 ( 1.0)

HS OP G-P3D 77.1 77.9 ( 0.8) DG-P3D 78.0 78.9 ( 0.9)

grid-search the training settings through four dimensions, i.e., input clip length, learning rate decay, sampling strategy and training epochs. Speciﬁcally, for input clip length, we follow the common training scheme that ﬁrst learns the network with short clips and then ﬁne-tunes the network on lengthy clips, and experiment with three strategies l1 l3, l2 l3 and l1 l2 l3. For each input clip length, we train the network with the same number of epochs. For learning rate decay, we choose two mostly utilized strategies, i.e., 3-step learning rate decay (Wang et al., 2018c; Qiu et al., 2019) and cosine decay (Feichtenhofer et al., 2019). The optimal training epoch for each strategy is determined by grid-searching from [128, 192, 256, 320, 384] epochs.

Table 3 shows the comparisons between optimization planning and hand-tuned strategies with DG-P3D architecture on Kinetics-400 dataset. The backbone network is derived from Res Net-50 pre-trained on Image Net dataset. The results constantly indicate that optimization planning exhibits better performance than hand-tuned strategies. In particular, training DG-P3D with optimization planning leads to a performance boost against the network learnt with best-performing hand-tuned strategy by 0.9%. That basically veriﬁes the advantage of dynamically determining the training strategy. Moreover, Table 4 summarizes the two training strategies on six different 3D Conv Nets, i.e., I3D, P3D, R(2+1)D, LGD3D, G-P3D and our DG-P3D. In between, G-P3D extends the P3D network by employing global context, and DG-P3D further devises dual-head classiﬁer. Here, we also include Slow Fast network (Feichtenhofer et al., 2019) trained with either cosine learning rate decay or multigrid method (Wu et al., 2020). Overall, optimization planning makes the absolute improvement over hand-tuned strategies by 0.8% 1.1% across six 3D Conv Nets, demonstrating its generalizability to different 3D architectures. With the same optimization planning strategy, DG-P3D network achieves 1.0% improvement over G-P3D, which validates the design of dual-head classiﬁer. Though both multigrid method and optimization planning involve the network training in stages with multiple hyper-parameters, they are different in the way that multigrid pre-deﬁnes the change of hyper-parameters in

Optimization Planning for 3D Conv Nets

Table 5. The comparisons between optimization planning and hand-tuned strategy with DG-P3D on HMDB51 (split 1), UCF101 (split1), Activity Net, SS-V1, SS-V2 and Kinetics-400 datasets. The backbone is Res Net-50 pre-trained on Image Net. The time cost for grid search/optimization planning is reported with 8 NVidia Titan V GPUs in parallel.

Strategy HMDB51 UCF101 Activity Net SS-V1 SS-V2 Kinetics-400

Hand-tuned top-1 57.7 88.2 75.2 51.4 63.9 78.0 Strategy time cost 83h 158h 166h 540h 1072h 4057h

Optimization top-1 58.4 ( 0.7) 89.1 ( 0.9) 76.5 ( 1.3) 52.8 ( 1.4) 65.5 ( 1.6) 78.9 ( 0.9) Planning time cost 6h 13h 38h 67h 142h 288h

Initial State

Consecutive Sampling (cs)

Final state S(cs) Best State

Initial State

Consecutive Sampling (cs)

Final state S(cs) Best State

66.0% 68.8%

75.7% 74.7%

74.9% 68.0% 68.7%

Initial State

Consecutive Sampling (cs)

Final state S(cs) Best State

66.0% 68.8%

75.7% 74.7%

74.9% 68.0% 68.7%

Initial State

Consecutive Sampling (cs)

Final state S(cs) Best State

56.9% 62.1% 68.2%

Figure 4. An example of optimization planning procedure with consecutive sampling strategy on Kinetics-400 dataset. The procedure includes (a) the extended transition graph, (b) exploring all the candidate transitions, (c) seeking the optimal path by maximizing the performance of the ﬁnal state, (d) forming the optimization path. The clip-level accuracy is also given for each explored transition.

Initial State

Consecutive Sampling (cs)

Final state S(cs) Best State

(a) Activity Net

Initial State

Uniform Sampling (us)

Final state S(us) Best State

Initial State

Uniform Sampling (us)

Final state S(us) Best State

Initial State

Consecutive Sampling (cs)

Final state S(cs) Best State

(d) Kinetics-400

Figure 5. The optimization path produced by optimization planning on (a) Activity Net, (b) SS-V1, (c) SS-V2, (d) Kinetics-400. The red edge represents the state transition in the optimization path, and the black edges denote the transitions that have been explored but not selected in the ﬁnal optimization path. The optimal number of training epochs is also given for each transition in the path.

each stage, and optimization planning adaptively schedules such change. As indicated by our results, DG-P3D with optimization planning leads the accuracy by 1.3%, against Slow Fast network with multigrid strategy.

Taking our DG-P3D as 3D Conv Nets, Table 5 further details the comparisons between optimization planning and the hand-tuned strategy across six different datasets. The accuracy of the hand-tuned strategy is reported on the best training scheme by grid search on each dataset. Such best hand-tuned strategy can be considered as a well-tuned DGP3D model without optimization planning. The time cost of optimization planning contains the training time of exploring all the possible transitions, and that of hand-tuned strategy is measured by grid-searching the candidate training strategies. Compared to the hand-tuned strategy, optimization planning shows consistent improvements across different datasets, and requires much less time than the exhaustive grid search, due to adaptive determination of training scheme.

5.4. Qualitative Analysis of Optimization Planning

Figure 4 illustrates the process of optimization planning with consecutive sampling strategy on Kinetics-400 dataset. The candidate transitions between states are explored in topological order, and the optimal path of each state derives from the best-performing preceding state. We also examine how optimization path impacts the performance and experiment with some variant paths on Kinetics-400, which are built by either inserting an additional state or skipping an intermediate state in our adopted optimization path. For fair comparisons, the numbers of epochs in these variant paths are re-determined by the algorithm in Section 3.3. The results indicate that inserting and skipping one state result in an accuracy decrease of 0.2% 0.6% and 0.3% 1.0%, respectively, and verify the impact of optimization planning.

Next, Figure 5 depicts the optimization paths on different datasets. An interesting observation is that SS-V1/2 tend to select uniform sampling while Kinetics-400 prefers consecutive sampling. We speculate that this may be the result

Optimization Planning for 3D Conv Nets

Table 6. Comparisons with the state-of-the-art methods on (a) HMDB51 (3 splits) & UCF101 (3 splits) and (b) Activity Net with RGB input.

(a) HMDB51 (H51) & UCF101 (U101)

Method Backbone H51 U101

I3D (Carreira & Zisserman, 2017) BN-Inception 74.5 95.4 ARTNet (Wang et al., 2018a) BN-Inception 70.9 94.3 Res Ne Xt (Hara et al., 2018) Res Ne Xt-101 70.2 94.5 R(2+1)D (Tran et al., 2018) Res Net-34 74.5 96.8 S3D-G (Xie et al., 2018) BN-Inception 75.9 96.8 STM (Jiang et al., 2019) Res Net-50 72.2 96.2 LGD-3D (Qiu et al., 2019) Res Net-101 75.7 97.0

DG-P3D Res Net-50 79.2 97.6 Res Net-101 80.4 97.9

(b) Activity Net

Method Backbone +K400 Top-1

TSN (Wang et al., 2018b) BN-Inception 72.9 RRA (Zhu et al., 2018) Res Net-152 78.8 MARL (Wu et al., 2019) Res Net-152 79.8 TSN (Wang et al., 2018b) BN-Inception 78.9 MARL (Wu et al., 2019) SERes Ne Xt152 85.7

Res Net-50 76.5 Res Net-50 85.9 Res Net-101 77.8 Res Net-101 86.8

Table 7. Performance comparisons with the state-of-the-art methods on SS-V1 and SS-V2 with RGB input. All the backbone networks are Res Net-50.

Method Pre-train SS-V1 SS-V2 Top-1 Top-5 Top-1 Top-5

NL I3D+GCN (Wang & Gupta, 2018) Kinetics 46.1 76.8 TSM (Lin et al., 2019) Kinetics 47.2 77.1 63.4 88.5 b LVNet-TAM (Fan et al., 2019) Image Net 48.4 78.8 61.7 88.1 ABM-C-in (Zhu et al., 2019) Image Net 49.8 61.2 I3D+RSTG (Nicolicioiu et al., 2019) Kinetics 49.2 78.8 GST (Luo & Yuille, 2019) Image Net 48.6 77.9 62.6 87.9 STDFB (Mart ınez et al., 2019) Image Net 50.1 79.5 STM (Jiang et al., 2019) Image Net 50.7 80.4 64.2 89.8 TEA (Li et al., 2020b) Image Net 51.9 80.3 ASS (Li et al., 2020a) Image Net 51.4 63.5

DG-P3D Image Net 52.8 81.8 65.5 90.3

of different emphases of the two sampling strategies. In general, the most special on uniform sampling is to capture the completeness of a video with only a small number of sampled frames. In contrast, consecutive sampling emphasizes the continuity in a video but may only focus on a part of the video content. The SS-V1/V2 datasets consist of ﬁne-grained interactions and the differentiation between these interactions relies more on the completeness of an action. For example, it is almost impossible to distinguish the videos of the category Pushing something so that it falls off the table from those of Pushing something so that it almost falls off but doesn t, if only based on part of the video content. In other words, uniform sampling offers the completeness of a video and beneﬁts the recognition on SS-V1/V2. Instead, the videos in Kinetics-400 are usually with static scenes or slow motion. Hence, the completeness may not be essential in this case, but consecutive sampling encodes the continuous changes across frames and thus captures the spatio-temporal relation better.

5.5. Comparisons with State-of-the-Art

We compare with several state-of-the-art techniques on HMDB51, UCF101 and Activity Net datasets. The performance comparisons are summarized in Table 6. The backbone of DG-P3D is either Res Net-50 or Res Net-101 pre-trained on Image Net. Please note that most recent works employ Kinetics-400 pre-training to improve the accuracy.

Table 8. Comparisons with state-of-the-art methods on Kinetics400 & Kinetics-600 with RGB input. The computational complexity is measured in GFLOPs views and the views represent the number of clips sampled from the full video during inference. * In view that it is not that fair to directly compare ir CSN pre-trained on IG65M (65M web videos) and other methods, here we report the performance of ir CSN pre-trained on Sports1M.

Method Backbone GFLOPs views Kinetics-400 Kinetics-600 (top-1/5) (top-1/5)

I3D BN-Inception 108 N/A 72.1/90.3 R(2+1)D custom 152 115 74.3/91.4 S3D-G BN-Inception 66.4 N/A 74.7/93.4 NL I3D Res Net-101 359 30 77.7/93.3 LGD-3D Res Net-101 195 N/A 79.4/94.4 81.5/95.6 X3D-XL custom 48.4 30 79.1/93.9 81.9/95.5 ir CSN* custom 96.7 30 79.0/93.5

Res Net-50 65.7 30 77.0/92.6 79.9/94.5 Res Net-101 213 30 78.9/93.5 81.1/95.1 Res Net-101+NL 234 30 79.8/93.9 81.8/95.1

DG-P3D Res Net-50 123 30 78.9/93.9 81.6/95.6 Res Net-101 218 30 80.5/94.6 82.7/95.8

Here, we also choose the two-step strategy that ﬁrst trains DG-P3D on Kinetics-400 (K400) and then ﬁne-tunes the network on the target dataset. The two steps are both trained with optimization planning. Overall, DG-P3D achieves the highest performances on all the three datasets, i.e., 80.4% on HMDB51, 97.9% on UCF101 and 86.8% on Activity Net. In particular, DG-P3D outperforms the other 3D Conv Nets of I3D, R(2+1)D, S3D-G and LGD-3D by 5.9%, 5.9%, 4.5% and 4.7% on HMDB51, respectively. The results again verify the merit of the learnt 3D Conv Nets. For Activity Net, most baselines utilize the temporal annotation to locate the foreground segment in the untrimmed videos. In our experiments, we only use the video-level annotations and our DG-P3D still surpasses the best competitor MARL by 1.1%.

Then, we turn to evaluate DG-P3D with optimization planning on four large-scale datasets, i.e., SS-V1, SS-V2, Kinetics-400 and Kinetics-600. To reduce the cost, the optimization path found on Kinetics-400 is directly utilized as the path for Kinetics-600. The top-1 and top-5 accuracies on the four datasets are reported in Table 7 and Table 8. Speciﬁcally, DG-P3D achieves the highest top-1 accuracy of 52.8% on SS-V1 and 65.5% on SS-V2. DG-P3D is

Optimization Planning for 3D Conv Nets

Table 9. Comparisons with state-of-the-art methods on Kinetics400 & Kinetics-600 datasets with the input of ﬂow modality, and the fusion of frame and ﬂow modalities.

Method Backbone

Kinetics-400 Kinetics-600 (top-1/5) (top-1/5) Flow Fusion Flow Fusion

I3D BN-Inception 65.3/86.2 75.7/92.0 R(2+1)D custom 68.5/88.1 75.4/91.9 S3D-G BN-Inception 68.0/87.6 77.2/93.0 LGD-3D Res Net-101 72.3/90.9 81.3/95.2 75.0/92.4 83.1/96.2

DG-P3D Res Net-50 72.0/90.9 80.4/94.7 74.8/93.2 82.7/95.9 Res Net-101 73.2/91.2 82.6/96.1 76.7/93.4 84.3/96.6

superior to TEA and STM, which reports the best known results, by 0.9% and 1.3%, respectively. On Kinetics-400, the top-1 accuracy of DG-P3D reaches 80.5%, which makes the improvements over the recent 3D Conv Nets ir CSN (Tran et al., 2019), X3D-XL (Feichtenhofer, 2020), LGD-3D (Qiu et al., 2019), Slow Fast (Feichtenhofer et al., 2019) by 1.5%, 1.4%, 1.1% and 0.7%, respectively. The similar performance trends are also observed on Kinetics-600. DG-P3D achieves 82.7% top-1 accuracy, which leads to the performance boost of 0.8%, over the best competitor X3D-XL. For fair comparisons with the two-stream baselines on Kinetics datasets, we additionally consider the ﬂow modality by the two-direction optical ﬂow image extracted by TV-L1 algorithm (Zach et al., 2007) and directly use the optimal path on RGB modality as that on ﬂow modality. Table 9 summarizes the performance comparisons. Overall, the two-stream DGP3D achieves 82.6%/84.3% top-1 accuracy, which leads the performance by 1.3%/1.2%, against two-stream LGD-3D on Kinetics-400 and Kinetics600, respectively. The results also validate the use of the learnt optimization path on frame modality to ﬂow modality and two-stream structures.

6. Conclusion

We have presented optimization planning which aims to automate the training scheme of 3D Conv Nets. Particularly, a training process is decided by a sequence of training states, namely optimization path, plus the number of training epochs for each state. We specify the hyper-parameters in each state and the permutation of states determines the changes of hyper-parameters. Technically, we propose a dynamic programming method to seek the best optimization path in the candidate transition graph and each state transition is stopped adaptively by estimating the knee point on the performance-epoch curve. Furthermore, we devise a new 3D Conv Nets, i.e., DG-P3D, with a unique design of the dual-head classiﬁer. The results on seven video benchmarks, which are different in terms of data scale, target categories and video duration, validate our proposal. Notably, DG-P3D with optimization planning obtains superior performances on all the seven datasets.

Acknowledgments. This work was supported by the National Key R&D Program of China under Grant No. 2020AAA0108600.

Branch, M. A., Coleman, T. F., and Li, Y. A subspace, interior, and conjugate gradient method for large-scale bound-constrained minimization problems. SIAM Journal on Scientiﬁc Computing, 21:1 23, 1999.

Caba Heilbron, F., Escorcia, V., Ghanem, B., and Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.

Cao, Y., Xu, J., Lin, S., Wei, F., and Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In ICCV Workshop, 2019.

Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.

Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., and Zisserman, A. A short note about kinetics-600. Ar Xiv, abs/1808.01340, 2018.

Chee, J. and Toulis, P. Convergence diagnostics for stochastic gradient descent with constant learning rate. In AISTATS, 2018.

Fan, Q., Chen, C.-F., Kuehne, H., Pistoia, M., and Cox, D. More is less: Learning efﬁcient video representations by big-little network and depthwise temporal aggregation. In Neur IPS, 2019.

Feichtenhofer, C. X3d: Expanding architectures for efﬁcient video recognition. In CVPR, 2020.

Feichtenhofer, C., Pinz, A., and Zisserman, A. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfast networks for video recognition. In ICCV, 2019.

Ge, R., Kakade, S. M., Kidambi, R., and Netrapalli, P. The step decay schedule: A near optimal, geometrically decaying learning rate procedure. Ar Xiv, abs/1904.12838, 2019.

Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fr und, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., and Memisevic, R. The something something video database for learning and evaluating visual common sense. In ICCV, 2017.

Optimization Planning for 3D Conv Nets

Hara, K., Kataoka, H., and Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet. In CVPR, 2018.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.

Ji, S., Xu, W., Yang, M., and Yu, K. 3d convolutional neural networks for human action recognition. IEEE Trans. on PAMI, 35(1):221 231, 2013.

Jiang, B., Wang, M., Gan, W., Wu, W., and Yan, J. Stm: Spatiotemporal and motion encoding for action recognition. In ICCV, 2019.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. Large-scale video classiﬁcation with convolutional neural networks. In CVPR, 2014.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. HMDB: a large video database for human motion recognition. In ICCV, 2011.

Lang, H., Xiao, L., and Zhang, P. Using statistics to automate stochastic optimization. In Neur IPS, 2019.

Li, D., Qiu, Z., Dai, Q., Yao, T., and Mei, T. Recurrent tubelet proposal and recognition networks for action detection. In ECCV, 2018.

Li, D., Yao, T., Qiu, Z., Li, H., and Mei, T. Long short-term relation networks for video action detection. In ACM MM, 2019.

Li, D., Qiu, Z., Pan, Y., Yao, T., Li, H., and Mei, T. Representing videos as discriminative sub-graphs for action recognition. In CVPR, 2021.

Li, H., Zheng, W.-S., Tao, Y., Hu, H., and Lai, J.-H. Adaptive interaction modeling via graph operations search. In CVPR, 2020a.

Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., and Wang, L. Tea: Temporal excitation and aggregation for action recognition. In CVPR, 2020b.

Lin, J., Gan, C., and Han, S. Tsm: Temporal shift module for efﬁcient video understanding. In ICCV, 2019.

Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., and Mei, T. Gaussian temporal awareness networks for action localization. In CVPR, 2019a.

Long, F., Yao, T., Qiu, Z., Tian, X., Mei, T., and Luo, J. Coarse-to-ﬁne localization of temporal action proposals. IEEE Trans. on MM, 22(6):1577 1590, 2019b.

Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., and Mei, T. Learning to localize actions from moments. In ECCV, 2020.

Luo, C. and Yuille, A. L. Grouped spatial-temporal aggregation for efﬁcient action recognition. In ICCV, 2019.

Mart ınez, B., Modolo, D., Xiong, Y., and Tighe, J. Action recognition with spatial-temporal discriminative ﬁlter banks. In ICCV, 2019.

Nicolicioiu, A. L., Duta, I., and Leordeanu, M. Recurrent space-time graph neural networks. In Neur IPS, 2019.

Qiu, Z., Yao, T., and Mei, T. Deep quantization: Encoding convolutional activations with deep generative model. In CVPR, 2017a.

Qiu, Z., Yao, T., and Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017b.

Qiu, Z., Yao, T., and Mei, T. Learning deep spatio-temporal dependence for semantic video segmentation. IEEE Trans. on MM, 20(4):939 949, 2017c.

Qiu, Z., Yao, T., Ngo, C.-W., Tian, X., and Mei, T. Learning spatio-temporal representation with local and global diffusion. In CVPR, 2019.

Qiu, Z., Yao, T., Ngo, C.-W., Zhang, X.-P., Wu, D., and Mei, T. Boosting video representation learning with multifaceted integration. In CVPR, 2021.

Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Neur IPS, 2014.

Soomro, K., Zamir, A. R., and Shah, M. UCF101: A dataset of 101 human action classes from videos in the wild. CRCV-TR-12-01, 2012.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.

Tran, D., Wang, H., Torresani, L., Ray, J., Le Cun, Y., and Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.

Tran, D., Wang, H., Torresani, L., and Feiszli, M. Video classiﬁcation with channel-separated convolutional networks. In ICCV, 2019.

Varol, G., Laptev, I., and Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. on PAMI, 40(6):1510 1517, 2018.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.

Optimization Planning for 3D Conv Nets

Wang, L., Li, W., Li, W., and Gool, L. V. Appearanceand-relation networks for video classiﬁcation. In CVPR, 2018a.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. Temporal segment networks for action recognition in videos. IEEE Trans. on PAMI, 2018b.

Wang, X. and Gupta, A. Videos as space-time region graphs. In ECCV, 2018.

Wang, X., Girshick, R., Gupta, A., and He, K. Non-local neural networks. In CVPR, 2018c.

Wu, C.-Y., Girshick, R., He, K., Feichtenhofer, C., and Krahenbuhl, P. A multigrid method for efﬁciently training video models. In CVPR, 2020.

Wu, W., He, D., Tan, X., Chen, S., and Wen, S. Multiagent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV, 2019.

Xie, S., Sun, C., Huang, J., Tu, Z., and Murphy, K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation. In ECCV, 2018.

Yaida, S. Fluctuation-dissipation relations for stochastic gradient descent. In ICLR, 2019.

Zach, C., Pock, T., and Bischof, H. A duality based approach for realtime tv-l1 optical ﬂow. Pattern Recognition, 2007.

Zhu, C., Tan, X., Zhou, F., Liu, X., Yue, K., Ding, E., and Ma, Y. Fine-grained video categorization with redundancy reduction attention. In ECCV, 2018.

Zhu, X., Xu, C., Hui, L., Lu, C., and Tao, D. Approximated bilinear modules for temporal modeling. In ICCV, 2019.