# video_prediction_via_selective_sampling__56bac6ad.pdf

Video Prediction via Selective Sampling

Jingwei Xu, Bingbing Ni , Xiaokang Yang Mo E Key Lab of Artiﬁcial Intelligence, AI Institute SJTU-UCLA Joint Research Center on Machine Perception and Inference, Shanghai Jiao Tong University, Shanghai 200240, China Shanghai Institute for Advanced Communication and Data Science xjwxjw,nibingbing,xkyang@sjtu.edu.cn

Most adversarial learning based video prediction methods suffer from image blur, since the commonly used adversarial and regression loss pair work rather in a competitive way than collaboration, yielding compromised blur effect. In the meantime, as often relying on a single-pass architecture, the predictor is inadequate to explicitly capture the forthcoming uncertainty. Our work involves two key insights: (1) Video prediction can be approached as a stochastic process: we sample a collection of proposals conforming to possible frame distribution at following time stamp, and one can select the ﬁnal prediction from it. (2) De-coupling combined loss functions into dedicatedly designed sub-networks encourages them to work in a collaborative way. Combining above two insights we propose a two-stage framework called VPSS (Video Prediction via Selective Sampling). Speciﬁcally a Sampling module produces a collection of high quality proposals, facilitated by a multiple choice adversarial learning scheme, yielding diverse frame proposal set. Subsequently a Selection module selects high possibility candidates from proposals and combines them to produce ﬁnal prediction. Extensive experiments on diverse challenging datasets demonstrate the effectiveness of proposed video prediction approach, i.e., yielding more diverse proposals and accurate prediction results.

1 Introduction

Video prediction has been receiving increasing research attention in computer vision [3, 29, 10, 8, 19], which has great potentials in applications such as future decision, robot manipulation and autonomous driving [20]. Previous methods [10, 19] with pixel-wise regression loss tend to produce blurry results as they seek the average from possible futures [29]. To enhance the generation quality, some works [8, 34] utilize adversarial learning [13] to facilitate video prediction task, i.e., adding an adversarial loss [13] on the prediction module.

However, paired regression and adversarial loss still CANNOT solve image blur and motion deformation problems in principle. A generator often struggles to balance between adversarial [13] and regression loss during training procedure [18, 2, 5], thus most possibly yielding an averaged result. As in worse case either adversarial [13] or regression loss tend to take dominate place, which forces the other term to fail to play its role. In other words, both loss functions work rather in a competitive way than collaboration. We start to think: (1) To address the blur issue, is it possible to sample a collection of high quality proposals conforming to possible frame distribution at following time stamp, and select the ﬁnal prediction from it? (2) To encourage collaboration between loss functions, is it possible to design dedicated sub-networks for adversarial and regression loss respectively?

Corresponding Author

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Table 1: Competition between adversarial and regression Loss in term of λLAdv + LReg.

Model MCNet [34] Dr Net [8] SAVP [25] λ=0.02/0.2/0.5 λ=1e-4/1e-3/4e-3 λ=0.01/0.1/0.3

LReg after 50K iterations 0.04/0.05/0.07 0.07/0.09/0.12 0.02/0.03/0.05

Motivated by these issues, we propose a two-stage framework called VPSS which utilizes different modules to handle both losses respectively, therefore to encourage collaboration between them, instead of competition. As shown in Figure 1, the Sampling module produces multiple high quality video frame proposals, by making use of a multiple choice adversarial learning scheme, yielding diverse video prediction set. This module is trained in an adversarial learning manner [5]. The Selection module selects high possibility candidates from proposals and combines to produce the ﬁnal prediction, according to the criteria of better position matching. By contrast, the selection module is trained with regression loss. We conduct both qualitative and quantitative experiments on diverse datasets, ranging from digits moving, human motion to robotic arm manipulation, including a visualization experiment. Our results clearly indicate higher visual quality and more precise motion prediction, even for complex motion patterns as well as long-term prediction, which signiﬁcantly outperform prior arts. Experiments show that these two modules could corporate well with each other under the VPSS framework.

2 Related work

Video Prediction. Many previous work have been done on video prediction task. Srivastava et al. [33] ﬁrst proposes to use LSTM [16] to predict future frames. Shi et al. [32] proposes Conv LSTM architecture dedicated to preserve the spatial information, which is a big step forward on this task. Different from Conv LSTM, DFN [19] generates convolution kernels according to the inputs, which shows more ﬂexibility on modelling the motion variation. CDNA [10] combines the advantages of Conv LSTM [32] and DFN [19] for further utilizing the spatial information. However all above methods suffer from the image blur and structure deformation because of the regression loss. To enhance the image quality, Dr Net [8] and MCNet [34] utilize adversarial training to generate more sharpen frames, yet hard to balance adversarial loss and regression loss. Some work [35, 20, 31] further push prediction model to achieve long-term prediction. More recently, stochastic prediction [7, 3, 25] tries to use random learned prior or noise to achieve prediction in a stochastic way, which generate visually appealing prediction but with stochastic motion prediction (i.e., location variation and shape transformation is random). Note that different from these stochastic prediction methods, our model ﬁrst generates multiple proposals and then precisely infer the ﬁnal outcome from them, which shows high precision motion prediction.

Competition between Adversarial and Regression Loss. This issue has been widely discussed in image generation and translation task [18, 36]. It is more commonly treated as the competition between generation diversity (i.e., related to GAN [13] loss) and visual quality (i.e., related to regression loss) in image domain. VAE-GAN [24] ﬁrst proposes to combine VAE and GAN [13] by replacing element-wise errors with feature-wise errors to better capture the data distribution. Some work try to solve this problem from information entropy view, such as Info GAN [6], ALI [9] and ALICE [26], which mainly try to ﬁnd a deterministic mapping relation between the image domain and latent code domain. Based on VAE-GAN [24], BEGAN [4] proposes a new equilibrium enforcing method to pursue fast and stable training and high visual quality. Different from the image generation task, video prediction further requires precise motion prediction along the time axis. Accordingly, the competition in video domain lies between visual quality and motion prediction precision, yet few methods have been proposed to address this issue.

Previous models [34, 34, 25] try to balance adversarial [13] loss (denoted as LAdv) and regression loss (denoted as LReg) with a hyper-parameter λ, i.e. λLAdv +LReg. To demonstrate the competition intuitively, we run three open-source codes with their default training setups2, only altering λ with

2MCNet: https://github.com/rubenvillegas/iclr2017mcnet; Dr Net: https://github.com/edenton/drnet; SAVP: https://github.com/alexlee-gk/video_prediction

𝑥𝑡 Sampler Selector Combiner 𝑥𝑡+1

Kernel Proposal

Image Proposal

Kernel Generator

Sampling Stage Selection Stage

Encs Decs Encc Decc

Figure 1: The general framework of VPSS. At the Sampling stage, our framework contains a conditional sampling module which produces N high-quality proposals at time stamp t + 1 based on inputs at time stamp t. For the Selection stage, we propose a selection module for ﬁnal prediction. This module further chooses K candidates from the high-quality proposals and fuses them to produce ﬁnal prediction.

different values. As shown in Table 1, if enlarging λ progressively, one can notice that converge value of LReg keeps increasing. This indicates that it is impractical to force prediction module to satisfy both sides, which directly motivates us to explore this task from a different view. Speciﬁcally, we propose a novel framework called VPSS with two modules of sampling and selection. In this section we discuss this framework and detailed architecture of both modules. Meanwhile we demonstrate how these two modules effectively corporate well with each other.

3.1 Sampling Module

Previous methods [25, 3] aim to model future uncertainty with temporal correlated noise, which is sampled from prior distribution then passed through a recurrent neural network (e.g., GRU, LSTM). However it is insufﬁcient to model the upcoming uncertainty explicitly with random noise based stochasticity, which lacks correlation with inputs. Inspired by previous work [30, 5], we show that video prediction can be approached as a stochastic process: we gather a random collection of high quality proposals in one shot, with a multiple choice adversarial learning scheme that encourages diversity within the collection.

More formally, consider single step forward prediction, where we use current input (denoted as xt RL H C) to predict next time-stamp frame (denoted as xt+1 RL H C). L, H, C stand for image width, height and channel respectively. Recall that xt is ﬁrstly fed into a sampler φspl to produce a collection of proposals (denoted as ˆXt+1 = {ˆx1 t+1, ..., ˆx N t+1}), where N is the desired number of proposals. For single prediction the number of ﬁnal output channels in φspl is C. So for N proposals we directly enlarge it to C N, where each consecutive C-tuple of channels forms a proposal, and denote the corresponding kernel as W = {W i}N i=1.

To guarantee the visual quality of proposals, we utilize adversarial learning to facilitate the sampling procedure:

Lspl(S, D) =

t=1 (EX[log D(xt+1)] + 1

N EX[log D(1 φspl(xt, W))]), (1)

where D is a discriminator capable of distinguishing sampled images from real images. Naturally we hope these proposals are diverse enough for further selection. However only using the same adversarial loss forces all proposals to be identical, which departs from the requirement of diversity. To this end, we propose a kernel generator φKGen to achieve this goal.

At time stamp t, the kernel generator takes previous xt, and xt 1 as input, and the output of φKGen is W = { W i}N i=1. As shown in Figure 1 (black dash lines), there is one-to-one correspondence between W i and ˆxi t+1. We denote the m-th input channel and n-th output channel of W as

Sampling Stage

(64,64,3) (64,64,3) (64,64,3,3) (3,64,64,3)

Channel-wise

Subtraction

N x (7,7,3,3) Kernel Generator Sampler

Kernel i Element-wise

Discriminator

Figure 2: Detailed architecture of Sampling Module. The left part is proposed kernel generator to produce multiple kernels, while the right part is sampler for sampling a collection of high quality proposals.

W(m, n). Similarly x(n) represents the n-th channel of x and W( , n) represents the n-th output channel along with all input channels of W. As shown in Figure 2, the kernel generation procedure ﬁrst performs channel-wise subtraction to obtain temporal variation information, and then encodes it into the current convolution kernel with a CNN φW , Meanwhile we perform channel-wise softmax [28] along the input channel:

ˆW(m, n) = φW (xt(m), xt 1(n)), m, n = 1, ..., C, (2)

W( , n) = Softmax( ˆW( , n)), n = 1, ..., C, (3) where C is the number of channels. The sampling procedure at time stamp t is executed as follows,

W = φKGen(xt, xt 1), ˆxi t+1 = φspl(xt; W i + W i), i = 1, ..., N. (4)

Through kernel generator we actually transform this problem into kernel diversity, i.e., diversity of { W i}N i=1. Inspired from Guzman-Rivera et al. [14], we develop a multiple choice adversarial learning scheme as follows,

t=2 E ˆ Xt+1[log D(ˆxk t+1)], k = min i ||xt+1 ˆxi t+1||1. (5)

This loss function is a critical component in sampling module, and the diversity essentially results from the min operation in Equation 5. To be speciﬁc, at each time stamp we only update kernel corresponding to the best sampled image, where N kernels ({ W i}N i=1) are optimized independently toward different directions throughout training procedure. This asynchronized updating scheme encourages the kernel generator to spread its bets and cover the exploration space of predictions that conform to all possible frame distribution. Intuitively one can consider that the kernel generator φKGen is boosted to capture motion information based on previous inputs and infer next time-stamp motion direction without constrain of regression loss. Meanwhile image quality is well guaranteed by adversarial learning loss (Equation 1). More importantly, different from those stochastic prediction methods [3, 7, 25] whose diversity is achieved by random noise, the diversity of our model is entirely achieved by previous inputs, which is more rational in principle and inherently useful for further selection.

3.2 Selection Module

Given sampled proposals ˆXt+1, the selection module ﬁrst selects best K candidates from them under the measure of motion precision. For example, the motion direction usually changes smoothly between frames, which can be treated as a selection criterion. But it is impractical to hand-crafted design several criteria for candidate sifting. Meanwhile under the problem setting of video prediction, ground truth at time stamp t + 1 is unknown at time stamp t. Directly comparing with ground truth is not an option. To tackle this problem, we formulate it as ranking all proposals based on previous inputs. To be speciﬁc, for each proposal we utilize recurrent neural network to regress conﬁdence score obtained in a self-supervised manner, and rank them based on the regressed values. The top-K candidates are feed into a combiner for further processing.

For clarity we denote the ground-truth score for i-th candidate as γi, and Γ = {γi}N i=1. Corresponding regressed score is denoted as ˆγi, and ˆΓ = {ˆγi}N i=1. The most direct way to generate ground-truth score

LSTM LSTM LSTM

Top K Selection

Confidence Score

1.Sum 2.Convolution

Final Prediction Selection Stage

Selector Combiner

K-Stream Auto-Encoder

Figure 3: Detailed architecture of Selection Module. The left part is proposed selector to select K candidates from N proposals, while the right part is combiner for combining K candidates into ﬁnal prediction. is to calculate the pixel-wise error between the proposals and ground truth: γi = ||xt+1 ˆxi t+1||1. It works well when the motion pattern is simple (e.g., transformation toward some direction) and the objects are rigid bodies. When encountering real-world video sequences (e.g., human or robot motion with occlusion), it severely suffers from the motion uncertainty and complex structure deformation.

For better capturing the motion and content information, we utilize a pre-trained discriminative model to extract the multi-level feature from video sequences, and compute the conﬁdence score at feature level, which is widely used in image translation task [18]: γi =

l=1 ||ϕl(xt+1) ϕl(ˆxi t+1)||1, i = 1, ..., N, (6)

where ϕ is the discriminative model (e.g., detection or pose estimation network), and l stands for the l-th layer of ϕ. As mentioned above, Γ is treated as target value of conﬁdence score. Based on it we can predict the conﬁdence score based on previous inputs as follows: ˆγi = ψslt(ˆxi t+1|xt, xt 1), i = 1, ..., N, (7)

where ψslt is essentially a 3-step Conv LSTM network [32] (Left-most part in Figure 3). It successively takes xt 1, xt, ˆxi t+1 as input . Therefore we propose to train the ψslt with a regression loss: Lslt = ||ˆΓ Γ||1. At time stamp t, we select top-K candidates X = {ˆxij, ij T (ˆΓ)}K j=1 for ﬁnal prediction, where T (ˆΓ) is the index set of top-K candidates in ˆΓ.

For the ﬁnal step, we utilize a Combiner ψcomb to compose candidates X into ﬁnal prediction. Correspondingly it requires ψcomb to fully capture information contained in X. Inspired from recent work on video analogy making [31], we propose a K-stream Auto-Encoder to capture feature information. As shown in Figure 3, each candidate is fed into an encoder to extract multi-level information, then passed to decoder to generate K masks M = {mj}K j=1, which are used to compose K candidates into ﬁnal prediction:

xt+1 = ψcomb( X) = Conv(

j=1 mj ˆxij t+1), (8)

where represents hadamard product. As shown in Figure 3, we also use skip connection [18] between encoders and decoder to enhance feature sharing, and we keep the spatial resolution of feature same to images throughout ψcomb, which proves more efﬁcient to preserve the spatial information [21]. We use regular regression loss to train ψcomb as follows : Lcomb = PT t=2 ||xt xt||1.

3.3 Design Considerations and Implementation Details

Design Considerations. Notably, our framework contains two modules and each involves two sub-networks. If without carefully designed training procedure, it will easily get stuck in a bad local minima with meaningless outputs. We therefore provide several practical considerations dedicated for training stability. Firstly, sampler φspl and selector ψslt act as models with strong prior knowledge, pre-training them stabilizes following training procedure by a large margin. Secondly, we utilize curriculum learning [22] along the temporal direction, which could effectively ﬂatten the training curve. Last but not least, we decouple combined loss function into corresponding sub-networks, i.e., we update φKGen, φspl, ψslt, ψcomb according to LKGen, Lspl, Lslt, Lcomb respectively. On one

Table 2: Detailed Evaluation Setup with different datasets and models.

Datasets Comparison models Inputs and outputs Video resolution

Moving Mnist [33] DFN [19]& SVG [7] 10 inputs for 10 outputs 64x64 Robot Push [10] CDNA [10]& SV2P [3] 2 inputs for 8 outputs 64x64 Human3.6M [17] Dr Net [8]& MCNet [34] 5 inputs for 10 outputs 64x64

Table 3: Qualitative experiments in terms of reality and similarity assessment.

model setup Moving Mnist [33] Robot Push [10] Human3.6M [17] DFN/SVG/Ours CDNA/SV2P/Ours Dr Net/MCNet/Ours

Reality 25.4%/43.1%/45.2% 22.5%/29.2%/35.9% 10.8%/27.7%/23.1% Image Quality 28.2%/33.8%/38.0% 23.5%/30.3%/46.2% 20.0%/39.2%/40.8% Prediction Accuracy 38.1%/26.6%/35.3% 21.9%/36.8%/41.4% 33.4%/26.0%/40.7%

hand, this design scheme prevents competition between adversarial and regression losses in principle. Because the gradients back-propagated from them are to optimize different networks, which gets rid of balancing between these two loss functions. On the other hand, sub-networks are designed dedicated for different objectives, e.g., the sampler is required to produce high quality proposals without requirement of motion accuracy, while the selector should be able to fully capture the motion information of previous inputs and select out proposals with high motion accuracy. Note that these objectives are complementary with each other, which essentially encourages corporation between different sub-networks.

Implementation Details. We implement the proposed framework with Tensorﬂow library [1]. The sampler consists of 3 down-sampling Res-Blocks [15] and 3 up-sampling Res-Blocks, where sampling operation is bilinear interpolation with stride 2. Each block is followed by Re LU [12] activation. The kernel generator consists of 5 convolution layers with leaky-Re LU [27] (Leaky rate 0.1) and ﬁnal layer with sigmoid. The selector is a 3-layer Conv LSTM [32] with stride 4. Finally the combiner consists of two parts, i.e., the encoder and decoder. Encoder part is a 4-layer convolution network with down-sampling of stride 2. The decoder is a mirrored version of the encoder with 4 deconvolution layers and a sigmoid output layer. The model ϕ is identical to that used in style transfer [11]. As mentioned above, we pre-train φspl and ψslt for 2 epochs. The curriculum learning is essentially increasing the prediction length by 1 every epoch with initial length of 1. In all experiments we train all our models with the ADAM optimizer [23] and learning rate η = 0.0001, β1 = 0.5, β2 = 0.999. In all experiments we set N = 5 and K = 2. This selection will be further discussed in Section 4.4.

4 Experiments

4.1 Datasets and Evaluation Setup

Datasets. We evaluate our framework (VPSS) on three diverse datasets: Moving Mnist [33], Robot Push [10] and Human3.6M [17], which represent challenges in different aspects. Moving Mnist [33] contains one bouncing digit, which is treated as toy example and suitable for demonstrating the superiority of our framework. Robot Push [10] involves complex robotic motion which has been widely used for video prediction. Human3.6M [17] captures single human motion whose challenge lies in motion stochasticity. Notably, in Human3.6M [17] the human subject only takes a small portion of current frame, whose motion could easily be ignored only with the regression loss.

Evaluation Setup. A long-standing issue in video prediction or generation task is how to ensure fair comparison with other models [25]. We try to get rid of unfair comparison by taking following two measures: (1) we compare our model with models whose source codes can be accessed, (2) we exactly follow their experiment setup and do not change their training procedure. For detailed evaluation setup, please refer to Table 2.

4.2 Quantitative Evaluation

To quantitatively evaluate our proposed framework, we compute the PSNR and SSIM value for DFN [19], SVG [7] and our model on Moving Mnist Datasets [33]. Due to the stochastic prediction of SVG [7], we plot these two curves on average 20 samples for it. Note that different from predicting 10 frames during training, we predict 20 frames to validate the generalization ability. As illustrated

0 5 10 15 20 Time Stamp

DFN Ours SVG

0 5 10 15 20 Time Stamp

DFN Ours SVG

0 5 10 15 20 Time Stamp

Inception Score

SV2P Ours CDNA

Figure 4: Quantitative comparison with different prediction models in term of PSNR (left), SSIM (middle) and Inception Score (right).

Table 4: Evaluation under PSNR and Inception Score with different combinations of N and K

(N, K) (3, 1) (4, 1) (3, 2) (4, 2) (5, 2)

(PSNR, Inception Score) (24.43,1.85) (24.96,1.90) (25.07,1.95) (25.21,1.98) (25.13,2.01)

in Figrue 4.4 (left and middle), our model outperforms the other two by a large margin, especially in latter time stamps. DFN [19] is trained only under regression loss, without modelling the future uncertainty. The prediction results degenerate to blurry frames rapidly. SVG [7] models the future with random noise sampled from a learned prior, which effectively prevents the blur effects. However the correlation between random noise and current inputs is quite weak, which is insufﬁcient to predict the future precisely. In other words, it is more like "random guess". So predictions of SVG [7] are commonly of high image quality but low prediction precision. By contrast within proposed framework both objectives degrade more gracefully than SVG [7] and DFN [19]. Our framework ﬁrst samples high quality proposals then combines them into ﬁnal prediction, which effectively tackles above two problems. It clearly proves that the two-stage framework with dedicated designed sub-networks uniﬁes both adversarial and regression loss functions into prediction system successfully.

Recently several works [29, 8, 25] argue that PSNR and SSIM are not convincing enough to guarantee the quality of video prediction. To this end, we compute the Inception Score for CDNA [10], SV2P [3] and our model on Robot Push Datasets [10]. As shown in Figure 4.4 (right), compared to CDNA [10] and SV2P [3] our model keeps relative higher scores throughout the prediction procedure, which mainly beneﬁts from high quality proposals during the sampling stage. The results of CDNA [10] and SV2P [3] clearly demonstrates that balance between loss functions is hard to satisfy all requirements.

4.3 Qualitative Evaluation In video prediction task, the most convincing way to demonstrate effectiveness of model is directly visualizing predicted results. We present several samples predicting up to 20 time stamps for all three datasets shown in Figure 5. As mentioned above, image blur and prediction accuracy are still two key issues we care about. We can clearly observe that DFN [19] and CDNA [10] produce blurry results because of regression loss. While for SVG [7] and SV2P [3], image quality is much better, but compared to ground truth, the prediction accuracy is not so satisfying (e.g., location prediction in Moving Mnist [33]). For MCNet [34] and Dr Net [8], although they try to enhance the image quality with adversarial learning, the conﬂict between adversarial loss and regression loss prevents them from achieving both requirements concurrently. By contrast, the proposed two-stage framework achieves both high image quality and precise motion prediction. We further consider that still images are not sufﬁcient to fully demonstrate information contained in video sequences, so we strongly recommend readers to refer to video results in supplemental material.

Meanwhile we provide a subjective experiment for further validation. To be speciﬁc, we collect 1270 prediction results of each dataset, and ask 40 people to provide subjective assessment on them. This experiment is conducted in three aspects: (1) Regarding the real video samples as baseline, which one is more realistic (Not based on previous inputs)? (2) Considering image quality and previous inputs, which one is more similar to the Ground Truth? (3) Considering motion accuracy and previous inputs, which one is more similar to the Ground Truth? These results are shown in Table 3. In the reality experiment, we observe that results produced by stochastic prediction (SVG [7]) and adversarial learning (Ours and MCNet [34]) related methods seem more realistic to human, which demonstrates

Figure 5: Prediction results on Moving Mnist (A) [33], Robot Push (C) [10] and Human3.6M (D) [17] Datasets at time stamp 4,8,12,16,20. Note that sub-ﬁgure (B) demonstrates proposals (second row) and candidates (third row) during a complete procedure of predicting moving digit 0. We strongly recommend readers to refer to more examples in supplemental material. effectiveness on enhancing the image quality. Similar effect can be observed even based on previous inputs (Image quality experiment). For the motion accuracy experiment, one can notice considerable preference drop of SVG [7], MCNet [34] and Dr Net [8] compared to image quality experiment. Notably, our framework achieves most of the highest scores in three experiments, which mainly beneﬁts from proposed selective sampling framework.

4.4 Discussion

Previous experiments demonstrate the superiority of proposed framework. In this section we present further analysis about it. We ﬁrstly discuss the selection of N and K, then delve deeper into the execution procedure of our framework.

Selection of N and K. As shown in Table 4, we evaluate the choice of N and K in term of PSNR and Inception Score [10] on Human3.6M Datasets [17]. One can notice that with the increasing of K, the prediction accuracy keeps growing (i.e., higher PSNR value). While with the increasing of N, the image quality seems to be better (i.e., higher inception score). But keeping N and K too high will drastically increase the model complexity. We choose N = 5 and K = 2 for the reason that this combination could achieve promising performance and keep model at a relative low complexity level.

Are these proposals rational enough? To examine the rationality of proposals at sampling stage, we plot all these proposals in single image for visualization on Moving Mnist Datasets [33]. As shown in the second row of Figure 5 (B), the operation of sampler is actually sampling N examples based on previous inputs, and motion direction of all proposals is roughly towards the ground truth. Meanwhile one can observe that the image quality of all proposals keeps at a relative high level and almost does not degrade along the prediction procedure.

Are these candidates accurate enough? To examine the accuracy of candidates at selection stage, same as above we plot all these candidates for visualization. As shown in the third row of Figure 5 (B), the operation of selector is actually selecting K candidates from N proposals. For Moving Mnist [33] prediction, the selector is actually ﬁltering out these proposals which are possibly far away from the Ground truth. With the help of combiner, high-quality candidates are composed into the ﬁnal prediction which demonstrates high motion accuracy.

0 5 10 15 20 25 30 32

SVG/best SV2P CDNA w/o combiner

w/o selector

PSNR-Human3.6M

0 5 10 15 20 25 30 0.9

SVG/best Dr Net DFN

SV2P CDNA w/o combiner

MCNet w/o selector

SSIM-Human3.6M

0 5 10 15 20 25 30 1.4

w/o selector

MCNet w/o combiner CDNA SV2P SVG/best Ours

Inception Score-Human3.6M

Mask of red

Mask of blue

Candidates (Red&Blue) Ground

Ouputs w/o selector

Ouputs with selector

Confidence of

(D) Confidence Visualization (E) Mask Visualization (F) Kernel Visualization (G) Long-term Prediction

T=1 T=4 T=7 T=10 T=1 T=4 T=7 T=10 T=1 T=4 T=7 T=10 T=1 T=9 T=18 T=27

Figure 6: Comparison experiments and ablation study. w/o: removing corresponding module.

Explanation on W: (1) W is the normal learnable kernel and W is generated at each time stamp. Element-wise addition means matrix addition between Wi and Wi. (2) We update Wi instead of Wi for the reason that Wi is treated as a basic kernel which estimates motion at a coarse level, and Wi is for ﬁne-grained prediction based on Wi. By updating Wi we could narrow down the possible distribution to more precise and smaller scale. (3) Inspired from DFN [9], we use softmax to encourage sparsity of Wi, thus we can mimic the complex motion dynamics more precisely. (4) We present visual evidence (Fig 6(F)) to show the difference among { Wi}2 i=0 (2nd input and 2nd output channel of { Wi}2 i=0 on up-right corner of each proposal). They also gradually change along with time increasing.

Comparison Experiments: As suggested we compare our model with all 6 baselines on Human3.6M datasets [17] in terms of PSNR, SSIM and Inception Score (Fig 6(A,B,C)). Our model (green line) still performs better than these baselines in both terms of motion accuracy and image quality when prediction length is extended to 30 (Long-term prediction, trained for 10 steps). Particularly it is slightly better than the baseline SVG/best (best of 10 random samples, light purple line).

Ablation Study: (1) As shown in Fig 6(A,B), when removing selector (K=N) or combiner (K=1), the prediction accuracy drops by a large margin (yellow and orange line) compared to the full model (green line). (2) The predicted results (Fig 6(D)) without selector (2nd row) tend to involve random motion. By contrast, the conﬁdence score (3rd row) assigned by selector could well estimate the distribution of future motion, where selector acts as a strong supervisor for motion prediction. (3) The masks from combiner (Fig 6(E)) actually combine different parts of candidates for ﬁnal prediction, which help to reﬁne the predicted motion and improve accuracy. (4) Analysis on N and K: Further increasing K involves proposals with low conﬁdence, which may potentially decrease the prediction accuracy. Keeping N high will improve performance a little but drastically increase the model complexity.

5 Conclusion

In this paper we propose a two-stage framework, called VPSS to study video prediction task from a novel view. At the sampling stage, our framework contains a conditional sampling module which produces multiple high-quality proposals at time each stamp. For the selection stage, we propose a selection module for ﬁnal prediction. Extensive experiments on diverse challenging datasets demonstrate the effectiveness of the proposed video prediction framework.

Acknowledgement

This work was supported by National Key Research and Development Program of China (2016YFB1001003), NSFC (61527804, U1611461, 61502301, 61521062). The work was partly supported by State Key Research and Development Program 18DZ2270700. This work was supported by SJTU-UCLA Joint Center for Machine Perception and Inference. The work was also partially supported by China s Thousand Youth Talents Plan, STCSM 17511105401, 18DZ2270700 and Mo E Key Lab of Artiﬁcial Intelligence, AI Institute, Shanghai Jiao Tong University, China.

[1] M. Abadi, P. Barham, J. Chen, and Z. Chen. Tensorﬂow: A system for large-scale machine learning. Co RR, abs/1605.08695, 2016. [2] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. Co RR, abs/1701.04862, 2017. [3] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. Co RR, abs/1710.11252, 2017. [4] D. Berthelot, T. Schumm, and L. Metz. BEGAN: boundary equilibrium generative adversarial networks. Co RR, abs/1703.10717, 2017. [5] Q. Chen and V. Koltun. Photographic image synthesis with cascaded reﬁnement networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1520 1529, 2017. [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2172 2180, 2016. [7] E. Denton and R. Fergus. Stochastic video generation with a learned prior. Co RR, abs/1802.07687, 2018. [8] E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 4417 4426, 2017. [9] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. C. Courville. Adversarially learned inference. Co RR, abs/1606.00704, 2016. [10] C. Finn, I. J. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 64 72, 2016. [11] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2414 2423, 2016. [12] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neural networks. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 315 323, 2011. [13] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672 2680, 2014. [14] A. Guzmán-Rivera, D. Batra, and P. Kohli. Multiple choice learning: Learning to produce multiple structured outputs. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 1808 1816, 2012. [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Co RR, abs/1512.03385, 2015. [16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. [17] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1325 1339, 2014. [18] P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5967 5976, 2017. [19] X. Jia, B. D. Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic ﬁlter networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 667 675, 2016. [20] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, and S. Yan. Predicting scene parsing and motion dynamics in the future. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6918 6927, 2017. [21] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1771 1779, 2017. [22] F. Khan, X. J. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching dimension. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 1449 1457, 2011.

[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR,2014. [24] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1558 1566, 2016. [25] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. Co RR, abs/1804.01523, 2018. [26] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin. ALICE: towards understanding adversarial learning for joint distribution matching. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5501 5509, 2017. [27] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectiﬁer nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013. [28] S. Marsland. Machine Learning - An Algorithmic Perspective. Chapman and Hall / CRC machine learning and pattern recognition series. CRC Press, 2009. [29] M. Mathieu, C. Couprie, and Y. Le Cun. Deep multi-scale video prediction beyond mean square error. ICLR, 2016. [30] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3510 3520, 2017. [31] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1252 1260, 2015. [32] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 802 810, 2015. [33] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 843 852, 2015. [34] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. Co RR, abs/1706.08033, 2017. [35] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future via hierarchical prediction. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3560 3569, 2017. [36] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2242 2251, 2017.