# video_face_superresolution_with_motionadaptive_feedback_cell__a7188c2b.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Video Face Super-Resolution with Motion-Adaptive Feedback Cell Jingwei Xin, Nannan Wang, Jie Li, Xinbo Gao, Zhifeng Li State Key Laboratory of Integrated Services Networks, School of Electronic Engineering, Xidian University, Xi an 710071, China State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi an 710071, China Tencent AI Lab, China, Video super-resolution (VSR) methods have recently achieved a remarkable success due to the development of deep convolutional neural networks (CNN). Current state-of-the-art CNN methods usually treat the VSR problem as a large number of separate multi-frame super-resolution tasks, at which a batch of low resolution (LR) frames is utilized to generate a single high resolution (HR) frame, and running a slide window to select LR frames over the entire video would obtain a series of HR frames. However, duo to the complex temporal dependency between frames, with the number of LR input frames increase, the performance of the reconstructed HR frames become worse. The reason is in that these methods lack the ability to model complex temporal dependencies and hard to give an accurate motion estimation and compensation for VSR process. Which makes the performance degrade drastically when the motion in frames is complex. In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but effective block, which can efficiently capture the motion compensation and feed it back to the network in an adaptive way. Our approach efficiently utilizes the information of the inter-frame motion, the dependence of the network on motion estimation and compensation method can be avoid. In addition, benefiting from the excellent nature of MAFC, the network can achieve better performance in the case of extremely complex motion scenarios. Extensive evaluations and comparisons validate the strengths of our approach, and the experimental results demonstrated that the proposed framework is outperform the state-of-the-art methods. Introduction Image and video super-resolution (SR) is now an efficient method that could be widely used in many fields ranging from the medical and satellite imaging (Thornton, Atkinson, and Holland 2006; Shi et al. 2013) to the security and surveillance (Zou and Yuen 2011), it has attracted much more attention in recent years. As a domain-specific of cross-modal learning (Yu et al. 2019; 2018), SR technology aims to generate a High-Resolution (HR) image from a Low-Resolution (LR) input one, which guarantees the restoration of the low frequency information in the frequency band and predicts the high frequency information above the cutoff frequency. Corresponding author: Nannan Wang (nnwang@xidian.edu.cn) Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. With the development of convolution neural networks, the deep learning based single image super-resolution (SISR) has received significant attention from the research community over the past few years (Kim, Kwon Lee, and Mu Lee 2016a; He et al. 2016), which achieve the state-of-the-art performances in terms of the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) (Wang et al. 2004). However, the SISR methods do not consider the temporal relationship between frames, that make its results in video super-resolution (VSR) is not satisfactory. Recently, the multi-frame SR technology is proposed in the field of video super-resolution, in which the information of the motion relationship between input frames is utilized to improve the performance. It takes multiple LR frames as inputs and output HR frames by taking into account subpixel motions between the neighboring LR frames. Usually, most deep learning based VSR methods (Liao et al. 2015; Kappeler et al. 2016; Caballero et al. 2017; Liu et al. 2017; Tao et al. 2017) follow a similar procedure that consists of two steps: the first step is the motion estimation and compensation, and the second one is an up-sampling process. The motion estimation and compensation are the key of these methods, it is a hard task, and especially when the complex motion or parallax appear across the neighboring frames. Caballero et.al.(Caballero et al. 2017) had proposed an efficient spatial transformer network to compensate the motion between frames fed to the SR network, but network performance decreases when the number of input frames exceeds 5. Jo et.al.(Jo et al. 2018) had explored a method based on the dynamic upsampling filters estimation which avoid explicit motion compensation, but it is also hard to make full use of the video s motion information. In addition, reducing the number of frames which input into the network at the same time, could effectively improve the network s ability to model the complex temporal dependencies(Sajjadi, Vemulapalli, and Brown 2018; Haris, Shakhnarovich, and Ukita 2019). However, the decrease in the number of input frames also means that the network be likely to receive less useful information, and the performance of the network will be further limited. In the deep-learning based VSR methods abovementioned, the input images are connected in parallel and the network has not treated them discriminately, which limits the ability of networks to learn the useful information when the input Figure 1: Our method aims to generate high-resolution video frames (2nd row) from low-resolution ones (1st row, visualized using pixel duplication). The low-resolution frames (1st row) are downsampled from the corresponding groundtruth frames (3rd row) with noise and blur. frames were too large. This would lead to the performance of the network decrease too much as the input frames number increasing(Caballero et al. 2017; Wang et al. 2018). In this paper, we focus on the problem of the motion estimation and compensation of video super resolution, and propose a Motion-Adaptive Feedback Ceil (MAFC) and a novel network named Motion-Adaptive Feedback Network (MAFN) to improve the VSR performance. MAFC can sensitively capture the motion information between each frame and feed back to the network in an adaptive way, and MAFN can process each input image independently by the channel separation method which can meets the input requirements of MAFC. Furthermore, we apply the proposed model to super resolve videos. The experimental results demonstrate that our approach can be applied in the complex motion scenarios, and achieve state-of-the-art performance. Moreover, it can also effectively solve the problem of network performance degradation caused by excessive input frames. The advantages of our model can be summarized as follows. 1). It makes full use of the motion relationship between different frames and avoids motion compensation operation. 2). It can enhance the connection between each frame features, and the different frame features can be treated discriminately. 3). The temporal dependency can be efficiently modelled by the network, and the performance of the network improves as the number of input frames increasing. Related Works Early works have made efforts on addressing the VSR problems by putting the motion between HR frames, the blurring process and the subsampling altogether into one framework and focused on solving for the sharp frames with an optimization (Ma et al. 2015). Among these traditional methods, Protter et al. (Protter et al. 2008) and Takeda et al. (Takeda et al. 2009) avoided the motion estimation by employing nonlocal mean and 3D steering kernel regression. Liu and Sun (Liu and Sun 2013) proposed a Bayesian approach to estimate HR video sequences, which can also compute the motion fields and blur kernels simultaneously. Recently, with the rise of deep learning, various networks have been designed in video super-resolution field, such as early deep learning method BRCN (Huang, Wang, and Wang 2015) using recurrent neural networks to model long-term contextual information of temporal sequences. Specifically, they used bidirectional connection between video frames with three types of convolutions: the feedforward convolution for spatial dependency, the recurrent convolution for long-term temporal dependency, and the conditional convolution for long-term contextual information. Besides, Liao et al. (Liao et al. 2015) proposed DESR, which reduces computational load for motion estimation by employing a noniterative framework. The SR drafts are generated by several hand-designed optical flow algorithms, leading to a deep network produce final results. Likewise, Kappeler et al. (Kappeler et al. 2016) proposed VSRnet, which compensates motions in input LR frames by using a hand-designed optical flow algorithm as a preprocessing before being fed to a pretrained deep SR network. Caballero et al. (Caballero et al. 2017) proposed VESPCN, which learns motions between input LR frames and improves HR frame reconstruction accuracy in real-time. Furthermore, this end-to-end deep network estimates the optical flow between input LR frames with a learned CNN to warp frames by a spatial transformer (Jaderberg et al. 2015), and produces an HR frame through another deep network. Similar to the above methods, the work in (Liu et al. 2017) also learns and compensates the motion between input LR frames. But after the motion compensation, they adaptively use the motion information in various temporal radius by temporal adaptive neural network. The network is composed of several SR inference branches for each different temporal radius, and the final output is generated by aggregating the outputs of all the branches. Tao et al. (Tao et al. 2017) used motion compensation transformer module from (Caballero et al. 2017) for the motion estimation, and proposed a subpixel motion compensation layer for simultaneous motion compensation and upsampling. For following SR network, Motion screen unit Compensation estimation unit Sigmoid function Element-wise product ܨ ௧ ଵ ܦ ௧ ଵ,௧ ܯ ௧ ଵ,௧ 濇 Figure 2: The proposed motion-adaptive feedback cell (MAFC). an encoder-decoder style network with skip connections is used to accelerate the training and the Conv LSTM module is used since video is sequential data. Jo et.al (Jo et al. 2018) designed a network which reconstructs image by generating the dynamic upsampling filters and a residual image. Sajjadi et.al (Sajjadi, Vemulapalli, and Brown 2018) proposed a framerecurrent video super-resolution framework that uses the previously inferred HR estimate to super-resolve the subsequent frame. Li et.al (Li et al. 2019) optimized the structure of 3D convolution and proposes a fast VSR method. Inspired by the idea of back-projection, Haris et.al(Haris, Shakhnarovich, and Ukita 2019) integrated spatial and temporal contexts from continuous video frames using a recurrent encoder-decoder module. Our starting point is improving the ability of network to model the motion information from the video. We restrict our analysis to the motion compensation between each frame and do not further investigate potentially beneficial extensions such as recurrence (Kim, Kwon Lee, and Mu Lee 2016b) and residual learning (Kim, Kwon Lee, and Mu Lee 2016a), width and depth of network (Zhang et al. 2018a) or different loss functions (Lai et al. 2017; Ledig et al. 2017). After presenting an introduction of the MAFC in Sec. 3.1 and defining a novel network used for combining MAFC in Sec. 3.2, we justify our design choices in Sec. 3.3 and give details on the implementation and training procedure in Sec. 3.4 and 3.5, respectively. Motion-Adaptive Feedback Cell The motion compensation operations of the existing methods are performed directly on the input data, which can be seen as a form of preprocessing. Due to some complex motions are difficult to be modeled, the low-quality motion compensation methods will lead to the performance of network decrease drastically(Caballero et al. 2017; Wang et al. 2018). What is more, the temporal dependencies among input frames may become too complex for networks to learn useful information, and act as noise degrading their performance(Caballero et al. 2017). Considering the problem mentioned above, we advise a motion feedback mechanism between each frame feature and propose a Motion-Adaptive Feedback Cell (MAFC), which can update the current frame features adaptively according to the difference between its neighboring frames. The overview of the proposed MAFC is shown in Fig.2. As shown in Fig.2, two input F t n and F t 1 n are the feature maps from different frames but with the same convolutional receptive field, and the output Mct,t 1 n is the motion compensation information of the input frames. The first step in MAFC is to throw away the redundant information from the cell state by a motion screen unit (MSU), and normalize it between 0 and 1. Next, we use the updated information Dt,t 1 n to infer the candidate values of motion compensation by a compensation estimation unit (CEU). Finally, we combine the two strategies to create the final cell state and use them as motion compensation features Mct,t 1 n . Among them : Dt,t 1 n = σ(Wm1[F t n, F t 1 n ] + bm1), (1) Ct,t 1 n = Wm2[Dt,t 1 n ] + bm2, (2) Mct,t 1 n = σ(Ct,t 1 n Dt,t 1 n ), (3) The motion screen unit is made by a reduction operating followed by a convolution layer and a sigmoid layer, and the compensation estimation unit similarly consists of two convolution layers and a sigmoid layer. Where σ is a sigmoid layer, Wm1 and Wm2 are the convolution parameters of MSU and CEU. In short, MSU is the initial screening of motion information, while CEU is the adaptive enhancement and restrain of the screened motion. Generally, the motion existing in video is very complex, such as the motion of image background, rigid motion of the whole face, non-rigid motion of facial expression and so on. The contribution of these motions to image reconstruction is different. The goal of the MSU screening process is to filter out unimportant movements such as background movements. The CEU is work to estimate the importance degree of the remaining multi-type motions Dt,t 1 n after MSU s screening, and generate their corresponding weight coefficients Ct,t 1 n adaptively. Then, at the end of the MAFC, The multiplication of Dt,t 1 n and Ct,t 1 n accomplishes the function for multi-type motions s enhance or restrain, and feedback to the network more clear and concise motion compensation features. Network Design Given a low-resolution, noisy and blurry video Xt, the goal of the VSR is to estimate a high-resolution, noise-free and 濁澵澺澷 濁澵澺澷 濁澵澺澷 ܨ ௧ ܨ ௧ ଵ ܨ ௧ ଶ ܯ ௧ ,௧ ଵ ܯ ௧ ଵ,௧ ଶ ܯ ௧ ଶ,௧ ଷ ܯ ௧ ଵ,௧ Figure 3: Network architectures for one mid-block of MAFN, named MAFB. blur-free version Yt. The LR frames Xt are downsampled from the corresponding groundtruth (GT) frames Yt with noise and blur, where t denotes the time step. With the VSR network G and the network parameters θ, the VSR problem is defined as: Yt = Gθ(Xt L:t+L), (4) where L is the temporal radius and the input frames number T for the network is 2L + 1. MAFC requires the two inputs from different frames to work properly. Unfortunately, none of existing deep learning based VSR methods work in this way. Thus, our MAFC cannot be carried out in the existing network structures. To address this problem, we simply conduct a lightweight network as shown in Fig.4, which consists of three parts: an Input Block to map an input frames Fslr into the deep features, a Mid-Block to convert the features to the more complete facial presentation features, and a Output-Block to produce the output image Y from the facial presentation features. The detail of mid-block is shown in Fig.3. Besides, it is worth noting that each mid-layer has the same structure. For the last layer output Fn 1, we first update the representation features of each input by a simple SR block and obtain the Frn. Then, the adjacent features are used as the inputs and sent to the MAFC in pairs, leading to the corresponding motion compensation information Mcn. It should be noted that each MAFC has the same model parameters in this layer, and that is MAFC utilizes a weight sharing way to ensure the fairness of motion compensation operation between any frames. Finally, we combine the motion compensation information with the newly obtained representation features to obtain the final output Fn. Among them: Fn 1 = [F t L n 1, ..., F t n 1, ..., F t+L n 1], (5) Frt n = σ(Wf[F t n 1] + bf), Mct,t+1 n = MAFC(Frt n, Frt n+1), (6) F t n = [Mct 1,t n , Frt n, Mct,t+1 n ], Fn = [F t L n , ..., F t n, ..., F t+L n ], (7) where Wf is the parameters of SR block. For each updated representation feature Frt n, we combine its adjacent motion compensation information and itself as the final output. Then, let k3 denote that the convolution kernel size is 3, s1 denotes that the stride is 1 and d16 denotes that the number of feature channels is 16. The Input-Layer is made by a k3s1d16 followed by a relu function, SR-Block is two k3s1d16 and a relu function, and the Output-Block architecture is: k3s1d128, Pixel Shuffle x2, k3s1d64, Pixel Shuffle x2, k3s1d1. Why Does MAFC Work Better? A short answer is that it could more efficient utilize the motion information against common motion compensation operation. The most methods rely heavily on the accuracy of motion estimation and compensation. However the complex motions are difficult to model, it can introduce adverse effects if not handled properly. Specifically, while motion compensation operation such as the STN (Jaderberg et al. 2015) is essential pieces in almost all the state-of-theart VSR models (Liao et al. 2015; Kappeler et al. 2016; Caballero et al. 2017; Liu et al. 2017; Tao et al. 2017; Sajjadi, Vemulapalli, and Brown 2018), they tend to increase the spatial homogeneity information through affine transformation and interpolation. Therefore, it may ignore the problem of model complexity caused by breaking spatial consistency. 澷濕濧濗濕濘濙濘澔濁澵澺澶澔澔 澽濢濤濩濨澡澶濠濣濗濟澔 濃濩濨濤濩濨澡澶濠濣濗濟 ܨݏ ܨ ܨଵ ܨଶ ܨ ଵ ܨ 澽濢濤濩濨澔激濕濭濙濦澔 Figure 4: Pipeline of our proposed MAFN model. Another aspect of the limitations is that the fusion process of the continuous frame is carried out in a way of weighted sum. As a result, it hard to make full use of the variation and correlation between each frame to model video motion. When too many frames are sent into the network, the performance of the network will decline significantly. The advantages of our method can be reflected in two aspects. On the one hand, our network map the separated images of each frame into the same feature space by the convolution layer with weight sharing. In this space, the difference between each group of features represents each frame s motion. MAFC could more efficient and intuitive to extract the variations between each frame features and feed back them to the network. the complexity of the network s temporal dependence modeling has been greatly reduced. On the other hand, because of the MAFC works after each SR-Layer, motion imformation with different convolution receptive fields could be simultaneously utilized by the network. So that the network has a richer source of information for modeling temporal dependencies. In the subsequent experiments, we further prove that our method has better modeling ability for the complex time dependence of video. Implementation Details Dataset We conduct experiments on Vox Celeb dataset. It contains over 1 million utterances for 6,112 celebrities, extracted from videos uploaded to You Tube, which provides the sequences of tracked faces in the form of bounding boxes. Dataset Vox Celeb objects sequences frames Training 100 3884 776640 Validation 5 10 2144 Testing 18 697 139368 Table 1: Datasets used in facial video super-resolution. Here we select 3884 video sequences of 100 people for training, 10 video sequences of 5 people for verification and 697 sequences of 18 people for testing. For each sequence, we compute a box enclosing the faces from all frames and use it to crop face images from the original video. All face images are resized to 128 128. Table.1 presents the split of training, validation and testing sets. Degradation models Considering the influence of a variety of adverse factors in the image acquisition processing, the obtained image may have some problems such as noise, blur and low resolution at the same time. The image degradation model can be approximated as : Xt = Dt Bt Yt + Zt, (8) where Dt and Bt is the downsampling and fuzzy matrix, Zt is the additional noise. Our LR inputs are generated from HR frames according to the above image degradation model. We first blur HR image by Gaussian kernel of size 7 7 with standard deviation 1.6, bicubic downsample HR image with scaling factor 4, and then add Gaussian noise with noise level 5 (Zhang et al. 2018b). Training Procedure The pipeline of our network structure is shown in Fig.3, named Motion-Adaptive Feedback Network MAFN, which is a flexible network. For our experiments, the network has one input layer, one output layer and seven SR blocks, which only consists of two convolution layers and one sigmoid layer. Furthermore, the loss function of MAFN is: i=1 { Yt Yt }, (9) where M is the number of training images. We implement our model by using the pytorch environment, and optimize our network by Adam with back propagation. The momentum parameter is set to 0.1, weight decay is set to 2 10 4, and the initial learning rate is set to 1 10 3 and be divided a half every 10 epochs. Batchsize is set to 16. Training a MAFN on Vox Celeb dataset generally takes 10 hours with one Titan X Pascal GPU. For assessing the quality of SR results, we employ two objective image quality assessment metrics: Peak Signal to Noise Ratio (PSNR) and structural similarity (SSIM). All metrics are performed on the Y-channel (YCb Cr color space) of super-resolved images. Methods T = 3 T = 5 T = 7 PSRN SSIM PSRN SSIM PSRN SSIM Bicubic 29.95 0.8416 29.95 0.8416 29.95 0.8416 DESR 32.19 0.8929 32.30 0.8953 32.09 0.8929 VESPCN 33.07 0.9097 33.14 0.9112 32.79 0.9055 LIU et.al 32.70 0.9033 32.82 0.9063 32.66 0.9033 SPMC 33.03 0.9066 33.23 0.9099 33.44 0.9132 FRVSR 33.26 0.9105 33.42 0.9129 33.53 0.9147 VSR DUF 34.38 0.9290 34.14 0.9245 33.82 0.9214 FSTRN 32.96 0.9059 33.11 0.9089 33.07 0.9085 RBPN 33.16 0.9084 33.67 0.9158 33.91 0.9232 MAFN 34.15 0.9237 34.59 0.9279 34.81 0.9318 Table 2: Performance of facial video hallucination on the testing sets Evaluation In order to demonstrate the effect of our proposed MAFC, we first compare the proposed network with other state-ofthe-art methods. Furthermore, we also investigate the impact of model architecture and input frames number on the performance. Comparisons with State-of-the-Arts Quantitative comparisons We compare our proposed MAFN with the state-of-the-art VSR methods, including DESR (Liao et al. 2015), VESPCN (Caballero et al. 2017), LIU et.al (Liu et al. 2017), SPMC (Tao et al. 2017), FRVSR (Sajjadi, Vemulapalli, and Brown 2018), VSR DUF (Jo et al. 2018) FSTRN (Li et al. 2019) and RBPN (Haris, Shakhnarovich, and Ukita 2019). For fair comparison, we train all models with the same training set. To demonstrate the ability of each method to model complex temporal dependencies, we use different numbers of input frames to train and test the network, where T {3, 5, 7}. Quantitative comparison with other state-of-the-art VSR methods is shown in Table.2. In general, VSR DUF (Jo et al. 2018) nearly achieves the best performance except our method, but the performance decreases significantly as the number of input frames increases. We think the reason is this method has no explicitly motion estimation and compensation operation, which making it difficult for the network to model the complex dependencies between frames. In addition, due to a lightweight structure, our method does not get the highest performance at T = 3. Then, our method achieves the best performance when T = 5, and the performance of all methods except VSR DUF has been improved. It can be seen that motion compensation operation could enhance the network s modeling ability for complex motion. Specifically, when T = 7, the performance of SPMC, FRVSR, RBPN and ours methods increased, but other methods decreased. The common feature of FRVSR and RBPN is that the network input only contains two frames of images at the same time. As for the SPMC, we modified the way the network reads in the image and made it consistent with the FRVSR. It is found that the network input with fewer images could effectively increase the network s ability to model temporal dependence. The advantage of this method is that the performance of the network can increase with the increase of the input frames number, but the lack of input also could lead to the performance of the network is hard to further improve. Benefit from the ability of MAFC to utilize inter-frame motion information, our method achieves excellent performance at T=7. Moreover, compared with T=5, the performance of the MAFN has a significant increase. Qualitative results A qualitative comparison between our method and other SR methods are shown in Figure. 5. There are three face images, each of which is reconstructed from seven consecutive frames. The super-resolved results from our method tend to be more appealing and clearer than those from other methods especially on the mouth and eyes. In Figure. 1, we show more example results from the voxceleb dataset. This method is applied to many scenes and has high image fidelity. Ablation Study We conduct the ablation study on our proposed network to demonstrate the effects of our methods. Since our network is a really simple network which is similar to the Fast Super Resolution Convolution Neural Network (Dong, Loy, and Tang 2016), it doesn t merit any additional discussion here. In this section, we mainly make a detailed discussion on how to improve the performance with MAFC. We conduct 4 experiments to estimate the basic network, signal screen uint, compensation estimation uint, and MAFC, respectively. Specifically, by removing the MAFC from our MAFN, the remaining parts constitute the first network, named Basic Net v1 . The second network, named Basic Net v2 , has the same structure as MAFN except that the MAFC retains only the signal screen uint. in the same way, the third network Basic Net v3 is the MAFC retains only the compensation estimation uint. In this part, we study the effects of different networks. For fairly comparison, the differences between the (a) (b) (c) (d) (e) (f) (g) Figure 5: Visual evaluation on scale 4. (a) Original HR images. (b) Input LR images. (c) Results of Caballero et al. s method (VESPCN). (d) Results of Sajjadi et al. s method (FRVSR). (e) Results of Jo et al. s method (VSR DUF). (f) Results of Haris et al. s method (RBPN). (g) Results of our MAFN. four networks are only the part of MAFC and we train all those models with other same implementation details. T Basic Net v1 Basic Net v2 Basic Net v3 MAFN 3 32.48 33.89 33.82 34.15 5 32.64 34.20 33.11 34.59 7 32.55 34.37 33.29 34.81 Table 3: Ablation study on effects of MAFC. Table.3 shows the results of different network structures. It can be seen that: (1) Compared to other networks, the basic network (Basic Net v1) has lower performance and the increase of the number of input frames does not significantly improve its results. (2) Basic Net v2 and Basic Net v3 are both can achieve good performance. It can be seen that the subtraction followed by convolution operation and the parallel followed by convolution operation can achieve motion compensation for input two-frame features. (3) The model using two uint (signal screen and compensation estimation) achieves the best performance, which indicates that more rich and distinct motion compensation information brings more improvement. Conclusion In this paper, we propose a Motion-Adaptive Feedback Ceil (MAFC) and a novel network named Motion-Adaptive Feedback Network (MAFN) for video super-resolution. The key contribute of this paper is that, we find the shortcoming of the current VSR method based on the motion estimation and compensation, and put forward an adaptive feedback method to deal with its drawback, and obtain the satisfactory results. The advantage of MAFC is it can efficiently extract the differences between each frame features intuitively, and capture the differences of each frame in each level representation space, which makes the motion information between each frame image could be learned more sufficiently. Extensive experiments show that MAFC significantly outperforms stateof-the-arts. Thus, we believe this motion adaptive feedback strategy could be more widely applicable in practice, and it is readily used to other machine vision problems such as video deblurring, compression artifact removal and even optical flow learning. Acknowledgement This work was supported in part by the National Natural Science Foundation of China under Grant Grant 61922066, Grant 61876142, Grant 61671339, Grant 61772402, Grant U1605252, Grant 61432014, in part by the National Key Research and Development Program of China under Grant 2016QY01W0200 and Grant 2018AAA0103202, in part by the National High-Level Talents Special Support Program of China under Grant CS31117200001, in part by the Fundamental Research Funds for the Central Universities under Grant JB190117, in part by the Xidian University-Intellifusion Joint Innovation Laboratory of Artificial Intelligence, in part by the Innovation Fund of Xidian University. References Caballero, J.; Ledig, C.; Aitken, A.; Acosta, A.; Totz, J.; Wang, Z.; and Shi, W. 2017. Real-time video super-resolution with spatio-temporal networks and motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4778 4787. Dong, C.; Loy, C. C.; and Tang, X. 2016. Accelerating the super-resolution convolutional neural network. In European conference on computer vision, 391 407. Springer. Haris, M.; Shakhnarovich, G.; and Ukita, N. 2019. Recurrent back-projection network for video super-resolution. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3897 3906. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. Huang, Y.; Wang, W.; and Wang, L. 2015. Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in Neural Information Processing Systems, 235 243. Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015. Spatial transformer networks. In Advances in neural information processing systems, 2017 2025. Jo, Y.; Wug Oh, S.; Kang, J.; and Joo Kim, S. 2018. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3224 3232. Kappeler, A.; Yoo, S.; Dai, Q.; and Katsaggelos, A. K. 2016. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging 2(2):109 122. Kim, J.; Kwon Lee, J.; and Mu Lee, K. 2016a. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1646 1654. Kim, J.; Kwon Lee, J.; and Mu Lee, K. 2016b. Deeplyrecursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1637 1645. Lai, W.-S.; Huang, J.-B.; Ahuja, N.; and Yang, M.-H. 2017. Deep laplacian pyramid networks for fast and accurate superresolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, 624 632. Ledig, C.; Theis, L.; Husz ar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4681 4690. Li, S.; He, F.; Du, B.; Zhang, L.; Xu, Y.; and Tao, D. 2019. Fast spatio-temporal residual network for video superresolution. ar Xiv preprint ar Xiv:1904.02870. Liao, R.; Tao, X.; Li, R.; Ma, Z.; and Jia, J. 2015. Video superresolution via deep draft-ensemble learning. In Proceedings of the IEEE International Conference on Computer Vision, 531 539. Liu, C., and Sun, D. 2013. On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence 36(2):346 360. Liu, D.; Wang, Z.; Fan, Y.; Liu, X.; Wang, Z.; Chang, S.; and Huang, T. 2017. Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision, 2507 2515. Ma, Z.; Liao, R.; Tao, X.; Xu, L.; Jia, J.; and Wu, E. 2015. Handling motion blur in multi-frame super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5224 5232. Protter, M.; Elad, M.; Takeda, H.; and Milanfar, P. 2008. Generalizing the nonlocal-means to super-resolution reconstruction. IEEE Transactions on image processing 18(1):36 51. Sajjadi, M. S.; Vemulapalli, R.; and Brown, M. 2018. Framerecurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6626 6634. Shi, W.; Caballero, J.; Ledig, C.; Zhuang, X.; Bai, W.; Bhatia, K.; de Marvao, A. M. S. M.; Dawes, T.; O Regan, D.; and Rueckert, D. 2013. Cardiac image super-resolution with global correspondence using multi-atlas patchmatch. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 9 16. Springer. Takeda, H.; Milanfar, P.; Protter, M.; and Elad, M. 2009. Super-resolution without explicit subpixel motion estimation. IEEE Transactions on Image Processing 18(9):1958 1975. Tao, X.; Gao, H.; Liao, R.; Wang, J.; and Jia, J. 2017. Detailrevealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, 4472 4480. Thornton, M. W.; Atkinson, P. M.; and Holland, D. 2006. Subpixel mapping of rural land cover objects from fine spatial resolution satellite sensor imagery using super-resolution pixel-swapping. International Journal of Remote Sensing 27(3):473 491. Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Simoncelli, E. P.; et al. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4):600 612. Wang, Z.; Yi, P.; Jiang, K.; Jiang, J.; Han, Z.; Lu, T.; and Ma, J. 2018. Multi-memory convolutional neural network for video super-resolution. IEEE Transactions on Image Processing 28(5):2530 2544. Yu, Y.; Tang, S.; Aizawa, K.; and Aizawa, A. 2018. Categorybased deep cca for fine-grained venue discovery from multimodal data. IEEE transactions on neural networks and learning systems 30(4):1250 1258. Yu, Y.; Tang, S.; Raposo, F.; and Chen, L. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15(1):20. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; and Fu, Y. 2018a. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), 286 301. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; and Fu, Y. 2018b. Residual dense network for image super-resolution. In CVPR. Zou, W. W., and Yuen, P. C. 2011. Very low resolution face recognition problem. IEEE Transactions on image processing 21(1):327 340.