# liftpool_bidirectional_convnet_pooling__22a653cd.pdf

Published as a conference paper at ICLR 2021

LIFTPOOL: BIDIRECTIONAL CONVNET POOLING

Jiaojiao Zhao & Cees G. M. Snoek Video & Image Sense Lab University of Amsterdam {jzhao3,cgmsnoek}@uva.nl

Pooling is a critical operation in convolutional neural networks for increasing receptive ﬁelds and improving robustness to input variations. Most existing pooling operations downsample the feature maps, which is a lossy process. Moreover, they are not invertible: upsampling a downscaled feature map can not recover the lost information in the downsampling. By adopting the philosophy of the classical Lifting Scheme from signal processing, we propose Lift Pool for bidirectional pooling layers, including Lift Down Pool and Lift Up Pool. Lift Down Pool decomposes a feature map into various downsized sub-bands, each of which contains information with different frequencies. As the pooling function in Lift Down Pool is perfectly invertible, by performing Lift Down Pool backward, a corresponding up-pooling layer Lift Up Pool is able to generate a reﬁned upsampled feature map using the detail sub-bands, which is useful for image-to-image translation challenges. Experiments show the proposed methods achieve better results on image classiﬁcation and semantic segmentation, using various backbones. Moreover, Lift Down Pool offers better robustness to input corruptions and perturbations.

1 INTRODUCTION

Spatial pooling has been a critical Conv Net operation since its inception (Fukushima, 1979; Le Cun et al., 1990; Krizhevsky et al., 2012; He et al., 2016; Chen et al., 2018). It is crucial that a pooling layer maintains the most important activations for the network s discriminability (Saeedan et al., 2018; Boureau et al., 2010). Several simple operations, such as average pooling or max pooling, have been explored for aggregating features in a local area. Springenberg et al. (2015) employ a convolutional layer with an increased stride to replace a pooling layer, which is equivalent to downsampling. While effective and efﬁcient, simply using the average or maximum activation may ignore local structures. In addition, as these functions are not invertible, upsampling the downscaled feature maps can not recover the lost information. Different from existing pooling operations, we propose in this paper a bidirectional pooling called Lift Pool, including Lift Down Pool which preserves details when downsizing the feature maps, and Lift Up Pool for generating ﬁner upsampled feature maps.

Lift Pool is inspired by the classical Lifting Scheme (Sweldens, 1998) from signal processing, which is commonly used for information compression (Pesquet-Popescu & Bottreau, 2001), reconstruction (Dogiwal et al., 2014), and denoising (Wu et al., 2004). The perfect invertibility of the Lifting Scheme stimulates some works on invertible networks (Dinh et al., 2017; Jacobsen et al., 2018; Atanov et al., 2019; Izmailov et al., 2020) . The Lifting Scheme decomposes an input signal into various sub-bands with downscaled size and this process is perfectly invertible. Applying the idea of Lifting Scheme, Lift Down Pool factorizes an input feature map into several downsized spatial sub-bands with different correlation structures. As shown in Figure 1, for an image feature map, the LL sub-band is an approximation removing several details. The LH, HL and HH represent details along horizontal, vertical and diagonal directions. Lift Down Pool respects preserving any sub-band(s) as the pooled result. Moreover, due to the invertibility of the pooling function, Lift Up Pool is introduced for upsampling feature maps. Upsampling a feature map is more challenging as seen for the Max Up Pool (Badrinarayanan et al., 2017), which generates an output with many holes (shown in Figure 1). Lift Up Pool utilizes the recorded details to recover a reﬁned output by performing Lift Down Pool backwards.

Published as a conference paper at ICLR 2021

Lift Down Pool

Max Up Pool

Lift Up Pool

Max Indices

Down-pool Up-pool

Figure 1: Illustration of the proposed Lift Down Pool and Lift Up Pool vs. Max Pool and Max Up Pool on an image from CIFAR-100. Where Max Pool takes the maximum activations from the input, Lift Down Pool decomposes the input into four sub-bands: LL, LH, HL and HH. LL contains low-pass coefﬁcients. It better reduces aliasing compared to Max Pool. LH, HL and HH represent details along horizontal, vertical and diagonal directions. For simplicity, we just upsample the down-pooled results for illustrating the up-pooling. Max Up Pool generates a sparse map with lost details. Lift Up Pool produces a reﬁned output from the recorded details by performing Lift Down Pool backwards.

We analyze the proposed Lift Pool from several viewpoints. Lift Down Pool allows a ﬂexible choice for any sub-band(s) as the pooled result. It outperforms baselines on image classiﬁcation with various Conv Net backbones. Lift Down Pool also presents better stability to corruptions and perturbations of inputs. By performing Lift Down Pool backwards, Lift Up Pool generates a reﬁned upsampling feature map for semantic segmentation.

The down-pooling operator is formulated as minimizing the information loss caused by downsizing feature maps, as in image downscaling by Saeedan et al. (2018); Kobayashi (2019a). The Lifting Scheme (Sweldens, 1998) naturally matches the problem. The Lifting Scheme was originally designed to exploit the correlated structures present in signals to build a downsized approximation and several detail sub-bands in the spatial domain (Daubechies & Sweldens, 1998). The inverse transform is realizable and always provides a perfect reconstruction of the input. Lift Pool is derived from the Lifting Scheme for bidirectional pooling layers.

2.1 LIFTDOWNPOOL

Taking a one-dimension (1D) signal as an example, Lift Down Pool decomposes a given signal x=[x1, x2, x3, ..., xn], xn R into a downscaled approximation signal s and a difference signal d by, s, d = F(x). (1)

where F( )=fupdate fpredict fsplit( ), consisting of three functions: split (downsample), predict and update. Here indicates the function composition operator. The Lift Down Pool-1D is illustrated in Figure 2(a). Speciﬁcally,

Split fsplit : x 7 (xe, xo). The given signal x is split into two disjoint sets xe=[x2, x4, ..., x2k] with even indices and xo=[x1, x3, ..., x2k+1] with odd indices. The two sets are typically closely correlated.

Published as a conference paper at ICLR 2021

Split Predict Update Merge

(a) Lift Down Pool (b) Lift Up Pool

& ' ' & ℳ )

Figure 2: Lift Down Pool and Lift Up Pool implementations. (a) Lift Down Pool-1D. x is split into xe and xo. The Predictor and Updater generate details d and an approximation s. (b) Lift Up Pool-1D. By running Lift Down Pool backwards, xe and xo are generated from s and d, and then merged into x.

Predict fpredict : (xe, xo) 7 d. Given one set e.g. xe, another set xo is able to be predicted by a predictor P( ). The predictor is not required to be precise, so the difference with the high-pass coefﬁcients d is deﬁned as: d = xo P(xe). (2)

Update fupdate : (xe, d) 7 s. Taking xe as an approximation of x causes a serious aliasing because xe is simply downsampled from x. Particularly, the running average of xe is not the same as that of x. To correct it, a smoothed version s is generated by adding U(d) to xe:

s = xe + U(d). (3)

The update procedure is equivalent to applying a low-pass ﬁlter to x. Thus, s with low-pass coefﬁcients is taken as an approximation of the original signal.

The classiﬁc Lifting Scheme method applies pre-deﬁned low-pass ﬁlters and high-pass ﬁlters to decompose an image into four sub-bands. However, pre-designing ﬁlters in P( ) and U( ) is difﬁcult (Zheng et al., 2010). Earlier, Zheng et al. (2010) proposed to optimize these ﬁlters by a backpropagation network. All functions in Lift Down Pool are differentiable. P( ) and U( ) are able to be simply implemented by convolution operators followed by non-linear activation functions (Glorot et al., 2011). Speciﬁcally, we design P( ) and U( ) as:

P( ) = Tanh() Conv(k=1,s=1,g=G2) Re LU() Conv(k=K,s=1,g=G1), (4)

U( ) = Tanh() Conv(k=1,s=1,g=G2) Re LU() Conv(k=K,s=1,g=G1). (5)

Here K is the kernel size and G1 and G2 are the number of groups. We prefer to learn the ﬁlters in P( ) and U( ) with deep neural networks in an end-to-end fashion. To that end, two constraints need to be added to the ﬁnal loss function. Recall, s is the downsized approximation of x. As s is updated from xe according to Eq 3, s is essentially close to xe. Thus, s is naturally required to be close to xo as well. Therefore, one constraint term cu is for minimising the L2-norm distance between s and xo. With Eq 3,

cu = s xo 2 = U(d) + xe xo 2. (6)

The other constraint term cp is for minimising the detail d, with Eq 2,

cp = xo P(xe) 2. (7)

The total loss is: L = Ltask + λucu + λpcp, (8) where Ltask is the loss for a speciﬁc task, like a classiﬁcation or semantic segmentation loss. We set λu=0.01 and λp=0.1. Our experiments show the two terms bring good regularization to the model.

Lift Down Pool-2D is easily decomposed into several 1D Lift Down Pool operators. Following the standard Lifting Scheme, we ﬁrst perform a Lift Down Pool-1D along the horizontal direction to obtain an approximation part s (low frequency in the horizontal direction) and a difference part d (high frequency in the horizontal direction). Then, for each of the two parts, we apply the same Lift Down Pool-1D along the vertical direction. By doing so, s is further decomposed into LL (low frequency in vertical and horizontal directions) and LH (low frequency in the vertical direction and high frequency in the horizontal direction). d is further decomposed into HL (high frequency in the vertical direction and low frequency in the horizontal direction) and HH (high frequency in vertical

Published as a conference paper at ICLR 2021

LL LH HL HH

Figure 3: Lift Down Pool visualization. Selected feature maps of an image in CIFAR-100, from the ﬁrst Lift Down Pool layer in VGG13. LL represents smoothed feature maps with less details. LH, HL and HH represent detailed features along horizontal, vertical and diagonal directions. Each sub-bands contains different correlation structures.

and horizontal directions). We can ﬂexibly choose sub-band(s) for down-pooling and keep the other sub-band(s) for reversing the operation. Naturally, Lift Down Pool-1D can be generalized further for any n-dimensional signal. In Figure 3, we show several feature maps from the ﬁrst Lift Down Pool layer based on VGG13. LL has smoothed features with less details. LH, HL and HH capture the details along horizontal, vertical and diagonal directions.

Discussion Max Pool is usually formulated as ﬁrst performing Max and then downsampling: Max Poolk,s=downsamples Maxk (Zhang, 2019). By contrast, Lift Down Pool is: Lift Down Poolk,s=updatek predictk downsamples. First downsampling and then performing two lifting steps (prediction and updating) helps anti-aliasing. A simple analysis is provided in the Appendix. As shown in Figure 1, Lift Down Pool keeps more structured information and better reduces aliasing then Max Pool.

2.2 LIFTUPPOOL

Lift Pool inherits the invertibility of the Lifting Scheme. Taking the 1D signal as an example, Lift Up Pool generates an upsampled signal x from s, d by:

x = G(s, d). (9)

where G( )=fmerge fpredict fupdate( ), including the functions: update, predict and merge. Speciﬁcally, s, d 7 xe, d 7 xe, xo 7 x are realized by:

xe = s U(d), (10)

xo = d + P(xe), (11)

x = fmerge(xe, xo). (12)

We simply get the even part xe and odd part xo from s and d, and then merge xe and xo into x. In this way, we generate upsampled feature maps with rich information.

Discussion Up-pooling has been used in image-to-image translation tasks such as semantic segmentation (Chen et al., 2017), super-resolution (Shi et al., 2016), and image colorization (Zhao et al., 2020). It is generally used in encoder-decoder networks such as Seg Net (Badrinarayanan et al., 2017) and UNet (Ronneberger et al., 2015). However, most existing pooling functions are not invertible. Taking Max Pool as the baseline, it is required to record the maximum indices during max pooling. For simplicity, we use the down-pooled results as inputs to the up-pooling in Figure 1. When performing Max Up Pool, the values of the input feature maps are directly ﬁlled on the corresponding maximum indices of the output and other indices will be given zeros. By doing so, the output looks sparse and loses much of the structured information, which is harmful for generating good-resolution outputs. Lift Up Pool performing an inverse transformation of Lift Down Pool, is able to produce ﬁner outputs by using the multiphase sub-bands.

Published as a conference paper at ICLR 2021

3 RELATED WORK

Taking the average over a feature map region was the pooling mechanism of choice in the Neocognitron (Fukushima, 1979; 1980) and Le Net (Le Cun et al., 1990). Average pooling is equivalent to blurred-downsampling. Max pooling later proved even more effective (Scherer et al., 2010) and became popular in deep Conv Nets. Yet, averaging activations or picking the maximum activations causes loss of details. Zeiler & Fergus (2013) and Zhai et al. (2017) introduced a stochastic process to pooling and downsampling, respectively, for a better regularization. Lee et al. (2016) mixed Average Pool and Max Pool by a gated mask to adapt to complex and variable input patterns. Saeedan et al. (2018) introduced detail-preserving pooling (DPP) for maintaining structured details. By contrast, Zhang (2019) proposed a Blur Pool by applying a low-pass ﬁlter, which removes details. Interestingly, both methods improved image classiﬁcation, indicating that (empirically) determining the best pooling strategy is beneﬁcial (Saeedan et al., 2018). Williams & Li (2018) introduced the wavelet transform into pooling for reducing jagged edges and other artifacts. Rippel et al. (2015) suggested pooling in the frequency domain, which enabled ﬂexibility in the choice of the pooling output dimensionality. Pooling based on a probabilistic model was proposed in (Kobayashi, 2019a) and (Kobayashi, 2019b). Kobayashi (2019a) ﬁrst used a Gaussian distribution to model the local activations and then aggregates the activations into the two statistics of mean and standard deviation. Kobayashi (2019b) estimated parameters from global statistics in the input feature map, to ﬂexibly represent various pooling types. Our proposed Lift Down Pool decomposes the input feature map into a downsized approximation and several details. It is ﬂexible to choose any sub-band(s) as pooled result.

While existing pooling functions are not invertible, our proposed Lift Pool is able to perform both down-pooling and up-pooling. Previously, Max Up Pool (Badrinarayanan et al., 2017) was introduced for semantic segmentation. As the max pooling function is not invertible, the lost details can not be recovered during up-pooling. Hence, the output suffers from aliasing. Although adding a Blur Pool to Max Up Pool may help to reduce the aliasing (Zhang, 2019), several details are still lost. Lift Up Pool, performing the Lift Down Pool functions backwards, is capable of producing a reﬁned high-resolution output with the help of the details sub-bands.

Earlier, Zheng et al. (2010) introduce back-propagation for the Lifting Scheme to perform nonlinear wavelet decomposition. They propose an update-ﬁrst Lifting Scheme and use back-propagation to replace the Updater and Predictor in the Lifting Scheme. In this way, they realize a back-propagation neural network in lifting steps for signal processing. There is no pooling layer used. We develop down-pooling and up-pooling layers by leveraging the idea of the Lifting Scheme for image processing. We utilize convolution layers and Re LU layers to implement the Updater and Predictor, which are optimized end-to-end with the deep neural network. Our pooling layers are easily plugged into various backbones. Recently, Rodriguez et al. (2020) introduce the Lifting Scheme for multiresolution analysis in a network. Speciﬁcally, they develop an adaptive wavelet network by stacking several convolution layers and Lifting Scheme layers. They focus on an interpretable network by integrating multiresolution analysis, rather than pooling. Our paper aims at developing a pooling layer by utilizing the lifting steps. We develop a down-pooling that constructs various sub-bands with different information, and an up-pooling which generates reﬁned upsampled feature maps.

4 EXPERIMENTS

4.1 CONVNET TESTBEDS

Image Classiﬁcation We ﬁrst verify the proposed Lift Down Pool for image classiﬁcation on CIFAR-100 (Krizhevsky & Hinton, 2009) with 32 32 low-resolution images. CIFAR-100 has 100 classes with 600 images each. There are 500 training images and 100 testing images per class. A VGG13 (Simonyan & Zisserman, 2015) network is trained on this dataset. For experiments conducted on CIFAR-100, we repeat each experiment three times with different initial random seeds during training and report the averaged error rate with the standard deviation. We also report results on Image Net (Deng et al., 2009) with 1.2M training and 5000 validation images for 1000 classes. We plug the Lift Down Pool into several popular Conv Net backbones to verify its generalizability for image classiﬁcation. We replace the local pooling layers by Lift Down Pool in all the networks. Error rate is utilized as the evaluation metric. All training settings are provided in the Appendix.

Published as a conference paper at ICLR 2021

Table 1: Flexibility. Top-1 image classiﬁcation error rate with varying sub-bands on CIFAR-100. Mixing low-pass and high-pass obtains the best result. Adding cu and cp helps improve the result.

LL 25.64 0.04 LH 25.71 0.04 HL 24.88 0.05 HH 25.18 0.08 LL+LH+HL+HH (w/o cu and cp) 26.43 0.07 LL+LH+HL+HH (w/ cu and cp) 24.35 0.11

Kernel Top-1

2 25.53 0.13 3 25.06 0.22 4 24.89 0.07 5 24.35 0.11 7 24.40 0.08

Table 2: Effectiveness. Top-1 image classiﬁcation error rate with varying kernel size on CIFAR-100. Kernel size 5 achieves better result.

Skip 27.09 0.11 Max Pool 25.71 0.13 Average Pool 25.87 0.03

Lift Down Pool 24.35 0.11

Table 3: Effectiveness. Top-1 image classiﬁcation error rate with various pooling methods on CIFAR-100. Lift Down Pool outperforms baselines.

Semantic Segmentation We also test the Lift Down Pool and Lift Up Pool for semantic segmentation on PASCAL-VOC12 (Everingham et al., 2010), which contains 20 foreground object classes and one background class. An augmented version with 10582 training images and 1449 validation images is used. We consider Seg Net (Badrinarayanan et al., 2017) with VGG13 and Deeplab V3Plus (Chen et al., 2018) with Res Net50 as Conv Nets for semantic segmentation. The performance is measured in terms of pixel mean-intersection-over-union (m Io U) across the 21 classes. Code is available at https://github.com/jiaozizhao/Lift Pool/.

4.2 ABLATION STUDY

Flexibility We ﬁrst test VGG13 on CIFAR-100. Different from previous pooling methods, Lift Down Pool generates four sub-bands, each of which contains a different type of information. Lift Down Pool allows to ﬂexibly choose which sub-band(s) to keep as the ﬁnal pooled results. In Table 1, we show the Top-1 error rate for the classiﬁcation based on different sub-bands. Interestingly, it is observed that vertical details contribute more for image classiﬁcation. Low-pass coefﬁcients and high-pass coefﬁcients along horizontal direction get similar error rate. Whether the two spatial dimensions should be treated equally we leave for our future work. To realize a less lossy pooling, we combine all the sub-bands by summing them up with almost no additional compute cost. Such a pooling signiﬁcantly improves the results. In addition, the constrains cu and cp help to decrease the error rate. Moreover, seen from Table 1 and Table 3, we further conclude Lift Down Pool outperforms other baselines even based on any single sub-band. We believe the learned Lift Down Pool provides an effective regularization to the model.

Effectiveness Table 2 ablates the performance when varying kernel sizes for the ﬁlters in P( ) and U( ). A larger kernel size, covering more local information, performs slightly better. When kernel size equals 7, it brings more computations but no performance gain. Unless speciﬁed otherwise, we use for all experiments from now on a kernel size of 5 and we sum up all the sub-bands. We also compare our Lift Down Pool with the commonly-used Max Pool, Average Pool, as well as the convolutional layer with stride 2 (Springenberg et al., 2015), which is called Skip by Kobayashi (2019a). Seen from Table 3, Lift Down Pool outperforms other pooling methods on CIFAR-100.

Generalizability We apply Lift Down Pool to several backbones including Res Net18, Res Net50 (He et al., 2016) and Mobile Net-V2 (Sandler et al., 2018) on Image Net. In Table 4, Lift Down Pool has 2% lower Top-1 error rate than Max Pool and Average Pool. While combining Max Pool and Average Pool in a Gated (Lee et al., 2016) or Mixed (Lee et al., 2016) fashion, still has a 1% gap with Lift Down Pool. Gauss (Kobayashi, 2019a) and GFGP (Kobayashi, 2019b) are comparable to Lift Down Pool with Res Net50, but not with lighter networks. Compared

Published as a conference paper at ICLR 2021

Res Net18 Res Net50 Mobile Net-V2

Top-1 Top-5 Top-1 Top-5 Top-1 Top-5

Skip (Springenberg et al., 2015) 30.22 10.23 24.31 7.34 28.66 9.70 Max Pool 28.60 9.77 24.26 7.22 28.65 9.82 Average Pool 28.03 9.55 24.40 7.35 28.32 9.72

S3Pool (Zhai et al., 2017) 33.91 13.09 27.98 9.34 40.56 17.91 Wavelet Pool (Williams & Li, 2018) 30.33 10.82 24.43 7.36 29.27 10.26 Blur Pool (Zhang, 2019) 29.88 10.58 24.60 7.73 30.58 11.26 DPP (Saeedan et al., 2018) 29.12 10.21 24.62 7.49 29.85 10.53 Spectral Pool (Rippel et al., 2015) 28.69 9.87 24.81 7.57 33.38 12.56 Gated Pool (Lee et al., 2016) 27.78 9.44 23.79 7.06 28.94 9.90 Mixed Pool (Lee et al., 2016) 27.76 9.50 24.08 7.32 29.00 9.97 GFGP (Kobayashi, 2019b) 26.88 8.66 22.76 6.34 28.42 9.59 Gauss Pool (Kobayashi, 2019a) 26.58 8.86 22.95 6.30 27.13 8.92

Lift Down Pool 25.80 8.14 22.36 6.11 26.09 8.22

Table 4: Generalizability of Lift Down Pool on Image Net. Lift Down Pool outperforms alternative pooling methods, no matter what Conv Net backbone is used. means the numbers are based on running the code provided by authors. Others are based on our re-implementation.

Normalized Unnormalized

Image Net-C Image Net-P Image Net-C Image Net-P

m CE m PR m CE m PR

Skip 72.71 61.75 57.05 7.56 Max Pool 73.09 62.64 57.40 7.57 Average Pool 72.09 56.23 56.56 6.90

Blur Pool (Zhang, 2019) 72.14 56.54 56.58 6.90 DPP (Saeedan et al., 2018) 72.12 62.30 56.67 7.62 Gated Pool (Lee et al., 2016) 72.58 58.05 57.00 7.23 Gauss Pool (Kobayashi, 2019a) 69.27 54.83 54.40 6.76

Lift Down Pool 68.45 52.91 53.80 6.55

Table 5: Out-of-distribution robustness of Lift Down Pool on Image Net-C and Image Net-P. Lift Down Pool is more robust to corruptions and perturbations compared to baselines.

to Spectral pooling (Rippel et al., 2015) and Wavelet pooling (Williams & Li, 2018), which are applied in the frequency or space-frequency domain, Lift Down Pool offers an advantage by learning correlated structures and details in the spatial domain. Compared to DPP (Saeedan et al., 2018), which preserves details, and Blur Pool (Zhang, 2019), smoothing feature maps by a low-pass ﬁlter, our Lift Down Pool retains all sub-bands which proves to be more powerful for image classiﬁcation. Stochastic approaches like S3Pool obtain poor results on the large-scale dataset because randomness in pooling hampers network training, as earlier observed by Kobayashi (2019b). To conclude, Lift Down Pool performs better no matter what backbone is used.

Parameter Efﬁciency. For all pooling layers in one network, we use the same kernel size in Lift Pool. For the trainable parameters, recall P or U has a 1D convolution, so each has C/G1 C K+G2 parameters. C is the number of the input channels and G2 equals the number of internal channels. A 2D Lift Down Pool shares these parameters three times without extra parameters. We compare our Lift Down Pool (with 25.58 M) to two other parameterized pooling methods using Res Net50 on Image Net: GFGP (31.08 M) and Gauss Pool (33.85 M). We achieve a lower error rate compared to GFGP and Gauss Pool with less parameters. Our performance boost is due to the Lift Pool scheme, not the added capacity.

Published as a conference paper at ICLR 2021

70 72 74 76 78 80 Accuracy

Consistency

Mobilenet-v2

Max Pool Skip Average Pool Lift Down Pool

Figure 4: Shift Robustness comparisons between various pooling methods including Max Pool, Skip, Average Pool and the proposed Lift Down Pool. Lift Down Pool improves classiﬁcation consistency and meanwhile boosts the accuracy, independent of the backbone used.

Max Up Pool Max Up Pool+Blur Pool Lift Up Pool Ground-truth

Figure 5: Lift Up Pool for Semantic Segmentation. Visualization of semantic segmentation maps on PASCAL-VOC12 based on Seg Net with varying up-pooling methods. Lift Up Pool presents more completed, precise segmentation maps with smooth edges.

4.3 STABILITY ANALYSIS

Out-of-distribution Robustness A good down-pooling method is expected to be stable to perturbations and noise. Following Zhang (2019), we test the robustness of Lift Down Pool to corruptions on Image Net-C and stability to perturbations on Image Net-P using Res Net50. Both datasets come from Hendrycks & Dietterich (2019). We report the mean Corruption Error (m CE) and mean Flip Rate (m FR) for the two tasks, with both unnormalized raw values and normalized values by Alex Net s m CE and m FR, following Hendrycks & Dietterich (2019).

From Table 5, Lift Down Pool effectively reduces raw m CE compared to the baselines. We show CE for each corruption type for further analysis in Figure 9 in the Appendix. Lift Down Pool enables robustness to both high-frequency corruptions, such as noise or spatter, and low-frequency corruptions, like blur and jpeg compression. We believe Lift Down Pool beneﬁts from the mechanism that all sub-bands are used. A similar conclusion is obtained for robustness to perturbations on Image Net-P from Table 5 and Figure 9 in the Appendix. Image Net-P contains short video clips of a single image with small perturbations added. Such perturbations are generated by several types of noise, blur, geometric changes, and simulated weather conditions. The metric FR measures how often the Top-1 classiﬁcation changes in consecutive frames. It is designed for testing a model s stability under small deformations. Again, Lift Down Pool achieves lower FR for most perturbations.

Shift Robustness We then test the shift-invariance of our model. Following Zhang (2019), we use classiﬁcation consistency to measure the shift-invariance. It represents how often the network outputs the same classiﬁcation, given the same image with two different shifts. We test the models with varying backbones trained on Image Net. In Figure 4, Lift Down Pool boosts classiﬁcation accuracy as well as consistency no matter which backbone is used. Besides, we have other interesting ﬁndings. The deeper Res Net50 network has more stable shift-invariance. Various pooling methods including Max Pool, Skip, Average Pool, do not make signiﬁcant difference on consistency. How-

Published as a conference paper at ICLR 2021

m Io U Max Up Pool 62.7 Max Up Pool + Blur Pool 64.0

Lift Up Pool 68.9

Table 6: Lift Up Pool for Semantic Segmentation on PASCAL-VOC12 based on Seg Net with varying uppooling methods.

m Io U Skip 76.1 Max Pool 76.2 Average Pool 76.4 Gauss (Kobayashi, 2019a) 77.4

Lift Down Pool 78.7

Table 7: Semantic Segmentation with Deep Lab V3Plus on PASCAL-VOC12 with various pooling methods. Lift Down Pool performs best.

ever, a lighter Res Net18 network is inﬂuenced much by the pooling method. Lift Down Pool brings more than 10% improvement on consistency using Res Net18. We leave for future work how the depth of the network affects the shift-invariance of the network itself.

4.4 RESULTS FOR SEMANTIC SEGMENTATION

Lift Up Pool for Semantic Segmentation Lift Down Pool functions are invertible as described in Eq 10 and Eq 11. It naturally beneﬁts a corresponding up-pooling operation, which is popularly used in Encoder-Decoder networks for image-to-image translation tasks. Usually, the Encoder downsizes feature maps layer by layer to generate a high-level embedding for understanding the image. Then the Decoder needs to translate the embedding with a tiny spatial size to a required map with the same spatial size as the original input image. Interpreting details is pivotal for producing high-resolution outputs. We replace all down-pooling and up-pooling layers with Lift Down Pool and Lift Up Pool in Seg Net for semantic segmentation on PASCAL-VOC12. For Lift Down Pool we only keep the LL sub-band. For Lift Up Pool, the detail-preserving sub-bands LH, HL and HH are used for generating upsampled feature maps. Max Up Pool is taken as the baseline. We also test Max Up Pool followed by a Blur Pool (Zhang, 2019), which is expected to help anti-aliasing. Table 6 reveals Lift Up Pool improves over the baselines with a considerable margin. As illustrated in Figure 1, Max Up Pool is unable to compensate for the lost details. Although Blur Pool helps smoothing local areas, it can only provide a small improvement. As Lift Up Pool is capable of reﬁning the feature map by fusing it with details, it is beneﬁcial for per-pixel prediction tasks like semantic segmentation. We show several examples for semantic segmentation in Figure 5. Lift Up Pool is more precise on details and edges. We also show the feature maps per predicted class in the Appendix.

Semantic Segmentation with Deep Lab V3Plus As discussed, Lift Down Pool helps to lift Conv Nets on accuracy and stability for image classiﬁcation. Image Net-trained Conv Nets often serve as the backbones for downstream tuning. It is expected to transfer the nature of Lift Down Pool to other tasks. We still consider semantic segmentation as our example. We leverage the state-of-the-art Deeplab V3Plus-Res Net50 (Chen et al., 2018). The input image has size 512 512. The output feature map of the encoder is 32 32. The decoder upsamples the feature map to 128 128 and concatenates them with the low-level feature map for the ﬁnal pixel-level classiﬁcation. As before, all local pooling layers are replaced by Lift Down Pool. We use the pre-trained weights for image classiﬁcation to initialize the corresponding model. As shown in Table 7, Lift Down Pool outperforms all the baselines with considerable gaps.

5 CONCLUSION

Applying classical signal processing theory to modern deep neural networks, we propose a novel pooling method: Lift Pool. Lift Pool is able to perform both down-pooling and up-pooling. Lift Down Pool improves both accuracy and robustness for image classiﬁcation. Lift Up Pool, generating reﬁned upsampling feature maps, outperforms Max Up Pool by a considerable margin on semantic segmentation. Future work may focus on applying Lift Pool to ﬁne-grained image classiﬁcation, super-resolution challenges or other tasks with high demands for detail preservation.

Published as a conference paper at ICLR 2021

Andrei Atanov, Alexandra Volokhova, Arsenii Ashukha, Ivan Sosnovik, and Dmitry Vetrov. Semiconditional normalizing ﬂows for semi-supervised learning. INNFw, 2019. 1

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Seg Net: A deep convolutional encoderdecoder architecture for image segmentation. TPAMI, 2017. 1, 4, 5, 6

Y-Lan Boureau, Jean Ponce, and Yann Le Cun. A theoretical analysis of feature pooling in visual recognition. In ICML, pp. 111 118, 2010. 1

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deep Lab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. TPAMI, 2017. 4

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. 1, 6, 9

Ingrid Daubechies and Wim Sweldens. Factoring wavelet transforms into lifting steps. Journal of Fourier Analysis and Applications, 4(3), 1998. 2

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In CVPR, 2009. 5

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In ICLR, 2017. 1

Sanwta Ram Dogiwal, YS Shishodia, and Abhay Upadhyaya. Efﬁcient lifting scheme based super resolution image reconstruction using low resolution images. In Advanced Computing, Networking and Informatics, pp. 259 266. Springer, 2014. 1

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (VOC) challenge. IJCV, 88(2):303 338, 2010. 6

Kunihiko Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in position-neocognitron. IEICE Technical Report, A, 62(10):658 665, 1979. 1, 5

Kunihiko Fukushima. A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern., 36:193 202, 1980. 5

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectiﬁer neural networks. In AISTATS, 2011. 3

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 6

Dan Hendrycks and Thomas G Dietterich. Benchmarking neural network robustness to common corruptions and surface variations. In ICLR, 2019. 8, 13

Pavel Izmailov, Polina Kirichenko, Marc Finzi, and Andrew Gordon Wilson. Semi-supervised learning with normalizing ﬂows. In ICML, 2020. 1

J orn-Henrik Jacobsen, Arnold Smeulders, and Edouard Oyallon. i-Rev Net: Deep invertible networks. In ICLR, 2018. 1

Takumi Kobayashi. Gaussian-based pooling for convolutional neural networks. In Neur IPS, 2019a.

2, 5, 6, 7, 9

Takumi Kobayashi. Global feature guided local pooling. In ICCV, 2019b. 5, 6, 7

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. 5

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012. 1

Published as a conference paper at ICLR 2021

Yann Le Cun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In NIPS, 1990. 1, 5

Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In AISTATS, pp. 464 472, 2016. 5, 6, 7

Adam Paszke et al. Automatic differentiation in Py Torch. 2017. 12

B eatrice Pesquet-Popescu and Vincent Bottreau. Three-dimensional lifting schemes for motion compensated video compression. In ICASSP, 2001. 1

Oren Rippel, Jasper Snoek, and Ryan P Adams. Spectral representations for convolutional neural networks. In Neur IPS, 2015. 5, 7

Maria Ximena Bastidas Rodriguez, Adrien Gruson, Luisa Polania, Shin Fujieda, Flavio Prieto, Kohei Takayama, and Toshiya Hachisuka. Deep adaptive wavelet network. In WACV, 2020. 5

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 4

Faraz Saeedan, Nicolas Weber, Michael Goesele, and Stefan Roth. Detail-preserving pooling in deep networks. In CVPR, 2018. 1, 2, 5, 7

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobile Net-V2: Inverted residuals and linear bottlenecks. In CVPR, 2018. 6

Dominik Scherer, Andreas M uller, and Sven Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In ICANN, pp. 92 101, 2010. 5

Wenzhe Shi, Jose Caballero, Ferenc Husz ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efﬁcient sub-pixel convolutional neural network. In CVPR, 2016. 4

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 5, 12

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In ICLR, 2015. 1, 6, 7

Wim Sweldens. The lifting scheme: A construction of second generation wavelets. SIAM Journal on Mathematical Analysis, 29(2):511 546, 1998. 1, 2, 12, 13

Travis Williams and Robert Li. Wavelet pooling for convolutional neural networks. In ICLR, 2018.

Yonghong Wu, Quan Pan, Hongcai Zhang, and Shaowu Zhang. Adaptive denoising based on lifting scheme. In ICSP, 2004. 1

Matthew D Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional neural networks. ar Xiv preprint ar Xiv:1301.3557, 2013. 5

Shuangfei Zhai, Hui Wu, Abhishek Kumar, Yu Cheng, Yongxi Lu, Zhongfei Zhang, and Rogerio Feris. S3Pool: Pooling with stochastic spatial sampling. In CVPR, 2017. 5, 7

Richard Zhang. Making convolutional networks shift-invariant again. In ICML, 2019. 4, 5, 7, 8, 9

Jiaojiao Zhao, Jungong Han, Ling Shao, and Cees GM Snoek. Pixelated semantic colorization. IJCV, 128(4):814 834, 2020. 4

Yi Zheng, Ruijin Wang, and Jianping Li. Nonlinear wavelets and BP neural networks adaptive lifting scheme. In ICACIA, 2010. 3, 5

Published as a conference paper at ICLR 2021

Max Up Pool

Max-pooling Indices

Lift Down Pool

Lift Up Pool

Approximation

Figure 6: Comparisons between Max Pool and Lift Down Pool, Max Up Pool and Lift Up Pool. Max Pool looses details. With the recorded maximum indices, Max Up Pool generates a very sparse output. Lift Down Pool decomposes the input into an approximation and several details sub-bands. It realizes a pooling by summing up all sub-bands. Lift Up Pool produces a reﬁned output by performing Lift Down Pool backwards.

We show additional analysis and results for robustness and semantic segmentation in this Appendix.

Lift Down Pool vs. Max Pool We provide a schematic diagram in Figure 6 to further illustrate the difference between Max Pool and Lift Down Pool, Max Up Pool and Lift Up Pool. Taking kernel size 2, stride 2 as an example, Max Pool selects the maximum activations in a local neighbourhood. Hence, it looses 75% information. The lost details could be important for image recognition. Lift Down Pool decomposes a feature map into LL, LH, HL and HH. LL containing low-pass coefﬁcients is an approximation of the input. It is designed for capturing correlated structures of the input. Other sub-bands contain detail coefﬁcients along different directions. The pooling is implemented by summing up all the sub-bands. The ﬁnal pooled result containing both the approximation and details is expected to be more effective for image classiﬁcation.

Lift Up Pool vs. Max Up Pool The pooling function in Max Pool is not invertible. Max Pool records the maximum indices for performing the corresponding Max Up Pool. Max Up Pool takes the activations at the corresponding positions for the recorded maximum indices on the output. For other indices, there will be zeros. The ﬁnal upsampled output has many holes . By contrast, the pooling functions in Lift Down Pool are invertible. Leveraging the property by performing a Lift Down Pool backwards, Lift Up Pool is able to generate a reﬁned output from an input, including the recorded details.

Experiment Settings The VGG13 (Simonyan & Zisserman, 2015) network trained on CIFAR100 is optimized by SGD with a batch size of 100, weight decay of 0.0005, momentum of 0.9. The learning rate starts from 0.1 and is reduced by multiplying 0.1 after 80 and 120 epochs for a total of 160 epochs. We train Res Nets for 100 epochs and Mobile Net for 150 epochs on Image Net, following the standard training recipe from the public Py Torch (Paszke et al., 2017) repository.

High-resolution Feature Maps Visualization By using Res Net50 with input size 224 224, we extract the feature maps of an image from the ﬁrst pooling layer. We show the high-resolution feature maps in Figure 7. We only show the LL sub-band from Lift Down Pool. Compared to Max Pool, Lift Down Pool better maintains the local structure.

Anti-aliasing Lift Down Pool effectively reduces aliasing following the Lifting Scheme (Sweldens, 1998) compared to naive downsizing. Figure 8(b) provides a simple illustration of Lift Down Pool. The dashed line is an original signal x. According to Eq 1, the predictor P( ) for the odd part x2k+1 could easily take the average of its two even neighbors: dk = x2k+1 (x2k + x2k+2)/2 (13)

Published as a conference paper at ICLR 2021

Input image Max Pool Lift Down Pool

Figure 7: High-resolution feature maps visualization. Lift Down Pool better maintains local structure.

(a) downsizing

2" 2 2" 1 2" 2" + 1 2" + 2 2" + 3 2" + 4 2" + 5

2" 3 2" 2 2" 1 2" 2" + 1 2" + 2 2" + 4 2" + 5

(b) Lift Down Pool

(a) Downsizing

Figure 8: Illustration how Lift Down Pool reduces aliasing compared to downsizing (Sweldens, 1998). Dashed line means original signal. (a) solid line is after downsizing. (b) solid line is after Lift Down Pool. The solid and dashed lines cover the same area in (b).

Thus, if x is linear in a local area, the detail dk is zero. The prediction step takes care of some of the spatial correlation. If an approximation s of the original signal x is simply taken from the even part xe, it is really downsizing the signal shaped in the red line. There is serious aliasing. The running average of xe is not the same as that of the original signal x. The updater U( ) in Eq 3 corrects this by replacing xe with smoothed values s. Speciﬁcally, U( ) restores the correct running average and thus reduces aliasing: sk = x2k + (dk 1 + dk)/4 (14) As shown in Figure 8, dk is the difference between the odd sample x2k+1 and the average of two even samples. This causes a loss dk/2 in the area with the red shade. To preserve the running average, this area is redistributed to the two neighbouring even samples x2k and x2k+2, which shapes a coarser piecewise linear signal s in the solid line. The signal after Lift Down Pool, drawn as solid line, covers the same area with the original signal dashed line. Lift Down Pool reduces aliasing compared to the downsizing drawn in the solid line in (a).

Out-of-distribution Robustness We show the robustness of pooling methods for each corruption and perturbation type in Figure 9. Corruption Error (CE) is the metric of the robustness to corruptions on Image Net-C. And Flip Rate (FR) is reported for the robustness to perturbation on Image Net-P. Following (Hendrycks & Dietterich, 2019), we report both unnormalized raw values

Published as a conference paper at ICLR 2021

Figure 9: Comparisons between the robustness of various pooling methods to per kind of corruption on Image Net-C and perturbation on Image Net-P. Lift Down Pool presents stronger robustness to almost all the corruptions and perturbations.

and normalized values by Alex Net s CE and FR. Lower values are better. As seen in Figure 9(a) and (c), Lift Down Pool gets the lowest CE for most of the high frequency corruptions including gaussian noise and spatter, as well as the low frequency corruptions such as motion blur, zoom blur. In Figure 9(b) and (d), it clearly shows Lift Down Pool has less sensitivity to most of the perturbations such as speckle noise and gaussian blur.

Visualization of Up-pooling In Figure 10, we show the feature map for each predicted category from the last layer of Seg Net using varying up-pooling methods. Using Max Up Pool, the feature maps look noisy and less continuous due to the fact that Max Up Pool generates the output with many zeros , where there is no information. By applying a Blur Pool following the Max Up Pool, the feature maps turn more smooth, while still with less details. Lift Up Pool, beneﬁting from the recorded details during Lift Down Pool, produces ﬁner feature maps for each category. It has smooth edges, continuous segmentation maps and less aliasing.

Published as a conference paper at ICLR 2021

Max Up Pool Max Up Pool+Blur Pool Lift Up Pool Ground-truth

Figure 10: Visualization of feature maps per-predicted-category from the last layer of Seg Net. Lift-Up Pool generates more precise predictions for each category.