# irevnet_deep_invertible_networks__c59ffdd5.pdf

Published as a conference paper at ICLR 2018

i-REVNET: DEEP INVERTIBLE NETWORKS

J orn-Henrik Jacobsen , Arnold Smeulders , Edouard Oyallon

University of Amsterdam joern.jacobsen@bethgelab.org

It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difﬁculty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as Image Net. Via a cascade of homeomorphic layers, we build the i-Rev Net, a network that can be fully inverted up to the ﬁnal projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difﬁcult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-Rev Nets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-Rev Net we reconstruct linear interpolations between natural image representations.

1 INTRODUCTION

A CNN may be very effective in classifying images of all sorts (He et al., 2016; Krizhevsky et al., 2012), but the cascade of linear and nonlinear operators reveals little about the contribution of the internal representation to the classiﬁcation. The learning process is characterized by a steady reduction of large amounts of uninformative variability in the images while simultaneously revealing the essence of the visual class. It is widely believed that this process is based on progressively discarding uninformative variability about the input with respect to the problem at hand (Dosovitskiy & Brox, 2016; Mahendran & Vedaldi, 2016; Shwartz-Ziv & Tishby, 2017; Achille & Soatto, 2017). However, the extent to which information is discarded is lost somewhere in the intermediate nonlinear processing steps. In this paper, we aim to provide insight into the variability reduction process by proposing an invertible convolutional network, that does not discard any information about the input.

The difﬁculty to recover images from their hidden representations is found in many commonly used network architectures (Dosovitskiy & Brox, 2016; Mahendran & Vedaldi, 2016). This poses the question if a substantial loss of information is necessary for successful classiﬁcation. We show information does not have to be discarded. By using homeomorphic layers, the invariance can be built only at the very last layer via a projection.

In Shwartz-Ziv & Tishby (2017), minimal sufﬁcient statistics are proposed as a candidate to explain the reduction of variability. Tishby & Zaslavsky (2015) introduces the information bottleneck principle which states that an optimal representation must reduce the mutual information between an input and its representation to reduce as much uninformative variability as possible. At the same time, the network should maximize the mutual information between the desired output and its representation to effectively preserve each class from collapsing onto other classes. The effect of the information bottleneck was demonstrated on small datasets in Shwartz-Ziv & Tishby (2017); Achille & Soatto (2017).

Now at Bethgelab, University of T ubingen CVN, Centrale Sup elec, Universit e Paris-Saclay ; Galen team, INRIA Saclay Seque L team, INRIA Lille ; DI, ENS, Universit e PSL

Published as a conference paper at ICLR 2018

However, in this work, we show it is not a necessary condition and we build a cascade of homeomorphic layers, which preserves the mutual information between input and hidden representation and shows that the loss of information can only occur at the ﬁnal layer. This way we demonstrate that a loss of information can be avoided while maintaining discriminability, even for large-scale problems like Image Net. One way to reduce variability is progressive contraction with respect to a meaningful ℓ2 metric in the intermediate representations.

Several works (Oyallon, 2017; Zeiler & Fergus, 2014) observed a phenomenon of progressive separation and contraction in non-invertible networks on limited datasets. Those progressive improvements can be interpreted as the creation of progressively stronger invariants for classiﬁcation. Ideally, the contraction should not be too brutal to avoid removing important information from the intermediate signal. This shows that a good trade-off between discriminability and invariance has to be progressively built. In this paper, we extend some ﬁndings of Zeiler & Fergus (2014); Oyallon (2017) to Image Net (Russakovsky et al., 2015) and, most importantly, show that a loss of information is not necessary for observing a progressive contraction.

The duality between invariance and separation of the classes is discussed in Mallat (2016). Here, intra-class variabilities are modeled as Lie groups that are processed by performing a parallel transport along those symmetries. Filters are adapted through learning to the speciﬁc bias of the dataset and avoid to contract along discriminative directions. However, using groups beyond the Euclidean case for image classiﬁcation is hard. Mainly because groups associated with abstract variabilities are difﬁcult to estimate due to their high-dimensional nature, as well as the appropriate degree of invariance required. An illustration of this framework on the Euclidean group is given by the scattering transform (Mallat, 2012), which builds invariance to small translations while being recoverable to a certain extent. In this work, we introduce a network that cannot discard any information except at the ﬁnal classiﬁcation stage, while we demonstrate numerically progressive contraction and separation of the signal classes.

We introduce the i-Rev Net, an invertible deep network.1 i-Rev Nets retain all information about the input signal in any of their intermediate representations up until the last layer. Our architecture builds upon the recently introduced Rev Net (Gomez et al., 2017), where we replace the non-invertible components of the original Rev Nets by invertible ones. i-Rev Nets achieve the same performance on Imagenet compared to similar non-invertible Rev Net and Res Net architectures (Gomez et al., 2017; He et al., 2016). To shed light on the mechanism underlying the generalization-ability of the learned representation, we show that i-Rev Nets progressively separate and contract signals with depth. Our results are evidence for an effective reduction of variability through a contraction with a recoverable input obtained from a series of one-to-one mappings.

2 RELATED WORK

Several recent works show that signiﬁcant information about the input images is lost with depth in successful Imagenet classiﬁcation CNNs (Dosovitskiy & Brox, 2016; Mahendran & Vedaldi, 2016). To understand the loss of information, the references propose to invert the representations by means of learned or hand-engineered priors. The approximate inversions indicate increased geometric and photometric invariance with depth. Multiple other works report progressive properties of deep networks that may be linked to discarded information in the representations as well, such as linearization (Radford et al., 2015), linear separability (Zeiler & Fergus, 2014), contraction (Oyallon, 2017) and low-dimensional embeddings (Aubry & Russell, 2015). However, it is not clear from above observations if the loss of information is a necessity for the observed progressive phenomena. In this work, we show that progressive separation and contraction can be obtained while at the same time allowing an exact reconstruction of the signal.

Multiple frameworks have been introduced that permit to learn invertible representations under certain conditions. Parseval networks (Cisse et al., 2017) have been introduced to increase the robustness of learned representations with respect to adversarial attacks. In this framework, the spectrum of convolutional operators is constrained to norm 1 during learning. The linear operator is thus injective.

1Code is available at: https://github.com/jhjacobsen/pytorch-i-revnet

Published as a conference paper at ICLR 2018

As a consequence, the input of Parseval networks can be recovered if but only if the built-in nonlinearities are invertible as well, which is typically not the case. Bruna et al. (2013) derive conditions under which pooling representations are, but our method directly overcomes this issue. The Scattering transform (Mallat, 2012) is an example of predeﬁned deep representation, approximately invariant to translations, that can be reconstructed when the degree of invariance speciﬁed is small. Yet, it requires a gradient descent optimization and no guarantee of convergences are known. In summary, the references make clear that invertibility requires special care in designing the architecture or special care in designing the optimization procedure. In this paper, we introduce a network, that overcomes these issues and has an exact inverse by construction.

Our main inspiration for this work is the recent reversible residual network (Rev Net), introduced in Gomez et al. (2017). Rev Nets are in turn closely related to NICE and Real-NVP architectures (Dinh et al., 2016; 2014), which make use of constrained Jacobian determinants for generative modeling. All these architectures are similar to the lifting scheme (Sweldens, 1998) and Feistel cipher diagrams (Menezes et al., 1996), as we will show. Rev Nets illustrate how to build invertible Res Net-type blocks that avoid storing intermediate activations necessary for the backward pass. However, Rev Nets still employ multiple non-invertible operators like max-pooling and downsampling operators as part of the network. As such, Rev Nets are not invertible by construction. In this paper, we show how to build an invertible type of Rev Net architecture that performs competitively with Rev Nets on Imagenet, which we call i-Rev Net for invertible Rev Net.

3 THE i-REVNET

This section introduces the general framework of the i-Rev Net architecture and explains how to explicitly build an inverse or a left-inverse to an i-Rev Net. Its practical implementation is discussed, and we demonstrate competitive numerical results.

3.1 AN INVERTIBLE ARCHITECTURE

Figure 1: The main component of the i-Rev Net and its inverse. Rev Net blocks are interleaved with convolutional bottlenecks Fj and reshufﬂing operations Sj to ensure invertibility of the architecture and computational efﬁciency. The input is processed through a splitting operator S, and output is merged through M. Observe that the inverse network is obtained with minimal adaptations.

We describe i-Rev Nets in their general setting. Their foundations are largely grounded in the recent Rev Net architecture (Gomez et al., 2017). In an i-Rev Net, an initial input is split into two sublayers (x0, x0) of equal size, thanks to a splitting operator Sx (x0, x0), in this paper we choose to split the channel dimension as is done in Rev Nets. The operator S is linear, injective, reduces the spatial resolution of the coefﬁcients and can potentially increase the layer size, as wider layers usually improve the classiﬁcation performance (Zagoruyko & Komodakis, 2016). We can thus build a pseudo inverse S+ that will be used for the inversion. Recall that if S is invertible, then S+ = S 1.

The number of coefﬁcients of the next block is maintained, and at each depth j, the representation Φjx is again decoupled into two variables Φjx (xj, xj) that play interlaced roles.

Published as a conference paper at ICLR 2018

The strategy implemented by an i-Rev Net consists in an alternation between additions, and nonlinear operators Fj, while progressively down-sampling the signal thanks to the operators Sj. Here, Fj consists of convolutions and non-linearity on xj. The pair of the ﬁnal layer is concatenated through a merging operator M. We will omit M, M 1, S+ and S for the sake of simplicity, when not necessary. Figure 1 describes the blocks of an i-Rev Net. The design is similar to the Feistel cipher diagrams (Menezes et al., 1996) or a lifting scheme (Sweldens, 1998), which are invertible and efﬁcient implementations of complex transforms like second generation wavelets.

In this way, we avoid the non-invertible modules of a Rev Net (e.g. max-pooling or strides) which are necessary to train them in a reasonable time and are designed to build invariance w.r.t. translation variability. Our method shows we can replace them by linear and invertible modules Sj, that can reduce the spatial resolution (we refer to it as a spatial down-sampling for the sake of simplicity) while maintaining the layer s size by increasing the number of channels.

We keep the computational cost manageable by tightly coupling downsampling and increase in width of the network. Reducing the spatial resolution can be undesirable, so Sj can potentially be the identity. We refer to such networks as i-Rev Nets. This leads to the following equations:

xj+1 = Sj+1 xj xj+1 = xj + Fj+1 xj xj = S 1 j+1xj+1 xj = xj+1 Fj+1 xj (1)

Figure 2: Illustration of the invertible down-sampling

Our downsampling layer can be written for u the spatial variable and λ the channel index:

Sjx(u, λ) = x(Ψ(u, λ))

where Ψ is some invertible mapping. In principle, any invertible downsampling operation like e.g. dilated convolutions (Yu & Koltun, 2015) can be considered here. We use the inverse of the operation described in Shi et al. (2016) as illustrated in Figure 2, since it preserves roughly the spatial ordering, and thus permits to avoid mixing different neighborhoods via the next convolution. S is similar, but also linearly increases the channel dimensionality, for example by concatenating 0.

The ﬁnal layer Φx ΦJx = (x J, x J) is then averaged along the spatial dimension, followed by a Re LU non-linearity and ﬁnally a linear projection on the class probes, which are fed to a supervised training algorithm. From a given i-Rev Net, it is possible to deﬁne a left-inverse Φ+, i.e. Φ+Φx = x or even an inverse Φ 1, i.e. Φ 1Φx = Φ 1Φx = x if S is invertible. In these cases, the convolutional sections are as well some i-Rev Nets. An i-Rev Net is the dual of its inverse, in the sense that it requires to replace (Sj, Fj) by (S 1 j , Fj) at each depth j, and to apply S+ on the output. In consequence, its implementation is simple and speciﬁed by Equation (1). In Subsection 4.2, we discuss that the inverse of Φ does not suffer from signiﬁcant round-off errors, while however being very sensitive to small variations of an input on a large subspace, as shown in Subsection 4.1.

3.2 ARCHITECTURE, TRAINING AND PERFORMANCES

In this subsection, we describe two models that we trained: an injective i-Rev Net (a) and a bijective i-Rev Net (b), with fewer parameters. The hyper-parameters were selected to be either close to the Res Net and Rev Net baselines in terms of the number of layers (a) or parameters (b) while keeping performance competitive. For the same reasons as in Gomez et al. (2017), our scheme also allows avoiding storing any intermediate activations at training time, making memory consumption for very deep i-Rev Nets not an issue in practice. We compare our implementation with a Rev Net with 56 layers corresponding to 28M parameters, as provided in the open source release of Gomez et al. (2017), and with a standard Res Net of 50 layers, with 26M parameters (He et al., 2016).

Each block Fj is a bottleneck block, which consists of a succession of 3 convolutional operators, each preceded by Batchnormalization (Ioffe & Szegedy, 2015) and Re LU non-linearity. The second layer has four times fewer channels than the other two, while their corresponding kernel sizes are respectively 1 1, 3 3, 1 1.

Published as a conference paper at ICLR 2018

Architecture Injective Bijective Top-1 error Parameters

Res Net - - 24.7 26M Rev Net - - 25.2 28M i-Rev Net (a) yes - 24.7 181M i-Rev Net (b) yes yes 26.7 29M

Table 1: Comparison of different architectures trained on ILSVRC-2012, in terms of classiﬁcation accuracy and number of parameters

The ﬁnal representation is spatially averaged and projected onto the 1000 classes after a Re LU nonlinearity. We now discuss how we progressively decrease the spatial resolution, while increasing the number of channels per layer by use of the operators Sj.

We ﬁrst describe the model (a), that consists of 56 layers which have been optimized to match the performances of a Rev Net or a Res Net with approximatively the same number of layers. In particular, we explain how we progressively decrease the spatial resolution, while increasing the number of channels per block by use of the operators Sj.

The splitting operator S consists in a linear and injective embedding that downsamples by a factor 42 the spatial resolution by increasing the number of output channels from 48 to 96 by simply adding 0. The latter permits to increase the initial layer size, and consequently, the size of the next layers as performed in Gomez et al. (2017); it is thus not a bijective yet an injective i-Rev Net. At depth j, Sj allows us to reduce the number of computations while maintaining good classiﬁcation performance. It will correspond to a downsampling operator respectively at the depth 3j = 15, 27, 45 (3j as one block corresponds to three layers), similar to a normal Rev Net. The spatial resolution of these layers is reduced by a factor 22 while increasing the number of channels by a factor of 4 respectively to 48, 192, 768 and 3072. Furthermore, it means that the corresponding spatial resolutions for an input of size 2242 are respectively 1122, 562, 282, 142, 72. The total number of coefﬁcients at each layer is then about 0.3M. All the remaining blocks Sj are kept ﬁx to the identity as explained in the section above.

Architecture (b) is bijective, it consists of 300 layers (100 blocks), whose total numbers of parameters have been optimized to match those of a Rev Net with 56 layers. Initially, the input is split via S, which corresponds to an invertible spatial downsampling of 22 that increases the number of channels from 3 to 12. It thus keeps the dimension constant and permits building a bijective i-Rev Net. Then, at depth 3j = 3, 21, 69, 285, the spatial resolution is reduced by 22 via Sj. Contrary to the architecture (a), the dimensionality of each layer is constantly equal to 3 2242, until the ﬁnal layer, with channel sizes of 24, 96, 384, 1536.

0 2 4 6 Iterations 10

Cross entropy loss

Res Net i-Rev Net

Figure 3: Training loss of the i-Rev Net (b), compared to the Res Net, on Image Net.

For both networks, the training on Imagenet follows the same setup as Gomez et al. (2017). We train with SGD and momentum of 0.9. We regularized the model with a ℓ2 weight decay of 10 4 and batch normalization. The dataset is processed for 600k iterations on a batch size of 256, distributed on 4GPUs. The initial learning rate is 0.1, dropped by a factor of ten every 160k iterations. The dataset was augmented according to Gomez et al. (2017). The images values are mapped to [0, 1] while following geometric transformations were applied: random scaling, random horizontal ﬂipping, random cropping of size 2242, and ﬁnally color distortions. No other regularizations were incorporated into the classiﬁcation pipeline. At test time, we rescale the image size to 2562 and perform a center crop of size 2242.

Published as a conference paper at ICLR 2018

We report the training loss (i.e. Cross entropy) curves in Figure 3 of our i-Rev Net (b) and the Res Net baseline, displayed is a moving average over 100 iterations. Observe that the decrease of both training-losses are very similar which indicates that the constraint of invertibility does not interfere negatively with the learning process. However, we observed one third longer wall-clock times for i-Rev Nets compared to plain Rev Nets because the channel size becomes larger. The Table 1 reports the performances of our i-Rev Nets, with comparable Rev Net and Res Net. First, we compare the i-Rev Net (a) with the Rev Net and Res Net. Indeed, those CNNs have the same number of layers, and the i-Rev Net (a) increases the channel width of the initial layer as done in Gomez et al. (2017). The drawback of this technique is that the kernel sizes will be larger for all subsequent layers.

The i-Rev Net (a) has about 6 times more parameters than a Rev Net and a Res Net but leads to a similar accuracy on the validation set of Image Net. On the contrary, the i-Rev Net (b) is designed to have roughly the same number of parameters as the Rev Net and Res Net, while being bijective. Its accuracy decreases by 1.5% absolute percent on Image Net compared to the Rev Net baseline, which is not surprising because the number of channels was not drastically increased in the earlier layers as done in the baselines (Gomez et al., 2017; Krizhevsky et al., 2012; He et al., 2016); we did not explore wide ranges of hyper-parameters, thus the gap between (a) and (b) can likely be reduced with additional engineering.

4 ANALYSIS OF THE INVERSE

We now analyze the representation Φ built by our bijective neural network i-Rev Net (b) and its inverse Φ 1, as trained on ILSVRC-2012. We ﬁrst explain why obtaining Φ 1 is challenging, even locally. We then discuss the reconstruction, while displaying in the image space linear interpolations between representations.

4.1 AN ILL-CONDITIONED INVERSION

0 2500 5000 7500 10000 Rank of the singular value

Figure 4: Normalized sorted singular values of Φx.

In the previous section, we have described the i-Rev Net architecture, that permits deﬁning a deep network with an explicit inverse. We explain now why this is normally difﬁcult, by studying its local inversion. We study the local stability of a network Φ and its inverse Φ 1 w.r.t. to its input, which means that we will quantify locally the variations of the network and its inverse w.r.t. to small variations of an input. As Φ is differentiable (and its inverse as well), an equivalent way to perform this study is to analyze the singular values of the differential Φ at some point, as for (a, b) close the following holds:

Φa Φb + Φb(a b).

Ideally, a well-conditioned operator has all its singular values constant equal to 1, for instance as achieved by the isometric operators of Cisse et al. (2017).

In our numerical application to an image x, Φx corresponds to a very large matrix (square of the number of coefﬁcients of the image at least) whose computations are expensive. Figure 4 corresponds to the singular values of the differential (i.e. the square roots of the eigen values of Φ Φ), in decreasing order, for a given natural image from Image Net. The example we plot is typical of the behavior of Φ. Observe there is a fast decay: numerically, the ﬁrst 103 and 104 singular values are responsible respectively for 80% and 97% of the cumulated energy (i.e. sum of squared singular values). This indicates Φ linearizes the space locally in a considerably smaller space in comparison to the original input dimension. However, the dimensionality is still quite large (i.e. > 10) and thus we can not infer that Φ lays locally in a low-dimensional manifold. It also proves that inversing Φ is difﬁcult and is an ill-conditioned problem. Thus obtaining implicitly this inverse would be a challenging task that we avoided, thanks to the formal reconstruction algorithm provided by Subsection 3.1.

Published as a conference paper at ICLR 2018

Figure 5: This graphic displays several reconstructed sequences {xt}t. The left image corresponds to x0 and the right image to x1.

4.2 LINEAR INTERPOLATION AND RECONSTRUCTION

Visualizing or understanding the important directions in the representation of inner layers of a CNN, and in particular, the ﬁnal layer is complex because typically the cascade is either not invertible or unstable. One approach to reconstruct from an output layer consists in ﬁnding the input image that matches the activation through via gradient descent. However, this technique leads only to a partial or informal reconstruction (Mahendran & Vedaldi, 2015).

Another method consists in embedding the representation in a lower dimensional space and comparing the common attributes of nearest neighbors (Szegedy et al., 2013). It is also possible to train a CNN to reconstruct the representation (Dosovitskiy & Brox, 2016). Yet these methods require a priori knowledge in order to ﬁnd the appropriate embeddings or training sets. We now discuss the improvements achieved by the i-Rev Net.

Our main claim is that while the local inversion is ill-conditioned, the inverse Φ 1 computations do not involve signiﬁcant round-off errors. The forward pass of the network does not seem to suffer from signiﬁcant instabilities, thus it seems coherent to assume that this will hold for Φ 1 as well. For example, adding constraints beyond vanishing moments in the case of a Lifting scheme is difﬁcult (Sweldens, 1998; Mallat, 1999), and this is a weakness of this method. We validate our claim by computing the empirical relative error on several subsets X of data:

ϵ(X) = 1 |X|

We evaluate this measure on a subset X1 of |X1| = 104 independent uniform noises and on the validation set X2 of Image Net. We report ϵ(X1) = 5 10 6 and ϵ(X2) = 3 10 6 respectively, which are close to the machine error and indicates that the inversion does not suffer from signiﬁcant round-off errors.

Given a pair of images {x0, x1}, we propose to study linear interpolations between the pair of representations {Φx0, Φx1}, in the feature domain. Those interpolations correspond to existing images as Φ 1 is an exact inverse. We reconstruct a convex path between two input points; it means that if: φt = tΦx0 + (1 t)Φx1,

then: xt = Φ 1φt is a signal that corresponds to an image.

Published as a conference paper at ICLR 2018

10 20 30 40 50 3j

(a) Res Net

100 200 300 3j

(b) i-Rev Net

Figure 6: Accuracy at depth j for a linear SVM and a 1-nearest neighbor classiﬁer applied to the spatially averaged Φj.

We discretized [0, 1] into {t1, ..., tk}, adapt the step size manually and reconstruct the sequence of {xt1, ..., xtk}. Results are displayed in the Figure 5. We selected images from the basel face dataset (Paysan et al., 2009), describable texture dataset (Cimpoi et al., 2014) and imagenet.

We now interpret the results. First, observe that a linear interpolation in the feature space is not a linear interpolation in the image space and that intermediary images are noisy, even for small deformations, yet they mostly remain recognizable. However, some geometric transformations such as a 3D-rotation seem to have been linearized, as suggested in Aubry & Russell (2015). In the next section, we thus investigate how the linear separation progresses with depth.

5 A CONTRACTION

In this section, we study again the bijective i-Rev Net. We ﬁrst show that a localized or linear classiﬁer progressively improves with depth. Then, we describe the linear subspace spanned by Φ, namely the feature space, showing that the classiﬁcation can be performed on a much smaller subspace, which can be built via a PCA.

5.1 PROGRESSIVE LINEAR SEPARATION AND CONTRACTION

We show that both a Res Net and an i-Rev Net build a progressively more linearly separable and contracted representation as measured in Oyallon (2017). Observe this property holds for the i Rev Net despite the fact that it can not discard any information.

We investigate these properties in each block, with the following experimental protocol. To reduce the computational burden we used a subset of 100 randomly selected imagenet classes, that consist of N = 120k images, and keep the same subset during all our following experiments. At each depth j, we extract the features {Φjxn}n N of the training set, we average them along the spatial variable and standardize them in order to avoid any ill-conditioning effects. We used both a nearest neighbor classiﬁer and a linear SVM. The former is a localized classiﬁer that indicates that the ℓ2 metric is progressively more important for classiﬁcation, while a linear SVM measures the linear separation of the different classes. The parameters of the linear SVM are cross-validated on a small subset of the training set, prior to training on the 100 classes. We evaluate both classiﬁers for each model on the validation set of Image Net and report the Top-1 accuracy in Figure 6.

We observe that both classiﬁers progressively improve similarly with depth for each model, the linear SVM performing slightly better than the nearest neighbor classiﬁer because it is the more robust and discriminative classiﬁer of the two. In the case of the i-Rev Net, the classiﬁcation performed by the CNN leads to 77%, and the linear SVM performs slightly better because we did not ﬁne-tune the model to 100 classes. Observe that there is a more intense jump of performance on the 3 last layers, which seems to indicate that the former layers have prepared the representation to be more contracted and linearly separated for the ﬁnal layers.

Published as a conference paper at ICLR 2018

The results suggest a low-dimensional embedding of the data, but this is difﬁcult to validate as estimating local dimensionality in high dimensions is an open problem. However, in the next section, we try to compute the dimension of the discriminative part of the representation built by an i-Rev Net.

5.2 DIMENSIONALITY ANALYSIS OF THE FEATURE SPACE

In this section, we investigate if we can reﬁne the dimensionality of informative variabilities in the ﬁnal layer of an i-Rev Net. Indeed, the cascade of convolutional operators has been trained on the training set to separate the 1000 different classes while being a homeomorphism on its feature space. Thus, the dimensionality of the feature space is potentially large.

As shown in the previous subsection, the ﬁnal layer is progressively prepared to be projected on the ﬁnal probes corresponding to the classes. This indicates that the non-informative variabilities for classiﬁcation can be removed via a linear projection on the ﬁnal layer Φ, which lie in a space of dimension 1000, at most. However, this projection has been built via supervision, which can still retain directions that have been contracted and thus will not be selected by an algorithm such as PCA. We show in fact a PCA retains the necessary information for classiﬁcation in a small subspace.

0 200 400 600 800 1000 PCA axes

Figure 7: Accuracy of a linear SVM and nearest neighbor against the number of principal components retained.

To do so, we build the linear projectors πd on the subspace of the d ﬁrst principal components, and we propose to measure the classiﬁcation power of the projected representation with a supervised classiﬁer, e.g. nearest neighbor or a linear SVM, on the previous 100 class task. Again, the feature representation {Φxn}n N are spatially averaged to remove the translation variability, and standardized on the training set. We apply both classiﬁers, and we report the classiﬁcation accuracy of {πdΦxn}n N w.r.t. to d on the Figure 7. A linear projection removes some information that can not be recovered by a linear classiﬁer, therefore we observe that the classiﬁcation accuracy only decreases signiﬁcantly for d 200. This shows that the signal indeed lies in a subspace much lower dimensional than the original feature dimensions that can be extracted simply with a PCA that only considers directions of largest variances, illustrating a successful contraction of the representation.

6 CONCLUSION

Invertible representations and their relationship to loss of information are on the agenda of deep learning for some time. Understanding how transformations in feature space are related to the corresponding input is an important step towards interpretable deep networks, invertible deep networks may play an important role in such analysis since, for example, one could potentially back-track a property from the feature space to the input space. To the best of our knowledge, this work provides the ﬁrst empirical evidence that learning invertible representations that do not discard any information about their input on large-scale supervised problems is possible.

To achieve this we introduce the i-Rev Net class of CNN which is fully invertible and permits to exactly recover the input from its last convolutional layer. i-Rev Nets achieve the same classiﬁcation accuracy in the classiﬁcation of complex datasets as illustrated on ILSVRC-2012, when compared to the Rev Net (Gomez et al., 2017) and Res Net (He et al., 2016) architectures with a similar number of layers. Furthermore, the inverse network is obtained for free when training an i-Rev Net, requiring only minimal adaption to recover inputs from the hidden representations.

The absence of loss of information is surprising, given the wide believe, that discarding information is essential for learning representations that generalize well to unseen data. We show that this is not the case and propose to explain the generalization property with empirical evidence of progressive separation and contraction with depth, on Image Net.

Published as a conference paper at ICLR 2018

ACKNOWLEDGEMENTS

J orn-Henrik Jacobsen was partially funded by the STW perspective program Ima Gene. Edouard Oyallon was partially funded by the ERC grant Invariant Class 320959, via a grant for Ph D Students of the Conseil r egional d Ile-de-France (RDM-Id F), and a postdoctoral grant from the from DPEI of Inria (AAR 2017POD057) for the collaboration with CWI. We thank Berkay Kicanaoglu for the Basel Face data, Mathieu Andreux, Eugene Belilovsky, Amal Rannen, Patrick Putzky and Kyriacos Shiarlis for feedback on drafts of the paper.

Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. ar Xiv preprint ar Xiv:1706.01350, 2017.

Mathieu Aubry and Bryan C Russell. Understanding deep features with computer-generated imagery. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2875 2883, 2015.

Joan Bruna, Arthur Szlam, and Yann Le Cun. Signal recovery from pooling representations. ar Xiv preprint ar Xiv:1311.4025, 2013.

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606 3613, 2014.

Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pp. 854 863, 2017.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016.

Alexey Dosovitskiy and Thomas Brox. Inverting visual representations with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4829 4837, 2016.

Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. ar Xiv preprint ar Xiv:1707.04585, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448 456, 2015.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012.

Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188 5196, 2015.

Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120(3):233 255, 2016.

St ephane Mallat. A wavelet tour of signal processing. Academic press, 1999.

Published as a conference paper at ICLR 2018

St ephane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012.

St ephane Mallat. Understanding deep convolutional networks. Phil. Trans. R. Soc. A, 374(2065): 20150203, 2016.

Alfred J Menezes, Paul C Van Oorschot, and Scott A Vanstone. Handbook of applied cryptography. CRC press, 1996.

Edouard Oyallon. Building a regular decision boundary with deep networks. ar Xiv preprint ar Xiv:1703.01775, 2017.

Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. In Advanced Video and Signal Based Surveillance, 2009. AVSS 09. Sixth IEEE International Conference on, pp. 296 301. Ieee, 2009.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

Wenzhe Shi, Jose Caballero, Ferenc Husz ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efﬁcient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874 1883, 2016.

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. ar Xiv preprint ar Xiv:1703.00810, 2017.

Wim Sweldens. The lifting scheme: A construction of second generation wavelets. SIAM journal on mathematical analysis, 29(2):511 546, 1998.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp. 1 5. IEEE, 2015.

Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. ar Xiv preprint ar Xiv:1511.07122, 2015.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818 833. Springer, 2014.