# signal_recovery_from_pooling_representations__0c7fd9c9.pdf Signal recovery from Pooling Representations Joan Bruna, Courant Institute, New York University BRUNA@CIMS.NYU.EDU Arthur Szlam, The City College of New York, CUNY ASZLAM@CCNY.CUNY.EDU Yann Le Cun, Courant Institute, New York Unversity YANN@CS.NYU.EDU In this work we compute lower Lipschitz bounds of ℓp pooling operators for p = 1, 2, as well as ℓp pooling operators preceded by halfrectification layers. These give sufficient conditions for the design of invertible neural network layers. Numerical experiments on MNIST and image patches confirm that pooling layers can be inverted with phase recovery algorithms. Moreover, the regularity of the inverse pooling, controlled by the lower Lipschitz constant, is empirically verified with a nearest neighbor regression. 1. Introduction A standard architecture for deep feedforward networks consists of a number of stacked modules, each of which consists of a linear mapping, followed by an elementwise nonlinearity, followed by a pooling operation. Critical to the success of this architecture in recognition problems is its capacity for preserving discriminative signal information while being invariant to nuisance deformations. The recent works (Mallat, 2012; Bruna and Mallat, 2012) study the role of the pooling operator in building invariance. In this work, we will study a network s capacity for preserving information. Specifically, we will study the invertibility of modules with a linear mapping, the half rectification nonlinearity, and ℓp pooling, for p {1, 2, }. We will discuss recent work in the case p = 2, and connections with the phase recovery problem of (Candes et al., 2013; Gerchberg and Saxton, 1972; Waldspurger et al., 2012). 1.1. ℓp pooling The purpose of the pooling layer in each module is to give invariance to the system, perhaps at the expense of resolution. This is done via a summary statistic over the outputs of groups of nodes. In the trained system, the columns of the weight matrix corresponding to nodes Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). grouped together often exhibit similar characteristics, and code for perturbations of a template (Kavukcuoglu et al., 2009; Hyv arinen and Hoyer, 2001). The summary statistic in ℓp pooling is the ℓp norm of the inputs into the pool. That is, if nodes x Ii, ..., x Il are in a pool, the output of the pool is (|x Ii|p + ... + |x Il|p)1/p , where as usual, if p , this is max (|x Ii|, ..., |x Il|) . If the outputs of the nonlinearity are nonnegative (as for the half rectification function), then p = 1 corresponds to average pooling, and the case p = is max pooling. 1.2. Phase reconstruction Given x Rn, a classical problem in signal processing is to recover x from the absolute values of its (1 or 2 dimensional) Fourier coefficients, perhaps subject to some additional constraints on x; this problem arises in speech generation and X-ray imaging (Ohlsson, 2013). Unfortunately, the problem is not well posedthe absolute values of the Fourier coefficients do not nearly specify x. For example, the absolute value of the Fourier transform is translation invariant. It can be shown (and we discuss this below) that the absolute value of the inner products between x and any basis of Rn are not enough to uniquely specify an arbitrary x; the situation is worse for Cn. On the other hand, recent works have shown that by taking a redundant enough dictionary, the situation is different, and x can be recovered from the modulus of its inner products with the dictionary (Balan et al., 2006; Candes et al., 2013; Waldspurger et al., 2012). Suppose for a moment that there is no elementwise nonlinearity in our feedforward module, and only a linear mapping followed by a pooling. Then with a slightly generalized notion of phase, where the modulus is the ℓp norm of the pool, and the phase is the ℓp unit vector specifying the direction of the inner products in the pool, the phase recovery problem above asks if the module loses any information. The ℓ2 case has been recently studied in (Cahill et al., 2013) Signal recovery from Pooling Representations 1.3. What vs. Where If the columns of the weight matrix in a pool correspond to related features, it can be reasonable to see the entire pool as a what . That is, the modulus of the pool indicates the relative presence of a grouping of (sub)features into a template, and the phase of the pool describes the relative arrangement of the subfeatures, describing where the template is, or more generally, describing the pose of the template. From this viewpoint, phase reconstruction results make rigorous the notion that given enough redundant versions of what and throwing away the where , we can still recover the where . 1.4. Contributions of this work In this work we give conditions so that a module consisting of a linear mapping, perhaps followed by a half rectification, followed by an ℓp pooling preserves the information in its input. We extend the ℓ2 results of (Cahill et al., 2013; Balan and Wang, 2013) in several ways: we consider the ℓp case, take into account the half rectification nonlinearity, and we make the results quantitative in the sense that we give bounds on the lower Lipschitz constants of the modules. This gives a measure of the stability of the inversion, which is especially important in a multi-layer system. Using our bounds, we prove that redundant enough random modules with ℓ1 or ℓ pooling are invertible. We also show the results of numerical experiments designed to explore the gaps in our results and the results in the literature. We note that the alternating minimization method of (Gerchberg and Saxton, 1972) can be used essentially unchanged for the ℓp case, with or without rectification, and show experiments giving evidence that recovery is roughly equally possible for ℓ1, ℓ2, and ℓ using this algorithm; and that half rectification before pooling can make recovery easier. Furthermore, we show that with a trained initialization, the alternating method compares favorably with the state of the art recovery methods (for ℓ2 with no rectification) in (Waldspurger et al., 2012; Candes et al., 2013), which suggests that the above observations are not an artifcact of the alternating method. 2. Injectivity and Lipschitz stability of Pooling Operators This section studies necessary and sufficient conditions guaranteeing that pooling representations are invertible. It also computes upper and lower Lipschitz bounds, which are tight under certain configurations. Let us first introduce the notation used throughout the paper. Let F = {f1, . . . , f M} be a real frame of RN, with M > N. The frame F is organized into K disjoint blocks Fk = {fj}j Ik, k = 1 . . . K, such that Ik Ik = and ! k Ik = {1 . . . M}. For simplicity, we shall assume that all the pools have equal size |Ik| = L. The ℓp pooling operator Pp(x) is defined as the mapping x & Pp(x) = { FT k x p , k = 1 . . . K} . (1) A related representation, which has gained popularity in recent deep learning architectures, introduces a point-wise thresholding before computing the ℓp norm. If α RM is a fixed threshold vector, and (ρα(x))i = max(0, xi αi), then the ℓp rectified pooling operator Rp(x) is defined as x & Rp(x) = { ραk(FT k x) p , k = 1 . . . K} , (2) where αk contains the coordinates Ik of α. We shall measure the stability of the inverse pooling with the Euclidean distance in the representation space. Given a distance d(x, x ) in the input space, the Lipschitz bounds of a given operator Φ(x) are defined as the constants 0 A B such that x , x , Ad(x, x ) Φ(x) Φ(x ) 2 Bd(x, x ) . In the remainder of the paper, given a frame F, we denote respectively by λ (F) and λ+(F) its lower and upper frame bounds. If F has M vectors and Ω {1, . . ., M}, we denote FΩthe frame obtained by keeping the vectors indexed in Ω. Finally, we denote Ωc the complement of Ω. 2.1. Absolute value and Thresholding nonlinearities In order to study the injectivity of pooling representations, we first focus on the properties of the operators defined by the point-wise nonlinearities. The properties of the phaseless mapping x & M(x) = {| x, fi | , i = 1 . . . m} , x Rn , (3) have been extensively studied in the literature (Balan et al., 2006; Balan and Wang, 2013), in part motivated by applications to speech processing (Achan et al., 2003) or X-ray crystallography (Ohlsson, 2013). It is shown in (Balan et al., 2006) that if m > 2n 1 then it is possible to recover x from M(x), up to a global sign change. In particular, (Balan and Wang, 2013) recently characterized the stability of the phaseless operator, that is summarized in the following proposition: Proposition 2.1 ((Balan and Wang, 2013), Theorem 4.3) Let F = (f1, . . . , f M) with fi RN and d(x, x ) = min( x x , x + x ). The mapping M(x) = {| x, fi |}i m satisfies x, x Rn , AF d(x, x ) M(x) M(x ) BF d(x, x ) , Signal recovery from Pooling Representations AF = min Ω {1...M} (FΩc) , (5) BF = λ+(F) . (6) In particular, M(x) is injective if and only if for any subset Ω {1, . . ., M}, either FΩor FΩc is an invertible frame. A frame F satisfying the previous condition is said to be phase retrievable. We now turn our attention to the half-rectification operator, defined as Mα(x) = ρα(FT x) . (7) For that purpose, let us introduce some extra notation. Given a frame F = {f1, . . . , f M}, a subset Ω {1 . . . M} is admissible if {x ; x, fi > αi} {x ; x, fi < αi} = . (8) We denote by Ωthe collection of all admissible sets, and VΩthe vector space generated by Ω. The following proposition, proved in Section B, gives a necessary and sufficient condition for the injectivity of the half-rectification. Proposition 2.2 Let A0 = minΩ VΩ). Then the half-rectification operator Mα(x) = ρα(FT x) is injective if and only if A0 > 0. Moreover, it satisfies x, x , A00d(x, x ) Mα(x) Mα(x ) B0 x x , (9) with d(x, x ) = min( x x , x + x ), B0 = maxΩ Ωλ+(FΩ) λ+(F) and A00 = min Ω,Ω λ (FΩ Ω )2 + min (λ (FΩ\(Ω Ω )), λ (FΩ\(Ω Ω )))2"1/2 The half-rectification has the ability to recover the input signal, without the global sign ambiguity. The ability to reconstruct from Mα is thus controlled by the rank of any matrix FΩwhose columns are taken from a subset belonging to Ω. In particular, if α 0, since Ω Ω, it results that m 2n is necessary in order to have A0 > 0. Equation (9) shows that the half rectified operator is globally bi-lipschitz with respect to d(x, x ). However, it is immediate from the proof that it also satisfies a much simpler local Lipscthiz condition with lower bound A0 x x for x, x sufficiently close. The rectified linear operator creates a partition of the input space into polytopes, defined by (8), and computes a linear operator on each of these regions. A given input x activates a set Ωx Ω, encoded by the sign of the linear measurements x, fi αi. As opposed to the absolute value operator, the inverse of Mα, whenever it exists, can be computed directly by locally inverting a linear operator. Indeed, the coordinates of Mα(x) satisfying Mα(x)j > αj form a set s(x), which defines a linear model Fs(x) which is invertible if A0 > 0. In order to compare the stability of the half-rectification versus the full rectification, one can modify Mα so that it maps x and x to the same point. If one considers & Mα(x) if λ (Fs(x)) > λ (Fs(x)c) , M α( x) otherwise . then Mα satisfies the following: Corollary 2.3 For each x there exists ϵ > 0 such that x s.t. d(x, x ) ϵ, 'A d(x, x ) % Mα(x) % Mα(x ) 'B d(x, x ) , (10) (FΩc)) , (11) λ+(FΩ) λ+(F) , (12) and d(x, x ) = min( x x , x + x ), so 'A 2 1/2A and 'B B. In particular, if M is invertible, so is Mα. It results that the bi-Lipschitz bounds of the halfrectification operator, when considered in under the equivalence x x, are controlled by the bounds of the absolute value operator, up to a factor 2 1/2. However, the lower Lipschitz bound (11) consists in a minimum taken over a much smaller family of elements than in (5). 2.2. ℓp Pooling We give bi-Lipschitz constants of the ℓp Pooling and ℓp rectified Pooling operators for p = 1, 2, . From its definition, it follows that pooling operators Pp and Rp can be expressed respectively as a function of phaseless and half-rectified operators, which implies that for the pooling to be invertible, it is necessary that the absolute value and rectified operators are invertible too. Naturally, the converse is not true. 2.2.1. ℓ2 POOLING The invertibility conditions of the ℓ2 pooling representation have been recently studied in (Cahill et al., 2013), where the authors obtain necessary and sufficient conditions on the frame F. We shall now generalize those results, and derive bi-Lipschiz bounds. Signal recovery from Pooling Representations Let us define (Uk Fk)k K ; k K , Uk Rd d , U T . (13) Q2 thus contains all the orthogonal bases of each subspace Fk. The following proposition, proved in section B, computes upper and lower bounds of the Lipschitz constants of P2. Proposition 2.4 The ℓ2 pooling operator P2 satisfies x, x ; , A2d(x, x ) P2(x) P2(x ) B2d(x, x ) , A2 = min F Q2 min Ω {1...M} B2 = λ+(F) . (15) This proposition thus generalizes the results from (Cahill et al., 2013), since it shows that A2 > 0 not only controls when P2 is invertible, but also controls the stability of the inverse mapping. We also consider the rectified ℓ2 pooling case. For simplicity, we shall concentrate in the case where the pools have dimension d = 2. For that purpose, for each x, x , we consider a modification of the families Q2, by replacing each sub-frame Fk by FIk s(x) s(x ), that we denote 'Q2,x,x . Corollary 2.5 Let d = 2, and set p(x, x ) = s(x) s(x )\(s(x) s(x )). Then the rectified ℓ2 pooling operator R2 satisfies x, x ; , A2d(x, x ) R2(x) R2(x ) B2d(x, x ) , A2 = inf x,x min F ! Q2,x,x min Ω s(x) s(x ) (Fp(x,x )) + Proposition 2.4 and Corollary 2.5 give a lower Lipschitz bound which gives sufficient guarantees for the inversion of pooling representations. Corollary 2.5 indicates that, in the case d = 2, the lower Lipschitz bounds are sharper than the non-rectified case, in consistency with the results of section 2.1. The general case d > 2 remains an open issue. 2.2.2. ℓ POOLING We give in this section sufficient and necessary conditions such that the max-pooling operator P is injective, and we compute a lower bound of its lower Lipschitz constant. Given x RN, we define the switches s(x) of x as the K vector of coordinates in each pool where the maximum is attained; that is, for each k {1, . . ., K}: s(x)k = arg max j Ik | x, fj |, and we denote by S the set of all attained switches: S = {s(x) ; x RN}. This is a discrete subset of , k{1, . . . , Ik}. Given s S, the set of input signals having s(x) = s defines a linear cone Cs RN: {x ; | x, fsk | | x, fj |} , and as a result the input space is divided into a collection of Voronoi cells defined from linear equations. Restricted to each cone Cs, the max-pooling operator computes the phaseless mapping M(x) from equation (3) corresponding to Fs = (fs1, . . . , fs K). Given vectors u and v, as usual, set the angle θ(u, v) = . For each s, s S such that Cs Cs = and for each Ω {1 . . .K}, we define β(s, s , Ω) = min u Fs This is a modified first principal angle between the subspaces Fs Ω, where the infimum is taken only on the directions included in the respective cones. Set Λs,s ,Ω= λ (F Ω) sin(β(s, s , Ω)). Given s, s , we also define J (s, s ) = {k ; sk = s k}. Recall L is the size of each pool. Set min Ω J (s,s ) λ2 4L min Ω J (s,s )c Λ2 The following proposition, proved in section B, gives a lower Lipschitz bound of the max-pooling operator. Proposition 2.6 For all x and x , the max-pooling operator P satisfies s,s A(s, s ) P (x) P (x ) , (18) where d(x, x ) = min( x x , x + x ). Propostion 2.6 shows that the lower Lipschitz bound is controlled by two different phenomena. The first one depends upon how the cones corresponding to disjoint switches are aligned, whereas the second one depends on the internal Signal recovery from Pooling Representations incoherence of each frame FJ (s,s ). One may ask how do these constants evolve in different asymptotic regimes. For example, if one lets the number of pools K be fixed but increases the size of each pool by increasing M. In that case, the set of possible switches S increases, and each cone Cs gets smaller. The bound corresponding to Cs Cs = decreases since the infimum is taken over a larger family. However, as the cones Cs become smaller, the likelihood that any pair x = x share the same switches decreases, thus giving more importance to the case Cs Cs = . Although the ratio 1 L decreases, the lower frame bounds λ (FΩ)2, λ (FΩc)2 will in general increase linearly with L. The lower bound will thus mainly be driven by the principal angles β(s, s , Ω). Although the minimum in (29) is taken over a larger family, each angle is computed over a smaller region of the space, suggesting that one can indeed increase the size of each pool without compromising the injectivity of the max-pooling. Another asymptotic regime considers pools of fixed size L and increases the number of pools K. In that case, the bound increases as long as the principal angles remain lower bounded. We also consider the stability of max-pooling with a halfrectification. By redefining the switches s(x) accordingly: s(x) = {j ; x, fj +αj > max(0, x, fj +αj ; j pool(j)} , (19) the following proposition, proved in section B, computes a lower bound of the Lipschitz constant of R . Corollary 2.7 The rectified max-pooling operator R satisfies x, x , x x min s,s A(s, s ) R (x) R (x ) , (FJ (s,s )) + 1 s,s ,J (s,s )c defined using the cones Cs obtained from (19). 2.2.3. ℓ1 POOLING AND MAX-OUT Propostion 2.6 can be used to obtain a bound of the lower Lipschitz constant of the ℓ1 pooling operator, as well as the Maxout operator (Goodfellow et al., 2013); see section B.4.2 in the supplementary material. 2.3. Random Pooling Operators What is the minimum amount of redundancy needed to invert a pooling operator? As in previous works on compressed sensing (Candes and Tao, 2004) and phase recovery (Balan et al., 2006), one may address this question by studying random pooling operators. In this case, the lower Lipschitz bounds derived in previous sections can be shown to be positive with probability 1 given appropriate parameters K and L. The following proposition, proved in Appendix B, analyzes the invertibility of a generic pooling operator constructed from random measurements. We consider a frame F where its M columns are iid Gaussian vectors of RN. Proposition 2.8 Let F = (f1, . . . , f M) be a random frame of RN, organized into K disjoint pools of dimension L. With probability 1 Pp is injective (modulo x x) if K 4N for p = 1, and if K 2N 1 for p = 2. The size of the pools L does not influence the injectivity of random pooling, but it affects the stability of the inverse, as shown in proposition 2.6. The half-rectified case requires extra care, since the set of admissible switches Ωmight contain frames with M < N columns with non-zero probability, and is not considered in the present work. 3. Numerical Experiments Our main goal in this section is to experimentally compare the invertibility of ℓp pooling for p {1, 2, }, with and without rectification. Unlike in the previous sections, we will not consider the Lipschitz bounds, as we do not know a good way to measure these experimentally. Our experiments suggest that recovery is roughly the same difficulty for p = 1, 2, , and that rectification makes recovery easier. In the ℓ2 case without rectification, and with d = 2, a growing body of works (Candes et al., 2013; Waldspurger et al., 2012) describe how to invert the pooling operator. This is often called phase recovery. A problem for us is a lack of a standard algorithm when p = 2 or with rectification. We will see that the simple alternating minimization algorithm of (Gerchberg and Saxton, 1972) can be adapted to these situations. However, alternating minimization with random initialization is known to be an inferior recovery algorithm for p = 2, and so any conclusions we will draw about ease of recovery will be tainted, as we would be testing whether the algorithm is equally bad in the various situations, rather than if the problems are equally hard. We will show that in certain cases, a training set allows us to find a good initialization for the alternating minimization, leading to excellent recovery performance, and that in this setting, or the random setting, recovery via alternating minimization is roughly as succesful for each of the p, suggesting invertibility is equally hard for each p. In the same way, we will see evidence that half rectification before pooling makes recovery easier. Signal recovery from Pooling Representations 3.1. Recovery Algorithms 3.1.1. ALTERNATING MINIMIZATION A greedy method for recovering the phase from the modulus of complex measurements is given in (Gerchberg and Saxton, 1972); this method naturally extends to the case at hand. As above, denote the frame {f1, ..., f M} = F, let Fk be the frame vectors in the kth block, and set Ik to be the indices of the kth block. Let F( 1) be the pseudoinverse of F; set (Pp(x))k = ||Fkx||p. Starting with an initial signal x0, update Ik = (Pp(x))k Fkx(n) ||Fkx(n)||p , k = 1 . . . K, 2. x(n+1) = F( 1)y(n). This approach is not, as far as we know, guarantee to converge to the correct solution, even when Pp is invertible. However, in practice, if the inversion is easy enough, or if x0 is close to the true solution, the method can work well. Moreover, this algorithm can be run essentially unchanged for each ℓp; for half rectification, we only use the nonegative entries in y for reconstruction. In the experiments below, we will use random, Gaussian i.i.d. F, but also we will use the outputs of dictionary learning with block sparsity. The F generated this way is not really a frame, as the condition number of a trained dictionary on real data is often very high. In this case, we will renormalize each data point to have norm 1, and modify the update x(n+1) = F( 1)y(n) to 2. x(n+1) = arg min||x||2=1 ||Fx y(n)||2. In practice, this modification might not always be possible, since the norm x is not explicitly presented in Pp. However, in the classical setting of Fourier measurements and positive x, this information is available. Moreover, our empirical experience has been that using this regularization on well conditioned analysis dictionaries offers no benefit; in particular, it gives no benefit with random analysis matrices. 3.1.2. PHASELIFT AND PHASECUT Two recent algorithms (Candes et al., 2013) and (Waldspurger et al., 2012) are guaranteed with high probability to solve the (classical) problem of recovering the phase of a complex signal from its modulus, given enough random measurements. In practice both perform better than the greedy alternating minimization. However, it is not obvious to us how to adapt these algorithms to the ℓp setting. 3.1.3. NEAREST NEIGHBORS REGRESSION We would like to use the same basic algorithm for all settings to get an idea of the relative difficulty of the recovery problem for different p, but also would like an algorithm with good recovery performance. If our algorithm simply returns poor results in each case, differences between the cases might be masked. The alternating minimization can be very effective when well initialized. When given a training set of the data to recover, we use a simple regression to find x0. Fix a number of neighbors q (in the experiments below we use q = 10, and suppose X is the training set). Set G = Pp(X), and for a new point x to recover from Pp(x), find the q nearest neighbors in G of Pp(x), and take their principal component to serve as x0 in the alternating minimization algorithm. We use the fast neighbor searcher from (Vedaldi and Fulkerson, 2008) to accelerate the search. 3.2. Experiments We discuss results on the MNIST dataset and on 16 16 patches drawn from the VOC dataset 1. For each of these data sets, we run experiments with random dictionaries and adapted dictionaries. We also run experiments where the data and the dictionary are both Gaussian i.i.d.; in this case, we do not use adapted dictionaries. The basic setup of the experiments in each case is the same: we vary the number of measurements (that is, number of pools) over some range, and attempt to recover the original signal from the ℓp pooled measurements, using various methods. We record the average angle between the recovered signal r and the original x, that is, we use |r T x|2/(||r||2||x||2) as the measure of success in recovery. In each case the random analysis dictionary F is built by fixing a size parameter m, and generating a Gaussian i.i.d. matrix F0 of size 2m n, where n = 100 for MNIST, and n = 256 for VOC. Each pair of rows of F0 is then orthogonalized to obtain F; that is we use groups of size 2, where the pair of elements in each group are orthogonal. This allows us to use standard phase recovery software in the ℓ2 case to get a baseline. We used the ADMM version of phaselift from (Ohlsson et al., 2012) and the phasecut algorithm of (Waldspurger et al., 2012). For all of our data sets, the latter gave better results (note that phasecut can explicitly use the fact that the solution to the problem is real, whereas that version of phaselift cannot), so we report only the phasecut results. In the experiments with adapted dictionaries, the dictionary is built using block OMP and batch updates with a K-SVD 1available at http://pascallin.ecs.soton.ac. uk/challenges/VOC/voc2012/ Signal recovery from Pooling Representations type update (Aharon et al., 2006); in this case, F is the transpose of the learned dictionary. We again use groups of size 2 in the adapted dictionary experiments. We run two sets of experiments with Gaussian i.i.d. data and dictionaries, with n = 20 and n = 40. We consider m in the range from n/2 to 8n. On this data set, phaselift outperforms alternating minimization; see the supplementary material. For MNIST, we use the standard training set projected to R100 via PCA, and we let the number of dictionary elements range from 60 to 600 (that is, 30 to 300 measurements). On this data set, alternating minimization with nearest neighbor initialization gives exact reconstruction by 130 measurements; for comparison, Phaselift at m = 130 has mean square angle of .48; see the supplementary material. We draw approximately 5 million 16 16 grayscale image patches from the PASCAL VOC data set; these are sorted by variance, and the largest variance 1 million are kept. The mean is removed from each patch. These are split into a training set of 900000 patches and a test set of 100000 patches. In this experiment, we let m range from 30 to 830. On this data set, by m = 330 measurements, alternating minimization with nearest neighbor initialization recovers mean angle of .97; for comparison, Phaselift at m = 330 has mean angle of .39; see the supplementary material. 3.3. Analysis The experiments from Figure 1) show that: For every data set, with random initializations and dic- tionaries, recovery is easier with half rectification before pooling than without (green vs dark blue in figures). ℓ , ℓ1, and ℓ2 pooling appear roughly the same diffi- culty to invert, regardless of algorithm (each column of figures, corresponding to an ℓp, is essentially the same). Good initialization improves performance; indeed, al- ternating minimization with nearest neighbor regression outperforms phaselift and phasecut (which of course do not have the luxury of samples from the prior, as the regressor does). We believe this of independent interest. Adapted analysis frames (with regularization) are easier to invert than random analysis frames, with or without regularization (the bottom row of each pair of graphs vs the top row of each pair in Figure 1). Each of these conclusions is unfortunately only true up to the optimization methodit may be true that a different op- timizer will lead to different results. With learned initializations and alternating minimization, recovery results can be better without half rectification. Note this is only up until the point where the alternating minimization gets better than the learned initialization without any refinement, and is especially true for random dictionaries. The simple interpretation is that the reconstruction step 2 of the alternating minimization just does not have a large enough span with roughly half the entries removed; that is, this is an effect of the optimization, not of the difference between the difficulty of the problems. 4. Conclusion We have studied conditions under which neural network layers of the form (1) and (2) preserve signal information. As one could expect, recovery from pooling measurements is only guaranteed under large enough redundancyand configurations of the subspaces, which depend upon which ℓp is considered. We have proved conditions which bound the lower Lipschitz constants for these layers, giving quantitative descriptions of how much information they preserve. Furthermore, we have given conditions under which modules with random filters are invertible. We have also given experimental evidence that for both random and adapted modules, it is roughly as easy to invert ℓp pooling with p = 1, 2, and ; and shown that when given training data, alternating minimization gives state of the art phase recovery with a regressed initialization. However, there are still many open questions relating to these networks, or even the invertibility of the layers of these networks. This work gives little direct help to a practicioner asking the question how should I design my network? . In particular, our results just barely touch on the distribution of the data; but the experiments make it clear (see also (Ohlsson et al., 2012)) that knowing more information about the data changes the invertibility of the mappings. Moreover, preservation of information needs to be balanced against invariance, and the tension between these is not discussed in this work. Even in the setting of this work, without consideration of the data distribution or tension with invariance, Proposition 2.4 although tight, is not easy to use, and even though we are able to use 2.6 to get an invertibility result, it is probably not tight. This work also shows there is much research to do in the field of algorithmic phase recovery. What are correct algorithms for ℓp inversion, perhaps with half rectification? How can we best use knowledge of the data distribution for phase recovery, even for the well studied ℓ2 case? Is it possible to guarantee that a well initialized alternating minimization converges to the correct solution? Signal recovery from Pooling Representations 0 50 100 150 200 250 300 350 400 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear relu 0 50 100 150 200 250 300 350 400 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 0 50 100 150 200 250 300 350 400 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear relu (a) Random vectors, random filters 0 50 100 150 200 250 300 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 0 50 100 150 200 250 300 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 0 50 100 150 200 250 300 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress (b) MNIST, random filters 0 50 100 150 200 250 300 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 0 50 100 150 200 250 300 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 0 50 100 150 200 250 300 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress (c) MNIST, adapted filters 0 100 200 300 400 500 600 700 800 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 0 100 200 300 400 500 600 700 800 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 0 100 200 300 400 500 600 700 800 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress (d) Image patches, random filters 0 100 200 300 400 500 600 700 800 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 0 100 200 300 400 500 600 700 800 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress 0 100 200 300 400 500 600 700 800 0 Number of samples vs reconstruction accuracy Number of samples mean |r T x|2 ||r||2||x||2 linear, random init relu, random init linear, am + nn init relu, am + nn init linear, nn regress relu, nn regress (e) Image patches, adapted filters Figure 1. Average recovery angle using alternating projections. The vertical axis measures the average value of |r T x|2/(||r||2||x||2), where r is the recovered vector, over 50 random test points. The horizontal axis is the number of pooling measurements. The leftmost figure is ℓ1 pooling, the middle ℓ2, and the right ℓ , all with pools of size 2. Random filters are Gaussian i.i.d.; adapted filters are generated by block OMP/KSVD with 5 nozero blocks of size 2. First row: each x is Gaussian i.i.d. in R40. Dark blue curve is alternating minimization, and green curve is alternating minimization with half rectification; both with random initialization. Remaining rows, magenta and yellow: nearest neighbor regressor described in 3.1.3 without and with rectification; red and aqua: alternating minimization initialized via neighbor regression, without and with rectification. See Section 3.3 for a discussion. Signal recovery from Pooling Representations Kannan Achan, Sam T. Roweis, and Brendan J. Frey. Prob- abilistic inference of speech signals from phaseless spectrograms. In In Neural Information Processing Systems 16, pages 1393 1400. MIT Press, 2003. M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algo- rithm for designing overcomplete dictionaries for sparse representation. Trans. Sig. Proc., 54(11):4311 4322, November 2006. ISSN 1053-587X. Radu Balan and Yang Wang. Invertibility and robustness of phaseless reconstruction, 2013. Radu Balan, Pete Casazza, and Dan Edidin. On signal re- construction without phase. Applied and Computational Harmonic Analysis, 20(3):345 356, May 2006. J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE transactions of PAMI, 2012. Jameson Cahill, Peter G. Casazza, Jesse Peterson, and Lindsey Woodland. Phase retrieval by projections, 2013. E. Candes and T. Tao. Near Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? Ar Xiv Mathematics e-prints, October 2004. Emmanuel J. Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241 1274, 2013. R. W. Gerchberg and W. Owen Saxton. A practical algo- rithm for the determination of the phase from image and diffraction plane pictures. Optik, 35:237 246, 1972. I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout Networks. Ar Xiv e-prints, February 2013. A. Hyv arinen and P. Hoyer. A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18):2413 2423, August 2001. ISSN 00426989. doi: 10.1016/s0042-6989(01)00114-6. Koray Kavukcuoglu, Marc Aurelio Ranzato, Rob Fergus, and Yann Le Cun. Learning invariant features through topographic filter maps. In Proc. International Conference on Computer Vision and Pattern Recognition (CVPR 09). IEEE, 2009. S. Mallat. Group Invariant Scattering. Communications in Pure and Applied Mathematics, 2012. Henrik Ohlsson, Allen Y. Yang, Roy Dong, and S. Shankar Sastry. Cprl an extension of compressive sensing to the phase retrieval problem. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, L eon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 1376 1384, 2012. Y. Ohlsson, H. Eldar. On conditions for uniqueness for sparse phase retrieval, 2013. A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/, 2008. Ir ene Waldspurger, Alexandre d Aspremont, and St ephane Mallat. Phase recovery, maxcut and complex semidefinite programming, 2012.