# modeling_visual_representationsdefining_properties_and_deep_approximations__16fb2d9a.pdf Published as a conference paper at ICLR 2016 VISUAL REPRESENTATIONS: DEFINING PROPERTIES AND DEEP APPROXIMATIONS Stefano Soatto Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA soatto@ucla.edu Alessandro Chiuso Dipartimento di Ingegneria dell Informazione Universit a di Padova Via Gradenigo 6/b, 35131 Padova, Italy alessandro.chiuso@unipd.it Visual representations are defined in terms of minimal sufficient statistics of visual data, for a class of tasks, that are also invariant to nuisance variability. Minimal sufficiency guarantees that we can store a representation in lieu of raw data with smallest complexity and no performance loss on the task at hand. Invariance guarantees that the statistic is constant with respect to uninformative transformations of the data. We derive analytical expressions for such representations and show they are related to feature descriptors commonly used in computer vision, as well as to convolutional neural networks. This link highlights the assumptions and approximations tacitly assumed by these methods and explains empirical practices such as clamping, pooling and joint normalization. 1 INTRODUCTION A visual representation is a function of visual data (images) that is useful to accomplish visual tasks. Visual tasks are decision or control actions concerning the surrounding environment, or scene, and its properties. Such properties can be geometric (shape, pose), photometric (reflectance), dynamic (motion) and semantic (identities, relations of objects within). In addition to such properties, the data also depends on a variety of nuisance variables that are irrelevant to the task. Depending on the task, they may include unknown characteristics of the sensor (intrinsic calibration), its inter-play with the scene (viewpoint, partial occlusion), and properties of the scene that are not directly of interest (e.g., illumination). We are interested in modeling and analyzing visual representations: How can we measure how useful one is? What guidelines or principles should inform its design? Is there such a thing as an optimal representation? If so, can it be computed? Approximated? Learned? We abstract (classes of) visual tasks to questions about the scene. They could be countable semantic queries (e.g., concerning objects in the scene and their relations) or continuous control actions (e.g., in which direction to move next ). In Sect. 2.1 we formalize these questions using well-known concepts, and in Sect. 2.3 we derive an equivalent characterization that will be the starting point for designing, analyzing and learning representations. 1.1 RELATED WORK AND CONTRIBUTIONS Much of Computer Vision is about computing functions of images that are useful for visual tasks. When restricted to subsets of images, local descriptors typically involve statistics of image gradients, computed at various scales and locations, pooled over spatial regions, variously normalized and ar Xiv:1411.7676v9 [cs.CV] 29 Feb 2016 Published as a conference paper at ICLR 2016 quantized. This process is repeated hierarchically in a convolutional neural network (CNN), with weights inferred from data, leading to representation learning Ranzato et al. (2007); Le Cun (2012); Simonyan et al. (2014); Serre et al. (2007); Bouvrie et al. (2009); Susskind et al. (2011); Bengio (2009). There are more methods than we can review here, and empirical comparisons (e.g., Mikolajczyk et al. (2004)) have recently expanded to include CNNs. Unfortunately, many implementation details and parameters make it hard to draw general conclusions Chatfield et al. (2011). We take a different approach, and derive a formal expression for optimal representations from established principles of sufficiency, minimality, invariance. We show how existing descriptors are related to such representations, highlighting the (often tacit) underlying assumptions. Our work relates most closely to Bruna & Mallat (2011); Anselmi et al. (2015) in its aim to construct and analyze representations for classification tasks. However, we wish to represent the scene, rather than the image, so occlusion and locality play a key role, as we discuss in Sect. 5. Also Morel & Yu (2011) characterize the invariants to certain nuisance transformations: Our models are more general, involving both the range and the domain of the data, although more restrictive than Tishby et al. (2000) and specific to visual data. We present an alternate interpretation of pooling, in the context of classical sampling theory, that differs from other analyses Gong et al. (2014); Boureau et al. (2010). Our contributions are to (i) define minimal sufficient invariant representations and characterize them explicitly (Claim 1); (ii) show that local descriptors currently in use in Computer Vision can approximate such representations under very restrictive conditions (Claim 2 and Sec. 3.3); (iii) compute in closed form the minimal sufficient contrast invariant (14) and show how local descriptors relate to it (Rem. 2); show that such local descriptors can be implemented via linear convolutions and rectified linear units (Sect. 4.3); (iv) explain the practice of joint normalization (Rem. 5) and clamping (Sect. 3.3.1) as procedures to approximate the sufficient invariant; these practices are seldom explained and yet they have a significant impact on performance in empirical tests Kirchner (2015); (v) explain spatial pooling in terms of anti-aliasing, or local marginalization with respect to a small-dimensional nuisance group, in convolutional architectures (Sec. 4.1-4.2). In the Appendix we show that an ideal representation, if generatively trained, maximizes the information content of the representation (App. A). 2 CHARACTERIZATION AND PROPERTIES OF REPRESENTATIONS Because of uncertainty in the mechanisms that generate images, we treat them as realizations of random vectors x (past/training set) and y (future/test set), of high but finite dimension. The scene they portray is infinitely more complex.1 Even if we restrict it to be a static model or parameter θ, it is in general infinite-dimensional. Nevertheless, we can ask questions about it. The number of questions ( classes ) K is large but finite, corresponding to a partition of the space of scenes, represented by samples {θ1, . . . , θK}. A simple model of image formation including scaling and occlusion Dong & Soatto (2014) is known as the Lambert-Ambient, or LA, model. 2.1 DESIDERATA Among (measurable) functions of past data (statistics), we seek those useful for a class of tasks. Abstracting the task to questions about the scene, useful can be measured by uncertainty reduction on the answers, captured by the mutual information between the data x and the object of interest θ. While a representation can be no more informative than the data,2 ideally it should be no less, i.e., a sufficient statistic. It should also be simpler than the data itself, ideally minimal. It should also discount the effects of nuisance variables g G, and ideally be invariant to them. We use a superscript to denote a collection of t data points (the history up to t, if ordered), xt .= {x1, . . . , xt}. Thus, a representation is any function φ constructed using past data xt that is useful to answer questions about the scene θ given future data y it generates, regardless of nuisance factors g. 1Scenes are made of surfaces supporting reflectance functions that interact with illumination, etc. No matter how many images we already have, even a single static scene can generate infinitely many different ones. 2Data Processing Inequality, page 88 of Shao (1998). Published as a conference paper at ICLR 2016 An optimal representation is a minimal sufficient statistic for a task that is invariant to nuisance factors. In Sec. 2.3 we introduce the SA Likelihood, an optimal representation, and in subsequent sections show how it can be approximated. 2.2 BACKGROUND The data X is a random variable with samples x, y; the model θ is unknown in the experiment E = {x, θ, pθ(x)} where pθ(x) is the probability density function of X, that depends on the parameter θ, evaluated at the sample x; a statistic T is a function of the sample; it is sufficient (of x for θ) if X | T = τ does not depend on θ;3 it is minimal if it is a function of all other sufficient statistics4. If θ is treated as a random variable and a prior is available, φ is Bayesian sufficient5 if p(θ|φ(xt)) = p(θ|xt). If T is minimal, any smaller6 U entails information loss. If θ was a discrete random variable, the information content of T could be measured by uncertainty reduction: H(θ) H(θ|T(X)), which is the mutual information7 between θ and T and H denotes entropy Cover & Thomas (1991); furthermore, T(X) arg infφ H(θ|φ(X)), where the infimum is with respect to measurable functions and is in general not unique. Consider a set G of transformations g of the data x, g(x), which we denote simply as gx. A function φG(x) is G-invariant if φG(gx) = φG(x) for all g G. The sensitivity of φ to G is S = φ(gx) g where φ is assumed to be differentiable. By definition, an invariant has zero sensitivity to G. The Likelihood function is L(θ; x) .= pθ(x), understood as a function of θ for a fixed sample x, sometimes written as p(x|θ) even though θ is not a random variable. Theorem 3.2 of Pawitan (2001) can be extended to an infinite-dimensional parameter θ (Theorem 6.1 of Bahadur (1954)): Theorem 1 (The likelihood function as a statistic). The likelihood function L(θ; x) is a minimal sufficient statistic of x for θ. 2.3 NUISANCE MANAGEMENT: PROFILE, MARGINAL, AND SAL LIKELIHOODS A nuisance g G is an unknown factor (a random variable, parameter, or transformation) that is not of interest and yet it affects the data. Given pθ( ), when g is treated as a parameter that transforms the data via g(y) .= gy, then pθ,g(y) .= pθ(gy); when it is treated as a random variable, pθ(y|g) .= pθ(gy). The profile likelihood pθ,G(y) .= sup g G pθ,g(y) (1) where the nuisance has been maxed-out is G-invariant. The marginal likelihood pθ(y|G) .= Z G pθ(y|g)d P(g) (2) is invariant only if d P(g) = dµ(g) is the constant8 measure on G. Both are sufficient invariants, in the sense that they are invariant to G and are minimal sufficient. This counters the common belief that invariance trades off selectivity. In Rem. 1 we argue that both can be achieved, at the price of complexity. Computing the profile likelihood in practice requires reducing G to a countable set {g1, . . . , g N} of samples,9 usually at a loss. The tradeoff is a subject of sampling theory, where samples can 3Definition 3.1 of Pawitan (2001) or Sec. 6.7 of De Groot (1989), page 356 4If U is sufficient, then the sigma algebra σ(T) σ(U), De Groot (1989), page 368. 5The two are equivalent for discrete random variables, but pathological cases can arise in infinite dimensions Blackwell & Ramamoorthi (1982). 6In the sense of inclusion of sigma algebras, σ(U) σ(T). 7See Cover & Thomas (1991) eq. 2.28, page 18. 8Base, or Haar measure if G is a group. It can be improper if G is not compact. 9Note that N can be infinite if G is not compact. Published as a conference paper at ICLR 2016 be generated regularly, independent of the signal being sampled, or adaptively.10 In either case, the occurrence of spurious extrema ( aliases ) can be mitigated by retaining not the value of the function at the samples, pθ,gi(y), but an anti-aliased version consisting of a weighted average around the samples: ˆpθ,gi(y) .= Z pθ,gi(gy)w(g)dµ(g) (3) for suitable weights w.11 When the prior d P(g) = w(g)dµ(g) is positive and normalized, the previous equation (anti-aliasing) can be interpreted as local marginalization, and is often referred to as mean-pooling. The approximation of the profile likelihood obtained by sampling and antialiasing, is called the SA (sampled anti-aliased) likelihood, or SAL: ˆpθ,G(y) = max i ˆpθ,gi(y) = max i Z pθ,gi(gy)d P(g). (4) The maximization over the samples in the above equation is often referred to as max-pooling. Claim 1 (The SAL is an optimal representation). Let the joint likelihood pθ,g be smooth with respect to the base measure on G. For any approximation error ϵ, there exists an integer N = N(ϵ) number of samples {gi}N i=1 and a suitable (regular or adaptive) sampling mechanism so that the SAL maxi ˆpθ,gi approximates to within ϵ the profile likelihood supg G pθ,g, after normalization, in the sense of distributions. For the case of (conditionally) Gaussian models under the action of the translation group, the claim follows from classical sampling arguments. More generally, an optimal representation is difficult to compute. In the next section, we show a first example when this can be done. Remark 1 (Invariance, sensitivity and selectivity ). It is commonly believed that invariance comes at the cost of discriminative power. This is partly due to the use of the term invariance (or approximate invariance or stability ) to denote sensitivity, and of the term selectivity to denote maximal invariance. A function is insensitive to g if small changes in g produce small changes in its value. It is invariant to g if it is constant as a function of g. It is a maximal invariant if equal value implies equivalence up to G (Sect. 4.2 of Shao (1998), page 213). It is common to decrease sensitivity with respect to a transformation g by averaging the function with respect to g, a lossy operation in general. For instance, if the function is the image itself, it can be made insensitive to rotation about its center by averaging rotated versions of it. The result is an image consisting of concentric circles, with much of the informative content of the original image gone, in the sense that it is not possible to reconstruct the original image. Nevertheless, while invertibility is relevant for reconstruction tasks, it is not necessarily relevant for classification, so it is possible for the averaging operation to yield a sufficient invariant, albeit not a maximal one. Thus, one can have invariance while retaining sufficiency, albeit generally not maximality: The profile likelihood, or the marginal likelihood with respect to the uniform measure, are sufficient statistics and are (strictly) invariant. The price to have both is complexity, as both are infinitedimensional in general. However, they can be approximated, which is done by sampling in the SA likelihood. 2.4 A FIRST EXAMPLE: LOCAL REPRESENTATIONS/DESCRIPTORS Let the task be to decide whether a (future) image y is of a scene θ given a single training image x of it, which we can then assume to be the scene itself: x = θ. Nuisances affecting y are limited 10{gi}N i=1 are generated by a (deterministic or stochastic) mechanism ψ that depends on the data and respects the structure of G. If G is a group, this is known as a co-variant detector: It is a function ψ that (is Morse in g, i.e., it) has isolated extrema {gi(y)}N i=1 = {g | Gψ(y, g) = 0} that equivary: Gψ( gy, gi) = 0 Gψ(y, ggi) = 0 for all i and g G. The samples {gi}N i=1 define a reference frame in which an invariant can be easily constructed in a process known as canonization Soatto (2009): φ(y) .= {g 1 i (y)y | Gψ(y, gi) = 0}. 11For regular sampling of stationary signals on the real line, optimal weights for reconstruction can be derived explicitly in terms of spectral characteristics of the signal, as done in classical (Shannon) sampling theory. More in general, the computation of optimal weights for function-valued signals defined on a general group G for tasks other than reconstruction, is an open problem. Recent results Chen & Edelsbrunner (2011) show that diffusion on the location-scale group typically reduce the incidence of topological features such as aliases in the filtered signal. Thus, low-pass filtering such as (generalized) Gaussian smoothing as done here, can have anti-aliasing effects. Published as a conference paper at ICLR 2016 to translation parallel to the image plane, scaling, and changes in pixel values that do not alter relative order.12 Under these (admittedly restrictive) conditions, the SAL can be easily computed and corresponds to known descriptors. Note that, by definition, x and y must be generated by the same scene θ for them to correspond. SIFT Lowe (2004) performs canonization10 of local similarity transformations via adaptive sampling of the planar translation-scale group (extrema of the difference-of-Gaussians operator in space and scale), and planar rotation (extrema of the magnitude of the oriented gradient). Alternatively, locations, scales and rotation can be sampled regularly, as in dense SIFT. Regardless, on the domain determined by each sample, gi, SIFT computes a weighted histogram of gradient orientations. Spatial regularization anti-aliases translation; histogram regularization anti-aliases orientation; scale anti-aliasing, however, is not performed, an omission corrected in DSP-SIFT Dong & Soatto (2015). In formulas, if α(y) = y S1 is a direction, θ = xi is the image restricted to a region determined by the reference frame gi, centered at (ui, vi) R2 axis-aligned and with size si > 0, we have13 φxi(y) = Z κσ(ui u, vi v)κϵ( x( u, v), α(y)) x( u, v) Esi(σ)d ud vdσ (5) and φsift(α) = [φx11(y), . . . , φx44(y)] is a sampling on a 4 4 grid, with each sample assumed independent and α = y quantized into 8 levels. Variants of SIFT such as HOG differ in the number and location of the samples, the regions where the histograms are accumulated and normalized. Here κσ and κϵ are Parzen kernels (bilinear in SIFT; Gaussian in DSP-SIFT) with parameter σ, ϵ > 0 and Es is an exponential prior on scales. Additional properties of SIFT and its variants are discussed in Sect. 3.3, as a consequence of which the above approximates the SAL for translation-scale and contrast transformation groups. Claim 2 (DSP-SIFT). The continuous extension of DSP-SIFT Dong & Soatto (2015) (5) is an antialiased sample of the profile likelihood (4) for G = SE(2) R+ H the group of planar similarities transformations and local contrast transformations, when the underlying scene θ = xi has locally stationary and ergodic radiance, and the noise is assumed Gaussian IID with variance proportional to the norm of the gradient. The proof follows from a characterization of the maximal invariant to contrast transformations described in Sect. 3.3. Out-of-plane rotations induce a scene-shape-dependent deformation of the domain of the image that cannot be determined from a single training image, as we discuss in Sect. 3. When interpreting local descriptors as samples of the SAL, they are usually assumed independent, an assumption lifted in Sect. 4 in the context of convolutional architectures. 3 A MORE REALISTIC INSTANTIATION Relative motion between a non-planar scene and the viewer generally triggers occlusion phenomena. These call for the representation to be local. Intrinsic variability of a non-static scene must also be taken into account in the representation. In this section we describe the approximation of the SAL under more realistic assumptions than those implied by local, single-view descriptors such as SIFT. 3.1 OCCLUSION, CLUTTER AND RECEPTIVE FIELDS The data y has many components, y = {y1, . . . , y My }, only an unknown subset of which from the scene or object of interest θ (the visible set V D = {1, . . . , My}). The rest come from clutter, 12The planar translation-scale group can be taken as a very crude approximation of the transformation induced on the image plane by spatial translation. Contrast transformations (monotonic continuous transformations of the range of the image) can be interpreted as crude approximations of changes of illumination. 13Here gy(ui, vi) = y(ui + u, vi + v) for the translation group and gy(ui, vi) = y (σui + u, σvi + v)) for translation-scale. In general, gy x = y g 1x, unless gy x = 0. If px(y(u)) .= q(y(u) x(u)) for some q is a density function for the random variable y(u), in general q(gy(u) x(u)) = q(y(u) g 1x(u)), unless the process y is G-stationary independent and identically distributed (IID), in which case px(gy) = pg 1x(y). Note that the marginal density of the gradient of natural images is to a reasonable approximation invariant to the translation-scale group Huang & Mumford (1999). Published as a conference paper at ICLR 2016 occlusion and other phenomena unrelated to θ, although they may be informative as context. We indicate the restriction of y to V as y|V = {yj, j V }. Since the visible set is not known, profiling pθ,G(y) = maxi,V P(D) pθ,gi(y|V ) requires searching over the power set P(D). To make computation tractable, we can restrict the visible set V to be the union of receptive fields Vj, that can be obtained by transforming14 a base region B0 ( unit ball, for instance a square patch of pixels with an arbitrary base size, say 10 10) with group elements gj G, Vj .= gj B0: V = SM j=1 gj B0 where the number of receptive fields M My. Thus V is determined by the reference frames (group elements) {gj}M j=1 of receptive fields that are active, which are unknown a-priori. Alternatively, we can marginalize V by computing, for every class (represented by a hidden variable θk as discussed next) and every receptive field (determined by gj as above), conditional densities pθ(y|θk, gj) that can be averaged with respect to weights wjk, trained to select (if sparse along j) and combine (by averaging along k) local templates via X j,k pθ(y|θk, gj)wjk (6) where the weights or filters wjk, if positive and normalized,15 are interpreted as probabilities wjk = pθ(θk, gj). To make this marginalization tractable, we write the first term as pθ(y|Vj , y|V c j |θk, gj) = p(y|Vj |θk, gj)pθ(y|V c j ) p(y|Vj |θk, gj) (7) where we have assumed that the second factor is constant, thus ignoring photometric context beyond the receptive fields, e.g., due to mutual illumination. Under these assumptions, we have pθ,gi(y) = X j,k p(y|Vj |θk, gigj)pθ,gi(gjθk) (8) where the first term in the sum is known as a feature map and we have assumed that both gi, gj G for simplicity, with gij = gigj. The order of operations (deformation by gi and selection by gj) is arbitrary, so we assume that the selection by gj is applied first, and then the nuisance-induced deformation gi, so pθ,gi(gjy) pθ,e(gijy), where e is the identity of the group G. 3.2 INTRA-CLASS VARIABILITY AND DEFORMABLE TEMPLATES For category-level recognition, the parameter space can be divided into K classes, allowing variability of θ within each class. Endowing the parameter space with a distribution p(θ|k) requires defining a probability density in the space of shapes, reflectance functions etc. Alternatively one can capture the variability θ induces on the data. For any scene θk from class k, one can consider a single image generated by it xk pθk(x) as a template from which any other datum from the same class can be obtained by the (transitive) action of a group gk G. Thus if y pθk(y) with θk p(θ|k), which we indicate with y p(y|k), then we assume that there exists a gk such that y = g 1 k xk, so p(y|k) = Z p(y|xk, gk)d P(xk, gk|θk)d P(θk|k) = Z p(gky|θk)d P(gk|θk) (9) where we have used the fact that p(y|xk, gk, k) = δ(y g 1 k xk) and that only one sample of θk is given, so all variability is represented by gk and xk conditionally on θk.16 For this approach to work, gk has to be sufficiently complex to allow xk to reach every datum y generated by an element17 of 14The action of a group g on a set B D is defined as g B D such that g(y|B) = y|g B. 15Note that current convolutional architectures rectify and normalize the feature maps, not the filters. However, learned biases, as well as rectification and normalization at each layer, may partly compensate for it. 16In the last expression we assumed that xk and gk are conditionally independent given θk, i.e., that the image formation process (noise) and deformation are independent once the scene θk is given. 17The group has to act transitively on xk. For instance, in Grenander (1993) gk was chosen to belong to the entire (infinite-dimensional) group of domain diffeomorphisms. Published as a conference paper at ICLR 2016 the class. Fortunately, the density on a complex group can be reduced to a joint density on GM, the mutual configuration of the receptive fields, as we will show. The restriction of gk to the domain of the receptive field Vj = gj B0 is indicated by {gkj}M j=1, defined by gkjx = gkx x Vj. Then, we can consider the global group nuisance gi, the selector of receptive fields gj and the local restriction of the intra-class group gk, assumed (d(gk, e)) small, as all belonging to the same group G, for instance affine or similarity transformations of the domain and range space of the data. Starting from (8), neglecting gi for the moment, we have18 j,k p(y|Vj |θk, gj)pθ(θk, gj) (10) GM p(y|Vj |θk, gkj)d P({gkj}|θk)pθ(θk, gj) (11) and bringing back the global nuisance gi, pθ,G(y) = max i GM p(gikjy|Vj |θk)d PG({gkj}|θk)pθ(θk, gj) | {z } pθ(giy) where the measure in the last equation is made invariant to gi G. The feature maps p(gigkjy|Vj |θk) represent the photometric component of the likelihood. The geometric component is the relative configuration of receptive fields {gkj}, which is class-dependent but G-invariant. The inner integral corresponds to mean pooling and the maximization to max pooling. The sum over j, k marginalizes the local classes θk, or parts and selects them to compose the hypothesis θ. To summarize, gi are the samples of the nuisance group in (4); gj are the local reference frames that define each receptive field in (8); gk are the global deformations that define the variability induced by a class k on a template in (9). The latter are in general far more complex than the former, but their restriction to each receptive field, gkj, can be approximated by an affine or similarity transformation and hence composed with gi and gj. Note that (11) can be interpreted as a model of a three-layer neural network: The visible layer, where y lives, a hidden layer, where the feature maps p(y|Vj |θk, gkj) live, and an output layer that, after rectification and normalization, yields an approximation of the likelihood pθ(y). Invariance to G can be obtained via a fourth layer outputting pθ,G(y) by max-pooling third-layer outputs pθ(giy) for different gi in (12). 3.3 CONTRAST INVARIANCE Contrast is a monotonic continuous transformation of the (range space of the) data, which can be used to model changes due to illumination. It is well-known that the curvature of the level sets of the image is a maximal invariant Alvarez et al. (1993). Since it is everywhere orthogonal to the level sets, the gradient orientation is also a maximal contrast invariant. Here we compute a contrast invariant by marginalizing the norm of the gradient of the test image (thus retaining its orientation) in the likelihood function of a training image. Since the action of contrast transformations is spatially independent, in the absence of other nuisances we assume that the gradient of the test image y can be thought of as a noisy version of the gradient of the training image x, i.e., y N( x, ϵ2) (13) and compute the density of y given x marginalized with respect to contrast transformations H of y. Theorem 2 (Contrast-invariant sufficient statistic). The likelihood of a training image x at a given pixel, given a test image y, marginalized with respect to contrast transformations of the latter, is 18Here we condition on the restrictions gkj of gk on the receptive fields Vj so that, by definition, p(y|Vj |θk, gj, gk) = p(y|Vj |θk, gkj). Published as a conference paper at ICLR 2016 px(y|H) .= p( y| x) = 1 2ϵ2 sin2( y x) x 2 M (14) where, if we call Ψ(a) .= 1 2 τ 2 dτ for any a R, and m .= cos( y x) x , then M = ϵe (m)2 2π + m mΨ m The expression in (14) is, therefore, a minimal sufficient statistic of y that is invariant to contrast transformations. 3 2 1 0 1 2 3 0 3 2 1 0 1 2 3 0 Figure 1: SIFT integrand (18) (red) vs. marginalized likelihood (14) (blue) computed for a random patch on α [ π, π] (left), and on a regular sub-sampling of 8 orientations (right). Several random tests are shown as mean and error-bars corresponding to three standard deviations across trials. Remark 2 (Relation to SIFT). Compared to (14), SIFT (i) neglects the normalization factor M 2πϵ, (ii) replaces the kernel κϵ(α) .= exp 1 2ϵ2 sin2(α) exp 1 2ϵ2 α2 (16) with a bilinear one κϵ defined by κϵ(α) .= α+ϵ ϵ2 α [ ϵ, 0] ϵ α ϵ2 α [0, ϵ] (17) and, finally, (iii) multiplies the result by the norm of the gradient, obtaining the sift integrand φsift( y| x) = κϵ( y x) x (18) To see this, calling α = y x and β = x > 0, notice that κϵ(α)β = κϵβ(αβ) exp 1 2ϵ2β2 α2β2 κϵ(α) (19) where the left-hand side is (18) and the right-hand side is (14) or (23). We make no claim that this approximation is good, it just happens to be the choice made by SIFT, and the above just highlights the relation to the contrast invariant (14). Remark 3 (Uninformative training images). It is possible that m ϵ 0 holds uniformly (in α) provided γ ϵ 1, i.e., , if the modulus of the gradient x is very small as compared to the standard deviation ϵ. Under such circumstances, the training image x is essentially constant ( flat ), and the conditional density p(α| x) becomes uniform 2πϵ2 e 1 2ϵ2 γ2(1 y, x 2) ϵ 2π = 1 2π (20) where the approximation holds given that e 1 2ϵ2 γ2(1 y, x 2) 1 when γ ϵ 1. This is unlike SIFT (18), that becomes zero when the norm of the gradient goes to zero. Published as a conference paper at ICLR 2016 Note that, other than for the gradient, the computations in (14) can be performed point-wise, so for an image or patch with pixel values yi, if αi(y) .= yi, we can write p(α| x) = Y i p(αi| xi). (21) We often omit reference to contrast transformations H in px(y|H), when the argument α makes it clear we are referring to a contrast invariant. The width of the kernel ϵ is a design (regularization) parameter. Remark 4 (Invariance for x). Note that (21) is invariant to contrast transformations of y, but not of x. This should not be surprising, since high-contrast training patches should yield tests with high confidence, unlike low-contrast patches. However, when the training set contains instances that are subject to contrast changes, such variability must be managed. To eliminate the dependency on x , consider a model where the noise is proportional to the norm of the gradient: y N x, ϵ2 (22) where ϵ( x ) = ϵ x . Under this noise model, the sufficient contrast invariant (14) becomes px(y|H) .= p( y| x) = 1 2ϵ2 sin2( y x) M (23) and M has the same expression (15) but with m = cos( y x). Thus a simple approach to managing contrast variability of x in addition to y is to use the above expression in lieu of (14). Remark 5 (Joint normalization). If we consider only affine contrast transformations ax + b where a, b are assumed to be constant on a patch V which contains all the cells Ci where the descriptors are computed19 it is clear that to recapture invariance w.r.t. the scale factor a it is necessary and sufficient that p( y| x(vi)) = p( y|a x(vi)), vi V . We shall now illustrate how this invariance can be achieved. Assume that data generating model (13) is replaced by the distribution-dependent model y N( x, ϵ2(px)) ϵ2(px) = σ2Ex x 2 = σ2 Z x 2px( x)d x (24) where the noise variance ϵ2 depends linearly on the average squared gradient norm (w.r.t. the distribution px( x)); σ2 is fixed constant. The resulting marginal distribution for the gradient orientation becomes px(y|H) .= p( y| x) = 1 2σ2 sin2( y x) x 2 where, defining m .= cos( y x) x M = σe ( m)2 2π + m mΨ m Equation (25) is clearly invariant to affine transformations of the image values x(v) ax(v) + b, v V .20 It is a trivial calculation to show that using p( y|ρ) in lieu of p( y|ρ), the result is invariant w.r.t affine transformations. To obtain a sampled version of this normalization the expected squared gradient norm can be replaced with the sample average on the training patch ˆρ2 .= 1 Npix i=1 x(vi) 2 19SIFT divides each patch into a 4 4 grid of cells Ci, i = 1, .., 16 20Note that an affine transformation on the image values x(v) ax(v) + b, v V , induces a scale transformation on the distribution px( x) so that pax+b(ρ) = 1 apx(ρ/a) and therefore the average squared gradient is scaled by a2, i.e. Eax+b ρ 2 = a2Ex ρ 2. Published as a conference paper at ICLR 2016 so that (24) becomes, y N( x, ϵ2(V )) ϵ2(V ) = σ2ˆρ2 = σ2 1 Npix i=1 x(vi) 2 (27) where vi V , i = 1, .., Npix are the pixel locations in the training patch V . This procedure is known as joint normalization , and is simply equivalent to normalizing the patch in pre-processing by dividing by the average gradient norm. 3.3.1 CLAMPING GRADIENT ORIENTATION HISTOGRAMS Local descriptors such as SIFT commonly apply a clamping procedure to modify a (discretized, spatially-pooled, un-normalized) density φsift of the form (18), by clipping it at a certain threshold τ, and re-normalizing: φclamp(α|x) = min{φsift(α|x), τ} R S1 min{φsift(α|x), τ}dα (28) where α = y S1 is typically discretized into 8 bins and τ is chosen as a percentage of the maximum, for instance τ = 0.2 maxα φsift(α|x). Although clamping has a dramatic effect on performance, with high sensitivity to the choice of threshold, it is seldom explained, other than as a procedure to reduce the influence of large gradient magnitudes Lowe (2004). Here, we show empirically that (18) becomes closer to (14) after clamping for certain choices of threshold τ and sufficiently coarse binning. Fig. 2 shows that, without clamping, (18) is considerably more peaked than (14) and has thinner tails. After clamping, however, the approximation improves, and for coarse binning and threshold between around 20% and 30% the two are very similar. 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 50% 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 40% 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 30% 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 20% 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 10% 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 50% 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 40% 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 30% 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 20% 3 2 1 0 1 2 3 0 Exact SIFT SIFT clamped at 10% Figure 2: Clamping effects: For α [ π, π] (abscissa), the top shows the marginalized likelihood p(α| x) (14) (blue), the SIFT integrand (18) (solid red), and its clamped version (28) (dashed red) for thresholds ranging from 50% to 10% of its maximum. The bottom shows the same discretized to 8 orientation bins. The clamping approximation is sensible only for coarse binning, and heavily influenced by the choice of threshold. For an 8-bin histogram, the best approximation is typically achieved with clamping threshold between 10% and 30% of the maximum; note that Lowe (2004) empirically chose 20%. 3.4 ROTATION INVARIANCE Canonization Soatto (2009) is particularly well suited to deal with planar rotation, since it is possible to design co-variant detectors with few isolated extrema. An example is the local maximum of the norm of the gradient along the direction α = ˆαl(x). Invariance to G = SO(2) can be achieved by retaining the samples {pθ(α|ˆαl)}L l=1. When no consistent (sometimes referred to as stable ) reference ˆαl can be found, it means that there is no co-variant functional with isolated extrema with respect to the rotation group, which means that the data is already invariant to rotation. Note that, again, planar rotations can affect both the training image x and the test image y. In some cases, a consistent reference (canonical element) is available. For instance, for geo-referenced scenes L = 1, and the projection of the gravity vector onto the image plane , ˆα, provides a canonical reference unless the two are orthogonal: pθ(α|G) = pθ(α|ˆα). (29) 4 DEEP CONVOLUTIONAL ARCHITECTURES In this section we study the approximation of (12) implemented by convolutional architectures. Starting from (12) for a particular class and a finite number of receptive fields we notice that, since Published as a conference paper at ICLR 2016 the true scene θ and the nuisances g are unknown, we cannot factor the likelihood pθ,g(y) into a product of pθ,gj, which would correspond to a bag-of-words discarding the dependencies in d PG({gj}|θ). Convolutional architectures (CNNs) promise to capture such dependencies by hierarchical decomposition into progressively larger receptive fields. Each layer is a collection of separator (hidden) variables (nodes) that make lower layers (approximately) conditionally independent. 4.1 STACKING SIMPLIFIES NUISANCE MARGINALIZATION We show this in several steps. We first argue that managing the group of diffeomorphisms can be accomplished by independently managing small diffeomorphisms in each layer. We use marginalization, but a similar argument can be constructed for max-out or the SA likelihood. Then, we leverage on the local restrictions induced by receptive fields, to deal with occlusion, and argue that such small diffeomorphisms can be reduced locally to a simpler group (affine, similarity, locationscale or even translations, the most common choice in convolutional architectures). Then global marginalization of diffeomorphisms can be accomplished by local marginalization of the reduced group in each layer. The following lemma establishes that global diffeomorphisms can be approximated by the composition of small diffeomorphisms. Lemma 1. Let g G, g : D D be an orientation-preserving diffeomorphism of a compact subset D of the real plane, e G the identity, and d(e, g) the size of g. Then for any ϵ > 0 there exists an N < and g1, . . . , g N such that g = g1 g2 g N and d(e, gi) < ϵ i = 1, . . . , N. Now for two layers, let g = g1 g2, with g1, g2 p(g) drawn independently from a prior on G. Then p(g|g1, g2) = δ(g g1 g2) (or a non-degenerate distribution if gi are approximated by elements of the reduced group). Then let θ1 = g2θ, or more generally let θ1 be defined such that θ1 g1 | θ, g2. We then have p G(y|θ) = Z p(y|θ, g)d P(g) = Z p(y|θ1, g1, g)d P(θ1, g1, g2|θ, g)d P(g) = (30) = Z p(y|θ1, g1)d P(g1|θ)p(θ1|θ, g2)d P(g2|θ)dθ1 (31) where we have also used the fact that y θ | θ1. Once the separator variable θ1 is reduced to a number K1 of filters, we have Z p(y|θ1 k, g1)d P(g1|θ) Z p(θ1 k|θ, g2)d P(g2|θ) k=1 p G(y|θ1 k)p G(θ1 k|θ) (32) in either case, by extending this construction to L = N layers, we can see that marginalization of each layer can be performed with respect to (conditionally) independent diffeomorphisms that can be chosen to be small per the Lemma above. Claim 3. Marginalization of the likelihood with respect to an arbitrary diffeomorphism g G can be achieved by introducing layers of hidden variables θl l = 1, . . . , L and independently marginalizing small diffeomorphisms gl G at each layer. The next step is to restrict the marginalization to each receptive field, at which point it can be approximated by a reduced subgroup, or the (linear) generators. 4.2 HIERARCHICAL DECOMPOSITION OF THE LIKELIHOOD Let p G(y|θ) .= Z G p(y|θ, g)d P(g) (33) be the marginal likelihood with respect to some prior on G and introduce a layer of separator variables θ1 and group actions g, defined such that y θ | (θ1, g1). This can always be done by choosing θ1 = y; we will address non-trivial choices later. In either case, forgoing the subscript G, p(y|θ) = Z p(y|θ1, g1)d P(θ1, g1|θ). (34) Published as a conference paper at ICLR 2016 If θ1 and g1 take a finite number K1 and L1 of values {θ1 1, . . . , θ1 K1} (filters) and {g1 1, . . . , g1 L1}, then the above reduces to a sum over k = 1, . . . , K1 and ℓ1 = 1, . . . L1; the conditional likelihoods {p(y|θ1 1, g1 j ), . . . , p(y|θ1 K1, g1 j )} are the feature maps. If y has dimensions N M and the group actions g1 j are taken to be pixel wise translations across the image plane, so that L1 = N M, the feature maps p(y|θ1, g1) can be represented as a tensor with dimensions N M K1. One can repeat the procedure for new separator variables that take K2 possible values, and group actions g2 that take L2 = N1 M1 values; the filters θ2 must be supported on the entire feature maps p(y|θ1, g1) (i.e., take values in N1 M1 K1) for the sum over k = 1, . . . , K1 to implement the marginalization above k=1 p(y|θ1 k, g1 ℓ1)p(θ1 k, g1 ℓ1|θ2 j, g2 ℓ2) p(θ2 j, g2 ℓ2|θ). (35) The sum is implemented in convolutional networks by the use of translation invariant filters: 1. At the first layer the support of θ1 is a small fraction of N M and g1 acts on y so that21 p(y|θ1 k, g1 ℓ1) = p(g1 ℓ1y|θ1 k) = p(y|Vj|θ1 k). 2. At the second layer the filter p(θ1 k, g1 ℓ1|θ2 j, g2 ℓ2) is nonzero for a finite (and small) number of group actions g1 ℓ1 and also satisfies the shift invariant (convolutional) property p(θ1 k, g1 ℓ1|θ2 j, g2 ℓ2) = p(θ1 k, g2 ℓ2g1 ℓ1|θ2 j) The third dimension of the filters is the number of feature maps in the previous layer. 4.3 APPROXIMATION OF THE FIRST LAYER Each node in the first layer computes a local representation (5) using parent node(s) as a scene. This relates to existing convolutional architectures where nodes compute the response of a rectified linear unit (Re Lu) to a filter bank. For simplicity we restrict G to the translation group, thus reducing (5) to SIFT, but the arguments apply to similarities. A Re Lu response at (u, v) to an oriented filter bank G with scale σ and orientation α is given by R+(α, u, v, σ) = max(0, G(u, v; σ, α) x(u, v)). Let N(u, v; σ) be a Gaussian, centered in (u, v) with isotropic variance σI, N(u, v; σ) = [ N u (u, v; σ) N v (u, v; σ)], r(α) = [cos α sin α]T . Then G(u, v; σ, α) .= N(u, v; σ)r(α) is a directional filter with principal orientation α and scale σ. Omitting rectification for now, the response of an image to a filter bank obtained by varying α [ π, π], at each location (u, v) and for all scales σ is obtained as R(α, u, v, σ) = G(u, v; σ, α) x(u, v) = N(u, v; σ) x(u, v)r(α) (36) = N(u, v; σ) x(u, v) x(u, v) , r(α) x(u, v) (37) = Z N(u u, v v; σ)κ( x( u, v), α) x( u, v) d ud v (38) where κ, the cosine function, has to be rectified for the above to approximate a histogram, κ+(α) = max(0, cos α) which yields SIFT. Unfortunately, in general the latter does not equal max(0, G x) for one cannot simply move the maximum inside the integral. However, under conditions on x, which are typically satisfied by natural images, this is the case. Claim 4. Let G be positive, smooth and have a small effective support σ < . I.e., ϵ1, ϵ2 σ | vol(G( u, v; σ, α) ϵ1) < ϵ2. Let x have a sparse and continuous gradient field, so that for every α the domain of x can be partitioned in three (multiply-connected) regions D+(α), D (α) and the remainder (the complement of their union), where the projection of the gradient in the direction α is, respectively, positive, negative, and negligible, and d(α) > 0 the minimum distance 21There is a non-trivial approximation here, namely that context is neglected when assuming that the likelihood p(g1 ℓ1y|θ1 k) depends only on the restriction of y to the receptive field Vj; see also Section 3.1 and equation (7). Published as a conference paper at ICLR 2016 between the regions D+ and D . Then, provided that σ minα d(α), we have that max(0, G(u, v; σ, α) x(u, v)) | {z } Re Lu Z N(u u, v v; σ)κ+( x( u, v), α) x( u, v) d ud v | {z } sift (39) 4.4 STACKING INFORMATION A local hierarchical architecture allows approximating the SA likelihood pθ,G( ) by reducing nuisance management to local marginalization and max-out of simple group transformations. The SA likelihood pθ,G(y) is an optimal representation for any query on θ given data y. For instance, for classification, the representation pθ,G(y) is itself the classifier (discriminant). Thus, if we could compute an optimal classifier, the representation would be the identity; vice-versa, if we could compute the optimal representation, the classifier would be a threshold. In practice, one restricts the family of classifiers for instance to soft-max, or SVM, or linear leaving the job of feeding the most informative statistic to the classifier. In a hierarchical architecture, this is the feature maps in the last layer. This is equivalent to neglecting the global dependency pθ,G(y|θL) on θ at the last layer. The information loss inherent in this choice is the loss of assuming that θL are independent (whereas they are only independent conditioned on θ). An optimal representation with restricted complexity L < , therefore, maximizes the independence of the components of θL, or equivalently the independence of the components of y given θL. Using those results, one can show that the information content of a representation ((47) in App. A) grows with the number of layers L. 5 DISCUSSION For the likelihood interpretation of a CNN put forward here to make sense, training should be performed generatively, so fixing the class θk one could sample (hallucinate) future images y from the class. However neither the architecture not the training of current CNN incorporate mechanisms to enforce the statistics of natural images. In this paper we emphasize the role of the task in the representation: If nothing is known about the task, nothing interesting can be said about the representation, and the only optimal one is the trivial one that stores all the data. This is because the task could end up being a query about the value of a particular pixel in a particular image. Nevertheless, there may be many different tasks that share the same representation by virtue of the fact that they share the same nuisances. In fact, the task affects what are nuisance factors, and the nuisance factors affect the design and learning of the representation. For some complex tasks, writing the likelihood function may be prohibitively complex, but some classes of nuisances such as changes of illumination or occlusions, are common to many tasks. Note that, by definition, a nuisance is not informative. Certain transformations that are nuisances for some tasks may be informative for others (and therefore would not be called nuisances). For instance, viewpoint is a nuisance in object detection tasks, as we want to detect objects regardless of where they are. It is, of course, not a nuisance for navigation, as we must control our pose relative to the surrounding environment. In some cases, nuisance and intrinsic variability can be entangled, as for the case of intra-class deformations and viewpoint-induced deformations in object categorization. Nevertheless, the deformation would be informative if it was known or measured, but since it is not, it must be marginalized. Our framework does not require nuisances to have the structure of a group. For instance, occlusions do not. Invariance and sensitivity are still defined, as a statistic is invariant if it is constant with respect to variations of the nuisance. What is not defined is the notion of maximal invariance, that requires the orbit structure. However, in our theory maximal invariance is not the focus. Instead, sufficient invariance is. The literature on the topic of representation is vast and growing. We focus on visual representations, where several have been active. Anselmi et al. (2015) have developed a theory of representation aiming at approximating maximal invariants, which restricts nuisances to (locally) compact groups and Published as a conference paper at ICLR 2016 therefore do not explicitly handle occlusions. Both frameworks achieve invariance at the expense of discriminative power, whereas in our framework both can be attained at the cost of complexity. Patel et al. (2015), that appeared after earlier drafts of this manuscript were made public, instead of of starting from principles and showing that they lead to a particular kind of computational architecture, instead assume a particular architecture and interpret it probabilistically, similarly to Ver Steeg & Galstyan (2015) that uses total correlation as a proxy of information, which is related to our App. A. However, there the representation is defined absent a task, so the analysis does not account for the role of nuisance factors. In particular, Anselmi et al. (2015) define a G-invariant representation µ of a (single) image I as being selective if µ(I) = µ(I ) I I for all I, I , i.e., if it is a maximal invariant. But while equivalence to the data up to the action of the nuisance group is critical for reconstruction, it is not necessary for other tasks. Furthermore, for non-group nuisances, such as occlusions, a maximal invariant cannot be constructed. Instead, given a task, we replace maximality with sufficiency for the task, and define at the outset an optimal representation to be a minimal sufficient invariant statistic, or sufficient invariant, approximated by the SAL Likelihood. The construction in Anselmi et al. (2015) guarantees maximality for compact groups. Similarly, Sundaramoorthi et al. (2009) have shown that maximal invariants can be constructed even for diffeomorphisms, which are infinitedimensional and non-compact. In practice, however, occlusions and scaling/quantization break the group structure, and therefore a different approach is needed that relies on sufficiency, not maximality, as we proposed here. To relate our approach to Anselmi et al. (2015), we note that the orbit probability of Def. 4.2 is given by ρI(A) = P({g | g I A}) (40) and is used to induce a probability on I, via P(I)[A] = P({g | g I A}). On the other hand, we define the minimal sufficient invariant statistic as the marginalized likelihood pθ,G(y) .= Z pθ,g(y)d P(g) (41) where y is a (future) image, and θ is the scene. If we consider the scene to be comprised of a set of images A = θ, and the future image y = I, then we see that the OP is a marginalized likelihood where d P(g) = dµ(g) is the Haar measure, and pθ,g(y) = δ(gy θ). Thus, substitutions G G, θ A, y I yield P(I)[A] = pθ,G(y) (42) for the particular choice of Haar measure and impulsive density pθ,g. The TP representation can also be understood as a marginalized likelihood, as Ψ(I)[A] is the G-marginalized likelihood of I given A when using the uniform prior and an impulsive conditional p A,g(I): Ψ(I)[A] = Z G p(g I|A)dµ(g). (43) Finally, our treatment of representations is not biologically motivated, in the sense that we simply define optimal representations from first principles, without regards for whether they are implementable with biological hardware. However, we have established connections with both local descriptors and deep neural networks, that were derived using biological inspiration. ACKNOWLEDGMENTS Work supported by FA9550-15-1-0229, ARO W911NF-15-1-0564, ONR N00014-15-1-2261, and MIUR RBFR12M3AC Learning meets time. We are appreciative of discussions with Tomaso Poggio, Lorenzo Rosasco, Stephane Mallat, and Andrea Censi. Alvarez, L., Guichard, F., Lions, P. L., and Morel, J. M. Axioms and fundamental equations of image processing. Arch. Rational Mechanics, 123, 1993. 7 Anselmi, F., Rosasco, L., and Poggio, T. On invariance and selectivity in representation learning. ar Xiv preprint ar Xiv:1503.05938, 2015. 2, 13, 14 Published as a conference paper at ICLR 2016 Bahadur, R. R. Sufficiency and statistical decision functions. Annals of Mathematical Statistics, 25 (3):423 462, 1954. 3 Bengio, Y. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2 (1):1 127, 2009. 2 Blackwell, D. and Ramamoorthi, R.V. A bayes but not classically sufficient statistic. The Annals of Statistics, 10(3):1025 1026, 1982. 3 Boureau, Y.-L., Ponce, J., and Le Cun, Y. A theoretical analysis of feature pooling in visual recognition. In Proc. of Intl Conf. on Mach. Learning, pp. 111 118, 2010. 2 Bouvrie, J. V., Rosasco, L., and Poggio, T. On invariance in hierarchical models. In NIPS, pp. 162 170, 2009. 2 Bruna, J. and Mallat, S. Classification with scattering operators. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2011. 2 Chatfield, K., Lempitsky, V., Vedaldi, A., and Zisserman, A. The devil is in the details: an evaluation of recent feature encoding methods. 2011. 2 Chen, C. and Edelsbrunner, H. Diffusion runs low on persistence fast. In ICCV, pp. 423 430, 2011. Cover, T. M. and Thomas, J. A. Elements of Information Theory. Wiley, 1991. 3 Creutzig, F., Globerson, A., and Tishby, N. Past-future information bottleneck in dynamical systems. Phys. Rev. Lett. E, 79(4):19251 19255, 2009. 17 De Groot, M. H. Probability and statistics. Addison-Wesley, 1989. 3 Dong, J. and Soatto, S. Machine Learning for Computer Vision, chapter Visual Correspondence, the Lambert-Ambient Shape Space and the Systematic Design of Feature Descriptors. R. Cipolla, S. Battiato, G.-M. Farinella (Eds), Springer Verlag, 2014. 2 Dong, J. and Soatto, S. Domain size pooling in local descriptors: Dsp-sift. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2015. 5 Fraser, D. and Naderi, A. Minimal sufficient statistics emerge from the observed likelihood function. International Journal of Statistical Science, 6:55 61, 2007. 17 Gong, Y., Wang, L., Guo, R., and Lazebnik, S. Multi-scale orderless pooling of deep convolutional activation features. ar Xiv preprint ar Xiv:1403.1840, 2014. 2 Grenander, U. General Pattern Theory. Oxford University Press, 1993. 6 Hinkley, D. Predictive likelihood. The Annals of Statistics, pp. 718 728, 1979. 19 Huang, J. and Mumford, D. Statistics of natural images and models. In Proc. CVPR, pp. 541 547, 1999. 5 Kirchner, M. R. Automatic thresholding of SIFT descriptors. Technical Report, 2015. 2 Le Cun, Y. Learning invariant feature hierarchies. In ECCV, pp. 496 505, 2012. 2 Lowe, D. G. Distinctive image features from scale-invariant keypoints. IJCV, 2(60):91 110, 2004. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., and Gool, L. Van. A comparison of affine region detectors. IJCV, 1(60):63 86, 2004. 2 Morel, J. M. and Yu, G. Is sift scale invariant? Inverse Problems and Imaging, 5(1):115 136, 2011. Patel, A., Nguyen, T., and Baraniuk, R. A probabilistic theory of deep learning. ar Xiv preprint ar Xiv:1504.00641, 2015. 14 Published as a conference paper at ICLR 2016 Pawitan, Y. In all likelihood: Statistical modeling and inference using likelihood. Oxford, 2001. 3, Ranzato, M., Huang, F. J., Boureau, Y.-L., and Le Cun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, pp. 1 8, 2007. 2 Serre, T., Oliva, A., and Poggio, T. A feedforward architecture accounts for rapid categorization. Proceedings of the National Academy of Sciences, 104(15):6424 6429, 2007. 2 Shao, J. Mathematical Statistics. Springer Verlag, 1998. 2, 4 Simonyan, K., Vedaldi, A., and Zisserman, A. Learning local feature descriptors using convex optimisation. IEEE Trans. Pattern Anal. Mach. Intell., 2(4), 2014. 2 Soatto, S. Actionable information in vision. In Proc. of the Intl. Conf. on Comp. Vision, October 2009. 4, 10, 17, 18 Sundaramoorthi, G., Petersen, P., Varadarajan, V. S., and Soatto, S. On the set of images modulo viewpoint and contrast changes. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., June 2009. 14 Susskind, J., Memisevic, R., Hinton, G. E., and Pollefeys, M. Modeling the joint density of two images under a variety of transformations. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., pp. 2793 2800, 2011. 2 Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. In Proc. of the Allerton Conf., 2000. 2 Ver Steeg, G. and Galstyan, A. Maximally informative hierarchical representations of highdimensional data. in depth, 13:14, 2015. 14 Published as a conference paper at ICLR 2016 A QUANTIFYING THE INFORMATION CONTENT OF A REPRESENTATION In general we do not know the likelihood, but we have a collection of samples xt pθ,gt(x), each generated with some nuisance gt, which we can use to infer, or learn an approximation of the SAL likelihood Fraser & Naderi (2007). φθ( ) .= pθ( ) ˆp Xt( ), Xt pθ( ) (empirical likelihood) (44) φθ,G( ) .= pθ,G( ) ˆp Xt,G( ) (profile likelihood) (45) φθ,G(y, xt) .= φθ,G(y)φθ(xt) ˆp Xt,G(y)ˆp Xt(xt) ˆpxt,G(y) (learned representation)(46) Depending on modeling choices made, including the number of samples N, the sampling mechanism, and the priors for local marginalization, the resulting representation will be lossy compared to the data. Next we quantify such a loss. A.1 INFORMATIVE CONTENT OF A REPRESENTATION The scene θ is in general infinite-dimensional. Thus, the informative content of the data cannot be directly quantified by mutual information I(θ; xt). In fact, H(θ) = and therefore, no matter how many (finite) data xt we have, I(θ; xt) = H(θ) H(θ|xt) = is not defined. Similarly, I(θ; φ(xt)) is undefined and therefore mutual information cannot be used directly to measure the informative content of a representation, or to infer the most informative statistic. The notion of Actionable Information Soatto (2009) as H(xt) .= H(φθ,G(xt)) where φθ,G is a maximal G-invariant, can be used to bypass the computation of H(θ): Writing formally I(θ; φθ,G(y)) = H(φθ,G(y)) H(φθ,G(y)|θ) = H(xt) H(φθ,G(y)|θ) (47) we see that, if we were given the scene θ, we could generate the data y, under the action of some nuisance g, up to the residual modeling uncertainty, which is assumed white and zero-mean (lest the mean and correlations of the residual can be included in the model). Similarly, we can generate a maximal invariant φθ,G(y) up to a residual with constant entropy; therefore, the statistic which maximizes I(θ; φθ,G(y)) is the one that maximizes the first term in (47), H(φθ,G(y)). This formal argument allows defining the most informative statistic as the one that maximizes Actionable Information, ˆφθ,G = arg maxφθ,G H(φθ,G(xt)) .= H(xt), bypassing the computation of the entropy of the scene θ. Note that the task still influences the information content of a representation, by defining what are the nuisances G, which in turn affect the computation of actionable information. An alternative approach is to measure the informative content of a representation bypassing consideration of the scene is described next. Definition 1 (Informative Content of a Representation). The information a statistic φ of xt conveys on θ is the information it conveys on a task T (e.g., a question on the scene θ), regardless of nuisances g G: I(g T; φ(xt)) = H(g T) H(g T|φ(xt)) g G (48) If the task is reconstruction (or prediction) T = y, where past data xt and future data y are generated by the same scene θ, then the definition above relates to the past-future mutual information Creutzig et al. (2009) except for the role of the nuisance g. The following claim shows that an ideal representation, as previously defined in terms of minimal sufficient invariant statistic, maximizes information. Claim 5. Let past data xt and future data y, used to accomplish a task T, be generated by the same scene θ. Then the representation φθ,G maximizes the informative content of a representation. The next claim relates a representation to Actionable Information. Claim 6. If φ maximizes Actionable Information, it also maximizes the informative content of a representation. Note that maximizing Actionable Information is a stronger condition than maximizing the information content of a representation. Since Actionable Information concerns maximal invariance, Published as a conference paper at ICLR 2016 sufficiency is automatically implied, and the only role the task plays is the definition of the nuisance group G. This is limiting, since we want to handle nuisances that do not have the structure of a group, and therefore Definition 1 affords more flexibility than Soatto (2009). The next two claims characterize the maximal properties of the profile likelihood. We first recall that the marginalized likelihood is invariant only if marginalization is done with respect to the base (Haar) measure, and in general it is not a maximal invariant, as one can show easily with a counter-example (e.g., a uniform density). On the other hand, the profile likelihood is by construction invariant regardless of the distribution on G, but is also in general not maximal. However, it is maximal under general conditions on the likelihood. To see this, consider pθ( ) to be given; for a free variable x, it can be written as a map q: pθ(x) .= q(θ, x). For a fixed x, q is a function of θ. If q is constant along x (the level curves are straight lines), then in general q(θ, y) = q(θ, x) for all θ does not imply y = x. Indeed, y can be arbitrarily different from x. Even if q is non-degenerate (non-constant along certain directions), but presents symmetries, it is possible for different y = x to yield the same q(θ, y) = q(θ, x) for all θ. However, under generic conditions q(θ, y) = q(θ, x) for all θ implies y = x. Now consider pθ,G( ) to be given; for a free variable x the map can be written using a function q such that pθ,G(x) = ming q(θ, gx). Note that pθ,G is, by construction, invariant to G. Also note that, following the same argument as above, the invariant is not maximal, for it is possible for pθ,G(x) = pθ,G(y) for all θ and yet x and y are not equivalent (equal up to a constant g: y = gx). However, if the function q is generic, then the invariant is maximal. In fact, let ˆg(θ, x) = arg min q(θ, gx), so that pθ,G(x) = q(θ, ˆg(θ, x)x). If we now have q(θ, ˆg(θ, y)y) = q(θ, ˆg(θ, x)x) for all θ, then we can conclude, based on the argument above, that ˆg(θ, x)x = ˆg(θ, y)y. Since x and y are fixed, and the equality is for all θ, we can conclude that ˆg(θ, y) 1ˆg(θ, x) is independent of θ. That allows us to conclude the following. Claim 7 (Maximality of the profile likelihood). If the density pθ(x) is generic with respect to x, then pθ,G( ) is a maximal G-invariant. Since we do not have control on the function pθ,G( ), which is instead in general constructed from data, it is legitimate to ask what happens when the generic condition above is not satisfied. Fortunately, distributions that yield non-maximal invariants can be ruled out as uninformative at the outset: Claim 8 (Non-maximality and non-informativeness). If q is such that, for any x = y we have q(θ, ˆg(θ, x)x) = q(θ, ˆg(θ, y)y) for all θ, then qθ,G( ) is uninformative. This follows from the definition of information, for any statistic T. As we have pointed out before, what matters is not that the invariant be maximal, but that it be sufficient. As anticipated in Rem. 1, we can achieve invariance with no sacrifice of discriminative power, albeit at the cost of complexity. Proof. Pick any θ0, define T(x) .= L( ;x) L(θ0;x) and f(T(x), θ) .= L(θ;x) L(θ0;x) and apply the factorization lemma with h(x) = L(θ0; x). Proof. We denote with y .= y y the normalized gradient of y, and similarly for x; Φ maps it to polar coordinates (α, ρ) = Φ( y) and (β, γ) = Φ( x), where α .= y ρ .= y β .= x γ .= x . The conditional density of y given x takes the polar form p(ρ, α| x) = p( y| x) y=Φ 1(ρ,α)ρ p( y| x) = 1 2πϵ2 e 1 2ϵ2 y x 2. (49) Published as a conference paper at ICLR 2016 Defining ( x)i to be the i-th component of x, (49) can be expanded as p(ρ, α| x) = ρ 1 2πϵ2 e 1 2ϵ2 [(ρ cos(α) ( x)1)2+(ρ sin(α) ( x)2)2] (50) and the exponent is (ρ cos(α) ( x)1)2 + (ρ sin(α) ( x)2)2 = ρ2 2ρ(( x)1 cos α + ( x)2 sin α) + x 2 = ρ γ y, x 2 + γ2 1 y, x 2 (51) We are now interested in the marginal of (50) with respect to ρ, i.e., p(α| x) = Z 0 p(ρ, α| x) dρ. (52) where we can isolate the factor that does not depend on ρ, p(α| x) = 1 2πϵ2 e 1 2ϵ2 γ2(1 y, x 2) Z 2πϵ2 e 1 2ϵ2 (ρ γ y, x ) 2 ρdρ | {z } M The bracketed term M is the integral on the interval [0, ) of a Gaussian density with mean m .= γ y, x = cos( y x) x and variance ϵ2; it can be rewritten, using the change of variable ξ .= (ρ m)/ϵ, as R m/ϵ 1 2πe 1 2 ξ2 (ϵξ + m) dξ which can be integrated by parts to yield 2π + m 1 Ψ m and therefore p(α| x) = 1 2πϵ2 exp x 2 m2 which, once written explicitly in terms of x and y, yields (14)-(15). Proof. Since pθ,G(y, xt) is sufficient for θ, and it factorizes into φθ,G(y)φθ(xt), then φθ(xt) is sufficient of xt for θ. By the factorization theorem (Theorem 3.1 of Pawitan (2001)), there exist functions fθ and ψ such that φθ(xt) fθ(ψ(xt)), i.e., the likelihood depends on the data only through the function ψ( ). This latter function is what is more commonly known as the sufficient statistic, which in particular has the property that p(θ|xt) = p(θ|ψ(xt)). However, if φθ is sufficient for θ, it is also sufficient for future data generated from θ. Formally, p G(y|xt) = Z p G(y|xt, θ)p(θ|xt)d P(θ) = Z p G(y|θ)p(θ|ψ(xt))d P(θ) = p G(y|ψ(xt)) (55) which shows that ψ minimizes the uncertainty of y for any g G since the right-hand-side is G-invariant. The right-hand side above is the predictive likelihood Hinkley (1979), which must therefore be proportional to fy(ψ(xt)) for some f and the same ψ, also by the factorization theorem. Proof. (Sketch) The integral in (38) can be split into 3 components, one of which omitted, leaving the positive component integrated on D+, the negative component on D . If the distance between these two is greater than σ, however, the components are disjoint, so for each (u, v) and α, only the the positive or the negative component are non-zero, and since N and x are both positive, and the sign is constant, rectification inside or outside the integral is equivalent. When σ > d there is an error in the approximation, that can be bounded as a function of σ, the minimum distance and the maximum gradient component. Published as a conference paper at ICLR 2016 Figure 3: D+(0) (green) and D (0), the positive and negative responses to a gradient filter in the horizontal direction. The black region is their complement, which separates them. 4 3 2 1 0 1 2 3 4 0 Figure 4: Rectified cosine (blue) and its powers, compared to a Gaussian kernel (red). While the two are distinctly different for ϵ = 1, as the power/dispersion decreases, the latter approximates the former. The plot shows ϵ = 1, 1/5, 1/9 for the cosine, and 1/5, 1/9, 1/13 for the Gaussian. A more general kernel could be considered, with a parameter ϵ that controls the decay, or width, of the kernel, κϵ(α). For instance, κϵ(α) = κ(α) 1 ϵ , with the default value being ϵ = 1. An alternative is to define κ to be an angular Gaussian with dispersion parameter ϵ, which is constrained to be positive and therefore does not need rectification. Although the angular Gaussian is quite different from the cosine kernel for ϵ = 1, it approximates it as ϵ decreases (Fig. 4). A corollary of the above is that the visible layer of a CNN computes the SAL Likelihood of the first hidden layer. The interpretation of SIFT as a likelihood function given the test image y can be confusing, as ordinarily it is interpreted as a feature vector associated to the training image x, and compared with other feature vectors using the Euclidean distance. In a likelihood interpretation, x is used to compute the likelihood function, and y is used to evaluate it. So, there is no descriptor built for y. The same interpretational difference applies to convolutional architectures. If interpreted as a likelihood, which would require generative learning, one would compute the likelihood of different hypotheses given the test data. Instead, currently the test data is fed to the network just as training data were, thus generating features maps, that are then compared (discriminatively) by a classifier.