# deep_scalespaces_equivariance_over_scale__fe8f0057.pdf Deep Scale-spaces: Equivariance Over Scale Daniel E. Worrall AMLAB, Philips Lab University of Amsterdam d.e.worrall@uva.nl Max Welling AMLAB, Philips Lab University of Amsterdam m.welling@uva.nl We introduce deep scale-spaces (DSS), a generalization of convolutional neural networks, exploiting the scale symmetry structure of conventional image recognition tasks. Put plainly, the class of an image is invariant to the scale at which it is viewed. We construct scale equivariant cross-correlations based on a principled extension of convolutions, grounded in the theory of scale-spaces and semigroups. As a very basic operation, these cross-correlations can be used in almost any modern deep learning architecture in a plug-and-play manner. We demonstrate our networks on the Patch Camelyon and Cityscapes datasets, to prove their utility and perform introspective studies to further understand their properties. 1 Introduction Scale is inherent in the structure of the physical world around us and the measurements we make of it. Ideally, the machine learning models we run on this perceptual data should have a notion of scale, which is either learnt or built directly into them. However, the state-of-the-art models of our time, convolutional neural networks (CNNs) [Lecun et al., 1998], are predominantly local in nature due to small filter sizes. It is not thoroughly understood how they account for and reason about multiscale interactions in their deeper layers, and empirical evidence [Chen et al., 2018, Yu and Koltun, 2015, Yu et al., 2017] using dilated convolutions suggests that there is still work to be done in this arena. In computer vision, typical methods to circumvent scale are: scale averaging, where multiple scaled versions of an image are fed through a network and then averaged [Kokkinos, 2015]; scale selection, where an object s scale is found and local computations are adapted accordingly [Girshick et al., 2014, Shelhamer et al., 2019]; and scale augmentation, where multiple scaled versions of an image are added to the training set [Barnard and Casasent, 1991]. While these methods help, they lack explicit mechanisms to fuse information from different scales into the same representation. Many works do indeed follow this approach Ke et al. [2017], Saxena and Verbeek [2016], Lin et al. [2017], Kanazawa et al. [2014], Huang et al. [2018] and in this work, we follow this line of thinking and construct a generalized convolution taking, as input, information from different scales. The utility of convolutions arises in scenarios where there is a translational symmetry (translation invariance) inherent in the task of interest [Cohen and Welling, 2016a]. Examples of such tasks are object classification [Krizhevsky et al., 2012], object detection [Girshick et al., 2014], or dense image labelling [Long et al., 2015]. By using translational weight-sharing [Lecun et al., 1998] for these tasks, we reduce the parameter count while preserving symmetry in the deeper layers. The overall effect is to improve sample complexity and thus reduce generalization error [Sokolic et al., 2017]. Furthermore, it has been shown that convolutions (and various reparameterizations of them) are the only linear operators that preserve symmetry [Kondor and Trivedi, 2018]. Attempts have been made to extend convolutions to scale, but they either suffer from breaking translation symmetry [Henriques and Vedaldi, 2017, Esteves et al., 2017], making the assumption that scalings can be modelled in the same way as rotations [Marcos et al., 2018], or ignoring symmetry constraints [Hilbert et al., 2018]. deworrall92.github.io 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Original Naive subsampling Bandlimited Bandlimited + Subsampled Figure 1: How to correctly downsample an image. Left to right: The original high-resolution image; an 1/8 subsampled image, notice how a lot of the image structure has been destroyed; a highresolution, bandlimited (blurred) image; a bandlimited and 1/8 subsampled image. Compare the bandlimited and subsampled image with the naïvely subsampled image. Much of the low-frequency image structure is preserved in the bandlimited and subsampled image. Image source: Image Net. The problem with the aforementioned approaches is that they fail to account for the unidirectional nature of scalings. In data there exist many one-way transformations, which cannot be inverted. Examples are occlusions, causal translations, downscalings of discretized images, and pixel lighting normalization. In each example the transformation deletes information from the original signal, which cannot be regained, and thus it is non-invertible. We extend convolutions to these classes of symmetry under noninvertible transformations via the theory of semigroups. Our contributions are the introduction of a semigroup equivariant correlation and a scale-equivariant CNN. 2 Background This section introduces some key concepts such as groups, semigroups, actions, equivariance, group convolution, and scale-spaces. These concepts are presented for the benefit of the reader, who is not expected to have a deep knowledge of any of these topics a priori. Downsizing images We consider sampled images f L2(Z2) such as in Figure 1. For image f, x is pixel position and f(x) is pixel intensity. If we wish to downsize by a factor of 8, a naïve approach would be to subsample every 8th pixel: fdown(x) = f(8x). This leads to an artifact, aliasing [Mallat, 2009, p.43], where the subsampled image contains information at a higher-frequency than can be represented by its resolution. The fix is to bandlimit pre-subsampling, suppressing high-frequencies with a blur. Thus a better model for downsampling is fdown(x) = [G Zd f](8x), where Zd denotes convolution over Zd, and G is an appropriate blur kernel (discussed later). Downsizing involves necessary information loss and cannot be inverted [Lindeberg, 1997]. Thus upsampling of images is not well-defined, since it involves imputing high-frequency information, not present in the low resolution image. As such, in this paper we only consider image downscaling. Scale-spaces Scale-spaces have a long history, dating back to the late fifties with the work of Iijima [1959]. They consist of an image f0 L2(Rd) and multiple blurred versions of it. Although sampled images live on Zd, scale-space analysis tends to be over Rd, but many of the results we present are valid on both domains. Among all variants, the Gaussian scale-space (GSS) is the commonest [Witkin, 1983]. Given an initial image f0, we construct a GSS by convolving f0 with an isotropic (rotation invariant) Gauss-Weierstrass kernel G(x, t) = (4πt) d/2 exp x 2/4t of variable width t and spatial positions x. The GSS is the complete set of responses f(t, x): f(t, x) = [G( , t) Rd f0](x), t > 0 (1) f(0, x) = f0(x), (2) where Rd denotes convolution over Rd. The higher the level t (larger blur) in the image, the more high frequency details are removed. An example of a scale-space f(t, x) can be seen in Figure 2. An interesting property of scale-spaces is the semigroup property [Florack et al., 1992], sometimes referred to as the recursivity principle [Pauwels et al., 1995], which is f(s + t, ) = G( , s) Rd f(t, ) (3) for s, t > 0. It says that we can generate a scale-space from other levels of the scale-space, not just from f0. Furthermore, since s, t > 0 it also says that we cannot generate sharper images from blurry ones, using just a Gaussian convolution. Thus moving to blurrier levels encodes a degree of information loss. This property emanates from the closure of Gaussians under convolution, namely for multidimensional Gaussians with covariance matrices Σ and T Figure 2: A Scale-space: For implementations we logarithmically discretize the scale-axis in a 1. G( , Σ + T) = G( , Σ) Rd G( , T). (4) We assume the initial image f0 has a maximum spatial frequency content dictated by pixel pitch in discretized images which we model by assuming the image has already been convolved with a width s0 Gaussian, which we call the zero-scale. Thus an image of bandlimit s in the scale-space is found at GSS slice f(s s0, ), which we see from Equation 3. There are many varieties of scale-spaces: the αscale-spaces [Pauwels et al., 1995], the discrete Gaussian scale-spaces [Lindeberg, 1990], the binomial scale-spaces [Burt, 1981], etc. These all have specific kernels, analogous to the Gaussian, which are closed under convolution (details in supplement). Slices at level t in the GSS correspond to images downsized by a dilation factor 0 a 1 (1 = no downsizing, 0 = shrinkage to a point). An a-dilated and appropriately bandlimited image p(a, x) is found as (details in supplement) p(a, x) = f(t(a, s0), a 1x), t(a, s0) := s0 For clarity, we refer to decreases in the spatial dimensions of an image as dilation and increases in the blurriness of an image as scaling. For a generalization to anisotropic scaling we replace scalar scale t with matrix T, zero-scale s0 with covariance matrix Σ0, and dilation parameter a with matrix A, so T(A, Σ0) = A 1Σ0A T Σ0. (6) Semigroups The semigroup property of Equation 3 is the gateway between classical scale-spaces and the group convolution [Cohen and Welling, 2016a] (see end of section). Semigroups (S, ) consist of a non-empty set S and a (binary) composition operator : S S S. Typically, the composition s t of elements s, t S is abbreviated st. For our purposes, these individual elements will represent dilation parameters. For S to be a semigroup, it must satisfy the following two properties Closure: st S for all s, t S Associativity: (st)r = s(tr) = str for all s, t, r S Note that commutativity is not a given, so st = ts. The family of Gaussian densities under spatial convolution is a semigroup2. Semigroups are a generalization of groups, which are used in Cohen and Welling [2016a] to model invertible transformations. For a semigroup to be a group, it must also satisfy the following conditions Identity element: there exists an e S such that es = se = s for all s S Inverses: for each s S there exists a s 1 such that s 1s = ss 1 = e. Actions Semigroups are useful because we can use them to model transformations, also known as (semigroup) actions. Given a semigroup S, with elements s S and a domain X, the action LX s : X X is a map, also written x 7 LX s [x] for x X. The defining property of the semigroup action is that it is associative and closed under composition, inheriting its compositional structure from the semigroup S. There are, in fact, two versions of the action, a left and a right action (Equation 7). For the left action LX st, we first apply LX t then LX s , but for the right action RX st, this is reversed. Left action: LX st[x] = LX s [LX t [x]] Right action: RX st[x] = RX t [RX s [x]]. (7) 2For the Gaussians we find the identity element as the limit limt 0 G( , t), which is a Dirac delta. Note that this element is not strictly in the set of Gaussians, thus the Gaussian family has no identity element. Actions can also be applied to functions f : X Y by viewing f as a point in a function space F(X). Note, when the function domain is obvious, we just write F for brevity. The result of the action is a new function, denoted LF s [f]. Since the domain of f is X, we commonly write LF s [f](x). Say the domain is a semigroup S then an example left action is LF s [f](x) = f(RX s [x]) = f(xs). This highlights a connection between left actions on functions and right actions on their domains, which we return to later in our discussion on nonlinearities. Another example, which we shall use later, is the action SΣ0 A,z used to form scale-spaces, namely SΣ0 A,z[f0](x) = [GΣ0 A Zd f0](A 1x + z), GΣ0 A := G( , A 1Σ0A Σ0). (8) The elements of the semigroup are the tuples (A, z) (dilation A, shift z) and GΣ0 A is an anisotropic discrete Gaussian. The action first bandlimits by GΣ0 A Zd f0 then dilates and shifts the domain by A 1x + z. Note for fixed (A, z) this maps functions on Zd to functions on Zd. Lifting We can also view actions as maps from functions on X to functions on the semigroup S, so F(X) F(S). We call this a lift [Kondor and Trivedi, 2018], denoting lifted functions as f . One example is the scale-space action (Equation 8). If we set x = 0, then f (A, z) = SΣ0 A,z[f0](0) = [GΣ0 A Rd f0](z), (9) which is the expression for an anisotropic scale-space (parameterized by the dilation A rather than the scale T(A, Σ0) = A 1Σ0A T Σ0). To lift a function on to a semigroup, we do not necessarily have to set x equal to a constant, but we could also integrate it out. An important property of lifted functions is that actions become simpler. For instance, if we define f (s) = LF s [f](0) then (LF t [f]) (s) = LF s [Lt[f]](0) = LF st[f](0) = f (st). (10) The action on f could be complicated, like a Gaussian blur, but the action on f is simply a shift on f (s) 7 f (st). We can then define the action LF (S) t on lifted functions as LF (S) t [f ](s) = f (st). This action is another example of where the left action on the function is a right action on the domain. Equivariance and Group Correlations CNNs rely heavily on the (cross-)correlation3. Correlations Zd are a special class of linear map of the form [f Zd ψ](s) = X x Zd f(x)ψ(x s). (11) Given a signal f and filter ψ, we interpret the correlation as the collection of inner products of f with all s-translated versions of ψ. This basic correlation has been extended to transformations other than translation via the group correlation, which as presented in Cohen and Welling [2016a] is [f H ψ](s) = X x X f(x)ψ(LX s 1[x]), (12) where H is the relevant group and LX s is a group action e.g. for 2D rotation LX s [x] = Rsx, where Rs is a rotation matrix. Most importantly, the domain of the output is H. It highlights how this is an inner product of f and ψ under all s-transformations of the filter. If we denote4 LF s [f](x) = f(LX s 1[x]), the correlation exhibits a special property. It is equivariant under actions of the group H. In math LF (S) s [f H ψ] = LF (X) s [f] H ψ. (13) Group correlation followed by the action is equivalent to the action followed by the group correlation, albeit the action is over a different domain. Note, the group action may look different depending on whether it was applied before or after the group correlation, but it represents the exact same transformation. Notation As a shorthand, we just write Ls from now on; the domain of the action should be obvious from context. 3In the deep learning literature these are inconveniently referred to as convolutions, but we stick to correlation. 4Recall how earlier we gave an example of left actions on functions as LF s [f](x) = f(RX s [x]) = f(xs), the group action we present here is an example of this, because LX s 1[x] is a right action. We aim to construct a scale-equivariant convolution. We shall achieve this here by introducing an extension of the correlation to semigroups, which we then tailor to scalings. Semigroup Correlation There are multiple candidates for a semigroup correlation S. The basic ingredients of such a correlation will be the inner product, the semigroup action Ls, and the functions ψ F and f F. Furthermore, it must be equivariant to (left) actions on f. For a semigroup S, domain X, and action Ls : F F, we define: [ψ S f](s) = X x X ψ(x)Ls[f](x). (14) It is the set of responses formed from taking the inner product between a filter ψ and a signal f under all transformations of the signal. Notice that we transform the signal and not the filter and that we write ψ S f, not f S ψ it turns out that a similar expression where we apply the action to the filter is not equivariant to actions on the signal. Furthermore this expression lifts a function from X to S, so we expect actions on f to look like a shift on the semigroup. A proof of equivariance to (left) actions on f is as follows [ψ Lt[f]](s) = X x X ψ(x)Ls[Lt[f]](x) = X x X ψ(x)Lst[f](x) = [ψ f](st) = Lt[ψ f](s) (15) We have used the definition of the left action Ls Lt = Lst, the semigroup correlation, and our definition of the action for lifted functions. We can recover the standard convolution , by substituting S = Zd, X = Zd, and the translation action Ls[f](x) = f(x + s): [ψ Zd f](s) = X x Zd ψ(x)f(x + s) = X x Zd ψ(x s)f(x ). (16) where x = x + s. We can also recover the group correlation by setting S = H, X = H, where H is a discrete group, and Ls[f](x) = f(Rsx), where Rs is a right action acting on the domain X = H: [ψ H f](s) = X x H ψ(x)f(Rs[x]) = X x H ψ(Rs 1[x ])f(x ). (17) where x = Rs[x] and since Rs is a group action inverses exist, so x = Rs 1[x ]. The semigroup correlation has a two notable differences from the group correlation: i) In the semigroup correlation we transform the signal and not the filter. When we restrict to the group and standard convolution, transforming the signal or the filter are equivalent operations since we can apply a change of variables. This is not possible in the semigroup case, since this change of variables requires an inverse, which we do not necessarily have. ii) In the semigroup correlation, we apply an action to the whole signal as Ls[f], as opposed to just the domain (f(Rs[x])). This allows for more general transformations than allowed by the group correlation of Cohen and Welling [2016a], since transformations of the form f(Rs[x]) can only move pixel locations, but transformations of the form Ls[f] can alter the values of the pixels as well, and can incorporate neighbourhood information into the transformation. The Scale-space Correlation We now have the tools to create a scale-equivariant correlation. All we have to choose is an appropriate action for Ls. We choose the scale-space action of Equation 8. The scale-space action SΣ0 z,A for functions on Zd is given by SΣ0 A,z[f](x) = [GΣ0 A Zd f](A 1x + z), SΣ0 A,z SΣ0 B,y = SΣ0 AB,Ay+z. (18) Since our scale-space correlation only works for discrete semigroups, we have to find a suitable discretization of the dilation parameter A. Later on we will choose a discretization of the form Ak = 2 k I for k 0, but for now we will just assume that there exists some countable set A, such that S = {(A, z)A A,z Zd} is a valid discrete semigroup. We begin by assuming we have lifted an input image f on to the scale-space via Equation 9. The lifted signal is indexed by coordinates A, z and so the filters share this domain and are of the form ψ(A, z). The scale-space correlation is then [ψ S f](A, z) = X (B,y) S ψ(B, y)SΣ0 B,y[f](A, y) = X y Zd ψ(B, y)f(BA, A 1y + z). (19) Figure 3: Scale correlation schematic: The left 3 stacks are the same input f, with levels fℓ( ) = f(2 ℓI, ). Each stack shows the inner product between filter ψ (in green) at translation z for dilation 2 k I corresponding to the output level [ψ S f]k on the right with matching color. Notice that as we dilate the filter, we also shift it one level up in the scale-space, accoridng to Equation 20. For the second equality we recall the action on a lifted signal is governed by Equation 10. The appealing aspect of this correlation is that we do not need to convolve with a bandlimiting filter a potentially expensive operation to perform at every layer of a CNN since we use signals that have been lifted on to the semigroup. Instead, the action of scaling by A is accomplished by fetching a slice f(BA, ) from a blurrier level of the scale-space. Let s restrict the scale correlation to the scale-space where Ak = 2 k I for k 0, with zero-scale Σ0 = 1 4I. Denoting f(2 k I, ) as fk( ), this can be seen as a dilated convolution [Yu and Koltun, 2015] between ψℓand slice fℓ+k. This form of the scale-space correlation (shown below) we use in our experiments. A diagram of this can be seen in Figure 3: [ψ S f]k(z) = X y Zd ψℓ(y)fℓ+k(2ky + z). (20) Equivariant Nonlinearities Not all nonlinearities commute with semigroup actions, but it turns out that pointwise nonlinearities ν commute with a special subset of actions of the form, Ls[f](x) = f(Rs[x]) (21) where Rs is a right action on the domain of f. For these sorts of actions, we cannot alter the values of the f, just the locations of the values. If we write function composition as [ν f](x) = ν(f(x)), then a proof of equivariance is as follows: [ν Ls[f]](x) = ν(f(Rs[x])) = [ν f](Rs[x]) = Ls[ν f](x). (22) Equation 21 may at first glance seem overly restrictive, but it turns out that this is not the case. Recall that for functions lifted on to the semigroup, the action is Ls[f](x) = f(xs). This satisfies Equation 21, and so we are free to use pointwise nonlinearities. Batch normalization For batch normalization, we compute batch statistics over all dimensions of an activation tensor except its channels, as in Cohen and Welling [2016a]. Initialization Since our correlations are based on dilated convolutions, we use the initialization scheme presented in [Yu and Koltun, 2015]. For pairs of input and output channels the center pixel of each filter is set to one and the rest are filled with random noise of standard deviation 10 2. Boundary conditions In our semigroup correlation, the scale dimension is infinite this is a problem for practical implementations. Our solution is to truncate the scale-space to finite scale. This breaks global equivariance to actions in S, but is locally correct. Boundary effects occur at activations with receptive fields covering the truncation boundary. To mitigate these effects we: i) use filters with a scale-dimension no larger than two, ii) interleave filters with scale-dimension 2 with filters of scale-dimension 1. The scale-dimension 2 filters enable multiscale interactions but they propagate boundary effects; whereas, the scale-dimension 1 kernels have no boundary overlap, but also no multiscale behavior. Interleaving trades off network expressiveness against boundary effects. Table 1: Results on the Patch Camelyon and Cityscapes Dataset. Higher is better. Our scaleequivariant model outperforms the matched baselines. We must caution that better competing results can be found in the literature, when the computational constraint is relaxed. For instance, Shelhamer et al. [2019] report a m AP of 71.4 on a Res Net-34, which is deeper than our model by 15 layers. PCam Model Accuracy Dense Net Baseline 87.0 S-Dense Net (Ours) 88.1 [Veeling et al., 2018] 89.8 Cityscapes Model m AP Res Net, matched parameters 45.66 Res Net, matched channels 49.99 S-Res Net, multiscale (Ours) 63.53 S-Res Net, no interaction (Ours) 64.78 Scale-space implementation We use a 4 layer scale-space and zero-scale 1/4 with dilations at integer powers of 2, the maximum dilation is 8 and kernel width 33 (4 std. of a discrete Gaussian). We use the discrete Gaussian of Lindeberg [1990]. In 1D for scale parameter t, this is G(x, t) = e t I|x|(t) (23) where Ix(t) is the modified Bessel function of integer order. For speed, we make use of the separability of isotropic kernels. For instance, convolution with a 2D Gaussian can we written as a convolution with 2 identical 1D Gaussians sequentially along the x and then the y-axis. For an N N input image and M M blurring kernel, this reduces computational complexity of the convolution as O(M 2N 2) O(2MN 2). With GPU parallelization, this saving is O(M 2) O(M), which is especially significant for us since the largest blur kernel we use has M = 33. Multi-channel features Typically CNNs use multiple channels per activation tensor, which we have not included in our above treatment. In our experiment we include input channels i and output channels o, so a correlation layer is [ψ S f]o k(z) = X x Zd ψi,o ℓ(x)f i ℓ+k(2kx + z). (24) 4 Experiments and Results Here we present results for some preliminary simple experiments on the Patch Camelyon [Veeling et al., 2018] and Cityscapes [Cordts et al., 2016] datasets. Due to the extra computational overhead of computing the scale-equivariance cross-correlation, the experiments are more an indicator of what our method is capable of, but please note that the baselines are non state-of-the-art reimplementations, restricted in size for fair-comparison rather than to prove we are beating any benchlines. We also visualize the quality of scale-equivariance achieved. Patch Camelyon The Patch Camelyon or PCam dataset [Veeling et al., 2018] contains 327 680 tiles from two classes, metastatic (tumorous) and non-metastatic tissue. Each tile is a 96 96 px RGB-crop labelled as metastatic if there is at least one pixel of metastatic tissue in the central 32 32 px region of the tile. We test a 4-scale Dense Net model [Huang et al., 2017], S-Dense Net , on this task (architecture in supplement). We also train a scale non-equivariant Dense Net baseline and the rotation equivariant model of Veeling et al. [2018]. Our training procedure is: 100 epochs SGD, learning rate 0.1 divided by 10 every 40 epochs, momentum 0.9, batch size of 512, split over 4 GPUs. For data augmentation, we follow the procedure of Veeling et al. [2018], Liu et al. [2017], using random flips, rotation, and 8 px jitter. For color perturbations we use: brightness delta 64/255, saturation delta 0.25, hue delta 0.04, constrast delta 0.75. The evaluation metric we test on is accuracy. The results in Table 1 show that both the scale and rotation equivariant models outperform the computation-matched baseline. Cityscapes The Cityscapes dataset [Cordts et al., 2016] contains 2975 training images, 500 validation images, and 1525 test images of resolution 2048 1024 px. The task is semantic segmentation into 19 classes. We train a 4-scale Res Net He et al. [2016], S-Res Net , and baseline. We train an equivariant network with and without multiscale interaction layers. We also train two scale nonequivariant models, one with the same number of channels, one with the same number of parameters. Our training procedure is: 100 epochs Adam, learning rate 10 3 divided by 10 every 40 epochs, batch size 8, split over 4 GPUs. The results are in Table 1. The evaluation metric is mean average precision. We see that our scale-equivariant model outperforms the baselines. We must caution however, that better competing results can be found in the literature. For instance, Shelhamer et al. [2019] report a m AP of 71.4 on a Res Net-34, which is deeper than our model by 15 layers. That said they train with continuous scale augmentation, which we do not. The reason our baseline underperforms compared to the literature is because of the parameter/channel-matching, which have shrunk its size somewhat due to our own resource constraints. On a like-for-like comparison scale-equivariance appears to help. Quality of Equivariance We validate the quality of equivariance empirically by comparing activations of a dilated image against the theoretical action on the activations. Using Φ to denote the deep scale-space (DSS) mapping, we compute the normalized L2-distance at each level k of a DSS. Mathematically this is L(2 ℓ, k) = Φ[f](k + ℓ, 2ℓ ) Φ[S1/4I 2 ℓ,0[f]](k, ) 2 Φ[f](k + ℓ, 2ℓ ) 2 . (25) The equivariance errors are in Figure 4 for 3 DSSs with random weights and a scale-space trucated to 8 scales. We see that the average error is below 0.01, indicating that the network is mostly equivariant, with errors due to truncation of the discrete Gaussian kernels used to lift the input to scale-space. We also see that the equivariance errors blow up for constant ℓ+ k in each graph. This is the point where the receptive field of an activation overlaps with the scale-space truncation boundary. 5 Related Work In recent years, there have been a number of works on group convolutions, namely continuous roto-translation in 2D [Worrall et al., 2017] and 3D [Weiler et al., 2018a, Kondor et al., 2018, Thomas et al., 2018] and discrete roto-translations in 2D [Cohen and Welling, 2016a, Weiler et al., 2018b, Bekkers et al., 2018, Hoogeboom et al., 2018] and 3D [Worrall et al., 2017], continuous rotations on the sphere [Esteves et al., 2018, Cohen et al., 2018b], in-plane reflections [Cohen and Welling, 2016a], and even reverse-conjugate symmetry in DNA sequencing [Lunter and Brown, 2018]. Theory for convolutions on compact groups used to model invertible transformations also exists [Cohen and Welling, 2016a,b, Kondor and Trivedi, 2018, Cohen et al., 2018a], but to date the vast majority of work has focused on rotations. For scale, there far fewer works with explicit scale equivariance. Henriques and Vedaldi [2017], Esteves et al. [2017] both perform a log-polar transform of the signal before passing to a standard CNN. Log-polar transforms reparameterize the input plane into angle and log-distance from a predefined origin. The transform is sensitive to origin positioning, which if done poorly breaks translational equivariance. Marcos et al. [2018] use a group CNN architecture designed for rototranslation, but instead of rotating filters in the group correlation, they scale them. This seems to work on small tasks, but ignores large scale variations. Kanazawa et al. [2014] convolve the same filter over rescaled versions of the same feature map and then max-pool over feature location, for local scale-invariance. Hilbert et al. [2018] instead use filters of different sizes, but without any attention to equivariance. Ke et al. [2017] introduce a multigrid convolution, where the convolution outputs multiple feature maps at different resolutions. The input to each resolution convolution is a concatenation of rescaled feature maps from the previous layer. This is similar to our work, but differing in two ways, 1) there is no across-scale weight-tying (so no explicit equivariance), and 2) 20 2 1 2 2 2 3 2 4 2 5 2 6 2 7 k = 0 k = 1 k = 2 k = 3 k = 4 20 2 1 2 2 2 3 2 4 2 5 2 6 2 7 k = 0 k = 1 k = 2 k = 3 k = 4 20 2 1 2 2 2 3 2 4 2 5 2 6 2 7 k = 0 k = 1 k = 2 k = 3 k = 4 Figure 4: Equivariance quality. Left to right: 1, 2, and 3 layer DSSs. Each line represents the error as in Equation 25. We see the residual error is typically < 0.01 until boundary effects are present. they maintain feature maps at difference resolutions rather than different scalings (bandlimits). Two other works, with similar approaches are Huang et al. [2018], Saxena and Verbeek [2016], but their goals are not scale equivariance, but architecture search. Feature Pyramid Networks [Lin et al., 2017] also uses multiscale features, using a scheme similar to a UNet [Ronneberger et al., 2015], but where predictions are made using every resolution of decoded feature maps. An interesting work with a very different approach is Shelhamer et al. [2019], where filters are formed as the convolution of a base filter and an anisotropic Gaussian filter, whose covariance is predicted at test time. This produces scale-adaptive filters. 6 Discussion, Limitations, and Future Works We found our best performing architectures were composed mainly of correlations where the filters scale dimension is one, interleaved with correlations where the scale dimension is higher. This is similar to a network-in-network [Lin et al., 2013] architecture, where 3 3 convolutional layers are interleaved with 1 1 convolutions. We posit this is the case because of boundary effects, as were observed in Figure 4. Further to boundary effects, we also suspect that using non-integer dilations with finer increments in scale would improve performance greatly, as often witnessed in the scale-space literature. This would, however, involve the development of non-integer dilations and hence interpolation. We see working on mitigating boundary effects, and using a semigroup correlation for non-integer scalings as an important future work, not just for scale-equivariance, but CNNs as a whole. Another limitation of the current model is the increase in computational overhead, since we have added an extra dimension to the activations. This may not be a problem long term, as GPUs grow in speed and memory, but the computational complexity of a correlation grows exponentially in the number of symmetries of the model, and so we need more efficient methods to perform correlation, either exactly or approximately. In terms of experimentation, this extra computation limited our ability to compare against large state-of-the-art models, perhaps giving our model an unfair advantage in this limited scheme. In terms of the experiments, a much more in depth exploration of the empirical properties of the scaleequivariant correlation is needed. Our light proof-of-concept experiments provide some evidence that built-in scale equivariance can help, but before this is going to be useful in practise, we need to solve the issues of efficiency, non-integer dilations, and the boundary effects due to scale-space truncation. We see semigroup correlations as an exciting new family of operators to use in deep learning. We have demonstrated a proof-of-concept on scale, but there are many semigroup-structured transformations left to be explored, such as causal shifts, occlusions, and affine transformations. Concerning scale, we are also keen to explore how multi-scale interactions can be applied on other domains as meshes and graphs, where symmetry is less well-defined. 7 Conclusion We have presented deep scale-spaces a generalization of convolutional neural networks, exploiting the scale symmetry structure of conventional image recognition tasks. We outlined new theory for a generalization of the convolution operator on semigroups, the semigroup correlation. Then we showed how to derive the standard convolution and the group convolution [Cohen and Welling, 2016a] from the semigroup correlation. We then tailored the semigroup correlation to the scale-translation action used in classical scale-space theory and demonstrated how to use this in modern neural architectures. Acknowledgements We thanks Koninklijke Philips N.V. for in-cash and in-kind support of this research. We also thank Rianne van den Berg, Patrick Forré, and the anonymous reviewers who all made important contributions to this paper. E. Barnard and D. Casasent. Invariance and neural nets. IEEE Transactions on Neural Networks, 2 (5):498 508, Sep. 1991. ISSN 1045-9227. doi: 10.1109/72.134287. Erik J. Bekkers, Maxime W. Lafarge, Mitko Veta, Koen A. J. Eppenhof, Josien P. W. Pluim, and Remco Duits. Roto-translation covariant convolutional networks for medical image analysis. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2018 - 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part I, pages 440 448, 2018. doi: 10.1007/978-3-030-00928-1\_50. Peter J Burt. Fast filter transform for image processing. Computer graphics and image processing, 16(1):20 51, 1981. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834 848, 2018. doi: 10.1109/ TPAMI.2017.2699184. Taco Cohen and Max Welling. Group equivariant convolutional networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2990 2999, 2016a. Taco Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homogeneous spaces. Co RR, abs/1811.02017, 2018a. Taco S. Cohen and Max Welling. Steerable cnns. Co RR, abs/1612.08498, 2016b. Taco S. Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical cnns. Co RR, abs/1801.10130, 2018b. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 3213 3223, 2016. doi: 10.1109/CVPR. 2016.350. Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar transformer networks. Co RR, abs/1709.01889, 2017. Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning SO(3) equivariant representations with spherical cnns. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, pages 54 70, 2018. doi: 10.1007/978-3-030-01261-8\_4. Luc Florack, Bart M. ter Haar Romeny, Jan J. Koenderink, and Max A. Viergever. Scale and the differential structure of images. Image Vision Comput., 10(6):376 388, 1992. doi: 10.1016/ 0262-8856(92)90024-W. Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 580 587, 2014. doi: 10.1109/CVPR.2014.81. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770 778, 2016. doi: 10.1109/CVPR.2016.90. João F. Henriques and Andrea Vedaldi. Warped convolutions: Efficient invariance to spatial transformations. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1461 1469, 2017. Adam Hilbert, Bastiaan Veeling, and Henk Marquering. Data-efficient convolutional neural networks for treatment decision support in acute ischemic stroke. International conference on Medical Imaging with Deep Learning, 2018. Emiel Hoogeboom, Jorn W. T. Peters, Taco S. Cohen, and Max Welling. Hexaconv. Co RR, abs/1803.02108, 2018. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261 2269, 2017. doi: 10.1109/CVPR. 2017.243. Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q. Weinberger. Multi-scale dense networks for resource efficient image classification. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=Hk2a Imx Ab. Taizo Iijima. Basic theory of pattern observation. Technical Group on Automata and Automatic Control, pages 3 32, 1959. Angjoo Kanazawa, Abhishek Sharma, and David W. Jacobs. Locally scale-invariant convolutional neural networks. Co RR, abs/1412.5104, 2014. URL http://arxiv.org/abs/1412.5104. Tsung-Wei Ke, Michael Maire, and Stella X. Yu. Multigrid neural architectures. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4067 4075, 2017. doi: 10.1109/CVPR.2017.433. URL http: //doi.ieeecomputersociety.org/10.1109/CVPR.2017.433. Iasonas Kokkinos. Pushing the boundaries of boundary detection using deep learning. International Conference on Learning Representations (ICLR), 2015. Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 2752 2760, 2018. Risi Kondor, Zhen Lin, and Shubhendu Trivedi. Clebsch-gordan nets: a fully fourier space spherical convolutional neural network. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada., pages 10138 10147, 2018. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 1106 1114, 2012. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, Nov 1998. ISSN 0018-9219. doi: 10.1109/5.726791. Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. Co RR, abs/1312.4400, 2013. Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 936 944, 2017. doi: 10.1109/CVPR.2017.106. URL https://doi.org/10.1109/CVPR.2017.106. Tony Lindeberg. Scale-space for discrete signals. IEEE Trans. Pattern Anal. Mach. Intell., 12(3): 234 254, 1990. doi: 10.1109/34.49051. Tony Lindeberg. On the axiomatic foundations of linear scale-space. In Gaussian Scale-Space Theory, pages 75 97, 1997. doi: 10.1007/978-94-015-8802-7\_6. Yun Liu, Krishna Gadepalli, Mohammad Norouzi, George E. Dahl, Timo Kohlberger, Aleksey Boyko, Subhashini Venugopalan, Aleksei Timofeev, Philip Q. Nelson, Gregory S. Corrado, Jason D. Hipp, Lily Peng, and Martin C. Stumpe. Detecting cancer metastases on gigapixel pathology images. Co RR, abs/1703.02442, 2017. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3431 3440, 2015. doi: 10.1109/CVPR.2015.7298965. Gerton Lunter and Richard Brown. An equivariant bayesian convolutional network predicts recombination hotspots and accurately resolves binding motifs. bio Rxiv, 2018. doi: 10.1101/351254. Stéphane Mallat. A Wavelet Tour of Signal Processing - The Sparse Way, 3rd Edition. Academic Press, 2009. ISBN 978-0-12-374370-1. Diego Marcos, Benjamin Kellenberger, Sylvain Lobry, and Devis Tuia. Scale equivariance in cnns with vector fields. Co RR, abs/1807.11783, 2018. Eric J. Pauwels, Luc J. Van Gool, Peter Fiddelaers, and Theo Moons. An extended class of scaleinvariant and recursive scale space filters. IEEE Trans. Pattern Anal. Mach. Intell., 17(7):691 701, 1995. doi: 10.1109/34.391411. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, pages 234 241, 2015. doi: 10.1007/978-3-319-24574-4\_28. URL https://doi.org/10. 1007/978-3-319-24574-4_28. Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4053 4061, 2016. URL http://papers.nips. cc/paper/6304-convolutional-neural-fabrics. Evan Shelhamer, Dequan Wang, and Trevor Darrell. Blurring the line between structure and learning to optimize and adapt receptive fields. Co RR, abs/1904.11487, 2019. URL http://arxiv.org/ abs/1904.11487. Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel R. D. Rodrigues. Generalization error of invariant classifiers. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 1094 1103, 2017. Nathaniel Thomas, Tess Smidt, Steven M. Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotationand translation-equivariant neural networks for 3d point clouds. Co RR, abs/1802.08219, 2018. Bastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant cnns for digital pathology. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2018 - 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II, pages 210 218, 2018. doi: 10.1007/978-3-030-00934-2\_24. Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada., pages 10402 10413, 2018a. Maurice Weiler, Fred A. Hamprecht, and Martin Storath. Learning steerable filters for rotation equivariant cnns. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 849 858, 2018b. Andrew P. Witkin. Scale-space filtering. In Proceedings of the 8th International Joint Conference on Artificial Intelligence. Karlsruhe, FRG, August 1983, pages 1019 1022, 1983. Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 7168 7177, 2017. doi: 10.1109/CVPR.2017.758. Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. Co RR, abs/1511.07122, 2015. Fisher Yu, Vladlen Koltun, and Thomas A. Funkhouser. Dilated residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 636 644, 2017. doi: 10.1109/CVPR.2017.75.