# equivariant_transformer_networks__f4626e0d.pdf

Equivariant Transformer Networks

Kai Sheng Tai 1 Peter Bailis 1 Gregory Valiant 1

How can prior knowledge on the transformation invariances of a domain be incorporated into the architecture of a neural network? We propose Equivariant Transformers (ETs), a family of differentiable image-to-image mappings that improve the robustness of models towards pre-deﬁned continuous transformation groups. Through the use of specially-derived canonical coordinate systems, ETs incorporate functions that are equivariant by construction with respect to these transformations. We show empirically that ETs can be ﬂexibly composed to improve model robustness towards more complicated transformation groups in several parameters. On a real-world image classiﬁcation task, ETs improve the sample efﬁciency of Res Net classiﬁers, achieving relative improvements in error rate of up to 15% in the limited data regime while increasing model parameter count by less than 1%.

1. Introduction

In computer vision, we are often equipped with prior knowledge on the transformation invariances of a domain. Consider, for example, the problem of classifying street signs in real-world images. In this domain, we know that the appearance of a sign in an image is subject to various deformations: the sign may be rotated, its scale will depend on its distance, and it may appear distorted due to perspective in 3D space. Regardless, the identity of the street sign should remain invariant to these transformations.

With the exception of translation invariance, convolutional neural network (CNN) architectures typically do not take advantage of such prior knowledge on the transformation invariances of the domain. Instead, current standard practice heuristically incorporates these priors during training via data augmentation (e.g., by applying a random rotation or

1Stanford University, Stanford, CA, USA. Correspondence to: Kai Sheng Tai <kst@cs.stanford.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

scaling to each training image). While data augmentation typically helps reduce the test error of CNN-based models, there is no guarantee that transformation invariance will be enforced for data not seen during training.

In contrast to training time approaches like data augmentation, recent work on group equivariant CNNs (Cohen & Welling, 2016; Dieleman et al., 2016; Marcos et al., 2017; Worrall et al., 2017; Henriques & Vedaldi, 2017; Cohen et al., 2018) has explored new CNN architectures that are guaranteed to respond predictably to particular transformations of the input. For example, the CNN model family may be constrained such that a rotation of the input results in a corresponding rotation of its subsequent representation, a property known as equivariance. However, these techniques most commonly designed for rotations and translations of the input (e.g., Dieleman et al. (2016); Marcos et al. (2017); Worrall et al. (2017)) fail to generalize to deeper compositions of continuous transformations. This limits the applicability of these techniques in more complicated real-world scenarios involving continuous transformations in several dimensions, such as the above example of street sign classiﬁcation.

To address these shortcomings of group equivariant CNNs, we propose Equivariant Transformer (ET) layers, a ﬂexible class of functions that improves robustness towards arbitrary pre-deﬁned groups of continuous transformations. An ET layer for a transformation group G is an image-to-image mapping that satisﬁes the following local invariance property: for any input image φ and transformation T G, the images φ and Tφ are both mapped to the same output image. ET layers are differentiable with respect to both their parameters and input, and thus can be easily incorporated into existing CNN architectures. Additionally, ET layers can be ﬂexibly combined to achieve improved invariance towards more complicated compositions of transformations (e.g., simultaneous rotation, scale, shear, and perspective transformations).

Importantly, the invariance property of ETs holds by construction, without any dependence on additional heuristics during training. We achieve this by using the method of canonical coordinates for Lie groups (Rubinstein et al., 1991). The key property of canonical coordinates that we utilize is their ability to reduce arbitrary continuous transfor-

Equivariant Transformer Networks

mations to translation. For example, polar coordinates are canonical coordinates for the rotation group, since a rotation reduces to a translation in the angular coordinate. These specialized coordinates can be analytically derived for a given transformation and efﬁciently implemented within a neural network.

We evaluate the performance of ETs using both synthetic and real-world image classiﬁcation tasks. Empirically, ET layers improve the sample efﬁciency of image classiﬁers relative to standard Spatial Transformer layers (Jaderberg et al., 2015). In particular, we demonstrate that ET layers improve the sample efﬁciency of modern Res Net classiﬁers on the Street View House Numbers dataset, with relative improvements in error rate of up to 15% in the limited data regime. Moreover, we show that a Res Net-10 classiﬁer augmented with ET layers is able to exceed the accuracy achieved by a more complicated Res Net-34 classiﬁer without ETs, thus reducing both memory usage and computational cost.

2. Related Work

Equivariant CNNs. There has been substantial recent interest in CNN architectures that are equivariant with respect to transformation groups other than translation. Equivariance with respect to discrete transformation groups (e.g., reﬂections and 90o rotations) can be achieved by transforming CNN ﬁlters or feature maps using the group action (Cohen & Welling, 2016; Dieleman et al., 2016; Laptev et al., 2016; Marcos et al., 2017; Zhou et al., 2017). Invariance can then be achieved by pooling over this additional dimension in the output of each layer. In practice, this technique supports only relatively small discrete groups since its computational cost scales linearly with the cardinality of the group.

Methods for achieving equivariance with respect to continuous transformation groups fall into one of two classes: those that expand the input in a steerable basis (Amari, 1978; Freeman & Adelson, 1991; Teo, 1998; Worrall et al.,

2017; Jacobsen et al., 2017; Weiler et al., 2018; Cohen et al., 2018), and those that compute convolutions under a specialized coordinate system (Rubinstein et al., 1991; Segman et al., 1992; Henriques & Vedaldi, 2017; Esteves et al., 2018). The relationship between these two categories of methods is analogous to the duality between frequency domain and time domain methods of signal analysis. Our work falls under the latter category that uses coordinate systems specialized to the transformation groups of interest.

Equivariance via Canonical Coordinates. Henriques & Vedaldi (2017) apply CNNs to images represented using coordinate grids computed using a given pair of continuous, commutative transformations. Closely related to this technique are Polar Transformer Networks (Esteves et al., 2018), a method that handles images deformed by translation, rota-

tion, and dilation by ﬁrst predicting an origin for each image before applying a CNN over log-polar coordinates. Unlike these methods, we handle higher-dimensional transformation groups by passing an input image through a sequence of ET layers in series. In contrast to Henriques & Vedaldi (2017), where a pair of commutative transformations is assumed to be given as input, we show how canonical coordinate systems can be analytically derived given only a single one-parameter transformation group using technical tools described by Rubinstein et al. (1991).

Spatial Transformer Networks. As with Spatial Transformer (ST) layers (Jaderberg et al., 2015), our ET layers aim to factor out nuisance modes of variation in images due to various geometric transformations. Unlike STs, ETs incorporate additional structure in the functions used to predict transformations. We expand on the relationship between ETs and STs in the following sections.

Locally-Linear Approximations. Gens & Domingos (2014) use local search to approximately align ﬁlters to image patches, in contrast to our use of a global change of coordinates. The sequential pose prediction process in a stack of ET layers is also reminiscent of the iterative nature of the Lucas-Kanade (LK) algorithm and its descendants (Lucas & Kanade, 1981; Lin & Lucey, 2017).

Image Registration and Canonicalization. ETs are related to classic phase correlation techniques for image registration that compare the Fourier or Fourier-Mellin transforms of an image pair (De Castro & Morandi, 1987; Reddy & Chatterji, 1996); these methods can be interpreted as Fourier basis expansions under canonical coordinate systems for the relevant transformations. Additionally, the notion of image canonicalization relates to work on deformable templates, where object instances are generated via deformations of a prototypical object (Amit et al., 1991; Yuille, 1991; Shu et al., 2018).

3. Problem Statement

In this section, we begin by reviewing inﬂuential prior work on image canonicalization with Spatial Transformers (Jaderberg et al., 2015). We then argue that the lack of self-consistency in pose prediction is a key weakness with the standard ST that results in poor sample efﬁciency.

3.1. Image Canonicalization with Spatial Transformers

Suppose that we observed a collection of images φ(x), each of which is a mapping from image coordinates x R2 to pixel intensities in each channel. Each image is a transformed version of some latent canonical image φ : φ = Tθφ := φ (Tθx), where the transformation Tθ : R2 R2

is modulated by pose parameters θ Rk.

Equivariant Transformer Networks

Figure 1. Sample complexity for predicting rotations. Predicted rotation angles vs. true angles for a rotated MNIST digit (left). The predictions of a self-consistent pose predictor will be parallel to the diagonal (dotted line). (a) After training with 10k rotated examples, a pose prediction CNN is not self-consistent; (b) with 50k rotated examples, it is only self-consistent over a limited range of angles. In contrast, (c) a rotationally-equivariant CNN outputs self-consistent predictions after 10k examples (with small error due to interpolation and boundary effects). There is a nonzero bias in ˆθ since the pose labels are latent and there is no preferred image orientation.

If the transformation family and the pose parameters θ for each image φ are known, then the learning problem may be greatly simpliﬁed. If Tθ is invertible, then access to θ implies that we can recover φ from φ via T 1 θ φ = T 1 θ Tθφ = φ . This is advantageous for learning when φ is drawn from a small or even ﬁnite set (e.g., φ could be sampled from a ﬁnite set of digits, while φ belongs to an inﬁnite set of transformed images).

When the pose parameters are latent, as is typical in practice, we can attempt to predict an appropriate inverse transformation from the observed input.1 Based on this intuition, a Spatial Transformer (ST) layer L : Φ Φ (Jaderberg et al., 2015) transforms an input image φ using pose parameters ˆθ = f(φ) that are predicted as a function of the input:

L(φ) = T 1 f(φ)φ,

where the pose predictor f : Φ Rk is typically parameterized as a CNN or fully-connected network.

3.2. Self-Consistent Pose Prediction

A key weakness of standard STs is the pose predictor s lack of robustness to transformations of its input. As a motivating example, consider images in a domain that is known to be rotationally invariant (e.g., classiﬁcation of astronomical objects), and suppose that we train an STaugmented CNN that aims to canonicalize the rotation angle of input images. For some input φ, let the output of the pose predictor be f(φ) = ˆθ for some ˆθ [0, 2π). Then given Tθφ (i.e., the same image rotated by an additional angle θ), we should expect the output of an ideal pose predictor to be f(Tθφ) = ˆθ +θ +2πm for some integer m. In other words, the pose prediction for an input φ should constrain those for Tθφ over the entire orbit of the transformation.

We refer to this desired property of the pose prediction function as self-consistency (Figure 2). In general, we say

1For example, the apparent convergence of parallel lines in the background of an image can provide information on the correct inverse projective transformation to be applied.

Predicted poses

Figure 2. Self-consistent pose prediction. We call a function f : Φ Rk self-consistent if the action of a transformation Tθ on its input results in a corresponding increment of θ in its output. Self-consistency is desirable for functions that predict the pose (e.g., rotation angle) of an object in an image.

that a pose prediction function f : Φ Rk is self-consistent with respect to a transformation group G parameterized by θ Rk if f(Tθφ) = f(φ) + θ, for any image φ and transformation Tθ G. We note that self-consistency is a special case of group equivariance.2

However, there is no guarantee that self-consistency should hold when pose prediction is performed using a standard CNN or fully-connected network: while standard CNNs are equivariant with respect to translation, they are not equivariant with respect to other transformation groups (Cohen & Welling, 2016). In Figure 1, we illustrate a simple example of this limitation of standard CNNs. Using MNIST digits rotated by angles uniformly sampled in θ [0, 2π), we train a CNN classiﬁer with a ST layer that predicts the rotation angle of the input image. During training, the model receives a rotated image as input along with the class label y {0, . . . , 9}; the true rotation angle θ is unobserved. In this example task, we ﬁnd that the poses predicted by the CNN are only approximately self-consistent within a small range of angles, even when the network is trained with 50,000 examples. In contrast, a rotation-equivariant CNN can achieve approximate self-consistency given only 10,000 training examples.

2A function f is equivariant with respect to the group G if there exist transformations Tg and T g such that f(Tgφ) = T gf(φ) for all g G and φ Φ.

Equivariant Transformer Networks

4. Equivariant Transformers

Due to this weakness of standard CNN pose predictors, we will instead use functions that are guaranteed by construction to satisfy self-consistency. We achieve this by leveraging the translation equivariance of standard CNN architectures in combination with specialized canonical coordinate systems designed for the particular transformation groups of interest. Canonical coordinates allow us to reduce the problem of self-consistent prediction with respect to an arbitrary continuous transformation group to that of selfconsistent prediction with respect to the translation group.

We begin with preliminaries on canonical coordinates systems ( 4.1). We then describe our proposed Equivariant Transformer architecture ( 4.2). Next, we describe how canonical coordinates can be derived for a given transformation ( 4.3). Finally, we describe how ET layers can be applied sequentially to handle compositions of several transformations ( 4.4) and cover implementation details ( 4.5).

4.1. Canonical Coordinate Systems for Lie Groups

The method of canonical coordinates was ﬁrst described by Rubinstein et al. (1991) and later developed in more generality by Segman et al. (1992) for the purpose of computing image descriptors that are invariant under the action of continuous transformation groups.

A Lie group with parameters θ Rk is a group of transformations of the form Tθ : Rd Rd that are differentiable with respect to θ. We let the parameter θ = 0 correspond to the identity element, T0x = x. A canonical coordinate system for G is deﬁned by an injective map ρ from Cartesian coordinates to the new coordinate system that satisﬁes

ρ(Tθx) = ρ(x) +

i=1 θiek, (1)

for all Tθ G, where ei denotes the ith standard basis vector. Thus, a transformation by Tθ appears as a translation by θ under the canonical coordinate system. To help build intuition, we give two examples of canonical coordinates:

Example 1 (Rotation). For Tθx = (x1 cos θ x2 sin θ, x1 sin θ + x2 cos θ), a canonical coordinate system is the polar coordinate system, ρ(x) = (tan 1(x2/x1), p

x2 1 + x2 2).

Example 2 (Horizontal Dilation). For Tθx = (x1eθ, x2), a canonical coordinate system is ρ(x) = (log x1, x2).

Reduction to Translation. The key property of canonical coordinates is their ability to adapt translation selfconsistency to other transformation groups. Formally, this is captured in the following result (we defer the straightforward proof to the Appendix):

(a) Spatial Transformer (ST)

(b) Equivariant Transformer (ET)

Figure 3. Spatial and Equivariant Transformer architectures. In both cases, pose parameters ˆθ estimated as a function f of the input image are used to apply an inverse transformation to the image. The ET predicts ˆθ in a self-consistent manner using a canonical coordinate system ρ.

Proposition 1. Let f : Φ Rk be self-consistent with respect to translation and let ρ be a canonical coordinate system with respect to a transformation group G parameterized by θ Rk. Then fρ(φ) := f(φ ρ 1) is self-consistent with respect to G.

Given a canonical coordinate system ρ for a group G, we can thus immediately achieve self-consistency with respect to G by ﬁrst performing a change of coordinates into ρ, and then applying a function that is self-consistent with respect to translation.

4.2. Equivariant Transformer Layers

Our proposed Equivariant Transformer layer leverages canonical coordinates to incorporate prior knowledge on the invariances of a domain into the network architecture:

An Equivariant Transformer (ET) layer LG,ρ : Φ Φ for the group G with canonical coordinates ρ is deﬁned as: LG,ρ(φ) := T 1 fρ(φ)φ (2)

where the self-consistent pose predictor fρ is a CNN whose input is represented using the coordinates ρ.

The ET layer is an image-to-image mapping that applies the inverse transformation of the predicted input pose, where the pose prediction is performed using a network that satisﬁes self-consistency with respect to a pre-deﬁned group G. A standard Spatial Transformer layer can be viewed as an ET where ρ is simply the identity map. Like the ST, the ET layer is differentiable with respect to both its parameters and its input; thus, it is easily incorporated as a layer in existing CNN architectures. We summarize the computation encapsulated in the ET layer in Figure 3.

Equivariant Transformer Networks

Local Invariance. Unlike ST layers, ET layers are endowed with a form of local transformation invariance: for any input image φ, we have that LG,ρ(φ) = LG,ρ(Tθφ) for all Tθ G. In other words, an ET layer collapses the orbit generated by the group action on an image to a single, canonical point. This property follows directly from the self-consistency of the pose predictor with respect to the group G. Importantly, local invariance holds for any setting of the parameters of the ET layer; thus, ETs are equipped with a strong inductive bias towards invariance with respect to the transformation group G.

Implementing Self-Consistency. We implement translation self-consistency in f by ﬁrst predicting a spatial distribution by passing a 2D CNN feature map through a softmax function, and then outputting the coordinates of the centroid of this distribution. By the translation equivariance of CNNs, a shift in the CNN input results in a corresponding shift in the predicted spatial distribution, and hence the location of the centroid. We rescale the centroid coordinates to match the scale of the input coordinate grid.

4.3. Constructing Canonical Coordinates (Algorithm 1)

In order to construct an ET layer, we derive a canonical coordinate system for the target transformation. Canonical coordinate systems exist for all one-parameter Lie groups (Segman et al., 1992; Theorem 1). For Lie groups with more than one parameter, canonical coordinates exist for Abelian groups of dimension k d: that is, groups whose transformations are commutative.

Here, we summarize the procedure described in Segman et al. (1992). For clarity of exposition, we will focus on Lie groups representing transformations on R2 with one parameter θ R. This corresponds to the practically useful case of one-parameter deformations of 2D images. In this setting, condition (1) reduces to:

ρ(Tθx) = ρ(x) + θe1.

Taking the derivative with respect to θ, we can see that it sufﬁces for ρ to satisfy the following ﬁrst-order PDEs: (Tθx)1

x1 + (Tθx)2

x1 + (Tθx)2

We can solve these ﬁrst-order PDEs using the method of characteristics (e.g., Strauss, 2007). Observe that the homogeneous equation (4) admits an inﬁnite set of solutions ρ2; each solution is a different coordinate function that is invariant to the transformation Tθ. Thus, there exists a degree of

Algorithm 1 Constructing a canonical coordinate system

Input: Transformation group {Tθ} Output: Canonical coordinates ρ(x) vi(x) ( (Tθx)i/ θ)|θ=0, i = 1, 2 Dx (v1(x) / x1 + v2(x) / x2) ρ1(x) a solution of Dxρ1(x) = 1 ρ2(x) a solution of Dxρ2(x) = 0 Return ρ(x) = (ρ1(x), ρ2(x))

freedom in choosing invariant coordinate functions; due to the ﬁnite resolution of images in practice, we recommend choosing coordinates that minimally distort the input image to mitigate the introduction of resampling artifacts.

Example 3 (Hyperbolic Rotation). As a concrete example, we will derive a set of canonical coordinates for hyperbolic rotation, Tθx = (x1eθ, x2e θ). This is a squeeze distortion that dilates an image along one axis and compresses it along the other. We obtain the following PDEs:

(x1 / x1 x2 / x2)ρ1(x) = 1,

(x1 / x1 x2 / x2)ρ2(x) = 0.

In the ﬁrst quadrant, the solution to the inhomogeneous equation is ρ1(x) = log p

x1/x2 + c1, where c1 is an arbitrary constant, and the solution to the homogeneous equation is ρ2(x) = h(x1x2), where h is an arbitrary differentiable function in one variable (the choice h(z) = z is known as the hyperbolic coordinate system). These coordinates can be deﬁned analogously for the remaining quadrants to yield a representation of the entire image plane, excluding the lines x1 = 0 and x2 = 0.

4.4. Compositions of Transformations

A single transformation group with one parameter is typically insufﬁcient to capture the full range of variation in object pose in natural images. For example, an important transformation group in practice is the 8-parameter projective linear group PGL(3, R) that represents perspective transformations in 3D space.

In the special case of two-parameter Abelian Lie groups, we can construct canonical coordinates that yield selfconsistency simultaneously for both parameters (Segman et al., 1992; Theorem 1). For example, log-polar coordinates are canonical for both rotation and dilation. However, for transformations on Rd, a canonical coordinate system can only satisfy condition (1) for up to d parameters. Thus, a single canonical coordinate system is insufﬁcient for higher-dimensional transformation groups on R2 such as PGL(3, R).

Stacked ETs. Since we cannot always achieve simultaneous self-consistency with respect to all the parameters

Equivariant Transformer Networks

of the transformation group, we instead adopt the heuristic approach of using a sequence of ET layers, each of which implements self-consistency with respect to a subgroup of the full transformation group. Intuitively, each ET layer aims to remove the effect of its corresponding subgroup.

Speciﬁcally, let Tθ be a k-parameter transformation that admits a decomposition into one-parameter transformations:

Tθ = T (1) θ 1 T (2) θ 2 T (k) θ k ,

where θ i R. For example, in the case of PGL(3, R), we can decompose an arbitrary transformation into a composition of one-parameter translation, dilation, rotation, shear, and perspective transformations. We then apply a sequence of ET layers in the reverse order of the transformations:

L(φ) = LG(k),ρ(k) LG(k 1),ρ(k 1) LG(1),ρ(1)(φ),

where ρ(i) are canonical coordinates for each one-parameter subgroup G(i).

While we can no longer guarantee self-consistency for a composition of ET layers, we show empirically ( 5) that this stacking heuristic works well in practice for transformation groups in several parameters.

4.5. Implementation

Here we highlight particularly salient details of our implementation of ETs. Our Py Torch implementation is available at github.com/stanford-futuredata/ equivariant-transformers.

Change of Coordinates. We implement coordinate transformations by resampling the input image over a rectangular grid in the new coordinate system. This grid consists of rows and columns that are equally spaced in the intervals [umin 1 , umax 1 ] and [umin 2 , umax 2 ], where the limits of these intervals are chosen to achieve good coverage of the input image. These points u in the canonical coordinate system deﬁne a set of sampling points ρ 1(u) in Cartesian coordinates. We use bilinear interpolation for points that do not coincide with pixel locations in the original image, as is typical with ST layers (Jaderberg et al., 2015).

Avoiding Resampling. When using multiple ET layers, iterated resampling of the input image will degrade image quality and amplify the effect of interpolation artifacts. In our implementation, we circumvent this issue by resampling the image lazily. More speciﬁcally, let φ(i) denote the image obtained after i transformations, where φ(0) is the original input image. At each iteration i, we represent φ(i) implicitly using the sampling grid Gi := T (1) ˆθ1 T (i) ˆθi

G0, where G0 represents the Cartesian grid over the original input. We materialize φ(i) (under the appropriate canonical

Figure 4. Projective MNIST. Examples of transformed digits from each class (ﬁrst row: 0 4, second row: 5 9). Each base MNIST image is transformed using a transformation sampled from a 6-parameter group (i.e., PGL(3, R) without translation).

coordinates) in order to predict ˆθi+1. By appending the next predicted transformation T (i+1) ˆθi+1 to the transformation stack, we thus obtain the subsequent sampling grid, Gi+1.

5. Experiments

We evaluate ETs on two image classiﬁcation datasets: an MNIST variant where the digits are distorted under random projective transformations ( 5.1), and the real-world Street View House Numbers (SVHN) dataset ( 5.2). Using projectively-transformed MNIST data, we evaluate the performance of ETs relative to STs in a setting where images are deformed by a known transformation group in several parameters. The SVHN task evaluates the utility of ET layers when used in combination with modern CNN architectures in a realistic image classiﬁcation task. In both cases, we validate the sample efﬁciency beneﬁts conferred by ETs relative to standard STs and baseline CNN architectures.3

5.1. Projective MNIST

We introduce the Projective MNIST dataset, a variant of the MNIST dataset where the digits are distorted using randomly sampled projective transformations: namely rotation, shear, xand y-dilation, and xand y-perspective transformations (i.e., 6 pose parameters in total). The Projective MNIST training set contains 10,000 base images sampled without replacement from the MNIST training set. Each image is resized to 64 64 and transformed using an independently-sampled set of pose parameters.

We also generated three larger versions of the dataset for the purpose of controlled evaluation of the effect of (idealized) data augmentation: these additional datasets respectively contain 2, 4, and 8 copies of the base MNIST images, each transformed under different sets of parameters.

Unlike other MNIST variants such as Rotated MNIST (Larochelle et al., 2007), MNIST-RTS (Jaderberg et al., 2015), and SIM2MNIST (Esteves et al., 2018), our

3In the Appendix, we report additional experimental results on robustness to transformations not seen at training time.

Equivariant Transformer Networks

Table 1. Classiﬁcation error rates on Projective MNIST ( 5.1). All methods use the same CNN architecture for classiﬁcation and differ in the transformations applied to the input images. We train on up to 8 sampled transformations for each base MNIST image. LP: log-polar coordinates; shx: x-shear; hr: hyperbolic rotation; px: x-perspective; py: y-perspective.

Method Transformations # sampled transformations

Cartesian - 11.91 9.67 7.64 6.93 Log-polar - 6.55 5.05 4.48 3.83

ST-LP shx 5.77 4.27 3.97 3.47 ST-LP shx hr 4.92 3.87 3.22 3.03 ST-LP* shx hr px py

ET-LP shx 5.48 4.67 3.63 3.21 ET-LP shx hr 4.18 3.17 2.96 2.62 ET-LP shx hr px py 3.76 3.11 2.80 2.60

*We omit this conﬁguration due to training instability.

Projective MNIST dataset incorporates higher-dimensional combinations of transformations, including projective transformations not considered in prior work (e.g., perspective transforms). We provide further details on the construction of the dataset in the Appendix.

Network Architectures. We used a CNN architecture based on the Z2CNN from Cohen & Welling (2016), with 7 layers of 3 3 convolutions with 32 channels, batch normalization after convolutional layers, and dropout after the 3rd and 6th layers. In addition to this baseline Cartesian CNN, we also evaluated a more rotationand dilation-robust network where the inputs are ﬁrst transformed to log-polar coordinates (Henriques & Vedaldi, 2017; Esteves et al., 2018).

We introduce a sequence of transformer layers before the log-polar coordinate transformation to handle the remaining geometric transformations applied to the input. For both the baseline STs and ETs, we apply a sequence of transformer layers, with each layer predicting a single pose parameter. The pose predictor networks in both cases are 3-layer CNNs with 32 channels in each layer. We selected the transformation order, dropout rate, and learning rate schedule based on validation accuracy (see the Appendix for details).

Classiﬁcation Accuracy (Table 1). We ﬁnd that the ET layers consistently improve on test error rate over both the log-polar and ST baselines. By accounting for additional transformations, the ET improves on the error rate of the baseline log-polar CNN by 2.79% a relative improvement of 43% when trained on a single pose per prototype. Note that we omit the ST baseline with the full transformation sequence due to training instability, despite more extensive hyperparameter tuning than the ET. We ﬁnd that all methods improve from augmentation with additional poses, with the

Figure 5. Sensitivity to initial learning rate. For each learning rate setting, we plot the minimum, mean, and maximum validation error rates over 10 runs for networks trained with ETs and STs. The predicted transformations are x-shear and hyperbolic rotation. We ﬁnd that ETs are signiﬁcantly more robust than STs to the learning rate hyperparameter.

ET retaining its advantage but at a reduced margin.

Hyperparameter Sensitivity (Figure 5). We compared the sensitivity of ET and ST networks to the initial learning rate by comparing validation error when training with learning rate values ranging from 1 10 4 to 4 10 3. For each setting, we trained 10 networks with independent random initializations on Projective MNIST with 10,000 examples, computing the validation error after each epoch and recording the minimum observed error in each run. We ﬁnd that STs were signiﬁcantly more sensitive to learning rate than ET, with far higher variance in error rate between runs. This suggests that the self-consistency constraint imposed on ETs helps improve the training-time stability of networks augmented with transformer layers.

5.2. Street View House Numbers (SVHN)

The goal of the single-digit classiﬁcation task of the SVHN dataset (Netzer et al., 2011) is to classify the digit in the center of 32 32 RGB images of house numbers. SVHN is well-suited to evaluating the effect of transformer layers since there is a natural range of geometric variation in the data due to differences in camera position unlike Projective MNIST, we do not artiﬁcially apply further transformations to the data. The training set consists of 73,257 examples; we use a randomly-chosen subset of 5,000 examples for validation and use the remaining 68,257 examples for training. In order to evaluate the data efﬁciency of each method, we also trained models using smaller subsets of 10,000 and 20,000 examples. The dataset also includes 531,131 additional images that can be used as extra training data; we thus additionally evaluate our methods on the concatenation of this set and the training set.

Network Architectures. We use 10-, 18-, and 34-layer Res Net architectures (He et al., 2016) as baseline networks. Each transformer layer uses a 3-layer CNN with 32 chan-

Equivariant Transformer Networks

Table 2. Classiﬁcation error rates on SVHN ( 5.2). For both STs and ETs, we used the following transformations: xand ytranslation, rotation, and x-scaling. Error rates are each averaged over 3 runs. ETs achieve the largest accuracy gains relative to STs and the baseline CNNs in the limited data regime.

Network Transformer # training examples

10k 20k 68k 600k

Res Net-10 None 9.83 7.90 5.35 2.96 Spatial 9.80 7.66 4.96 2.92 Equivariant 8.24 6.71 4.84 2.70

Res Net-18 None 9.23 7.31 4.81 2.76 Spatial 9.10 7.17 4.51 2.70 Equivariant 7.81 6.37 4.50 2.57

Res Net-34 None 8.73 7.05 4.67 2.53 Spatial 8.60 6.91 4.37 2.66 Equivariant 7.72 5.98 4.23 2.47

nels per layer for pose prediction. We applied xand ytranslation, rotation, and xand y-scaling to the input images: these were selected from among the subgroups of the projective group using the validation set.

Results (Table 2). We ﬁnd that ETs improve on the error rate achieved by both STs and the baseline Res Nets, with the largest gains seen in the limited data regime: with 10,000 examples, ETs improve on the error rates of the baseline CNNs and ST-augmented CNNs by 0.9 1.6%, or a relative improvement of 10 16%. We see smaller gains when more training data is available: the relative improvement between ETs and the baseline CNNs is 11 13% with 20,000 examples, and 6.4 9.5% with 68,257 examples.

When data is limited, we ﬁnd that a simpler classiﬁer where prior knowledge on geometric invariances has been encoded using ETs can outperform more complex classiﬁers that are not equipped with this additional structure. In particular, when trained on 10,000 examples, a Res Net-10 classiﬁer with ET layers achieves lower error than the baseline Res Net-34 classiﬁer. The baseline Res Net-34 has over 5.3M parameters; in contrast, the Res Net-10 has 1.2M parameters, with the ET layers adding only 31k parameters in total. The ET-augmented Res Net-10 therefore achieves improved error rate with an architecture that incurs less memory and computational cost than a Res Net-34.

6. Discussion and Conclusion

Limitations of ETs. The self-consistency guarantee of ETs can fail due to boundary effects that occur when image content is cropped after a transformation. This issue can be mitigated by padding the input such that the transformed image does not fall out of frame . Even without a strict self-consistency guarantee, we still observe gains when ET layers are used in practice (e.g., in our SVHN experiments).

input x-shear hyp. rot. x-persp. y-persp.

input translation rot./scale x-scale

Figure 6. Predicted transformations. On Projective MNIST (top), ETs reverse the effect of distortions such as shear and perspective, despite being provided no direct supervision on pose parameters (the ﬁnal images remain rotated and scaled since the classiﬁcation CNN operates over their log-polar representation). On SVHN (bottom), the ﬁnal x-scale transformation has a cropping effect that removes distractor digits.

As discussed in 4.4, the method of stacking ET layers is ultimately a heuristic approach as it does not guarantee self-consistency with respect to the full transformation group. Moreover, higher-dimensional groups require the use of long sequences of ET layers, resulting in high computational cost. In such cases, we could employ a hybrid approach where difﬁcult subgroups are handled by ET layers, while the remaining degrees of freedom are handled by a standard ST layer. In general, enforcing equivariance guarantees for higher-dimensional transformation groups in a computationally scalable fashion remains an open problem.

In contrast to the use of prior knowledge on transformation invariances in this work, there is a separate line of research that concerns learning various classes of transformations from data (Hashimoto et al., 2017; Thomas et al., 2018). Extending ETs to these more ﬂexible notions of invariance may prove to be an interesting direction for future work.

Conclusion. We proposed a neural network layer that builds in prior knowledge on the continuous transformation invariances of its input domain. By encapsulating equivariant functions within an image-to-image mapping, ETs expose a convenient interface for ﬂexible composition of layers tailored to different transformation groups. Empirically, we demonstrated that ETs improve the sample efﬁciency of CNNs on image classiﬁcation tasks with latent transformation parameters. Using libraries of ET layers, practitioners are able to quickly experiment with multiple combinations of transformations to realize gains in predictive accuracy, particularly in domains where labeled data is scarce.

Equivariant Transformer Networks

Acknowledgements

We thank Pratiksha Thaker, Kexin Rong, and our anonymous reviewers for their valuable feedback on earlier versions of this manuscript. This research was supported in part by afﬁliate members and other supporters of the Stanford DAWN project Ant Financial, Facebook, Google, Intel, Microsoft, NEC, SAP, Teradata, and VMware as well as Toyota Research Institute, Keysight Technologies, Northrop Grumman, Hitachi, NSF awards AF-1813049 and CCF1704417, an ONR Young Investigator Award N00014-18-12295, and Department of Energy award DE-SC0019205.

Amari, S. Feature Spaces which Admit and Detect Invariant Signal Transformations. In International Joint Conference on Pattern Recognition, 1978.

Amit, Y., Grenander, U., and Piccioni, M. Structural image restoration through deformable templates. Journal of the American Statistical Association, 1991.

Cohen, T. S. and Welling, M. Group Equivariant Convolutional Networks. In International Conference on Machine Learning, 2016.

Cohen, T. S., Geiger, M., K ohler, J., and Welling, M. Spherical CNNs. In International Conference on Learning Representations, 2018.

De Castro, E. and Morandi, C. Registration of translated and rotated images using ﬁnite Fourier transforms. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 700 703, 1987.

Dieleman, S., De Fauw, J., and Kavukcuoglu, K. Exploiting cyclic symmetry in convolutional neural networks. International Conference on Machine Learning, 2016.

Esteves, C., Allen-Blanchette, C., Zhou, X., and Daniilidis, K. Polar Transformer Networks. In International Conference on Learning Representations, 2018.

Freeman, W. T. and Adelson, E. H. The design and use of steerable ﬁlters. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 891 906, 1991.

Gens, R. and Domingos, P. M. Deep Symmetry Networks. In Advances in Neural Information Processing Systems, 2014.

Hashimoto, T. B., Liang, P. S., and Duchi, J. C. Unsupervised transformation learning via convex relaxations. In Advances in Neural Information Processing Systems, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.

Henriques, J. F. and Vedaldi, A. Warped Convolutions: Efﬁcient Invariance to Spatial Transformations. In International Conference on Machine Learning, 2017.

Jacobsen, J.-H., De Brabandere, B., and Smeulders, A. W. Dynamic Steerable Blocks in Deep Residual Networks. ar Xiv preprint ar Xiv:1706.00598, 2017.

Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K. Spatial Transformer Networks. In Advances in Neural Information Processing Systems, 2015.

Laptev, D., Savinov, N., Buhmann, J. M., and Pollefeys, M. TI-POOLING: transformation-invariant pooling for feature learning in convolutional neural networks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.

Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In International Conference on Machine Learning, 2007.

Lin, C.-H. and Lucey, S. Inverse Compositional Spatial Transformer Networks. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.

Lucas, B. D. and Kanade, T. An iterative image registration technique with an application to stereo vision. In International Joint Conference on Artiﬁcial intelligence, 1981.

Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. Rotation Equivariant Vector Field Networks. In International Conference on Computer Vision, 2017.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, 2011.

Rawlinson, D., Ahmed, A., and Kowadlo, G. Sparse unsupervised capsules generalize better. ar Xiv preprint ar Xiv:1804.06094, 2018.

Reddi, S. J., Kale, S., and Kumar, S. On the Convergence of Adam and Beyond. In International Conference on Learning Representations, 2018.

Reddy, B. S. and Chatterji, B. N. An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Transactions on Image Processing, 5(8): 1266 1271, 1996.

Equivariant Transformer Networks

Rubinstein, J., Segman, J., and Zeevi, Y. Recognition of distorted patterns by invariance kernels. Pattern Recognition, 24(10):959 967, 1991.

Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, 2017.

Segman, J., Rubinstein, J., and Zeevi, Y. Y. The canonical coordinates method for pattern deformation: Theoretical and computational considerations. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1171 1183, 1992.

Shu, Z., Sahasrabudhe, M., Alp Guler, R., Samaras, D., Paragios, N., and Kokkinos, I. Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance. In European Conference on Computer Vision (ECCV), 2018.

Strauss, W. A. Partial Differential Equations: An Introduction. Wiley, 2007.

Teo, P. C. Theory and Applications of Steerable Functions. Ph D thesis, Stanford University, 1998.

Thomas, A., Gu, A., Dao, T., Rudra, A., and R e, C. Learning compressed transforms with low displacement rank. In Advances in Neural Information Processing Systems, 2018.

Weiler, M., Hamprecht, F. A., and Storath, M. Learning Steerable Filters for Rotation Equivariant CNNs. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.

Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and Brostow, G. J. Harmonic Networks: Deep Translation and Rotation Equivariance. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.

Yuille, A. L. Deformable templates for face recognition. Journal of Cognitive Neuroscience, 1991.

Zhou, Y., Ye, Q., Qiu, Q., and Jiao, J. Oriented response networks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.