# rethinking_neural_operations_for_diverse_tasks__e9c54c9d.pdf

Rethinking Neural Operations for Diverse Tasks

Nicholas Roberts University of Wisconsin-Madison nick11roberts@cs.wisc.edu

Mikhail Khodak Carnegie Mellon University khodak@cmu.edu

Tri Dao Stanford University trid@stanford.edu

Liam Li Hewlett Packard Enterprise me@liamcli.com

Christopher Ré Stanford University chrismre@cs.stanford.edu

Ameet Talwalkar Carnegie Mellon University & Hewlett Packard Enterprise talwalkar@cmu.edu

An important goal of Auto ML is to automate-away the design of neural networks on new tasks in under-explored domains. Motivated by this goal, we study the problem of enabling users to discover the right neural operations given data from their speciﬁc domain. We introduce a search space of operations called XD-Operations that mimic the inductive bias of standard multi-channel convolutions while being much more expressive: we prove that it includes many named operations across multiple application areas. Starting with any standard backbone such as Res Net, we show how to transform it into a search space over XD-operations and how to traverse the space using a simple weight-sharing scheme. On a diverse set of tasks solving PDEs, distance prediction for protein folding, and music modeling our approach consistently yields models with lower error than baseline networks and often even lower error than expert-designed domain-speciﬁc approaches.

1 Introduction

Automated machine learning (Auto ML) and neural architecture search (NAS) are often motivated by a vision of democratizing ML by reducing the need for expert design on a variety of tasks. While NAS has grown rapidly with developments such as weight-sharing [36] and NAS-benches [47, 49], most efforts focus on search spaces that glue together established primitives for well-studied tasks like vision and text [32, 26, 45, 25] or on issues such as latency [8, 13]. In this work, we revisit the broader vision of NAS and propose to move towards much more general search spaces while still exploiting successful network topologies. To do so we focus on expanding the set of operations, which is usually fairly small; for example, that of the well-studied DARTS space has eight elements: a few types of convolution and pooling layers [32]. The baseline approach for expanding this set adding operations one-by-one scales poorly and will not result in new operations when faced with new types of data.

Our core contribution is a re-imagining of NAS operation spaces that drastically expands this set in a principled fashion to include both standard operations as well as a wide range of new ones. To do so we exploit the fact that most standard operations used in modern NAS return linear transforms diagonalized by the discrete Fourier transform (DFT). Replacing the DFT matrices in the diagonal decomposition by a more expressive family of efﬁcient linear transforms known as Kaleidoscope or

denotes equal contribution.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

search space of XD-operations standard discrete NAS search space

running NAS algorithm ...

picking operation for edge (2,4) ...

model weights

input to edge (2,4)

backbone network

expressive diagonalization relaxation

Figure 1: Diagram of our search space depicting a NAS method picking an operation for an edge in a backbone network (left). Instead of choosing from a discrete search space, we use a relaxation based on the convolution s diagonalization by the discrete Fourier transform in which the DFTs are replaced by K-matrices [10] K, L, and M (middle); these are the main architecture parameters of our new search space over Expressive Diagonalization (XD) operations. This space contains most operations considered in standard NAS and many other important operations in a variety of domains (right).

K-matrices [10] yields the set of Expressive Diagonalization (XD) Operations, which comprise a large search space containing various types of grid-based convolutions and pooling, permutations, transposed convolutions, certain kinds of graph convolutions, the Fourier Neural Operator (FNO) [30], and inﬁnitely many more. This broad expressivity reﬂects the key insight of our work: that many of the most important neural operations in ML consist of multiple channels that apply weights w to inputs x by computing K diag(Lw)Mx (1)

where the matrices K, L, and M are efﬁcient (to represent and apply) and shared across channels.

We leverage XD-operations to take critical steps towards a broader NAS that enables the discovery of good design patterns with limited human speciﬁcation from data in under-explored domains. To do so we develop a simple procedure which transforms any backbone convolutional neural network (CNN) into an architecture search space by replacing its operations with XD-operations. This space is then searched using a simple weight-sharing algorithm that needs only a small amount of tuning to ﬁnd effective operations. As a simple ﬁrst demonstration, we show that XD-operations yield models that are 15% more accurate than standard discrete search spaces on permuted CIFAR-10, highlighting the fragility of standard NAS operation spaces on new datasets, and thus the need for XD-operations.

As our main evaluation, we demonstrate the effectiveness of XD-operations in a series of applications showing that, starting from vanilla CNNs, they consistently outperform custom-designed operations.

Learning to solve partial differential equations (PDEs): when substituted into a simple CNN backbone, XD-operations outperform convolutions and the dense prediction NAS method Auto Deep Lab [31], and even achieve lower error than custom-designed, state-of-the-art operations (FNOs [30]) across three problems with different dimensionalities (Burgers equation, Darcy Flow, and Navier-Stokes). Our method also maintains consistent performance across different resolutions, a major stated advantage of FNOs over previous methods. Protein folding: on the task of predicting residue distances in a polypeptide chain a key component of the protein folding problem we substitute XD-operations into vanilla Res Nets and achieve lower error than cyclically-dilated Res Nets adapted speciﬁcally for this setting [1]. Furthermore, our Res Net-34 XD outperforms the reported error of the much deeper Dilated Res Net-258. Music modeling: on two next-note prediction tasks, we show that substituting XD-operations into an undilated CNN outperforms temporal convolutional networks (TCNs) exponentially-dilated 1d CNNs that themselves outperform standard convolutional and recurrent networks [5].

Code to reproduce these results is available here: https://github.com/nick11roberts/XD. Software to apply XD-operations can be found here: https://github.com/mkhodak/relax.

Related Work Auto ML is a well-studied area, with most work focusing on fairly small hyperparameter spaces [7, 24] or on NAS [12]. Most NAS operation spaces only contain a few operations such as convolutions [32, 33, 49, 11], which may not be useful for domains where CNNs are ineffective. Applications of NAS outside vision largely follow the same pattern of combining human-designed operations [35, 43]. On the other extreme, Auto ML-Zero [37] demonstrates the possibility of evolving all aspects of ML from scratch. We seek to establish a middle ground with large and domain-agnostic search spaces that still allow the use of well-tested methods, e.g. stochastic gradient descent (SGD).

Several papers have generalized the DFT to replace layers in deep nets [9, 3, 2, 10] in order to speed up or add structure to models while reducing expressivity. In contrast, we can replace convolutions and other layers while increasing expressivity by extending their diagonalization via K-matrices. As discussed in Section 2, using K-matrices for this directly is inefﬁcient for input dimension > 1.

2 The Expressive Diagonalization Relaxation

In this section we overview our main contribution: a large, general search space of neural operations. Formally, we view an architecture as a parameterizable object a mapping from model weights to functions described by a labeled directed acyclic graph (DAG) G(V, E). Each edge in E has the form (u, v, Op), where u, v V are nodes and Op is an operation that can be parameterized to deﬁne some transformation of the representation at node u; node v aggregates the outputs of its incoming edges into a new representation. For example, the popular Res Net architecture [15] has many nodes with two incoming edges, one labeled by the convolution operation Conv and one by the identity (skip-connect) Id, whose outputs it sums and passes to outgoing edges with the same labels. Each architecture has a source node taking in input data and an output node returning a prediction.

Neural architecture search is the problem of automatically selecting an operation for each edge of G to optimize an objective.1 For each edge e E a NAS algorithm must pick one element of a search space S = {Opa |a A} of operations speciﬁed by architecture parameters a A to assign to e; in past work, A usually indexes a small set of operations. As an example, we will refer to a variant2 Sdiscrete of the DARTS search space with parameters Adiscrete = {1, . . . , 8} where each operation is one of Zero, Id, Max Pool3 3, Avg Pool3 3, Conv3 3 or 5 5, or Dilated Conv3 3,2 or 5 5,2 [32].

Our main contribution is a novel family of operations that comprise a search space containing almost all these operations, in addition to many others that have been found useful on different types of data. The starting point of our construction of these XD-operations is the simple observation that all the operations Op Sdiscrete listed above except Max Pool3 3 are linear, i.e. for any model weights w there exists a matrix Aw such that for all inputs x we have Op(w)(x) = Awx. More speciﬁcally, all seven of them return convolutions: to see this note that Zero, Id, and Avg Pool3 3 each apply a convolution with ﬁlter 01 1, 11 1, and 13 3/9, respectively. This means that most of the operations in the DARTS search space which is representative of NAS operation spaces in computer vision share the convolution s diagonalization by the discrete Fourier transform (DFT). Formally, if Aw Rn2 n2 is the matrix representing a 2d convolution with ﬁlter w Rk of kernel size k [n]2, then for any 2d input x Rn2 we have

Conv(w)(x) = Awx = F 1 diag (Fw) Fx (2)

Here [n] = {1, . . . , n}, diag(z) denotes the diagonal matrix with entries z, w Rn2 is an appropriate zero-padding of w Rk, and F Cn2 n2 is the 2d DFT (a Kronecker product of two 1d DFTs).

This diagonalization explicates both the computational and representational efﬁciency of the DARTS operations, as the DFT and its inverse can be applied in time O(n log n) and stored with O(n log n) bits. It also suggests a natural way to dramatically expand the operation space while preserving these efﬁciencies: just replace matrices F and F 1 in (2) by any one of a general family of efﬁcient matrices. Doing so yields the single-channel version of our expressive diagonalization (XD) operations:

XD1 α(w)(x) = Real (K diag (Lw) Mx) (3)

Here architecture parameter α = (K, L, M) sets the matrices replacing F and F 1 in Equation 2.

1It is often deﬁned as selecting both operations and a graph topology [50], but if the set of operations contains the zero-operation Zero then the former subsumes the latter. 2For memory-efﬁciency, all convolutions in the original DARTS search space are separable [32].

The main remaining question is the family of efﬁcient matrices to use, i.e. the domain of the architecture parameters K, L, and M. For this we turn to the Kaleidoscope matrices, or K-matrices [10], which generalize F and F 1 to include all computationally efﬁcient linear transforms with short description length, including important examples such as sparse matrices and permutations. To obtain this general family, K-matrices allow the DFT s butterﬂy factors matrices whose products yield its efﬁcient implementation to take on different values. While a detailed construction of K-matrices can be found in the original paper, we need only the following useful properties: they are as (asymptotically) efﬁcient to apply as DFTs, are differentiable and can thus be updated using gradient-based methods, and can be composed (made deeper ) to make more expressive K-matrices.

Specifying that K, L, and M in Equation 3 are K-matrices largely completes our core contribution: a new search space SXD of XD-operations with K-matrix architecture parameters. We give a full multi-channel formalization in N dimensions, as well as an overview of its expressivity, in Section 3. First, we note some key aspects of this new search space:

Complexity: XD1 α(w) requires three K-matrices and O(1) ﬁlter weights to represent, i.e. description length O(n log n); this is larger than a regular convolution (which has no architecture parameters) but is not quadratic in the input size like a linear layer. Applying XD1 α requires multiplication by three K-matrices, yielding a theoretical per-channel time complexity of O(n log n), matching the efﬁciency of convolutions. However, as XD-operations strictly generalize convolutions they are more expensive to apply in-practice; we detail these costs both in the application sections and as appendix table, and we view improving upon them as an important future direction. Initialization: a crucial advantage of XD-operations is that we can initialize or warm-start search using operations with known constructions. In particular, since we can recover convolutions (2) by setting architecture parameters K = F 1, L = F, and M = F in Equation 3, we can always start search with any CNN backbone. We use this extensively in experiments. K-matrices: as they contain all efﬁcient linear transforms, K-matrices can represent all functions returned by XD-operations, including convolutions. However, for input dimension and ﬁlter size > 1 the only known way is to apply K-matrices directly to ﬂattened inputs x Rn N , yielding much worse description length O(n N log n). In contrast, as detailed in Section 3, our diagonalization approach uses Kronecker products to apply DFTs to each dimension separately, yielding description length O(n log n). It is thus the ﬁrst (and in some sense, right ) method to use such matrices to replace convolutions. Furthermore, diagonalization allows us to separate model weights w from architecture parameters α, letting the former vary across channels while ﬁxing the latter.

Finally, we address the fact that the architecture parameters of SXD are continuous, not discrete, contrasting with much of the NAS literature. This can be viewed as a natural extension of the weight-sharing paradigm [36], in which continuous relaxation enables updating architecture parameters with gradient methods. For example, many algorithms traverse the relaxed DARTS search space Sdiscrete = n P8 i=1 λi Opi |λi 0, P8 i=1 λi = 1 o , deﬁned via DARTS operations Opi Sdiscrete and architecture parameters λi in the 8-simplex; most search spaces then require discretizing after search via a rounding procedure that maps from the simplex to Adiscrete. Note that the fully continuous nature of XD-operations means that we will only evaluate the ﬁnal network returned by search. In particular, while some weight-sharing papers also report the correlation between true architecture performance and that indicated by the shared weights [46], there is no obvious way to deﬁne a ranking or sampling distribution over XD-operations in order to do so. This also means that our ﬁnal architecture will not be more efﬁcient than the supernet, unlike other weight-sharing methods that do discretize.

3 XD-Operations and Their Expressivity

Here we formalize XD-operations and show what operations they include. We ﬁrst deﬁne operations:

Deﬁnition 3.1. A parameterizable operation is a mapping Op : W 7 F from parameter space W to a space F = {Op(w) : X 7 Y|w W} of parameterized functions from input space X to output space Y. A search space is a set of operations with the same W, X, and Y.

For example, if X = Y = Rn and W = Rn n then each W W deﬁnes a parameterized linear layer that for each x X returns Lin(W)(x) = Wx. Here Lin is the parameterizable operation and for each W the linear map Lin(W) is the parameterized function.

From Deﬁnition 3.1, we say a search space can express a speciﬁc operation if it contains it. Crucially, the ability of a parameterizable operation Op1 to express a parameterized function Op2(w) output from another operation Op2 given the right set of weights w does not imply that a search space containing Op1 can express Op2. For example, Lin(In) = Id(W) W Rn n but Lin(W) = Id(W) W = In, so a search space containing the linear operation Lin cannot express the skip-connection Id, despite the fact that Lin can be parameterized to compute the identity.

Formalizing Multi-Channel XD-Operations Recall the single-channel XD-operation XD1 α in Equation 3 speciﬁed by three-matrix architecture parameter α = (K, L, M). For input dimension N 1, every matrix B α is a Kronecker product of N K-matrices of depth d Z3 +, i.e. B = NN i=1 Bi for K-matrices Bi Cn n of depth d[1], d[2], or d[3] for B = K, L, or M, respectively.3 Roughly speaking, XD1 α can return any linear operation that is diagonalized by Kmatrices and is thus efﬁcient to compute and represent, e.g. any convolution (recall we recover the diagonalization of Conv(w) in Equation 2 by setting K, L, and M appropriately in Equation 3). However, XD1 α cannot represent efﬁcient parameter-free operations such as skip-connections and average-pooling, both common in NAS. In particular, the only way to always ignore the model weights w is to set one of the K-matrices to zero, producing the zero-operation. We avoid this by adding a bias b Cn N as an architecture parameter, yielding the biased single-channel XD-operation:4

XD1 α,b(w)(x) = Real (K diag(Lw + b)Mx) (4)

This lets us deﬁne skip-connections (set K = M = In N , L = 0n N n N , and b = 1n N ) and average-pooling (set K = F 1, L = 0n N n N , M = F, and b to be F multiplied by a pooling ﬁlter).

Lastly, we use XD1 α,b to construct multi-channel layers that pass multiple input features through multiple channels and re-combine them as multiple output features. This follows the primary way of using convolutions in deep nets. The key insight here is that we will share the same parameterizable operation (speciﬁed by α and b) across all channels, just as in convolutional layers.

Deﬁnition 3.2. Let a = (α, b, C) be an architecture parameter containing a triple α = (K, L, M) of Kronecker products of N K-matrices with depths d Z3 +, a bias b Cn N , and channel gates C Cc c.5 Using L to denote concatenation, the XD-operation XDa of depth d speciﬁed by a is a parameterizable operation on parameter space W = Rc c k consisting of c2 ﬁlters of size k [n]N that outputs parameterized functions on X = Rc m N for m n mapping every x X to

XDa(w)(x) =

j=1 C[i,j]XD1 α,b(w[i,j])(x[j]) (5)

The last architecture parameter C allows interpolation between all-to-all layers (C = 1c c), e.g. multi-channel convolutions, and layers where each channel is connected to one other channel (C = Ic), e.g. skip-connections and average-pooling. We note that we use SXD to describe the set of operations covered by Deﬁnition 3.2 and conclude our construction by discussing two properties:

Kernel size: the weight-space available to an XD-operation is Rc c n N ; however, since we will initialize search with existing CNNs, we will zero-pad to have the same weight-space Rc c k N as the convolutions with ﬁlter size k n that they replace. This preserves the weight count but also means that if the backbone has 3 3 ﬁlters our search space will not contain 5 5 convolutions. Experimentally, we ﬁnd that relaxing the constraint to allow this does not signiﬁcantly affect results on image tasks, so we do not do so in subsequent applications to avoid increasing the weight count. Depth: an XD-operation s depth is a triple describing the depths of its K-matrices K, L, and M. Increasing it trades off efﬁciency for expressivity; for example, in the next section we describe operations that we can show are contained in SXD if L or M have depth > 1. By default we will set the depth to be the minimum needed to initialize search with the backbone operation.

3A depth-d K-matrix is a product of d depth-1 K-matrices. 4Zero-padding x as well lets the input to be smaller than the output if needed, e.g. for transposed convolutions. 5For simplicity we formalize the case where all N dimensions have the same input size and there is an identical number c of input and output channels; both are straightforward to extend.

Expressivity of XD-Operations For many papers that replace deep net layers with efﬁcient linear transforms [34, 10], the question of expressivity comes down to the transform capacity. For example, layers with a K-matrix in every channel can represent a different transform in each, thus allowing the output to be any combination of efﬁcient linear operations. Our case is less straightforward since we care about expressivity of the search space, not of parameterized functions, and our approach is less-expressive by design as all channels share K-matrices K, L, and M. The latter can be thought of as a useful inductive bias on NAS: the set of XD-operations is still much broader than the set of convolutions, but the way in which model weights are applied is the same across all channels.

Expressivity results are a way to see if this bias is useful or constraining. Here we summarize some important operations that are 1d XD-operations; proofs can be found in the appendix and are straightforward to extend to multi-dimensional inputs. Formally, there exists d Z3 + such that the set of XD-operations of depth d over weights W = Rc c k and inputs X = Rm for m n contains

1. convolutions with ﬁlter size k, dilation n 1

k 1 , stride n 1, and arbitrary channel groups. 2. parameter-free operations Id, Zero, and Avg Pools for any kernel size s n. 3. composing 1 or 2 with multiplication of all input or output channels by a bounded-depth K-matrix.

Note this does not account for all important XD-operations, e.g. we show in the appendix that they also express Fourier Neural Operators [30] with k/2 modes and any transposed convolutions whose stride equals the dilated kernel size.6 Still, the ﬁrst two items account for non-separable variants of most operations considered in past NAS work in computer vision, excluding the nonlinear Max Pool [47, 11]. Note depthwise-separable convolutions are contained in the set of compositions of XD-operations. The third item implies that XD-operations can express the basic and diffusion graph convolutions over ﬁxed graphs [21, 27]: both are point-wise convolutions composed with sparse multiplication by a modiﬁed adjacency matrix, which K-matrices can represent efﬁciently.

As a concrete example, consider dilated convolutions, which for k > 1 and dilation factor d 1 apply ﬁlters of effective size (k 1)d + 1 with nonzero entries separated by d 1 zeros. One could hope to express the application of Dilated Convk,d to an input x Rn in the single-channel setting as F 1 diag(F diag(pk,d)w)Fx, where pk,d {0, 1}n zeroes out appropriate entries of w, but this requires ﬁlter size (k 1)d + 1 > k, increasing the number of weights. Instead, we can use a permutation Pk,d {0, 1}n n before the DFT to place the k entries of w into dilated positions:

Dilated Convk,d(w)(x) = F 1 diag(FPk,dw)Fx (6)

As permutations are depth-2 K-matrices [10], we can express Dilated Convk,d with an XDoperation of depth (1, 3, 1), with K = F 1, L = FPk,d, and M = F.

4 Finding and Evaluating XD-Operations

This section outlines a simple procedure that we use to evaluate XD-operations. Recall that NAS methods specify architectures by assigning operations to each edge (u, v, Op) of a computational graph. We aim to simultaneously ﬁnd good operations and model weights, a goal distinct from the classic two-stage NAS formulation, which ﬁnds assignments in an initial search phase before training the resulting architecture from scratch [47]. However, the use of weight-sharing [36] extends NAS to one-shot objectives where weights and architectures are jointly optimized. Under weight-sharing, architecture parameters become weights in a larger supernet, extending the hypothesis class [25].

To assess XD-operations directly we assume the user provides a starter network with existing edge labels Opu,v as a backbone. We transform this into a weight-sharing supernet by reparameterizing each operation Opu,v as an XD-operation XDau,v with architecture parameter au,v. Then we simultaneously train both au,v and the model weights wu,v associated with each edge as follows:

Architecture parameters au,v are initialized using the original operation used by the CNN backbone by setting Opu,v = XDau,v; au,v is then updated via SGD or Adam [20]. We tune step-size, momentum, and the number of warmup epochs: initial epochs during which only model weights wu,v are updated. This can be viewed as a specialized step-size schedule. Model weights wu,v are initialized and updated using the routine provided with the backbone.

6This restriction still includes transposed convolutions used in well-known architectures such as U-Net [38].

Table 1: Search space comparison on CIFAR-10. Validation accuracies are averages of three trials. While we use small CNNs for exploration, XD-operations can also be used with high-performance backbones to obtain > 95% accuracy (c.f. the appendix).

Backbone Permuted Cost search space CIFAR-10 CIFAR-10 (hours )

Le Net 75.5 0.1 43.7 0.5 0.3 Sdiscrete 75.6 3.4 47.7 1.0 1.0 SXD 77.7 0.7 63.0 1.0 0.9

Res Net-20 91.7 0.2 58.6 0.7 0.6 Sdiscrete 92.7 0.2 58.0 1.0 5.3 SXD 92.4 0.2 73.5 1.6 5.6

No data augmentation used in the permuted case.

Figure 2: On permuted images, where convolutions are not the right operation, we ﬁnd XD-operations that are farther away from the operations of the initial CNN backbone.

This approach allows us to use established topologies and optimizers while searching for new operations, thus aligning with the goal for Sections 5, 6, and 7: to improve upon the CNN backbones that practitioners often use as a ﬁrst attempt. As a simple example, we start by applying the procedure to image classiﬁcation. Since this is not the main objective of our work, we treat it as a warmup and consider two datasets: CIFAR-10 and a variant where the images rows and columns are permuted. On CIFAR-10 we do not expect to see much improvement from XD-operations over the CNN backbone used to initialize search, as convolutions are already the right operation for images. On the other hand, the right operation on permuted data, at least in layer one, is an inverse permutation followed by convolution; as this is an XD-operation7, here we do hope to see improvement.

Using Le Net [23] and Res Net-20 [15] as backbones, we compare applying our algorithm to XDoperations with two baselines: (1) using just the backbone CNN and (2) applying a similar method to the relaxed set Sdiscrete of DARTS operations from Section 2. To optimize over Sdiscrete we take an approach similar to DARTS: parameterize the simplex using a softmax and apply Adam. We experiment with both a uniform initialization and one biased towards the backbone s operation. While both SXD and Sdiscrete contain Le Net s Conv5 5 and Res Net s Conv3 3 and Id, for Le Net s Max Pool3 3 layer we initialize with the closest operation. For direct comparison, both search spaces employ weights with maximum ﬁlter size 5 5 and for both we evaluate the shared weights rather than retraining, which we ﬁnd hurts Sdiscrete. We set the XD-operations depth to d = 33 to express the dilated convolutions in Sdiscrete and convolutions composed with permutations.

In Table 1, we see that while both the relaxed discrete NAS operations and XD-operations perform comparably on regular images, XD-operations achieve around 15% better accuracy with both backbones when the images are permuted.8 Note that even networks obtained by running state-of-the-art NAS procedures such as GAEA PC-DARTS [25] and Dense NAS [13] on permuted CIFAR-10 achieve only 66.3% and 61.6% accuracy, respectively, despite using millions more parameters than Res Net-20. While it is not straightforward to understand the recovered XD-operations that perform so well, we can use the relative Euclidean distance of their architecture parameters from initialization as a proxy for novelty; in Figure 2 we see that on regular images our procedure ﬁnds operations that are quite similar to convolutions, but on permuted data they are much further away. These results show that to enable NAS on diverse data, we will need a search space that contains truly novel operations, not just combinations of existing ones. In the remainder of the paper, we study more diverse and realistic tasks that show further evidence that SXD is a strong candidate for this.

7Recall SXD includes compositions of convolutions with multiplication by a K-matrix, e.g. a permutation. 8Full accuracy can be recovered via an auxiliary loss encouraging permutation-like K-matrices [10].

256 512 1024 2048 4096 8192 Resolution

L2 relative error

Burgers' equation

CNN-1d FNO-1d CNN-1d XD

85 106 141 211 421 Resolution

L2 relative error

CNN-2d FNO-2d CNN-2d DARTS operations Auto-Deep Lab CNN-2d XD

Figure 3: Relative error on Burgers equation (left) and Darcy Flow (right) across different resolutions.

5 Application: Learning to Solve Partial Differential Equations

As our ﬁrst non-vision application, we consider the task of solving PDEs, an important application area of ML in the natural sciences [28, 29, 41]. In our setup, data generated by classical PDE solvers is used to learn functions from some initial condition or setting to the corresponding PDE solution, with the goal of replacing the solver by a deep net forward pass; the latter can be orders of magnitude faster. A recent state-of-the-art approach for this introduces Fourier Neural Operators [30], operations that signiﬁcantly improve upon previous neural approaches across three different PDE settings. To evaluate the ability of XD-operations to compete with such custom-designed operations starting from simple CNN backbones, we will investigate the same three PDEs that they study: Burgers equation, Darcy Flow, and the 2d Navier-Stokes equations, which involve 1d, 2d, and 3d data, respectively. The ﬁrst two are studied across multiple resolutions, while the last one is studied at different viscosities.

As before, we start with a simple CNN backbone the type a scientist might use in a ﬁrst attempt at a solution and replace all convolutions by XD-operations. We initially hope to do better than this backbone, but ambitiously also hope to compete with the custom-designed FNO. The speciﬁc CNN we use is simply the FNO architecture of the appropriate dimension N but with all N-dimensional FNOs replaced by N-dimensional convolutions; this performs similarly to their CNN baselines [30]. In all cases we compare mainly to the CNN backbone and our reproduction of the FNO results, as the latter exceeds all other neural methods; a complete results table is provided in the appendix. Our reproduction of FNO is slightly worse than their reported numbers for Burgers equation and slightly better in the other two settings. Note that on the Navier-Stokes equations we only compare to the 3d FNO on the two settings in which we were able to reproduce their approach; moreover, we do not compare to their use of a 2d FNO plus a recurrent net in time, but in-principle XD-operations can also be substituted there. In the 2d Darcy Flow case we also include comparisons to DARTS operations in the simple CNN backbone, as in Section 4, and to Auto-Deep Lab (Auto DL) [31], a well-known NAS method for dense prediction. For evaluating XD-operations we again follow the procedure in Section 4, in which we tune only the architecture optimizer; notably, we do this only at the lowest resolutions. At all dimensions we use XD-operations of depth d = 13; in addition, in dimensions N > 1 we ﬁx the architecture biases b and channel gates C to 0 and 1, respectively, to conserve memory at higher resolutions. At lower ones we ﬁnd that the performance difference is negligible.

We report our results for the Burger s equation and Darcy Flow in Figure 3; for 2d Navier-Stokes the results are in Table 2. In all cases we dramatically outperform the CNN backbone used to initialize XD-operations; furthermore, we also achieve better error than FNO, despite it being custom-made for this problem. In particular, we ﬁnd that XD-operations have higher training error but generalize better (c.f. the appendix). Figure 3 also shows that XD-operations perform consistently well across resolutions, a major advantage of FNOs over previous methods, whose performance was tightly coupled to the discretization [30]. Notably, CNN performance worsens with higher resolution, unlike that of XD and FNO. Finally, we also substantially outperform DARTS operations and Auto DL in 2d, although the latter is at least consistent across resolutions. These results provide strong evidence that XD-operations are a useful search space for discovering neural operations, even in domains where the convolutions used to initialize them perform much worse than state-of-the-art. Note that these results do come at a cost of slower training and inference: XD-operations are roughly an order of magnitude slower than FNOs, despite having fewer parameters in 2d and 3d. This still yields solvers one-to-two orders of magnitude faster than classical solvers, maintaining usefulness for the problem.

Table 2: Relative test error on the 2d Navier-Stokes equations at different settings of the viscosity ν and time steps T. Best results in each setting are bolded.

ν = 10 4, T = 30 ν = 10 5, T = 20

CNN-3d (our baseline) 0.325 0.278 FNO-3d (reproduced) 0.182 0.177 CNN-3d XD (ours) 0.172 0.168

4 6 10 18 34 Res Net Depth

PSICOV real-valued distance prediction

Dilated Res Net-258 (reported) Auto-Deep Lab Res Net Dilated Res Net Res Net DARTS operations Res Net XD

Figure 4: Res Net XD outperforms both baseline and dilated Res Nets on PSICOV. At the highest depth we test we also outperform the reported MAE8 of the much deeper Dilated Res Net-258 [1].

6 Application: Real-Valued Distance Prediction for Protein Folding

As a second scientiﬁc application, we consider the task of inferring the 3d folded structure of a polypeptide chain, which yields important insights into the function of the resulting protein [18]. This problem is a high-priority challenge in biology and has recently seen signiﬁcant ML-driven advances from deep learning methods such as Alpha Fold [40, 19] and PDNET [1]. These typically involve training a network to predict pairwise physical distances between residues in the chain. We work with the PDNET benchmark, which consists of a training set of 3,356 proteins, a validation set of 100 of proteins, and the PSICOV [18] test set of 150 proteins. PDNET is designed to be more accessible than datasets used by large-scale methods such as Alpha Fold, which are not always publicly available and/or require massive compute [40, 19]. We follow the PDNET training procedure [1] and evaluate test set performance using their MAE8 metric for assessing long-range distances.

As before we start with simple CNN backbones in this case Res Nets. We choose this to compare most directly to the custom-designed architecture used by PDNET, consisting of a Dilated Res Net characterized by its use of a cyclically increasing dilation rate across Res Net blocks [1]. At a sufﬁcient depth, the Dilated Res Net is shown to outperform a standard pre-activation Res Net adapted to this task [1]. Our goal will be to see whether we can start with the vanilla Res Net and use XD to outperform both it and the specialized Dilated Res Net. We also aim to outperform the DARTS operations baseline from the previous two sections as well as the Auto DL NAS approach for dense prediction. We use XD-operations of depth d = 13 and ﬁx the architecture biases and channel gates as before to conserve memory. We evaluate architectures of different depths 4, 6, 10, 18, and 34 by varying the number of Res Net blocks used in the backbone architecture and baseline.

We report the results as averages across three trials for each depth in Figure 4. Notably, while Dilated Res Net slightly outperforms Res Net, Res Net XD outperforms both dilated and standard Res Nets at all depths. This provides further evidence that XD-operations can outperform specialized operations for diverse domains, even when initialized naively as standard convolutions. XD also outperforms Auto DL, which does poorly, and DARTS operations, except at the two smaller depths where performance is similar. Moreover, our Res Net-34 XD s MAE8 of 4.0 also improves upon PDNET s reported MAE8 of 4.1 attained by the much deeper Dilated Res Net-258 [1]; however, in our reproduction Dilated Res Net-258 achieved an MAE8 of 3.5. Given the trend in Figure 4, where XD-operations consistently improve the backbone architecture of the same depth, we conjecture that Res Net-258 XD could further improve upon this result. We leave scaling XD-operations to such deeper networks to future work.

Table 3: XD-operations compared to recent results in music modeling. We report average loss across three trials. The best result on each task is bolded.

Method (source) JSB Chorales Nottingham

Best recurrent [5] 8.43 3.29 TCN [5] 8.10 3.07 Transformer [44] - 3.34 R-Transformer [44] - 2.37

Undilated TCN (our baseline) 8.16 0.04 3.23 0.02 TCN (reproduced) 8.17 0.01 2.97 0.01 Undilated TCN XD (ours) 8.07 0.01 2.84 0.02

7 Application: Music Modeling

Our ﬁnal application is to music modeling, i.e. learning to predict the next note from sheet music [4]. The dominant approaches for such tasks are recurrent nets [16] and Transformers [42], but recent work has shown that specially-designed convolutional models can also be made competitive at similar model sizes [5, 6]. We will consider the temporal convolutional network (TCN) [5], which improves upon a regular CNN by having the dilation factor grow exponentially across layers. The tasks we study are on the JSB Chorales and Nottingham corpora, used in the original evaluation of TCNs [5]. As the baseline we take the TCN and set all dilation factors to one (undilated); our goal will be to start with this undilated network and match or outperform the custom dilation design of the TCN.

The results presented in Table 3 show that we achieve this goal, as we outperform both the undilated baseline and the TCN on both tasks. While the simple undilated backbone that we initialize with turns out to already match the TCN on JSB Chorales, on Nottingham our approach demonstrates that XD-operations can be used to outperform hand-designed architectures starting from vanilla CNNs.9 Where possible we also compare to other known results; XD-operations outperforms all of these except the R-Transformer [44], a model combining recurrent nets and self-attention, on Nottingham.

Together with our results on PDEs and proteins, our study of music modeling provides further evidence that XD-operations can effectively ﬁnd good operations using standard backbones on diverse tasks. One notable difﬁculty here is causality enforcement: making sure the input data does not contain the target when predicting the next entry. While TCNs can efﬁciently do so via temporal shifts, we do it in a brute-force manner by treating sequences of length n as n 1 data-points with masked targets. This is expensive and thus limits our evaluation to small music tasks. A fruitful direction for future work is thus to examine whether it is possibly to directly enforce causality in XDoperations, e.g. by forcing architecture parameters K and M to be lower triangular; since a product of lower triangular matrices is again lower triangular, the entire operation is then a multiplication of the input sequence by a lower triangular matrix, which sufﬁces to prevent causality violations.

8 Conclusion

This work aims to transition NAS from combining existing operations designed for vision and text to ﬁnding novel and effective operations in many domains. To do so we introduced a new search space of XD-operations and demonstrated its effectiveness on diverse tasks. Combining XD-operations with standard topology-search NAS, warm-starting search from non-standard operations such as graph convolutions and FNOs,10 improving the computational limitations described earlier, and constructing spaces containing missing operations such as Batch Norm [17] and self-attention [42] are all promising future directions. Finally, note that our goal lowering the barrier for applying ML necessarily comes with the possibility of misuse. Mitigating this involves developing tools for application-speciﬁc concerns, e.g. privacy and fairness, that go beyond the error metrics we target.

9In the appendix we report similar improvements on two other tasks on which TCNs were evaluated permuted MNIST and Penn Tree Bank that we do not discuss in detail as our focus is on under-explored tasks. 10 In this direction, we found that initializing XD with FNO did worse than initializing with convolutions on Burgers equation and Darcy Flow, a surprising result given how much better FNO is than the baseline CNN. Similarly, initializing XD with convolutions dilated as in the original TCN did not lead to signiﬁcant improvement, except in one setting, over undilated initialization. See the appendix for more details and results.

Acknowledgments

We thank Maria-Florina Balcan, Jeremy Cohen, and Tian Li for helpful advice on early versions of this paper and anonymous reviewers for suggested improvements. This work was supported in part by DARPA under cooperative agreements FA875017C0141 and HR0011202000, NSF grants CCF1535967, CCF-1910321, IIS-1618714, IIS-1705121, IIS-1838017, IIS-1901403, and IIS-2046613, a Microsoft Research Faculty Fellowship, a Bloomberg Data Science research grant, an Amazon Research Award, an AWS Machine Learning Research Award, a Facebook Faculty Research Award, funding from Booz Allen Hamilton Inc., a Block Center Grant, a Carnegie Bosch Institute Research Award, and a Two Sigma Fellowship Award. We also gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); ONR under No. N000141712266 (Unifying Weak Supervision); the Moore Foundation, NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, American Family Insurance, Google Cloud, Swiss Re, Total, the HAI-AWS Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), and members of the Stanford DAWN project: Facebook, Google, and VMWare. The Mobilize Center is a Biomedical Technology Resource Center, funded by the NIH National Institute of Biomedical Imaging and Bioengineering through Grant P41EB027060. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, ﬁndings and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of DARPA, NSF, NIH, ONR, or any other funding agency.

[1] Badri Adhikari. A fully open-source framework for deep learning protein real-valued distances. Scientiﬁc Reports, 10(1):13374, 2020.

[2] Nir Ailon, Omer Leibovich, and Vineet Nair. Sparse linear networks with a ﬁxed butterﬂy structure: Theory and practice. ar Xiv, 2020.

[3] Keivan Alizadeh vahid, Anish Prabhu, Ali Farhadi, and Mohammad Rastegari. Butterﬂy transform: An efﬁcient FFT based neural architecture design. In Proceedings of the IEEE Conference on Conference on Computer Vision and Pattern Recognition, 2020.

[4] Moray Allan and Christopher Williams. Harmonising chorales by probabilistic inference. In Advances in Neural Information Processing Systems, 2005.

[5] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. ar Xiv, 2018.

[6] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Trellis networks for sequence modeling. In Proceedings of the 7th International Conference on Learning Representations, 2019.

[7] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281 305, 2012.

[8] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efﬁcient deployment. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[9] Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. Learning fast algorithms for linear transforms using butterﬂy factorizations. In Proceedings of the 36th International Conference on Machine Learning, 2019.

[10] Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, and Christopher Ré. Kaleidoscope: An efﬁcient, learnable representation for all structured linear maps. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[11] Xuanyi Dong and Yi Yang. NAS-Bench-201: Extending the scope of reproducible neural architecture search. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[12] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1 21, 2019.

[13] Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. Densely connected search space for more ﬂexible neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.

[14] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hi PPO: Recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems, 2020.

[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[16] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9:1735 1780, 1997.

[17] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, 2015.

[18] David T. Jones, Daniel W. A. Buchan, Domenico Cozzetto, and Massimiliano Pontil. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics, 28(2):184 190, 11 2011.

[19] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583 589, 2021.

[20] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015.

[21] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, 2017.

[22] Alex Krizhevksy. Learning multiple layers of features from tiny images. Technical report, 2009.

[23] Yann Le Cun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based learning. In Shape, Contour and Grouping in Computer Vision. 1999.

[24] Liam Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1 52, 2018.

[25] Liam Li, Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Geometry-aware gradient algorithms for neural architecture search. In Proceedings of the 9th International Conference on Learning Representations, 2021. To Appear.

[26] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2019.

[27] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven trafﬁc forecasting. In Proceedings of the 6th International Conference on Learning Representations, 2018.

[28] Yingzhou Li, Haizhao Yang, Eileen R. Martin, Kenneth L. Ho, and Lexing Ying. Butterﬂy factorization. Multiscale Modeling & Simulation, 13(2):714 732, 2015.

[29] Yingzhou Li, Haizhao Yang, and Lexing Ying. Multidimensional butterﬂy factorization. Applied and Computational Harmonic Analysis, 44(3):737 758, 2018.

[30] Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In Proceedings of the 9th International Conference on Learning Representations, 2021. To Appear.

[31] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, and Li Fei-Fei. Auto-Deep Lab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Conference on Computer Vision and Pattern Recognition, 2019.

[32] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In Proceedings of the 7th International Conference on Learning Representations, 2019.

[33] Jieru Mei, Yingwei Li, Xiaochen Lian, Xiaojie Jin, Linjie Yang, Alan Yuille, and Jianchao Yang. Atom NAS: Fine-grained end-to-end neural architecture search. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[34] Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas. ACDC: A structured efﬁcient linear layer. ar Xiv, 2015.

[35] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid. Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.

[36] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efﬁcient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning, 2018.

[37] Esteban Real, Chen Liang, David R. So, and Quoc V. Le. Auto ML-Zero: Evolving machine learning algorithms from scratch. In Proceedings of the 37th International Conference on Machine Learning, 2020.

[38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, 2015.

[39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

[40] Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, and Demis Hassabis. Improved protein structure prediction using potentials from deep learning. Nature, 577(7792):706 710, 2020.

[41] Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339 1364, 2018.

[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.

[43] Yujing Wang, Yaming Yang, Yiren Chen, Jing Bai, Ce Zhang, Guinan Su, Xiaoyu Kou, Yunhai Tong, Mao Yang, and Lidong Zhou. Textnas: A neural architecture search space tailored for text representation. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, 2020.

[44] Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang. R-Transformer: Recurrent neural network enhanced transformer. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[45] Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong. PC-DARTS: Partial channel connections for memory-efﬁcient architecture search. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[46] Antoine Yang, Pedro M. Esperança, and Fabio M. Carlucci. NAS evaluation is frustratingly hard. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[47] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. NAS-Bench-101: Towards reproducible neural architecture search. In Proceedings of the 36th International Conference on Machine Learning, 2019.

[48] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference, 2016.

[49] Arber Zela, Julien Siems, and Frank Hutter. NAS-Bench-1Shot1: Benchmarking and dissecting one-shot neural architecture search. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[50] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See the ﬁrst bullet-point of Section 2 and the last paragraphs of Sections 5, 6, and 7. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See the end of Section 8. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See the appendix. (b) Did you include complete proofs of all theoretical results? [Yes] See the appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See the appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Tables 1 and 3 and Figure 4. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Table 1 and the appendix. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

See code in the supplementary material. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]