# on_the_foundations_of_shortcut_learning__7dad3910.pdf

Published as a conference paper at ICLR 2024

ON THE FOUNDATIONS OF SHORTCUT LEARNING

Katherine L. Hermann1, Hossein Mobahi2, Thomas Fel1, and Michael C. Mozer1

Google {1Deep Mind, 2Research}, Mountain View, CA, USA {hermannk, hmobahi, thomasfel, mcmozer}@google.com

Deep-learning models can extract a rich assortment of features from data. Which features a model uses depends not only on predictivity how reliably a feature indicates training-set labels but also on availability how easily the feature can be extracted from inputs. The literature on shortcut learning has noted examples in which models privilege one feature over another, for example texture over shape and image backgrounds over foreground objects. Here, we test hypotheses about which input properties are more available to a model, and systematically study how predictivity and availability interact to shape models feature use. We construct a minimal, explicit generative framework for synthesizing classiﬁcation datasets with two latent features that vary in predictivity and in factors we hypothesize to relate to availability, and we quantify a model s shortcut bias its over-reliance on the shortcut (more available, less predictive) feature at the expense of the core (less available, more predictive) feature. We ﬁnd that linear models are relatively unbiased, but introducing a single hidden layer with Re LU or Tanh units yields a bias. Our empirical ﬁndings are consistent with a theoretical account based on Neural Tangent Kernels. Finally, we study how models used in practice trade off predictivity and availability in naturalistic datasets, discovering availability manipulations which increase models degree of shortcut bias. Taken together, these ﬁndings suggest that the propensity to learn shortcut features is a fundamental characteristic of deep nonlinear architectures warranting systematic study given its role in shaping how models solve tasks.

1 INTRODUCTION

Natural data domains provide a rich, high-dimensional input from which deep-learning models can extract a variety of candidate features. During training, models determine which features to rely on. Following training, the chosen features determine how models generalize. A challenge for machine learning arises when models come to rely on spurious or shortcut features instead of the core or deﬁning features of a domain (Arjovsky et al., 2019; Mc Coy et al., 2019; Geirhos et al., 2020; Singla & Feizi, 2022). Shortcut cheat features, which are correlated with core true features in the training set, obtain good performance on the training set as well as on an iid test set, but poor generalization on out-of-distribution inputs. For instance, Image Net-trained CNNs classify primarily according to an object s texture (Baker et al., 2018; Geirhos et al., 2018a; Hermann et al., 2020), whereas people deﬁne and classify solid objects by shape (e.g., Landau et al., 1988). Focusing on texture leads to reliable classiﬁcation on many images but might result in misclassiﬁcation of, say, a hairless cat, which has wrinkly skin more like that of an elephant. The terms spurious and shortcut are largely synonymous in the literature, although the former often refers to features that arise unintentionally in a poorly constructed dataset, and the latter to features easily latched onto by a model. In addition to a preference for texture over shape, other common shortcut features include a propensity to classify based on image backgrounds rather than foreground objects (Beery et al., 2018; Sagawa et al., 2020a; Xiao et al., 2020; Moayeri et al., 2022), or based on individual diagnostic pixels rather than higher-order image content (Malhotra et al., 2020).

The literature examining feature use has often focused on predictivity how well a feature indicates the target output. Anomalies have been identiﬁed in which networks come to rely systematically on one feature over another when the features are equally predictive, or even when the preferred feature has lower predictivity than the non-preferred feature (Beery et al., 2018; Pérez et al., 2018; Tachet

Published as a conference paper at ICLR 2024

et al., 2018; Arjovsky et al., 2019; Mc Coy et al., 2019; Hermann & Lampinen, 2020; Shah et al., 2020; Nagarajan et al., 2020; Pezeshki et al., 2021; Fel et al., 2023). Although we lack a general understanding of the cause of such preference anomalies, several speciﬁc cases have been identiﬁed. For example, features that are linearly related to classiﬁcation labels are preferred by models over features that require nonlinear transforms (Hermann & Lampinen, 2020; Shah et al., 2020). Another factor leading to anomalous feature preferences is the redundancy of representation, e.g., the size of the pixel footprint in an image (Sagawa et al., 2020a; Wolff & Wolff, 2022; Tartaglini et al., 2022).

Because predictivity alone is insufﬁcient to explain feature reliance, here we explicitly introduce the notion of availability to refer to the factors that inﬂuence the likelihood that a model will use a feature more so than a purely statistical account would predict. A more-available feature is easier for the model to extract and leverage. Past research has systemtically manipulated predictivity; in the present work, we systematically manipulate both predictivity and availability to better understand their interaction and to characterize conditions giving rise to shortcut learning. Our contributions are:

We deﬁne quantitative measures of predictivity and availability using a generative framework that allows us to synthesize classiﬁcation datasets with latent features having speciﬁed predictivity and availability. We introduce two notions of availability relating to singular values and nonlinearity of the data generating process, and we quantify shortcut bias in terms of how a learned classiﬁer deviates from an optimal classiﬁer in its feature reliance.

We perform parametric studies of latent-feature predictivity and availability, and examine the sensitivity of different model architectures to shortcut bias, ﬁnding that it is greater for nonlinear models than linear models, and that model depth ampliﬁes bias.

We present a theoretical account based on Neural Tangent Kernels (Jacot et al., 2018) which indicates that shortcut bias is an inevitable consequence of nonlinear architectures.

We show that vision architectures used in practice can be sensitive to non-core features beyond their predictive value, and show a set of availability manipulations of naturalistic images which push around models feature reliance.

2 RELATED WORK

The propensity of models to learn spurious (Arjovsky et al., 2019) or shortcut (Geirhos et al., 2020) features arises in a variety of domains (Heuer et al., 2016; Gururangan et al., 2018; Mc Coy et al., 2019; Sagawa et al., 2020a) and is of interest from both a scientiﬁc and practical perspective. Existing work has sought to understand the extent to which this tendency derives from the statistics of the training data versus from model inductive bias (Neyshabur et al., 2014; Tachet et al., 2018; Pérez et al., 2018; Rahaman et al., 2019; Arora et al., 2019; Geirhos et al., 2020; Sagawa et al., 2020b;a; Pezeshki et al., 2021; Nagarajan et al., 2021), for example a bias to learn simple functions (Tachet et al., 2018; Pérez et al., 2018; Rahaman et al., 2019; Arora et al., 2019). Hermann & Lampinen (2020) found that models preferentially represent one of a pair of equally predictive image features, typically whichever feature had been most linearly decodable from the model at initialization. They also identiﬁed cases where models relied on a less-predictive feature that had a linear relationship to task labels over a more-predictive feature that had a nonlinear relationship to labels. Together, these ﬁndings suggest that predictivity alone is not the only factor that determines model representations and behavior. A theoretical account by Pezeshki et al. (2021) studied a situation in supervised learning in which minimizing cross-entropy loss captures only a subset of predictive features, while other relevant features go unnoticed. They introduced a formal notion of strength that determines which features are likely to dominate a solution. Their notion of strength confounds predictivity and availability, the two concepts which we aim to disengangle in the present work.

Work in the vision domain has studied which features vision models rely on when trained on natural image datasets. For example, Image Net models prefer to classify based on texture rather than shape (Baker et al., 2018; Geirhos et al., 2018a; Hermann et al., 2020) and local rather than global image content (Baker et al., 2020; Malhotra et al., 2020), marking a difference from how people classify images (Landau et al., 1988; Baker et al., 2018). Other studies have found that image backgrounds play an outsize role in driving model predictions (Beery et al., 2018; Sagawa et al., 2020a; Xiao et al., 2020). Two studies manipulated the quantity of a feature in an image to test how this changed model behavior. Tartaglini et al. (2022) probed pretrained models with images containing a shape consistent

Published as a conference paper at ICLR 2024

Figure 1: Synthetic data. A: Two datasets differing in the predictivity of zs. B: Schematic of the embedding procedure manipulating availability via the mapping from z to x. Dashed boxes are optional.

with one class label and a texture consistent with another, where the texture was present throughout the image, including in the background. They varied the opacity of the background, and as the the texture became less salient, models increasingly classiﬁed by shape. Wolff & Wolff (2022) found that when images had objects with opposing labels, models preferred to classify by the object with the larger pixel footprint, an idea we return to in Section 5. The development of methods for reducing bias for particular features over others is an active area (e.g. Arjovsky et al., 2019; Geirhos et al., 2018a; Hermann et al., 2020; Robinson et al., 2021; Minderer et al., 2020; Sagawa et al., 2020b; Ryali et al., 2021; Kirichenko et al., 2022; Teney et al., 2022; Tiwari & Shenoy, 2023; Ahmadian & Lindsten, 2023; Pezeshki et al., 2021; Puli et al., 2023; La Bonte et al., 2023), important for improving generalization and addressing fairness concerns (Zhao et al., 2017; Buolamwini & Gebru, 2018).

3 GENERATIVE PROCEDURE FOR SYNTHETIC DATASETS

To systematically explore the role of predictivity and availability, we construct synthetic datasets from a generative procedure that maps a pair of latent features, z = (zs, zc), to an input vector x Rd and class label y { 1, +1}. The subscripts s and c denote the latent dimensions that will be treated as the potential shortcut and core feature, respectively. The procedure draws z from a multivariate Gaussian conditioned on class,

z | y N y µs y µc

, 1 σsc σsc 1

with |σsc| < 1. Through symmetry, the optimal decision boundary for latent feature i is zi = 0, allowing us to deﬁne the feature predictivity ρi Pr(y = sign(zi)). For Gaussian likelihoods, this predictivity is achieved by setting µi =

2 erf 1 (2ρi 1). Figure 1A shows sample latent-feature distributions for ρc = 0.9 and two levels of ρs.

Availability manipulations. Given a latent vector z, we manipulate the hypothesized availability (hereafter, simply availability) of each feature independently through an embedding procedure, sketched in Figure 1B, that yields an input x Rd. We posit two factors that inﬂuence availability of feature i, its ampliﬁcation αi and its nesting ηi. Ampliﬁcation αi is a scaling factor on an embedding, ei = αiwizi, where wi Rd is a feature-speciﬁc random unit vector. Ampliﬁcation includes manipulations of redundancy (replicating a feature) and magnitude (increasing a feature s dynamic range). Nesting ηi is a factor that determines ease of recovery of a latent feature from an embedding. We assume a nonlinear, rank-preserving transform, e i = fηi(e), where fηi is a fully connected, random net with ηi N tanh layers in cascade. For ηi = 0, feature i remains in explicit form, e i = ei; for ηi > 0, the feature is recoverable through an inversion of increasing complexity with ηi. To complete the data generative process, we combine embeddings by summation: x = e s + e c.

Assessing shortcut bias. Given a synthetic dataset and a model trained on this dataset, we assess the model s reliance on the shortcut feature, i.e., the extent to which the model uses this feature as a basis for classiﬁcation decisions. When zc and zs are correlated (σsc > 0), some degree of reliance on the shortcut feature is Bayes optimal (see orange dashed line in Figure 6B). Consequently, we need to assess reliance relative to that of an optimal classiﬁer. We perform this assessment in latent space, where the Bayes optimal classiﬁer can be found by linear discriminant analysis (LDA) (see

Published as a conference paper at ICLR 2024

Figure 2: Deep nonlinear models can prefer a less-predictive but more-available feature to a more-predictive but less-available one. The color of each cell in the heatmap indicates the mean bias of a model as a function of the availability and predictivity of the shortcut feature, zs. The inset shows in faint coloring the decision surface for an optimal Bayes classiﬁer (LDA) and a trained model. Overlaid points are a subset of training instances. The model obtains a shortcut bias of 0.53.

Figure 2 inset). The shortcut bias is the reliance of a model on the shortcut feature over that of the optimal, LDA: bias = reliancemodel relianceoptimal. Appendix C.1 describes and justiﬁes our reliance measure in detail. For a given model M, whether trained net or LDA, we probe over the latent space and determine for each probe z the binary classiﬁcation decision, ˆy M(z) (see Figure 2). Shortcut reliance is the difference between the model s alignment (covariance) with decision boundaries based only on the shortcut and core features:

reliance M = 2

i ˆy Misign(zsi) X

i ˆy Misign(zci)

where n is the number of probe items. Both the reliance score and shortcut bias are in [ 1, +1].

4 EXPERIMENTS MANIPULATING FEATURE PREDICTIVITY AND AVAILABILITY

Methodology. Using the procedure described in Section 3, we conduct controlled experiments examining how feature availability and predictivity and model architecture affect the shortcut bias. We sample class-balanced datasets with 3200 train instances, 1000 validation instances, and 900 probe (evaluation) instances that uniformly cover the (zs, zc) space by taking a Cartesian product of 30 zs evenly spaced in [ 3µs, +3µs] and 30 zc evenly spaced in [ 3µc, +3µc]. In all simulations, we set d = 100, ηc = ηs = 0, ρc = 0.9, σsc = 0.6. We manipulate shortcut-feature predictivity with the constraint that it is lower than core-feature predictivity but still predictive, i.e., 0.5 < ρs < ρc = 0.9. Because only the relative ampliﬁcation of features matters, we vary the ratio αs/αc, with the shortcut feature more available, i.e., αs/αc 1. We report the means across 10 runs (Appendix C.3).

Models prefer the more available feature, even when it is less predictive. We ﬁrst test whether models prefer zs when it is more available than zc, including when it is less predictive, while holding model architecture ﬁxed (see Appendix C.3). Figure 2 shows that when predictivity of the two features is matched (ρc = ρs = 0.9), models prefer the more-available feature. And given sufﬁciently high availability, models can prefer the less-predictive but more-available shortcut feature to the more-predictive but less-available core feature. Availability can override predictivity.

Model depth increases shortcut bias. In the previous experiment, we used a ﬁxed model architecture. Here, we investigate how model depth and width inﬂuence shortcut bias when trained with αs/αc = 64 and ρs = 0.85, previously shown to induce a bias (see Figure 2, gray square). As shown in Figure 3A, we ﬁnd that bias increases as a function of model depth when dataset is held ﬁxed.

Model nonlinearity increases shortcut bias. To understand the role of the hidden-layer activation function, we compare models with linear, Re LU, and Tanh activations while holding weight initialization and data ﬁxed. As indicated in Figure 3B, nonlinear activation functions induce a larger shortcut bias than their linear counterpart.

Published as a conference paper at ICLR 2024

Figure 3: A: Model depth increases shortcut bias. The color of each cell indicates the mean bias of an MLP with Re LU hidden activation-functions, for various model widths and depths, trained on data with a shortcut feature that is more available (αs/αc = 64) but less predictive (ρs = 0.85) than the core feature. Model nonlinearity increases shortcut bias. B: Shortcut bias for three hidden activation functions for a deep MLP with width 128 and depth 2, trained on datasets where predictivity is matched (ρs = ρc = 0.9), but shortcut availability is higher (αs/αc = 32). A shortcut bias is more pronounced when the model contains a nonlinear activation function. C: Shortcut bias for MLPs with a single hidden layer and a hidden activation function that is either linear (left) or Re LU (right), for various shortcut feature availabilities (αs/αc) and predictivities (ρs). See B.1 for Tanh.

Figure 4: Res Net-18 prefers a shortcut feature when availability is instantiated as the pixel footprint of an object (feature), even when that feature is less predictive. A: Sample images. B: Shortcut bias increases as a function of relative availability of the shortcut feature when features are equally predictive (ρs = ρc = 0.9), consistent with Wolff & Wolff (2022). C: Even when the shortcut feature is less predictive, models have a shortcut bias due to availability, when αs/αc = 4.

Feature nesting increases shortcut bias. The synthetic experiments reported above all manipulate availability with the amplitude ratio αs/αc. We also conducted experiments manipulating a second factor we expected would affect availability (Hermann & Lampinen, 2020; Shah et al., 2020), the relative nesting of representations, i.e., ηc ηs 1. We report these experiments in Appendix C.3.

5 AVAILABILITY MANIPULATIONS WITH SYNTHETIC IMAGES

What if we instantiate the same latent feature space studied in the previous section in images? We form shortcut features that are analogous to texture or image background features previously noted to preferentially drive the behavior of vision models (e.g. Geirhos et al., 2018a; Baker et al., 2018; Hermann et al., 2020; Beery et al., 2018; Sagawa et al., 2020a; Xiao et al., 2020). Building on the work of Wolff & Wolff (2022), we hypothesize that these features are more available because they have a large footprint in an image, and hence, by our notions of availability, a large αs.

Methods. We instantiate a latent vector z from our data-generation procedure as an image. Each feature becomes an object (zs a circle, zc a square) whose color is determined by the respective feature value. Following Wolff & Wolff (2022), we manipulate the availability of each feature in terms of its size, or pixel footprint. We randomly position the circle and square entirely within the image, avoiding overlap, yielding a 224 224 pixel image (Figure 4A, Appendix C.4).

Results. Figure 4B presents the shortcut bias in Res Net18 as a function of shortcut-feature availability (footprint) when the two features are equally predictive (ρs = ρc = 0.9). In Figure 4C, the availability

Published as a conference paper at ICLR 2024

ratio is ﬁxed at αs/αc = 4, and the shortcut bias is assessed as a function of ρs. Res Net18 is biased toward the more available shortcut feature even when it is less predictive than the core feature. Together, these results suggest that a simple characteristic of image contents the pixel footprint of an object can bias models output behavior, and may therefore explain why models can fail to leverage typically-smaller foreground objects in favor of image backgrounds (Section 7).

6 THEORETICAL ACCOUNT

In our empirical investigations, we quantiﬁed the extent to which a trained model deviates from a statistically optimal classiﬁer in its reliance on the more-available feature using a measure which considered the basis for probe-instance classiﬁcations. Here, we use an alternative approach of studying the sensitivity of a Neural Tangent Kernel (NTK) (Jacot et al., 2018) to the availability of a feature. The resulting form presents a crisp perspective of how predictivity and availability interact. In particular, we prove that availability bias is absent in linear networks but present in Re LU networks. The proof for the theorems of this section is given in the Supplementary Materials.

For tractability of the analysis, we consider a few simplifying assumptions. We focus on 2-layer fully connected architectures for which the kernel of the Re LU networks admits a simple closed form. In addition, to be able to compute integrations that arise in the analysis, we resort to an asymptotic approximation which assumes covariance matrix is small. Speciﬁcally, we represent the covariance matrix as s [ 1 , σ12 ; σ12 , 1 ], where the scale parameter s > 0 is considered to be small. Finally, in order to handle the analysis for a Re LU kernel, we will use a polynomial approximation.

Kernel Spectrum. Consider a two-layer network with the ﬁrst layer having a possibly nonlinear activation function. When the width of this model gets large, learning of this model can be approximated by a kernel regression problem, with a given kernel function k( . , . ) that depends on the architecture and the activation function. Given a distribution over the input data p(x), we deﬁne a (linear) kernel operator as one that acts on a function f to produce another function g as in g(x) = R

Rn k(x, z) f(z) p(z) dz. This allows us to deﬁne an eigenfunction φ of the kernel operator as one that satisﬁes, λ φ(x) = R

Rn k(x, z) φ(z) p(z) dz . (1)

The value of λ will be the eigenvalue of that eigenfunction when the eigenfunction φ is normalized as R

Rn φ2(x) p(x) dx = 1 .

Form of p(z). Recall that in our generative dataset framework, we have a pair of latent features zc and zs that are embedded into a high dimensional space via xd 1 = αszsws + αczcwc = U d 2A2 2z2 1 . With this expression, we switch terminology such that our wi U and αi A, and therefore A is diagonal matrix with positive diagonal entries, and the columns of U are (approximately) orthonormal. Henceforth, we also refer to features with indices 1 and 2 instead of s and c. An implication of orthonormal columns on U is that the dot product of any two input vectors x and x will be independent of U, i.e., x, x = Az, Az . Consequently, we can compute dot products in the original 2-dimensional space instead of in the d-dimensional embedding space. On the other hand, we will later see that the kernel function k(x1, x2) of the two cases we study here (Re LU and linear) depends on their input only through the dot product x1, x2 and norms x1 and x2 (self dot products). Thus, the kernel is entirely invariant to U and without loss of generality, we can consider the input to the model as x = Az. Therefore,

x+ N a1µ1 a2µ2

, a2 1 a1a2σ12 a1a2σ12 a2 2

, x N a1µ1 a2µ2

, a2 1 a1a2σ12 a1a2σ12 a2 2

Linear kernel function. If the activation function is linear, then the kernel function simply becomes a standard dot product, k(x1, x2) x1, x2 . (2)

The following theorem provides details about the spectrum of this kernel.

Theorem 1 Consider the kernel function k(x1, x2) x1, x2 . The kernel operator associated with k under the data distribution p speciﬁed above has only one non-zero eigenvalue λ = Aµ 2

and its eigenfunction has the form φ(x) = Aµ,x

Published as a conference paper at ICLR 2024

Re LU kernel function. If the activation function is set to be a Re LU, then the kernel function is known to have the following form (Cho & Saul, 2009; Bietti & Bach, 2020):

k(x1, x2) x1 x2 h x1

u π arccos(u) + p

In order to obtain an analytical form for the eigenfunctions of the kernel under the considered data distribution, we resort to a quadratic approximation of h by bh as bh(u) 815 3072(1 + u)2. This approximation enjoys certain optimality criteria. Derivation details of this quadratic form are provided in the Supplementary Materials, as is a plot showing the quality of the approximation. We now focus on spectral characteristics of the approximate Re LU kernel. Replacing h in the kernel function k of Equation 3 with bh, we obtain an alternative kernel function approximation for Re LUs:

k(x, z) x z bh D x x , z z E = x z a 1 + D x x , z z E 2 . (4) The following theorem characterizes the spectrum of this kernel.

Theorem 2 Consider the kernel function k(x, z) x z a 1 + D x x , z z E 2 . The kernel operator associated with k under the data distribution p speciﬁed above has only two non-zero eigenvalues λ1 = λ2 = 2a Aµ 2 with associated eigenfunctions given by

φ1(x) = x Aµ 1+ x x , Aµ Aµ 2

2 and φ2(x) = D x Aµ , Aµ Aµ E .

6.1 SENSITIVITY ANALYSIS

We now assign a target value y(x) to each input point x. By expressing the kernel operator using its eigenfunctions, it is easy to show that f(x) = P

Rn φi(z) y(z) p(z) dz) where i runs over non-zero eigenvalues λi of the kernel operator (see also Appendix G). We now restrict our focus to a binary classiﬁcation scenario, y : X {0, 1}. More precisely, we replace p with our Gaussian mixture and set y(x) to 1 and 0 depending on whether x is drawn from the positive or negative component of the mixture.

R2 φi(z) (1) p+(z) dz + 1

R2 φi(z) (0) p (z) dz = 1 2 P

i φi(x)φi(Aµ) ,

where p+ and p denote class-conditional normal distributions. Now let us tweak the availability of each feature by a diagonal scaling matrix B diag(b) where b [b1, b2]. Denote the modiﬁed prediction function as g B. That is, g B(x) 1 2 P

i φi(x)φi(BAµ). The alignment between f and g is their normalized dot product,

R2 f 2(t)p(t) dt g B(x) R

R2 g2 B(t)p(t) dtp(x) dx . (5)

We deﬁne the sensitivity of the alignment to feature i for i = 1, 2 as its m th order derivative of γ w.r.t. bi evaluated at b = 1 (which leads to identity scale factor B = I),

P i φi(x)φi(Aµ) r

R R2 P i φi(x)φi(Aµ) 2 dt

P i φi(x)φi(BAµ) r

R R2 P i φi(x)φi(BAµ) 2 dt dx

ζi indicates how much the model relies on feature i to make a prediction. In particular, if whenever feature i is more available than feature j, i.e. ai > aj, we also see the model is more sensitive to feature i than feature j, i.e. |ζi| > |ζj|, the model is biased toward the more-available feature. In addition, when ai < aj but |ζi| > |ζj|, the bias is toward the less-available feature. One can express

the presence of these biases more concisely as sign (|ζ1| |ζ2|)(a1 a2) , where values of +1 and 1 indicate bias toward moreand lessavailable features, respectively. To verify either of these two cases via sensitivity, we must choose the lowest order m that yields a non-zero |ζ1| |ζ2|. On the other hand, if |ζ1| |ζ2| = 0 for any choice of a1 and a2, the model is unbiased to feature availability.

The following theorems now show that linear networks are unbiased to feature availability, while Re LU networks are biased toward more available features.

Theorem 3 In a linear network, |ζ1| |ζ2| is always zero for any choice of m 1 and regardless of the values of a1 and a2.

Published as a conference paper at ICLR 2024

µ2 = 0.1 µ2 = 0.5 µ2 = 1 µ2 = 2 µ2 = 10

Figure 5: Plot of sign((|ζ1| |ζ2|)(a1 a2)) for Re LU network as a function of a1 and a2. We ﬁx µ1 = 1 and vary µ2 {0.1, 0.5, 1, 2, 10}. Yellow and blue correspond to values +1 and 1 respectively.

Theorem 4 In a Re LU network, |ζ1| |ζ2| = 0 for any 1 m 8. The ﬁrst non-zero |ζ1| |ζ2| happens at m = 9 and has the following form,

|ζ1| |ζ2| = 5670 Aµ 18 (a1a2µ1µ2)8(a2 1µ2 1 a2 2µ2 2) . (6)

A straightforward consequence of this theorem is that

sign (|ζ1| |ζ2|)(a1 a2) = sign 5670

Aµ 18 (a1a2µ1µ2)8(a2 1µ2 1 a2 2µ2 2)(a1 a2) = sign (a2 1µ2 1 a2 2µ2 2)(a1 a2) . (7)

Recall from Section 3 that feature predictivity ρi is related to µi via ρi = 1

2(1 + erf( µi

2)). Observe that ρi is an increasing function in µi; thus, bigger µi implies larger ρi. That together with (7) provides a crisp interpretation of the trade-off between predictivity and availability. For example, when the latent features are equally predictive (µ1 = µ2), the sign becomes +1 for any (non-negative) choice of availability parameters a1 and a2. Thus, for equally predictive features, the Re LU networks are always biased toward the more available feature. Figure 5 shows some more examples with certain levels of predictivity. The coloring indicates the direction of the availability bias (only the boundaries between the blue and the yellow regions have no availability bias).

7 FEATURE AVAILABILITY IN NATURALISTIC DATASETS

We have seen that models trained on controlled, synthetic data are inﬂuenced by availability to learn shortcuts when a nonlinear activation function is present. How do feature predictivity and availability in naturalistic datasets interact to shape the behavior of models used in practice? To test this, we train Res Net18 models (He et al., 2016) to classify naturalistic images by a binary core feature irrespective of the value of a non-core feature. We construct two datasets by sampling images from Waterbirds (Sagawa et al., 2020a) (core: Bird, non-core: Background), and Celeb A (Liu et al., 2015) (core: Attractive, non-core: Smiling). See C.5 for additional details.

Sensitivity to the non-core feature beyond a statistical account. Figures 6A and B show that, for both datasets, as the training-set predictivity of the non-core feature increases, model accuracy dramatically increases for congruent probes and decreases for incongruent ones. In contrast, a Bayes optimal classiﬁer is far less sensitive to the predictivity of the non-core feature. Thus, models are more inﬂuenced by the non-core feature than what we would expected based solely on predictivity. This heightened sensitivity implies that models prioritize the non-core feature more than they should, given its predictive value. Thus, in the absence of predictivity as an explanatory factor, we conclude that the non-core feature is more available than the core feature.

Availability manipulations. Motivated by the result that Background is more available to models than the core Bird (Figure 6A), we test whether speciﬁc background manipulations (hypothesized types of availability) shift model feature reliance. As shown in Figure 6C, we ﬁnd that Bird accuracy increases as we reduce the availability of the image background by manipulating its spatial extent (Bird size, Background patch removal) or drop background color (Color), implicating these as among the features that models latch onto in preferring image backgrounds (validated with explainability analyses in C.5). Experiments in Figure B.9 show that this phenomenon also occurs in Image Netpretrained models; background noise and spatial frequency manipulations also drive feature reliance.

8 CONCLUSION

Shortcut learning is of both scientiﬁc and practical interest given its implications for how models generalize. Why do some features become shortcut features? Here, we introduced the notion of

Published as a conference paper at ICLR 2024

Figure 6: Availability as well as predictivity determines which features image classiﬁers rely on. A: Models (Res Net18) were trained to classify Birds from images sampled from Waterbirds. We varied Background (non-core) predictivity while keeping Bird (core) predictivity ﬁxed (= 0.99), and show Bird classiﬁcation accuracy for two types of probes: congruent (blue, core and non-core features support the same label) and incongruent (orange, core and non-core features support opposing labels). As Background predictivity increases, the gap in accuracy between incongruent and congruent probes also increases. The model is more sensitive to the non-core feature than expected by a Bayes-optimal classiﬁer (optimal): predictivity alone does not explain the model s behavior. B: Similar to the Waterbirds dataset, models trained to classify images from Celeb A as Attractive exhibit an effect of Smiling availability. C: Bird accuracy for incongruent Waterbirds probes is inﬂuenced by both Background predictivity (ρ) and availability when we manipulate the latter explicitly (see also B.9).

availability and conducted studies systematically varying availability. We proposed a generative framework that allows for independent manipulation of predictivity and availability of latent features. Testing hypotheses about the contributions of each to model behavior, we ﬁnd that for both vector and image classiﬁcation tasks, deep nonlinear models exhibit a shortcut bias, deviating from the statistically optimal classiﬁer in their feature reliance. We provided a theoretical account which indicates the inevitability of a shortcut bias for a single-hidden-layer nonlinear (Re LU) MLP but not a linear one. The theory speciﬁes the exact interaction between predictivity and availability, and consistent with our empirical studies, predicts that availability can trump predictivity. In naturalistic datasets, vision architectures used in practice rely on non-core features more than they should on statistical grounds alone. Connecting with prior work identifying availability biases for texture and image background, we explicitly manipulated background properties such as spatial extent, color, and spatial frequency and found that they inﬂuence a model s propensity to learn shortcuts.

Taken together, our empirical and theoretical ﬁndings highlight that models used in practice are prone to shortcut learning, and that to understand model behavior, one must consider the contributions of both feature predictivity and availability. Future work will study shortcut features in additional domains, and develop methods for automatically discovering further shortcut features which drive model behavior. The generative framework we have laid out will support a systematic investigation of architectural manipulations which may inﬂuence shortcut learning.

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

We thank Pieter-Jan Kindermans for feedback on the manuscript, and Jaehoon Lee, Lechao Xiao, Robert Geirhos, Olivia Wiles, Isabelle Guyon, and Roman Novak for interesting discussions.

Amirhossein Ahmadian and Fredrik Lindsten. Enhancing representation learning with deep classiﬁers in presence of shortcut. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1 5. IEEE, 2023.

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.

Nicholas Baker, Hongjing Lu, Gennady Erlikhman, and Philip J Kellman. Deep convolutional networks do not classify based on global object shape. PLo S computational biology, 14(12): e1006613, 2018.

Nicholas Baker, Hongjing Lu, Gennady Erlikhman, and Philip J Kellman. Local features and global shape information in object classiﬁcation by deep convolutional neural networks. Vision research, 172:46 61, 2020.

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pp. 456 473, 2018.

Alberto Bietti and Francis R. Bach. Deep equals shallow for relu networks in kernel regimes. Ar Xiv, abs/2009.14397, 2020.

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classiﬁcation. In Conference on fairness, accountability and transparency, pp. 77 91. PMLR, 2018.

Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta (eds.), Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009.

Thomas Fel, Remi Cadene, Mathieu Chalvidal, Matthieu Cord, David Vigouroux, and Thomas Serre. Look at the variance! efﬁcient black-box explanations with sobol-based sensitivity analysis. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Thomas Fel, Victor Boutin, Mazda Moayeri, Rémi Cadène, Louis Bethune, Mathieu Chalvidal, Thomas Serre, et al. A holistic approach to unifying automatic concept extraction and concept importance estimation. In Advances in Neural Information Processing Systems (Neur IPS), 2023.

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ar Xiv preprint ar Xiv:1811.12231, 2018a.

Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural information processing systems, 31, 2018b.

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665 673, 2020.

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. Annotation artifacts in natural language inference data. ar Xiv preprint ar Xiv:1803.02324, 2018.

Published as a conference paper at ICLR 2024

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Katherine Hermann and Andrew Lampinen. What shapes feature representations? exploring datasets, architectures, and training. Advances in Neural Information Processing Systems, 33:9995 10006, 2020.

Katherine Hermann, Ting Chen, and Simon Kornblith. The origins and prevalence of texture bias in convolutional neural networks. Advances in Neural Information Processing Systems, 33: 19000 19015, 2020.

Hendrik Heuer, Christof Monz, and Arnold WM Smeulders. Generating captions without looking beyond objects. ar Xiv preprint ar Xiv:1610.03708, 2016.

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.

Jason Jo and Yoshua Bengio. Measuring the tendency of cnns to learn surface statistical regularities. ar Xiv preprint ar Xiv:1711.11561, 2017.

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufﬁcient for robustness to spurious correlations. ar Xiv preprint ar Xiv:2204.02937, 2022.

Tyler La Bonte, Vidya Muthukumar, and Abhishek Kumar. Towards last-layer retraining for group robustness with fewer annotations. ar Xiv preprint ar Xiv:2309.08534, 2023.

Barbara Landau, Linda B Smith, and Susan S Jones. The importance of shape in early lexical learning. Cognitive development, 3(3):299 321, 1988.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Gaurav Malhotra, Benjamin D Evans, and Jeffrey S Bowers. Hiding a plane with a pixel: examining shape-bias in cnns and the beneﬁt of building in biological constraints. Vision Research, 174: 57 68, 2020.

R Thomas Mc Coy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. ar Xiv preprint ar Xiv:1902.01007, 2019.

Matthias Minderer, Olivier Bachem, Neil Houlsby, and Michael Tschannen. Automatic shortcut removal for self-supervised representation learning. In International Conference on Machine Learning, pp. 6927 6937. PMLR, 2020.

Mazda Moayeri, Phillip Pope, Yogesh Balaji, and Soheil Feizi. A comprehensive study of image classiﬁcation model sensitivity to foregrounds, backgrounds, and visual attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Vaishnavh Nagarajan, Anders Andreassen, and Behnam Neyshabur. Understanding the failure modes of out-of-distribution generalization. ar Xiv preprint ar Xiv:2010.15775, 2020.

Vaishnavh Nagarajan, Anders Andreassen, and Behnam Neyshabur. Understanding the failure modes of out-of-distribution generalization. In International Conference on Learning Representations, 2021.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, 2014.

Guillermo Valle Pérez, Chico Q Camargo, and Ard A Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. stat, 1050:23, 2018.

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. In Proceedings of the British Machine Vision Conference (BMVC), 2018.

Published as a conference paper at ICLR 2024

Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34:1256 1272, 2021.

Aahlad Puli, Lily Zhang, Yoav Wald, and Rajesh Ranganath. Don t blame dataset shift! shortcut learning due to gradients and cross entropy. ar Xiv preprint ar Xiv:2308.12553, 2023.

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301 5310. PMLR, 2019.

Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, and Suvrit Sra. Can contrastive learning avoid shortcut solutions? Advances in neural information processing systems, 34:4974 4986, 2021.

Chaitanya K Ryali, David J Schwab, and Ari S Morcos. Characterizing and improving the robustness of self-supervised learning through background augmentations. ar Xiv preprint ar Xiv:2103.12719, 2021.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2020a.

Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pp. 8346 8356. PMLR, 2020b.

Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33:9573 9585, 2020.

Sahil Singla and Soheil Feizi. Salient imagenet: How to discover spurious features in deep learning? In Proceedings of the International Conference on Learning Representations (ICLR), 2022.

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. In Workshop on Visualization for Deep Learning, Proceedings of the International Conference on Machine Learning (ICML), 2017.

Ajay Subramanian, Elena Sizikova, Najib J Majaj, and Denis G Pelli. Spatial-frequency channels, shape bias, and adversarial robustness. ar Xiv preprint ar Xiv:2309.13190, 2023.

Remi Tachet, Mohammad Pezeshki, Samira Shabanian, Aaron Courville, and Yoshua Bengio. On the learning dynamics of deep neural networks. ar Xiv preprint ar Xiv:1809.06848, 2018.

Alexa R Tartaglini, Wai Keen Vong, and Brenden M Lake. A developmentally-inspired examination of shape versus texture bias in machines. ar Xiv preprint ar Xiv:2202.08340, 2022.

Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton Van den Hengel. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16761 16772, 2022.

Rishabh Tiwari and Pradeep Shenoy. Sifer: Overcoming simplicity bias in deep networks using a feature sieve. ar Xiv preprint ar Xiv:2301.13293, 2023.

Yusuke Tsuzuku and Issei Sato. On the structural sensitivity of deep convolutional networks to the directions of fourier basis functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 51 60, 2019.

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology, 2011.

Max Wolff and Stuart Wolff. Signal strength and noise drive feature preference in cnn image classiﬁers. ar Xiv preprint ar Xiv:2201.08893, 2022.

Published as a conference paper at ICLR 2024

Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. ar Xiv preprint ar Xiv:2006.09994, 2020.

Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision. Advances in Neural Information Processing Systems, 32, 2019.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2014.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias ampliﬁcation using corpus-level constraints. ar Xiv preprint ar Xiv:1707.09457, 2017.

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452 1464, 2017.

Published as a conference paper at ICLR 2024

Supplementary Material for On the Foundations of Shortcut Learning

A BROADER IMPACTS

Our work aims to achieve a principled and fundamental understanding of the mechanisms of deep learning when it succeeds, when it fails, and why it fails. As the goal is to better understand existing algorithms, there is no downside risk of the research. The upside beneﬁt concerns cases of shortcut learning which have been identiﬁed as socially problematic and where fairness issues are at play, for example, the use of shortcut features such as gender or ethnicity that might be spuriously correlated with a core feature (e.g., likelihood of default).

B SUPPLEMENTARY FIGURES

Methods details accompanying these ﬁgures appear in Section C.

Figure B.1: Bias as a function of zs availability and predictivity for a single-hidden-layer MLP with a tanh hidden-layer activation function. Settings match those described in Figure 3C.

Figure B.2: The zs-reliance (Sections 3 and C.1) of hypothetical classiﬁers which rely exclusively on zs (top left) or zc (top right), or rely equally on both features (bottom row). Color indicates the predicted label (red: pos, blue: neg) for probe items plotted by base probe (zs, zc) values

Published as a conference paper at ICLR 2024

Figure B.3: Supplementary data for the results presented in Figure 2. TOP ROW: Model performance on the train (left) and validation (right) sets. BOTTOM ROW: zs-reliance for the optimal classiﬁer (left) and models (right).

Figure B.4: The results shown in Figure 2 are not speciﬁc to choice of σsc. Results for the same experiment using datasets with σsc = 0.4 (A) and σsc = 0.8 (B).

Figure B.5: Supplementary data for the results presented in Figure 3. Model performance on the train (left) and validation (right) sets.

Published as a conference paper at ICLR 2024

Figure B.6: Supplementary data for image experiments. Model performance underlying A: Experiments with the Waterbirds datasets presented in Figure 6 and described in C.5, and B: Experiments with image instantiations of the base dataset manipulating pixel footprint presented in Figure 4 B (left) and C (right).

Figure B.7: Models prefer a shallowly embedded nonlinear feature (zs) to a deeply embedded nonlinear one (zc). A: The color of each cell of the heatmap indicates the mean bias of models as a function of the degree of nesting of zc (ηc), and the predictivity of zs (ρs). zs is always shallowly embedded (ηs = 0), and the predictivity of zc is held ﬁxed (ρc = 0.9). See C.3 for additional details. B: Model performance. C:zs-reliance of the optimal classiﬁer (left) versus the model (right).

Published as a conference paper at ICLR 2024

Figure B.8: Illustration of studied Background availability manipulations. For each of the six perturbations inﬂuencing the availability of both the background and the bird (in the case of bird scale), we provide visual representations of the potential perturbations applied to the training dataset. For instance, the ﬁrst row demonstrates variations in the dataset resulting from changes in bird size, ranging from 5% to 95%.

Published as a conference paper at ICLR 2024

Figure B.9: Manipulations of image backgrounds show that availability as well as predictivity determines feature reliance in vision models, including in models pretrained on Image Net. Expanding on the results shown in Figure 6, we plot Bird accuracy for incongruent probes (opposing Bird and Background labels) as a function of Background predictivity (ρ) and hypothesized types of availability. TOP ROW: Res Net18 models. BOTTOM ROW: IN Res Net50 models (see Section C.5 for additional details).

Published as a conference paper at ICLR 2024

Figure B.10: Explainability sanity check: Attribution maps corroborate Background availability study. In Figures 6 and B.9, we saw that a Bird size manipulation affects models tendency to classify incongruent probes consistent with their Bird labels. Here, we see that attribution maps for probe images (top row) reﬂect a difference in focal point for models trained on images with Bird size = 40% (preference for image background) versus Bird size = 80% (preference for foreground bird). See C.5 for additional details.

Published as a conference paper at ICLR 2024

C SUPPLEMENTARY METHODS AND RESULTS

C.1 ASSESSING BIAS

In Section 3, we described how we determine reliance M, a score which quantiﬁes the extent to which a model (optimal classiﬁer or trained model) uses feature zs as the basis for its classiﬁcation decisions. In Figure B.2, for intuition, we show the zs reliance of hypothetical classiﬁers.

C.2 SYNTHETIC DATA

Optimal classiﬁer. To obtain optimal classiﬁcations, we use Linear Discriminant Analysis (LDA, as implemented in Sklearn with the least-squares solver). In all but the experiments on the effect of nesting as part of the generative process (C.3 and Figure B.7), the optimal classiﬁer was ﬁt to and probed with the same embedded base inputs used to train the corresponding model in a given experiment. In the nesting experiments, LDA was ﬁt to and probed with base inputs directly.

C.3 VECTOR EXPERIMENTS

Default model architecture and training procedure. Unless otherwise described, we train a multilayer perceptron (MLP, depth = 8, width = 128, hidden activation function = Re LU) with MSE loss for 100 epochs using an SGD optimizer with batch size = 64 and learning rate = 1e 02. We use Glorot Normal weight initialization.

Models prefer the more-available feature, even when it is less predictive. In these experiments, given a base dataset with (ρs, ρc), for each availability setting (αs, αc), we manipulate the availability of the dataset, sample a random intialization seed, train the default model architecture, and evaluate the bias of the model. We repeat this process for each of 10 runs, computing the biases for individual (model, optimal classiﬁer) pairs, and then averaging the results.

Effect of nesting on availability. Figure 1B illustrates how our generative procedure for synthetic datasets supports nested embedding of latent features, as described in Section 3. In this experiment, we test whether a model prefers a shallowly embedded nonlinear feature (zs) to a deeply embedded nonlinear one (zc) when the amplitude of both features is matched (αs = αc = 0.1) and the nonlinearity is a scaled tanh (tanh(λx), with λ = 100). After embedding each feature as ei = αiwizi, we apply the nonlinearity. For a nesting factor ηi > 0, we pass ei through a fully connected network (depth = ηi, width = d, no bias weights) with scaled tanh activations on the hidden and output layers. We initialize weight matrices to be random special orthogonal matrices (implemented as scipy.stats.special_ortho_group). In Figure B.7, we show bias as a function of ηc and ρs; zs is ﬁxed to be shallowly embedded with ηs = 0 and has ρc = 0.9. We report the results as the mean across 3 runs.

C.4 PIXEL FOOTPRINT EXPERIMENTS

Model architecture and training procedure. In these experiments, we train a randomly initialized Res Net18 architecture with MSE loss using an Adam optimizer with batch size = 64 and learning rate = 1e 02, taking the best (deﬁned on validation accuracy) across 100 epochs of training. For each (ρs, ρc) predictivity setting, we repeat this process for 10 (sampled dataset, random weight initialization) runs.

Analyses. In Figure 4, we report results for experiments in which we use a base zs pixel footprint of 400 px, and manipulate zs pixel footprint to be 1 5 larger (αc = 1, αs [1, 5]). To determine the bias of a model, we compute the zs-reliance of the trained model given image instantiations of the base probe items, and compare to the zs-reliance of the optimal classiﬁer (LDA) ﬁt to and probed using vector instantiations of the same base dataset with the same αs and αc. We repeat this process for 5 runs, computing bias on a per-run basis as in Section C.3.

C.5 FEATURE AVAILABILITY IN NATURALISTIC DATASETS

Datasets. Waterbirds. Waterbirds images (Sagawa et al., 2020a) combine birds taken from the Caltech-UCSD Birds-200-2011 dataset (Wah et al., 2011) and backgrounds from the Places dataset

Published as a conference paper at ICLR 2024

(Zhou et al., 2017). To construct the datasets used in Figure 6A, we sample images from a base Waterbirds dataset generated using code by (Sagawa et al., 2020a) (github.com/kohpangwei/ group_DRO) with val frac = 0.2 and confounder strength = 0.5, yielding sets of 5694 train images (224 224), 300 validation images, and 5794 test images. We then subsample these sets to 1200 train, 90 validation ( 2 sets), and 1000 probe images ( 2 sets), respectively, such that the train and validation sets instantiate target feature predictivities, and the probe sets contain congruent or incongruent feature-values, as described in Section 7.

To construct the datasets used in Figures 6C and B.9, for a given predictivity setting, we subsample 8 bird types per category (land, water) and cross them with land and water backgrounds randomly sampled from the standard Waterbirds background classes, to generate class-balanced train sets ranging in size from 800 1200, and incongruent probe sets of size 400.

Celeb A. The Celeb A (Liu et al., 2015) dataset contains images of celebrity faces paired with a variety of binary attribute labels. In our experiments, we cast the Attractive label as the core feature, and the Smiling label as the non-core feature. To construct the datasets used in Figure 6B, we sample images consistent with the target predictivities.

Waterbirds availability manipulations. In the datasets depicted in Figures 6C and B.9, we aim to manipulate various aspects of background availability. To do so, we use CUB-200 masks to exclusively modify the background, and apply ﬁve distinct types of perturbations: altering bird size, removing background patches, changing background colors, applying low-pass ﬁltering, adding Perlin noise, and adding white noise. These perturbations test hypotheses about speciﬁc image properties than may make an image background more available to a model than the foreground object. We vary Background predicitivity while holding bird predictivity ﬁxed (= 0.99). We report results for at least 500 runs (model training) for each perturbation type. Figure B.8 shows sample image to which the perturbations have been applied. In detail, the perturbations are as follow:

Bird size: We manipulate the pixel footprint of the bird by scaling the square mask. Note that an increase in bird size corresponds to a decrease in background pixels in the image.

Background patch removal: This manipulation removes patches of the image background while preserving the foreground bird. To apply the manipulation, we partition the image into a 3 3 checkerboard, position the bird in the center, and then remove x in [1, 8] background cells.

Color: To assess the inﬂuence of color information, we use the original image with its colored background ( Color ) or convert the background to grayscale ( Grayscale ).

Low-pass ﬁltering: We apply a low-pass ﬁlter to the frequency representation of the background, with the cutoff frequency varying from 224 (original image) to 2 (retaining only components at 2 cycles per image).

Uniform noise: Uniform noise is introduced to the image with an amplitude ranging from 0 (original image) to 255. It is essential to note that image values are clipped within the (0, 255) range before preprocessing.

Perlin noise: We apply Perlin noise, which possesses spatial coherence, at various intensity levels along with uniform noise. We employ 6-octave Perlin noise to match the frequency distribution of natural backgrounds (wavelengths 2, 4, 8, 16, 32, 64, and 128).

Model architecture and training procedure. For the experiments shown in Figure 6A and B, we normalize all images by the Image Net train-set statistics (RGB mean = [0.485, 0.456, 0.406]), std = [0.229, 0.224, 0.225]), and then use the same settings as described in 5.

In the experiments shown in Figures 6C and B.9, we preprocess images by normalizing to be in [ 1, 1], apply random crops of size 200 | 200, resize to 224 | 224, and randomly ﬂip over the horizontal axis with p = 0.5 We train models with MSE loss using an Adam optimizer with batch size = 32, cosine decay learning rate schedule (linear warmup, initial learning rate = 1e 03), and weight decay = 1e 05. For the Res Net18 experiments in Figures 6C and B.9, we train a randomly initialized Res Net18 trained for 30 epochs. For the Res Net50 experiments in Figure B.9, we train the randomly initialized readout layer of a frozen, Image Net-pretrained Res Net50 (IN Res Net50) for 15 epochs.

Published as a conference paper at ICLR 2024

Results: Vision models, including Image Net-pretrained ones, are sensitive to Background availability manipulations. Figure B.9 displays the results of the six manipulations which we hypothesized affect Background availability. The top row shows accuracy for Res Net18 models trained from scratch, while the bottom row shows the results for IN-Res Net50. Our ﬁndings reveal several key observations:

First, as the size of the bird increases (Col.1), as well as when the background becomes noisier (Col. 4,6), ﬁltered of its high frequencies (Col 5), or simply masked (Col 2), the model tends to rely more heavily on the bird itself for classiﬁcation. Conversely, when the background is clear and informative (available), the model utilizes the background features to predict the class. Second, the spuriousness of the background, as quantiﬁed by ρ, also has a discernible effect on the model s behavior. However, it only partially accounts for the observed variations, demonstrating that availability factors are at play. Last, these results indicate that pretraining has a beneﬁcial impact and can modulate feature availability. This effect may be attributed to the learned invariances during the pretraining process, enabling the model to better adapt to different availability conditions. Our spatial frequency experiments complement existing and concurrent work studying what models learn as a function of spatial frequency manipulations at training (Jo & Bengio, 2017; Yin et al., 2019; Tsuzuku & Sato, 2019; Hermann et al., 2020; Subramanian et al., 2023) or test time (Geirhos et al., 2018b).

Results: Explainability analyses corroborate Background availability study. The experiments presented in Figures 6C and B.9 showed that the degree of Background availability inﬂuences model classiﬁcations of incongruent probes. Here, we use attribution methods to verify that when a model predominantly relies on the image background, or on the foreground bird, according to the experimental results, its focal point reﬂect this. Figure B.10 shows the attributions maps of two Res Net18 models trained from scratch, one using a bird size of 40%, and the other using a bird size of 80%. For the same set of probe images, the model trained on larger birds exhibits a strategy shift in classiﬁcation, seemingly having his focal point on the birds, whereas the model trained on smaller birds mainly relies on the background. The methods employed encompass a combination of black-box techniques (Rise (Petsiuk et al., 2018) and Sobol indices (Fel et al., 2021)) and gradient-based approaches (Saliency (Zeiler & Fergus, 2014) and Smoothgrad (Smilkov et al., 2017)).

D QUADRATIC APPROXIMATION OF h

Since u is the dot product of two normalized vectors, its range is limited to 1 u 1. Observe that the derivative of h has the form,

h (u) = 1 1

π arccos(u) . (8)

The only place within u [ 1, 1] that h vanishes is at u = 1, suggesting the critical point of our quadratic must lie at this point. Furthermore, observe that h( 1) = 0. Forcing these two constraints onto our quadratic, leaves us with the form,

bh(u) = a(1 + u)2 , (9)

where a is a parameter to be determined. Our aim is to ﬁnd a so that the function bh and h looks similar over the entire domain u [ 1, 1]. We choose ℓ2 error between the two functions deﬁned as,

h(u) bh(u) 2 d u . (10)

Published as a conference paper at ICLR 2024

Figure D.1: The plots for h (blue) bh (orange) within u [ 1, 1].

Replacing the deﬁnitions for h and bh into (h(u) bh(u))2, one can 1 obtain

5 a2 + 32 27π2 . (18)

Since ea is a convex quadratic in a, its minimizer can be determined by zero-crossing the derivative of ea w.r.t. a. This yields,

a arg min a ea = 815 3072 . (19)

This leads to the claimed quadratic form,

bh(u) 815 3072(1 + u)2 . (20)

Figure D.1 shows a plot of the original h and the quadratic approximation bh to visually get a sense of the approximation quality.

E PROOF OF RELU KERNEL THEOREM

Recall that the kernel function k for Re LU is expressed using h as in (3). Replacing h with bh thus gets us an alternate kernel function which approximates that of Re LU s.

k(x, z) x z bh x

= x z a 1 + x

1We ﬁrst claim that the indeﬁnite integral of (h(u) bh(u))2 has the form,

u π arccos(u) + p

a(1 + u)2 2 d u (11)

= 1 2160π2 432π2a2u5 + 80 9π2 6a2 4a + 1 17 u3 (12)

+240 9π2a2 + 11 u + 1080π2a(2a 1)u4 + 2160π2a(2a 1)u2 (13)

1 u2 a 90u3 + 256u2 + 207u 64 128u2 + 32

+120 3πa 3u2 + 8u + 6 u2 12πu3 + 4 1 4u2 p

1 u2 cos 1(u) (15)

1215πa sin 1(u) + 720u3 cos 1(u)2

This can be be easily shown by taking the derivative of both sides and arriving at equality. The deﬁnite integral can now computed by evaluation the indeﬁnite integral at the boundary points.

ea = Ea(1) Ea( 1) = 1

5 a2 + 32 27π2 . (17)

Published as a conference paper at ICLR 2024

E.1 EIGENFUNCTIONS

An eigenfunction φ must satisfy, Z

Rn a x z 1 + x

2 φ(z)p(z) dz = λ φ(x) , (22)

or equivalently,

2 φ(z)p(z) dz = λ φ(x) , (23)

We ﬁrst focus on the above integral. Z

2 φ(z)p(z) dz (24)

Rn z 1 + 2 x, z

x z + x, z 2

φ(z)p(z) dz (25)

φ(z)p(z) dz (26)

z + 2x1z1 + x2z2 p

x2 1 + x2 2 + x2 1z2 1 + x2 2z2 2 + 2x1x2z1z2 (x2 1 + x2 2) p

z2 1 + z2 2

φ(z)p(z) dz (27)

x2 1 + x2 2 +

x2 1 + x2 2 + z2 1 p

z2 1 + z2 2

x2 1 (x2 1 + x2 2) (28)

z2 1 + z2 2

x2 2z2 2 (x2 1 + x2 2) +

z2 1 + z2 2

2x1x2 (x2 1 + x2 2)

φ(z)p(z) dz (29)

x2 1 + x2 2 + a2

x2 1 + x2 2 + a3 x2 1 x2 1 + x2 2 + a4 x2 2 x2 1 + x2 2 + a5

2x1x2 x2 1 + x2 2 , (30)

where a0 to a5 are the result of the deﬁnite integral for each term. Observe that each ai is thus constant in x and z. Plugging the above expression into (23) yields,

x2 1 + x2 2 +a2

x2 1 + x2 2 +a3 x2 1 x2 1 + x2 2 +a4 x2 2 x2 1 + x2 2 +a5

2x1x2 x2 1 + x2 2 ) = λ φ(x) ,

(31) Thus the eigenfunction has the form,

λ x (a0 + a1

x2 1 + x2 2 + a2

x2 1 + x2 2 + a3 x2 1 x2 1 + x2 2 + a4 x2 2 x2 1 + x2 2 + a5

2x1x2 x2 1 + x2 2 ) .

(32) In order to ﬁgure out the values of a0 to a5 we plug the obtained eigenfunction form into (23) again, for both φ(x) and φ(z), That takes us from

2 φ(z)p(z) dz = λ φ(x) , (33)

x2 1 + x2 2 +

x2 1 + x2 2 + z2 1 p

z2 1 + z2 2

x2 1 (x2 1 + x2 2) (34)

z2 1 + z2 2

x2 2z2 2 (x2 1 + x2 2) +

z2 1 + z2 2

2x1x2 (x2 1 + x2 2)

λ z (a0 + a1

z2 1 + z2 2 + a2

z2 1 + z2 2 + a3 z2 1 z2 1 + z2 2 + a4 z2 2 z2 1 + z2 2 + a5

2z1z2 z2 1 + z2 2 )

p(z) dz (36)

λ x (a0 + a1

x2 1 + x2 2 + a2

x2 1 + x2 2 + a3 x2 1 x2 1 + x2 2 + a4 x2 2 x2 1 + x2 2 + a5

2x1x2 x2 1 + x2 2 ) . (37)

Published as a conference paper at ICLR 2024

Dividing both sides by a x /λ yields,

2 φ(z)p(z) dz = λ φ(x) (38)

x2 1 + x2 2 +

x2 1 + x2 2 + z2 1 q

z2 1 + z2 2

x2 1 (x2 1 + x2 2) (39)

z2 1 + z2 2

x2 2z2 2 (x2 1 + x2 2) +

z2 1 + z2 2

2x1x2 (x2 1 + x2 2)

z2 1 + z2 2 + a2

z2 1 + z2 2 + a3 z2 1 z2 1 + z2 2 + a4 z2 2 z2 1 + z2 2 + a5

2z1z2 z2 1 + z2 2 ) p(z) dz (41)

= λ (a0 + a1

x2 1 + x2 2 + a2

x2 1 + x2 2 + a3 x2 1 x2 1 + x2 2 + a4 x2 2 x2 1 + x2 2 + a5

2x1x2 x2 1 + x2 2 ) . (42)

For the above identity to hold at any point x we need to equate the coefﬁcients for each term involving x. That is, we need to have,

z2 1+z2 2 z2 2 q

a0 a1 a2 a3 a4 a5

a0 a1 a2 a3 a4 a5

(43) or in a more compact form,

vzv T z p(z) dz a = λ a , (44)

2z2 z2 1 z z2 2 z

We now focus on the form of p(z). Recall that we assume p(z) is a mixture of I Gaussian sources. Denote the selection probability of each Gaussian component by πi and the corresponding componentconditional normal density function by g(z ; µ(i), C(i)). Thus,

i πig(z ; µ(i), C(i)) . (46)

Since we assume that the covariance matrix of each gi is small element-wise, we can resort to a Laplace-type approximation for the integrals as below. Z

R2 f(z) p(z) dz X

i πi f(µ(i)) . (47)

Under this approximation, the linear system can be expressed as, X

i πiviv T i

2µ(i) 2 µ(i) 1 2

µ(i) µ(i) 2 2

2µ(i) 1 µ(i) 2 µ(i)

Thus, any solution a must be an eigenvector of C. In the sequel we discuss how to obtain the eigenvectors of C. In particular, in our 2-d problem with 2 Gaussian mixtures, where π1 = π2 = 1/2 and µ(1) = Aµ and µ(2) = Aµ, the linear equation can be expressed as below. 1

Published as a conference paper at ICLR 2024

2a2µ2 (a1µ1)2

2a2µ2 (a1µ1)2

Aµ i T . (52)

We now seek the eigenvectors of C in this speciﬁc setting. A key observation is that v+ = v because the two are different only in the sign of their components. We now claim that, for any two vectors v+ and v such that v+ = v . the eigevectors of the matrix C v+v+T + v v T

have the form, ψ1 = c1 (v+ + v ) , ψ2 = c2 (v+ v ) , (53) where c1 and c2 are normalization constants. We prove that ψ1 is an eigenvector of C; similar argument applies to ψ2.

Cψ1 = (v+v+T + v v T ) c1 (v+ + v ) (54)

= c1 v+ v+ 2 + v v+, v + v+ v+, v + v v 2

= c1 v+ v+ 2 + v v+, v + v+ v+, v + v v 2

= c1( v+ 2 + v+, v )v+ + c1( v 2 + v+, v )v (57)

= c1( v+ 2 + v+, v )v+ + c1( v+ 2 + v+, v )v (58)

= ( v+ 2 + v+, v ) c1(v1 + v2) (59)

= v+ 2 + v+, v ψ1 . (60)

The above shows that ψ1 satisﬁes Cψ1 ψ1 and thus ψ1 must be an eigenvector of C. Plugging the deﬁnition for v+ and v gives us the eigenvectors of our C matrix, and thus obtain the solution a in the equation (48). Since we are not constraining the length of the eigenvectors here, we can absorb all the constants in equation (48) into one and express the solution pair a as below.

2 = h Aµ 0 0 (a1µ1)2

2a2µ2 0 0 0 T . (62)

Recall from (32) that,

λ x (a0 + a1

2x2 x + a3 x2 1 x 2 + a4 x2 2 x 2 + a5

2x1x2 x 2 ) . (63)

Plugging the pair of solutions for a into the above yields the two eigenfunctions,

Aµ + (a1µ1)2

Aµ x2 1 x 2 + (a2µ2)2

Aµ x2 2 x 2 +

or more simply,

φ2(x) x, Aµ , (67)

We convert the to equality by applying proper scaling factors d1 and d2 as follows.

φ1(x) = d1 x Aµ

φ2(x) = d2 x, Aµ , (69)

Published as a conference paper at ICLR 2024

In order to ﬁnd the factors d1 and d2, we require φ1 and φ2 to have unit norm in the sense of (6). Starting with φ1 we proceed as,

φ1 2 = d2 1 Aµ 2 Z

R2 x 2 1 + x

2 2 p(x) dx (70)

= d2 1 Aµ 2 1 2 x 2 1 + x

2 x 2 1 + x

= d2 1 Aµ 2 x 2 1 + x

= d2 1 Aµ 4 1 + Aµ

= d2 1 Aµ 4 1 + 12 2 . (74)

Setting the above to 1 and solving in d1 yields,

d1 = 1 2 Aµ 2 , (75)

and consequently,

φ1(x) = x 1 + D x x , Aµ Aµ E2

2 Aµ . (76)

Similarly for φ2 we proceed as,

φ2 2 = d2 2

R2 x, Aµ 2 p(x) dx (77)

2 x, Aµ 2 x=Aµ + 1

2 x, Aµ 2 x= Aµ

= d2 2 Aµ, Aµ 2 (79)

= d2 2 Aµ 4 . (80)

Setting the above to 1 and solving in d2 yields,

d2 = 1 Aµ 2 , (81)

and consequently,

φ2(x) = x Aµ , Aµ Aµ

Published as a conference paper at ICLR 2024

E.2 EIGENVALUES

To ﬁnd the eigenvalues, we simply put the eigenfunctions in their deﬁning equation (1). Starting with φ1 we proceed as below.

R2 k(x, z) φ1(z) p(z) dz (83)

R2 a x z 1 + x

2 φ1(z) p(z) dz (84)

R2 a x z 1 + x

1 + D z z , Aµ Aµ E2

2 p(z) dz (85)

a x z 1 + x

1 + D z z , Aµ Aµ E2

a x z 1 + x

1 + D z z , Aµ Aµ E2

a x Aµ 1 + x

2 1 + D Aµ Aµ , Aµ Aµ E2

a x Aµ 1 + x

2 1 + D Aµ Aµ , Aµ Aµ E2

= 1 2a x Aµ

= 2a Aµ 2 x 2 Aµ

= 2a Aµ 2 φ1(x) . (93)

λ1 = 2a Aµ 2 . (94)

Published as a conference paper at ICLR 2024

Similarly for φ2 we have, Z

R2 k(x, z) φ2(z) p(z) dz (95)

R2 a x z 1 + x

2 φ2(z) p(z) dz (96)

R2 a x z 1 + x

2 z Aµ , Aµ

p(z) dz (97)

a x z 1 + x

2 z Aµ , Aµ

a x z 1 + x

2 z Aµ , Aµ

a x Aµ 1 + x

a x Aµ 1 + x

2 Aµ Aµ , Aµ

a x Aµ 1 + x

a x Aµ 1 + x

= 1 2a x Aµ

= 1 2a x Aµ

= 2a Aµ 2 x Aµ , Aµ

= 2a Aµ 2 φ2(x) . (107)

Therefore, λ2 = 2a Aµ 2 . (108)

F PROOF OF LINEAR KERNEL THEOREM

An eigenfunction φ of the kernel operator associated with kernel k(x, z) = x, z must satisfy, Z

Rn x, z φ(z)p(z) dz = λ φ(x) (109)

Following our earlier setup, we proceed with a 2-dimensional input this becomes, Z

Rn x, z φ(z)p(z) dz (110)

R2 (x1z1 + x2z2) φ(z)p(z) dz (111)

= a1x1 + a2x2 , (112)

where a1 and a2 are the result of the deﬁnite integrals of form R

R2 xif(z) dz, for i = 1, 2. Observe that each ai is thus constant in x and z. Plugging the above expression into (109) yields,

a1x1 + a2x2 = λ φ(x) , (113)

Published as a conference paper at ICLR 2024

Thus the eigenfunction has the form,

λ (a1x1 + a2x2) . (114)

In order to ﬁgure out the values of a1 and a2 we plug the obtained eigenfunction form into (109) again, for both φ(x) and φ(z), Z

R2 x, z φ(z)p(z) dz = λ φ(x) (115)

R2(x1z1 + x2z2) 1

λ(a1z1 + a2z2) p(z) dz = λ 1

λ(a1x1 + a2x2) (116)

R2(x1z1 + x2z2) (a1z1 + a2z2) p(z) dz = λ (a1x1 + a2x2) (117)

i πi(x1µ(i) 1 + x2µ(i) 2 ) (a1µ(i) 1 + a2µ(i) 2 ) = λ (a1x1 + a2x2) , (118)

where in the last line we use the assumption that the covariance matrix of each Gaussian source is small, and thus we can resort to a Laplace-like approximation of the integral. To have the above identity hold true at any x we need to require that, X

i πiµ(i) 1 (a1µ(i) 1 + a2µ(i) 2 ) = λ a1 (119)

i πiµ(i) 2 (a1µ(i) 1 + a2µ(i) 2 ) = λ a2 (120)

or in matrix form,

2 µ(i) 1 µ(i) 2 µ(i) 1 µ(i) 2 µ(i) 2

The equation can be compactly expressed as,

Ca = λa . (123)

Thus, any solution a must be an eigenvector of C. In the sequel we discuss how to obtain the eigenvectors of C. Observe that each matrix Ci is a rank-one symmetric matrix that can be written as µ(i)µ(i)T . In particular, in our 2-d problem with 2 Gaussian mixtures, where π1 = π2 = 1/2 and µ(1) = Aµ and µ(2) = Aµ, the above equation can be expressed as below. 1

2(Aµ)(Aµ)T + 1

2( Aµ)( Aµ)T a = λa (124)

(Aµ)(Aµ)T a = λa , (125)

Thus, a Aµ . (126) Recall from (114) that,

λ a, x . (127)

Plugging the solution for a into the above yields the eigenfunction,

φ(x) Aµ, x , (128)

We convert the to equality by applying proper scaling factors c as follows.

φ(x) = c Aµ, x , (129)

Published as a conference paper at ICLR 2024

In order to ﬁnd the scale factor c, we require φ to have unit norm in the sense of (6).

R2 ( Aµ, x )2 p(x) dx (130)

2 ( Aµ, x )2 x=Aµ + c2

2 ( Aµ, x )2 x= Aµ (131)

= c2 ( Aµ, x )2 x=Aµ (132)

= c2 ( Aµ 2)2 . (133)

Setting the above to 1 and solving in c yields,

c = 1 Aµ 2 , (134)

and consequently,

φ(x) = Aµ, x

Aµ 2 . (135)

For ﬁnding the eigenvalue associated with φ we plug this form into (109) for both φ(x) and φ(z) and then solve in λ. Z

Rn x, z φ(z)p(z) dz = λ φ(x) (136)

Rn x, z Aµ, z

Aµ 2 p(z) dz = λ Aµ, x

z= Aµ = λ Aµ, x

x, Aµ Aµ, Aµ

Aµ 2 = λ Aµ, x

Aµ 2 = λ . (140)

G OPTIMAL PREDICTION

Consider the task of ﬁnding a function f(x) that best approximates a given function y(x) via the following regression setup.

f(x) y(x) 2 p(x)dx (141)

Suppose we are interested in expressing f using the bases induced by the kernel k, that is the eigenfunctions of the linear operator associated with k whose eigenvaules are non-zero. That is,

k akφk(x) , (142)

where φk is an eigenfunction, and k runs over eigenfunctions with non-zero eigenvalue. We now show that the function f which minimizes the regression loss L has the form,

k akφk(x) (143)

Rn φk(z)y(z) p(z)dzφk(x) . (144)

We start from,

f(x) y(x) 2 p(x)dx (145)

k akφk(x) y(x) 2 p(x)dx . (146)

Published as a conference paper at ICLR 2024

k akφk(x) y(x) ajφj(x) p(x)dx (147)

k akajφk(x)φj(x) ajφj(x)y(x) p(x)dx (148)

Rn φk(x)φj(x) p(x)dx aj

Rn φj(x)y(x) p(x)dx (149)

Rn φj(x)y(x) p(x)dx . (150)

Zero-crossing,

Rn φj(x)y(x) p(x)dx . (151)

k akφk(x) (152)

Rn φk(z)y(z) p(z)dzφk(x) (153)

In particular, if we replace p(z) by our pair of Gaussian densities with small covariance, and set y(x) to 1 and 0 depending on whether x is drawn from the positive or negative component of the mixture, we arrive at,

R2 φi(z) (1) p+(z) dz + 1

R2 φi(z) (0) p (z) dz = 1

i φi(x)φi(Aµ) ,

(154) where p+ and p denote class-conditional normal distributions.

H SENSITIVITY ANALYSIS

H.1 LINEAR KERNEL

Theorem 5 In a linear network, |ζ1| |ζ2| is always zero for any choice of m 1 and regardless of the values of a1 and a2.

Recall the deﬁnitions,

i φi(x)φi(Aµ) (155)

i φi(x)φi(BAµ) (156)

R2 f(x) q R

R2 f 2(t) p(t) dt

R2 g2 B(t) p(t) dt p(x) dx . (157)

For linear kernel, we had only one eigenfunction with the form,

φ1(x) = Aµ, x

Aµ 2 . (158)

Replacing that, in the above deﬁnitions yields,

f(x) = 1 2 Aµ, x

Aµ 2 Aµ, Aµ

g B(x) = 1 2 Aµ, x

Aµ 2 Aµ, BAµ

Aµ 2 . (160)

Published as a conference paper at ICLR 2024

It is easy to obtain, Z

Rn f 2(x)p(x) = 1 4 1 Aµ 4

Rn( Aµ, x )2 p(x) dx (161)

= 1 4 1 Aµ 4 2( Aµ, Aµ )2 = 1

Rn g2(x)p(x) = 1 4 ( Aµ, BAµ )2

Rn( Aµ, x )2 p(x) dx (163)

= 1 4 ( Aµ, BAµ )2

Aµ 8 2( Aµ, Aµ )2 (164)

= 1 2 ( Aµ, BAµ )2

Aµ 4 . (165)

Plugging these results into the deﬁnition of γ yields,

R2 f(x) q R

R2 f 2(t) p(t) dt

R2 g2 B(t) p(t) dt p(x) dx (166)

Aµ 2 Aµ,BAµ

2 | Aµ,BAµ |

Aµ 2 p(x) dx (167)

Aµ 2 sign( Aµ, BAµ )

2 p(x) dx (168)

= sign( Aµ, BAµ )

R2( Aµ, x )2 p(x) dx (169)

= sign( Aµ, BAµ ) (170)

= sign(b1a2 1µ2 1 + b2a2 2µ2 2) . (171)

Recall that,

b=1 . (172)

It is easy to see that when for any m 1,

bm i γ = 0 b1a2 1µ2 1 + b2a2 2µ2 2 = 0 Na N b1a2 1µ2 1 + b2a2 2µ2 2 = 0 . (173)

To ﬁnd ζ1 and ζ2 we need to evaluate the above at the point b1 = b2 = 1. Since by deﬁnition, µi = 0 and ai = 0, it easy follows that ζ1 = ζ2 = 0 and thus |ζ1| = |ζ2| for any m 0.

H.2 RELU LINEAR KERNEL

Theorem 6 In a Re LU network, |ζ1| |ζ2| = 0 for any 1 m 8. The ﬁrst non-zero |ζ1| |ζ2| happens at m = 9 and has the following form,

|ζ1| |ζ2| = 5670 Aµ 18 (a1a2µ1µ2)8(a2 1µ2 1 a2 2µ2 2) . (174)

Recall the deﬁnitions,

i φi(x)φi(Aµ) (175)

i φi(x)φi(BAµ) (176)

R2 f(x) q R

R2 f 2(t) p(t) dt

R2 g2 B(t) p(t) dt p(x) dx . (177)

Published as a conference paper at ICLR 2024

For Re LU kernel, we had only two eigenfunctions with the form,

φ1(x) = x Aµ

1 + D x x , Aµ Aµ E2

2 , φ2(x) = x Aµ , Aµ

Replacing them in the above deﬁnitions yields,

f(x) = 1 2 x Aµ

1 + D x x , Aµ Aµ E2

1 + D Aµ Aµ , Aµ Aµ E2

Aµ 2 Aµ, Aµ

2 1 + 1 + 1

Aµ 2 , (182)

g B(x) = 1 2 x Aµ

1 + D x x , Aµ Aµ E2

1 + D BAµ BAµ , Aµ Aµ E2

Aµ 2 Aµ, BAµ

= 1 8 x BAµ

BAµ , Aµ Aµ

Aµ 2 Aµ, BAµ

Aµ 2 . (184)

It is easy to obtain,

4 x Aµ 1 + x

4 x Aµ 1 + x

and therefore, Z

Rn f 2(x)p(x) dx = 2 1

4 Aµ Aµ 1 + Aµ

2 + 0 (187)

4 1 + 1 2 + 2 1

= 1 . (189)

It in a similar way, one can compute R

Rn g2(x)p(x) dx. Plugging these into the deﬁnition of γ yields,

2(a2 1µ2 1+a2 2µ2 2)3(a2 1b2 1µ2 1+a2 2b2 2µ2 2) 8a8 1b4 1µ8 1+8a6 1b2 1a2 2(b1+b2)2µ6 1µ2 2+a4 1a4 2(b1+b2)2(b2 1+10b1b2+b2 2)µ4 1µ4 2+8a2 1a6 2b2 2(b1+b2)2µ2 1µ6 2+8a8 2b4 2µ8 2 (190) It is messy but straightforward to write the derivatives of γ w.r.t. b1 and b2 evaluated at b1 = b2 = 1 (which gives ζ1 and ζ2) to see that for any derivative of order m 8 the above yields ζ1 = ζ2 and at m = 9 one obtains,

|ζ1| |ζ2| = 5670 Aµ 18 (a1a2µ1µ2)8(a2 1µ2 1 a2 2µ2 2) . (191)

I QUADRATIC APPROXIMATION VS RELU KERNEL

The paper presents an analytical form that characterizes the trade-off between predictivity and availability in a single-layer Re LU model for classifying two Gaussian sources.

In order to keep the theoretical analysis tractable, we the following approximations. First, we used an asymptotic approximation to the covariance matrix by having the size of the covariance going

Published as a conference paper at ICLR 2024

Quadratic Re LU

Figure I.1: Plot of sign((|ζ1| |ζ2|)(a1 a2)) for Re LU and Quadratic NTK models trained on 50 training samples from positive N(µ, C) and negative N( µ, C) classes. Here, µ = (µ1, µ2), covariance is C = σ2I, where µ1 = 1 and σ = 0.01. Top and bottom ﬁgures respectively correspond to µ2 = 10 are µ = 0.1.

to zero. Second, use replaced the Re LU kernel function by a quadratic approximation. A natural question is, whether the predictions made by our theory are sensitive to these approximations. More precisely, if we work with the actual Re LU kernel and a real covariance matrix, whether observe similar predictions.

We verify this here by learning a model from ﬁnite training data in the NTK regime. The latter amounts to performing a kernel regression with kernel speciﬁed by the NTK. The result of the simulation is provided in Figure I.1 and conﬁrms that the approximations used in the theory are not affecting the actual predictions, in the sense that it matches the trend of the predictions made by the theory from Figure 5.

The left panel in Figure I.1 still uses quadratic kernel. However, it is generated by actual predictions from a model trained by ﬁnite training data, whose distribution is a real covariance (as opposed to an asymptotically small one). This conﬁrms the closed-form expression derived from our theory (predictions from Figure 5) match the simulation result.

The right panel of Figure I.1 goes further and replaces the quadratic kernel by the actual Re LU kernel. Observe that this does not affect the trend of the predictions, hence showing there is not much lost by switching between the Re LU kernel and its quadratic approximation.

NUM_PLOT_POINTS = 20 # Number of points used to plot the result (larger better but slower) , N_TRAIN_SAMPLES = 50 # Number of training points per Gaussian source , USE_QUAD = True # When to use Re LU or its quadratic approximation , MU_1 = 1.0 MU_2_List = [10, 0.1] SCALE_COV = 0.0001 # Scale parameter s from the paper COR = 0.0 # Abs[COR] must be less than 1 EPS = SCALE_COV/100 # Epsilon identity added before inverting kernel matrix , DELTA = 0.05 # delta to approximate derivatives with finite different ,

import numpy as np from matplotlib import pyplot as plt

def kernel(x1,x2): norm_x1=np.sqrt(np.dot(x1,x1)) norm_x2=np.sqrt(np.dot(x2,x2)) u=np.dot(x1/norm_x1,x2/norm_x2)

Published as a conference paper at ICLR 2024

if np.abs(u)>1.0: u=np.sign(u) if USE_QUAD: h=815.0/3072.0*(1 + u)**2 else: h=(u*(np.pi-np.arccos(u))+np.sqrt(1-u**2))/np.pi return norm_x1*norm_x2*h

def kernel_predict(A,x,y,X,kernel_func):

# do linear ridge regression to obtain coefficients c_1 to c_2n , K=np.empty((2*N_TRAIN_SAMPLES,2*N_TRAIN_SAMPLES)) for i in range(2*N_TRAIN_SAMPLES):

for j in range(2*N_TRAIN_SAMPLES): K[i,j]=kernel_func(A@x[i],A@x[j]) c = np.linalg.solve(K+EPS*np.identity(2*N_TRAIN_SAMPLES),y)

# compute prediction vector K_pred=np.empty((2*N_TRAIN_SAMPLES,2*N_TRAIN_SAMPLES)) for i in range(2*N_TRAIN_SAMPLES):

for j in range(2*N_TRAIN_SAMPLES): K_pred[i,j]=kernel_func(A@X[i],A@x[j]) return K_pred@c

### MAIN ### sensitivity =

np.empty((len(MU_2_List),NUM_PLOT_POINTS,NUM_PLOT_POINTS)) , for MU_2_indx in range(len(MU_2_List)): MU_2 = MU_2_List[MU_2_indx] # generate 2d-gaussian data (n samples per class) for two classes: x0 and x1 , mean_class_0=np.array([MU_1, MU_2]) mean_class_1=-mean_class_0 cov=SCALE_COV*np.array([[1.0, COR],[COR, 1.0]]) rng = np.random.default_rng() x0 = rng.multivariate_normal(mean_class_0, cov, size=N_TRAIN_SAMPLES) , y0 = np.zeros(N_TRAIN_SAMPLES) x1 = rng.multivariate_normal(mean_class_1, cov, size=N_TRAIN_SAMPLES) , y1 = np.ones(N_TRAIN_SAMPLES) x=np.concatenate((x0,x1),axis=0) y=np.concatenate((y0,y1),axis=0) X=x

for i_a1 in range(NUM_PLOT_POINTS):

print ('completion:',i_a1+1,'of',NUM_PLOT_POINTS) a1 = (4.0*(i_a1+1))/NUM_PLOT_POINTS for i_a2 in range(NUM_PLOT_POINTS): a2 = (4.0*(i_a2+1))/NUM_PLOT_POINTS A=np.diag([a1,a2]) pred = kernel_predict(A,x,y,X,kernel) A_Perturbed = A + [[DELTA*np.abs(a1),0],[0,0]] pred_inc_a1 = kernel_predict(A_Perturbed,x,y,X,kernel) A_Perturbed = A + [[0,0],[0,DELTA*np.abs(a2)]] pred_inc_a2 = kernel_predict(A_Perturbed,x,y,X,kernel) # normalize prediction vectors

Published as a conference paper at ICLR 2024

pred = pred/np.sqrt(np.dot(pred,pred)) pred_inc_a1 =

pred_inc_a1/np.sqrt(np.dot(pred_inc_a1,pred_inc_a1)) , pred_inc_a2 =

pred_inc_a2/np.sqrt(np.dot(pred_inc_a2,pred_inc_a2)) , sensitivity[MU_2_indx,i_a1,i_a2] = np.sign( ( np.abs(np.dot(pred,pred_inc_a1)-1.0) / (DELTA*np.abs(a1))-np.abs(np.dot(pred,pred_inc_a2)-1.0) / (DELTA*np.abs(a2)) ) * (a1-a2) )

# Plot Sensitivity xv, yv = np.meshgrid(np.linspace(0, 4, NUM_PLOT_POINTS), np.linspace(0, 4, NUM_PLOT_POINTS)) , fig, (ax1, ax2) = plt.subplots(2) ax1.pcolormesh(xv, yv, sensitivity[0], vmin = -1, vmax = 1); ax2.pcolormesh(xv, yv, sensitivity[1], vmin = -1, vmax = 1); ax1.set_box_aspect(1) ax2.set_box_aspect(1) plt.show()