# can_implicit_bias_imply_adversarial_robustness__eff1577b.pdf

Can Implicit Bias Imply Adversarial Robustness?

Hancheng Min 1 René Vidal 1

The implicit bias of gradient-based training algorithms has been considered mostly beneficial as it leads to trained networks that often generalize well. However, Frei et al. (2023) show that such implicit bias can harm adversarial robustness. Specifically, they show that if the data consists of clusters with small inter-cluster correlation, a shallow (two-layer) Re LU network trained by gradient flow generalizes well, but it is not robust to adversarial attacks of small radius. Moreover, this phenomenon occurs despite the existence of a much more robust classifier that can be explicitly constructed from a shallow network. In this paper, we extend recent analyses of neuron alignment to show that a shallow network with a polynomial Re LU activation (p Re LU) trained by gradient flow not only generalizes well but is also robust to adversarial attacks. Our results highlight the importance of the interplay between data structure and architecture design in the implicit bias and robustness of trained networks.

1. Introduction

Behind the success of deep neural networks in many application domains lies their vulnerability to adversarial attacks, i.e., small and human-imperceptible perturbations to the input data. Such a phenomenon was observed in the seminal paper of Szegedy et al. (2014) and has motivated a large body of work on building defenses against such attacks (Shafahi et al., 2019; Papernot et al., 2016; Wong et al., 2019; Guo et al., 2018; Cohen et al., 2019; Levine & Feizi, 2020; Yang et al., 2020; Sulam et al., 2020; Kinfu & Vidal, 2022).

However, many defense strategies have been shown to fail against new adaptive attacks (Athalye et al., 2018; Carlini et al., 2019), and understanding these failures seems

1University of Pennsylvania, Philadelphia, PA, USA. Correspondence to: Hancheng Min <hanchmin@seas.upenn.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

to be a fundamental challenge. For example, Fawzi et al. (2018); Dohmatob (2019); Shafahi et al. (2018) show the non-existence of robust classifiers for certain data distributions. Recently, Pal et al. (2023) show that having a data distribution that concentrates within a small-volume subset of the ambient space is necessary for the existence of a robust classifier. These results highlight the importance of understanding and exploiting data structure in the process of finding classifiers with certified robustness, yet almost none of the existing defense strategies do so.

Besides data distribution, additional issues arise from the training algorithms. For example, Vardi et al. (2022) and Frei et al. (2023) show that when the data consists of clusters with small inter-cluster correlation, a shallow (twolayer) Re LU network trained by gradient flow generalizes well but fails to be robust against adversarial attacks of small radius, despite the existence of a much more robust classifier that can be explicitly constructed from a shallow network. This study unveils new challenges in the search for robust classifiers: Even if we know robust classifiers exist for certain data distribution, the implicit bias from our training algorithm (the choice of network architecture, optimization algorithm, etc.) may prevent us from finding it.

Paper contributions. In this paper, we show that under the same setting studied in Frei et al. (2023), the implicit bias of gradient flow that leads to non-robust networks can be altered to favor robust networks by modifying the Re LU activation. Specifically, we consider a data distribution consisting of a mixture of K Gaussians, referred to as subclasses, which have small inter-subclass correlation and are grouped into two superclasses/classes. When training a two-layer binary classification network, we show, with formal theorems, and with conjectures validated experimentally, that (also illustrated in Figure 1):

If the activation is a Re LU, neurons (rows of the first layer weight matrix) tend to learn only the average direction of each class, leading to a classifier that generalizes well on clean data, but is vulnerable to an adversarial attack with l2 radius O 1

, i.e. the trained network is non-robust with many subclasses. This leads to a new neural alignment perspective on the nonrobustness of Re LU networks identified by Frei et al. (2023).

If the activation is replaced by a novel polynomial Re LU

Can Implicit Bias Imply Adversarial Robustness?

Figure 1. Visualizing the training of a p Re LU network under small initialization (Details explained in later sections). The dataset has its positive class sampled from two subclasses. (a) At initialization, all neurons have small norms and point toward random directions; When p = 1 (vanilla Re LU network), (b) During the alignment phase, the neuron directions are aligned with either of the average class centers µ+ and µ ; (c) During the second phase, neurons keep the alignment with µ+ and µ while growing their norms; When p = 3, (d) neurons learn subclass centers during alignment phase and (e) keep the alignment in the second phase. Note: the neurons pointing toward directions other than class/subclass centers are not activated by any data point and have small norms throughout training.

activation, proposed based on recent advances in understanding the neuron alignment in shallow networks, neurons tend to learn the direction of each subclass center, leading to a classifier that generalizes well on clean data and can sustain any adversarial attack with O (1) radius.

Our analysis (1) highlights the importance of the interplay between data structure and network architecture in determining the robustness of the trained network, (2) explains how the implicit bias (regularization) of training a Re LU network fails to exploit the data structure and leads to nonrobust networks, and (3) shows how the issue is resolved by using a polynomial Re LU activation function. Moreover, numerical experiments on real datasets show that shallow networks with our generalized Re LU activation functions are much more robust than those with a Re LU activation.

Relation to existing analysis on implicit bias of neural networks. Our discussion is theory-centric. It builds upon the analysis of the implicit bias of training algorithms, but it is significantly different from prior theoretical analyses. The implicit bias of training algorithms has been studied extensively over past years for various architectures, including both linear networks (Saxe et al., 2014; Gunasekar et al., 2017; Ji & Telgarsky, 2019; Woodworth et al., 2020; Min et al., 2021; Stöger & Soltanolkotabi, 2021; Jacot et al., 2021; Wang & Jacot, 2023), and nonlinear networks (Lyu & Li, 2019; Ji & Telgarsky, 2020; Maennel et al., 2018; Chizat & Bach, 2020; Boursier et al., 2022; Min et al., 2024; Wang & Ma, 2023; Frei et al., 2022; 2023; Kumar & Haupt, 2024; Abbe et al., 2023; Tarzanagh et al., 2023). Moreover, this phenomenon has been studied from different perspectives, including maxmargin (Lyu & Li, 2019; Chizat & Bach, 2020; Tarzanagh et al., 2023), min-norm (Gunasekar et al., 2017; Min et al., 2021), sparsity/low-rankness (Saxe et al., 2014; Woodworth et al., 2020; Wang & Jacot, 2023; Abbe et al., 2023), and alignment (Maennel et al., 2018; Min et al., 2024; Ku-

mar & Haupt, 2024). While most works consider these implicit biases beneficial for the success of neural networks in practice, and some existing work (Faghri et al., 2021) has even shown that such biases help robustness in the cases of linear regression; few works (Frei et al., 2023; Boursier & Flammarion, 2024) discuss the potential harm caused by such biases. Our work takes one step further by proposing fixes to the harms identified by prior work (Frei et al., 2023), which sheds light on the potential of using deep learning theory to not only understand but also improve neural networks in practice.

Paper organization. Our main formal results regarding adversarial robustness will be shown (in Section 3) for explicitly constructed classifiers that can be realized by shallow networks with different activation functions. Then we connect these robustness results to those of actual trained shallow networks by a conjecture that shallow networks, when initialized properly, can learn such constructions via gradient flow training. The rationale behind this conjecture is carefully explained in Section 4, together with a preliminary theoretical analysis that supports this conjecture. Lastly, we empirically verify our conjecture in Section 5 via experiments on synthetic data, and then we conduct numerical experiments on real datasets, showing that our proposed new activation improves the robustness of shallow networks.

Notation: We denote the the inner product between vectors x and y by x, y = x y, and the cosine of the angle between them as cos(x, y) = x

x , y y . For an n m matrix A, we let A and A F denote the spectral and Frobenius norm of A, respectively. We also define 1A as the indicator for a statement A: 1A = 1 if A is true and 1A = 0 otherwise. We also let N(µ, Σ2) denote the normal distribution with mean µ and covariance matrix Σ2, and Unif(S) denote the uniform distribution over a set S. Lastly, we let [N] denote the integer set {1, , N}.

Can Implicit Bias Imply Adversarial Robustness?

2. Problem Setting

We consider a binary classification problem where within each class, the data consists of subclasses with small intersubclass correlations, formally defined as follows.

Data distribution. We assume (X, Y, Z) D, where a sample (x, y, z) RD { 1, 1} N+ drawn from D consists of the observed data and class label (x, y), and a latent (unobserved) variable z denoting the subclass membership. We denote the marginal distribution of the observed part (X, Y ) as DX,Y . In particular, given 1 K1 < K, we let

Z DZ = Unif([K]); Y = 1{Z K1} 1{Z>K1};

(X | Z = k) N µk, α2

D I , 1 k K, where

α > 0 is the intra-subclass variance, and {µ1, , µK} are subclass centers that forms an orthonormal basis of a K-dimensional subspace in RD, where K < D.

This data distribution has K subclasses with small intersubclass correlation if α is small (since the subclass centers are orthonormal to each other), and the observed labels only reveal the superclass/class membership: the first K1 subclasses belong to the positive class (y = +1) and the remaining K2 := K K1 belong to the negative class (y = 1). One such a dataset is depicted in Figure 1.

Classification task and robustness of a classifier. The learning task we consider is to find a binary classifier f(x) : RD R such that given a randomly drawn (x, y), sign(f(x)) predicts the correct class label y, i.e., f(x)y > 0, with high probability. Moreover, we are interested in the robustness of f(x) against additive adversarial attacks/perturbations on x with l2-norm bounded by some constant r > 0 (called the attack radius). Specifically, given a randomly drawn (x, y) from DX,Y , we would like inf d =1 f(x+rd)y > 0 to hold with high probability (formal statement in Section 3) for r as large as possible.

Gradient flow training. A candidate classifier can be realized from a neural network f(x; θ) parameterized by its network weights θ. Given a size-n training dataset {(xi, yi), i = 1, , n}, where each training sample is randomly drawn from DX,Y , we define a loss function

L(θ; {xi, yi}n i=1) = Xn

i=1 ℓ(yi, f(xi; θ)) , (1)

where ℓ(y, ˆy) : R R R is either the exponential loss ℓ(y, ˆy) = exp ( yˆy) or the logistic loss ℓ(y, ˆy) = log (1 + exp ( yˆy)). We train the network using gradient flow (GF), θ(t) L(θ(t)), where L denotes the Clarke subdifferential (Clarke, 1990) w.r.t. θ. With proper initialization θ(0), one expects the trained network weights θ(T) (for some large T > 0) to be close to some minimizer of L(θ) and f(x; θ(T)) to be a good classifier for DX,Y .

Shallow networks with p Re LU activation. We specifically study the following shallow (two-layer) network, parametrized by θ := {(wj, vj) RD R, j = 1, , h}, as a candidate classifier:

fp(x; {wj, vj}h j=1) = Xh

j=1 vj [σ( x, wj )]p

(Shallow p Re LU Network) where σ( ) = Re LU( ) = max{ , 0} and p 1. When p = 1, this is exactly a shallow Re LU network. When p 1, fp can be loosely viewed as a shallow network with a polynomial Re LU activation and a special form of weight normalization (Salimans & Kingma, 2016) on each wj. One of the most important reasons for this particular generalization of the Re LU activation is the extra penalty" on angle separation between the input x and neurons wjs. To see this, assume x = 1, then

[σ( x, wj )]p

wj p 1 = σ( x, wj )cosp 1 (x, wj) wj p 1

= σ( x, wj ) cosp 1 (x, wj) ,

that is, for each neuron wj, when compared to Re LU activation (p = 1), the post-activation value is much smaller (penalized) if the angle separation between x and wj is large. When p > 1, such penalties, as we will see later, promote the alignment between training data samples and neurons, and result in trained networks that capture well the intrinsic structure of the data. In addition, such generalized Re LU activation makes the function fp positively homogeneous of degree two w.r.t. its parameters θ, i.e., fp(x; γθ) = γ2fp(x; θ) for any θ and γ > 0. Many existing analyses (Du et al., 2018; Lyu et al., 2021; Kumar & Haupt, 2024) on positively homogeneous networks apply to the generalized p Re LU networks. Lastly, it is worth noting that the concept of promoting alignment between data points and neurons via cosine penalty has been explored for improving interpretability (Böhle et al., 2022) of the trained network.

3. Main Results on Adversarial Robustness

This section considers two distinct classifiers for DX,Y :

k=1 σp( µk, x )

k=K1+1 σp( µk, x ) (2)

K1σ( µ+, x ) p

K2σ( µ , x ) , (3)

where µ+ = 1 K1 PK1 k=1 µk and µ = 1 K2 PK k=K1+1 µk are, respectively, the average direction of the positive and negative subclass centers. Here the average direction is

Can Implicit Bias Imply Adversarial Robustness?

computed by taking the sum of all µks, then normalize it, thus we have µ+ SD 1 and µ SD 1.

Based on existing analyses for the implicit bias of gradientbased training, we conjecture that both classifiers can be learned by gradient flow on shallow networks with p Re LU activations and proper initialization. Before studying this conjecture, we analyze these two classifiers in more detail.

Realizability of F(x) and F (p)(x) by p Re LU networks. We first notice that both F(x) and F (p)(x) can be realized (non-uniquely) by a p Re LU network fp(x; {wj, vj}h j=1) for suitable choices of the weights {wj, vj}h j=1, as stated in the following claim. (Verifying this claim is straightforward and we leave it to Appendix C).

Claim. The following two statements are true:

When p = 1, let I+ and I be any two non-empty and disjoint subsets of [h], let λ := (λj) Rh 0 be any vector such that P

j I+ λ2 j = K1 and P

j I λ2 j = K2, and let

wj = λj µ+, vj = λj, j I+ ,

wj = λj µ , vj = λj, j I ,

wj = 0, vj = 0, otherwise.

Then fp(x; {wj, vj}h j=1) F(x) .

When p 1, let I1, , IK be any K non-empty and disjoint subsets of [h], let λ := (λj) Rh 0 be any vector such that P

j Ik λ2 j = 1, k [K], and let

wj = λjµk, vj = λj, j Ik, k K1 ,

wj = λjµk, vj = λj, j Ik, k > K1 ,

wj = 0, vj = 0 otherwise.

Then fp(x; {wj, vj}h j=1) F (p)(x) .

Remark 1. As shown in the claim, there are infinitely many parameterizations of fp (determined by the choice of subsets of [h] and λ) that lead to the same function F(x) (or F (p)(x)), due to the symmetry in the network weights. That is, the resulting network represents the same function if one re-indexes (permutes) the neurons, or if one does the following rescaling on some weights (wj, vj) (γwj, vj/γ) for some j [j] and some γ > 0.

Notice that the classifier F corresponds to a vanilla shallow Re LU network whose non-zero neurons are either aligned with µ+, the average direction of the positive subclass centers, or µ , the negative counterpart. On the other hand, F (p) corresponds to a p Re LU network whose non-zero neurons are aligned with one of the subclass centers. As we will see soon, both F(x) and F (p)(x) achieve high prediction accuracy on clean samples (x, y) from DX,Y , i.e.,

learning subclass centers is not necessary for good generalization. However, we will also show that F (p) is much more robust against adversarial perturbation than F. This is why we argue that learning subclass centers significantly improve robustness.

Generalization and Robustness of F(x) and F (p)(x). Having established that both F and F (p) are realizable by a p Re LU network fp, we now study the generalization performance and robustness of F and F (p). Under the conjecture that both F and F (p) can be learned by gradient flow training on a p Re LU network fp, we expect that such generalization and robustness properties extend to fp.

The following result states that both F and F (p) achieve good generalization performance on DX,Y when the dimension D of the input data X is sufficiently large and the number of subclasses K is not too large.

Proposition 1 (Generalization on clean data). Given a test sample (x, y) DX,Y , and classifiers F and F (p), p 1, defined in (3) and (2), resp., there is a constant C such that

P (F(x)y > 0) 1 4 exp CD

P F (p)(x)y > 0 1 2(K + 1) exp CD

However, the following theorem shows a significant difference between F and F (p) regarding their adversarial robustness, when the number of subclasses K is large.

Theorem 1 (l2-adversarial robustness). Given a test sample (x, y) DX,Y , and classifiers F and F (p), defined in (3) and (2), resp., the following statement is true for the same constant C in Proposition 1:

Non-robustness of F against O 1

-attack Let

K SD 1. Then for any ρ 0,

P F x y(1 + ρ)

y>0 2 exp CDρ2

Robustness of F (p) against O (1) -attack Let p 2. Then for any 0 δ

min d 1 F (p)

1 2(K + 1) exp CDδ2

We refer the reader to Appendix C for the proofs for Proposition 1 and Theorem 1. Our theoretical results should be understood in the high-dimensional or low-intra-classvariance regime so that D

α2 K2 log K. On the one hand,

Can Implicit Bias Imply Adversarial Robustness?

Theorem 1 shows that F, a classifier that corresponds to shallow Re LU networks whose neurons are aligned with either µ+ or µ , is vulnerable to an adversarial attack of

, which diminishes as the number of subclasses grows. On the other hand, Theorem 1 shows that F (p), p 2, a classifier that corresponds to shallow p Re LU networks whose neurons are aligned with one of the subclass centers, is O (1)-robust against adversarial attacks.

Remark 2. Complementary results to Theorem 1 in Appendix C.4 show that F is robust against any attack with l2-norm slightly smaller than 1

K , and F (p) is not robust

to some attack with l2-norm slightly larger than

2 2 . Therefore, 1

2 2 are the critical" attack radius for F and

F (p), resp. We conjecture this is also the case for the associated trained networks, which we verify in Section 5.1.

Conjecture on the outcome of gradient flow training. We now study the conjecture that both classifiers can be learned by gradient flow on a shallow network with p Re LU activations and proper initialization. Specifically, we pose the following: Conjecture 1. Suppose that the intra-subclass variance α > 0 is sufficiently small, that one has a training dataset of sufficiently large size, and that we run gradient flow training on fp(x; θ), θ = {wj, vj}h j=1 of sufficiently large width h for sufficiently long time T, starting from random initialization of the weights with a sufficiently small initialization scale. If p = 1, then we have infc>0 supx SD 1 |cfp(x; θ(T)) F(x)| 1; If p [3, p) for some p > 3, then we we have infc>0 supx SD 1 |cfp(x; θ(T)) F (p)(x)| 1.

Our conjecture states that with proper initialization and sufficiently long training time, gradient flow finds a network that is, up to a scaling factor c > 01, close to F in l - distance over SD 1, if p = 1. That is, when training a vanilla Re LU network, the neurons learn average directions µ+, and µ of subclass centers. However, when p 3, gradient flow finds a network close to F (p), i.e., the neurons learn individual subclass centers. Note that p can not be too large, as the post activation [σ( x,wj )]p

wj p 1 converges to 1cos(x,wj)=1 x, wj when p grows ( x = 1), effectively zeroing out post activation values almost everywhere and also the gradient, staggering gradient flow training.

Given Conjecture 1, Theorem 1 can be used to infer the robustness of the networks fp(x; θ(T)) obtained from gra-

1Since fp(x; θ) is positively homogeneous of degree two w.r.t. its parameter θ, θ(t) as training time t (Lyu & Li, 2019; Ji & Telgarsky, 2020), the output of fp(x; θ(T)) can never match that of F(x) or F (p)(x) without a proper choice of scaling factor c > 0. However, we note, that such a scaling factor will not change the prediction sign(fp(x; θ(T))).

dient flow training with small initialization. When training a vanilla Re LU network in this regime, we expect the trained network to be close to F, thus non-robust against O 1

-attacks. When training a p Re LU network with

p 3, we expect the trained network to be close to F (p), which is more robust than its Re LU counterpart.

Our conjecture is based on existing work on the implicit bias of gradient flow on shallow networks with small initialization, and we carefully explain the rationale behind it in Section 4. However, formally showing such convergence results is challenging as the data distribution DX,Y considered here is more complicated than those for which convergence results can be successfully shown (Boursier et al., 2022; Min et al., 2024; Chistikov et al., 2023; Wang & Ma, 2023). Instead, we provide a preliminary analysis of this conjecture. Moreover, in Section 5, we provide numerical evidence that supports our conjecture.

Comparison with prior work. Frei et al. (2023) consider a similar setting as ours with three minor differences. Firstly, their data distribution differs from ours by a scaling factor of

D on the input data. Second, they allow for tiny correlations between the two subclass centers. Third, they consider shallow Re LU networks with bias. They show that any network trained by gradient flow is non-robust against O 1

attacks, which covers any initialization, but their analysis does not characterize what classifier the trained network corresponds to. While we consider specifically small initialization, we can at least conjecture, and numerically verify, what the corresponding classifier is at the end of the training. In addition, Frei et al. (2023) show the existence of O(1)-robust Re LU network if neurons are aligned with subclasses and there is some carefully chosen bias. While it already sheds light on the need for learning subclasses, their proposed network can not be found by gradient flow. On the contrary, our proposed robust F (p)

can be obtained by gradient flow.

4. Discussion on Gradient Flow Training

In Section 3, we conjectured that under gradient flow starting from a small initialization, a vanilla Re LU network favors learning average directions µ+ and µ of subclass centers, while a p Re LU network favors learning every subclass centers µk, k [K], resulting in a significant difference between these two networks in terms of adversarial robustness. In this section, we provide a theoretical explanation of such alignment preferences in the scope of implicit bias of gradient flow training with small initialization (Maennel et al., 2018; Boursier & Flammarion, 2024). We remind the reader that gradient flow training is described in Section 2.

Can Implicit Bias Imply Adversarial Robustness?

4.1. Gradient flow with small initialization

We first briefly describe the training trajectory of gradient flow on shallow networks with a small initialization. With a sufficiently small initialization scale (to be defined later), the gradient flow training is split into two phases with distinct dynamic behaviors of the neurons. The first phase is often referred to as the initial/early alignment phase (Boursier & Flammarion, 2024; Min et al., 2024), during which the neurons keep small norms while changing their directions towards one of the extremal vectors (Maennel et al., 2018), which are jointly determined by the training dataset and the activation function. The second phase is often referred to as the fitting/convergence phase. Within the fitting phase, neurons keep a good alignment with the extremal vectors and start to grow their norms, and the loss keeps decreasing until convergence, upon which one obtains a trained network whose dominant neurons (those with large norms) are all aligned with one of the extremal vectors.

Notably, the neurons favor different extremal vectors depending on the activation function, precisely depicted by Figure 1. With the same dataset and with the same initialization of the weights, p Re LU activation makes a significant difference in neuron alignment: When p = 1 (vanilla Re LU network), the average class centers µ+ and µ are those extremal vectors attracting" neurons during the alignment phase, leading to a trained Re LU network that has effectively two neurons (one aligned with µ+ and another with µ ) at the end of the training. However, when p = 3, the subclass centers µ1, , µk become extremal vectors that are attracting" neurons , the resulting p Re LU networks successfully learn every subclass center, which, we have argued in Section 3, substantially improves the robustness (over vanilla Re LU net) against adversarial attack. While the case visualized in Figure 1 is a simple one in 3dimension with 3 subclasses, we will provide experiments in higher dimension and with more subclasses in Section 5.

Despite the seemingly simple dynamic behavior of the neurons, the rigorous theoretical analysis is much more challenging due to the complicated dependence (Maennel et al., 2018; Boursier et al., 2022; Wang & Ma, 2023; Kumar & Haupt, 2024) of the extremal vectors on the dataset and the activation function. Hence prior works (Boursier et al., 2022; Min et al., 2024; Chistikov et al., 2023; Wang & Ma, 2023) assume simple training datasets and their analyses are for Re LU activation. Here, we are faced with a relatively more complex data distribution that generates our dataset, and at the same time deals with a p Re LU activation, thus a full theoretical analysis of the convergence, that would prove our Conjecture 1, still has many challenges and deserves a careful treatment in a separate future work. For now, we provide some preliminary theoretical analysis of the neural alignment in p Re LU networks that supports

our conjecture.

4.2. Preliminary theoretical analysis on neuron alignment in p Re LU networks

We first make the following two simplifying assumptions:

Small and balanced initialization. First, we obtain i.i.d. samples wj0, j = 1, , h drawn from some random distribution such that almost surely wj0 M, j for some M > 02, and then and then initialize the weights as

wj(0) = ϵwj0, vj(0) { wj(0) , wj(0) }, j [h] . (4) That is, wj0 determines the initial direction of the neurons and we use ϵ to control the initialization scale. This balanced assumption is standard in the analysis of shallow networks with small initialization (Soltanolkotabi et al., 2023; Boursier et al., 2022; Min et al., 2024).

Simplified training dataset. The training dataset {(xk, yk), k = 1, , K} is the collection of exact subclass centers, together with their class label. That is, xk = µk, yk = 1{k K1} 1{k>K1}, k [K]. Similar datasets with orthogonality among data points xks are studied in Boursier et al. (2022) for Re LU networks.

Neuron alignment in p Re LU networks. The key to the theoretical understanding of neuron alignment is the following lemma.

Lemma 1. Given some initialization from (4), if ϵ = O( 1

h), then there exists T = θ( 1

hϵ) such that the trajectory under gradient flow training with the simplified training dataset almost surely satisfies that t T,

d dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t))

where P w = I ww

x(p)(w) = XK

k=1 γk(w)ykxk p[cos(xk, w)]p 1, (5)

with γk(w) being a subgradient of σ(z) at z = xk, w when p = 1 and γk(w) = 1cos(xk,w) 0 when p > 1.

Lemma 1 suggests that during the alignment phase t T, one have the following approximation

d dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t)) , (6)

2For example, if we sample every entries of wj0 by a uniform

distribution over 1

Can Implicit Bias Imply Adversarial Robustness?

Figure 2. Numerical experiment (K = 10, K1 = 6) validates Conjecture 1. (a) We train p Re LU networks using SGD with small initialization, then estimate the distance dist(fp, F) between the trained network fp and the classifier F, when p = 1 (top plot); When p = 3, we estimate dist(fp, F (p)) instead (bottom plot). The training is done under different choices of intra-subclass variance α and repeated 10 times per α; the Solid line shows the average and the shade denotes the region between max and min values. (b) Given a trained network obtained from an instance of this training (α = 0.1), we reorder the neurons w.r.t. their contributions |vj| wj and then plot the contributions in a bar plot; (c)(d) For neurons with large contributions, we plot a colormap, with each pixel represents some cos(wj, µ), where µ could be either average class centers µ+ and µ or subclass centers µk, k [K]. Note: For visibility, the neurons are reordered again so that neurons aligned with the same µ are grouped together. (e) Lastly, we carry out l2 PGD attack on a test dataset and plot the robust accuracy of the trained network under different choices of attack radius.

which essentially shows that when wj is a positive neuron (sign(vj(0)) > 0), then gradient flow dynamics during alignment phase pushes wj/ wj toward the direction of x(p)(wj). Notably, x(p)(wj) is a weighted combination of xks and the mixing weights critically depend on p. Roughly speaking, when p = 1, x(p)(wj), weighing equally on xks that activate wj, are more aligned with µ+ and µ , while when p 3, x(p)(wj), weighing more on xks that are close to wj in angle, are more aligned with one of the subclass centers, thus by moving toward x(p)(wj) in direction, the neuron wj is likely to align with average class centers in the former case, and with subclass centers in the latter case (We illustrate this in an example in Appendix D.1). This trend leads to the following alignment bias.

Theorem 2 (Alignment bias of positive neurons). Given some sufficiently small δ > 0 and a fixed choice of p 1, then ϵ0 := ϵ0(δ, p) > 0 such that for any solution of the gradient flow on fp(x; θ) with the simplified training dataset, starting from some initialization from (4) with initialization scale ϵ < ϵ0, almost surely we have that at any time t T = θ( 1

hϵ) and j with sign(vj(0)) > 0,

d dtcos wj(t), µ+ cos(wj(t), µ+)=1 δ

( > 0, when p = 1 < 0, when p 3 ,

d dtcos(wj(t), µk) cos(wj(t),µk)=1 δ > 0, k K1 . (8)

This theorem shows how different choice of p alters the neurons preferences on which directions to align. If we define the neighborhood of some µ direction as B(µ, δ) := {z SD 1 : cos (µ, z) 1 δ}, then specifically

for a positive neuron: When p = 1, then any neuron with wj(t0) B( µ+, δ) necessarily has wj(t) B( µ+, δ), t [t0, T], i.e. it keeps at least 1 δ alignment with µ+ until the end of the alignment phase (at time T). Therefore, there is a preference of staying close to µ+ for positive neurons. Interestingly, such preference no longer exists when p 3. In particular, any positive neuron with wj(t0) / B( µ+, δ) necessarily has wj(t) / B( µ+, δ), t [t0, T], i.e., any neuron initialized with some angular distance to µ+ will not get any closer to µ+. Instead, the neurons are now more likely to stay close to some subclass centers, as shown in (8). Similar results can be said for negative neurons (whose index j has sign(vj(0)) < 0): they favor average negative class center µ when p = 1 and favor subclass directions when p 3. We refer the interested readers to Appendix D for a complete Theorem 2 (and its proof) that also includes the statement for negative neurons.

5. Numerical Experiments

Our numerical experiments3 have two parts: First, we conduct experiments to validate Conjecture 1. Then, we provide preliminary experiments on real datasets to highlight the potential of using p Re LU activation for obtaining more robust classifiers.

5.1. Numerical evidence supports our conjecture

We run SGD (batch size 100 and step size 0.2 for 2 105

epochs) with small initialization (all weights initialized as mean-zero Gaussian with standard deviation 10 7) to train a p Re LU network with h = 2000 neurons on a dataset drawn from DX,Y with D = 1000, K = 10, and

3Code available at https://github.com/hanchmin/ p Re LU_ICML24.

Can Implicit Bias Imply Adversarial Robustness?

Figure 3. Parity prediction on MNIST dataset with p Re LU networks. (a) We plot the data correlation as a colormap, where each pixel represents some cos (xi, xj) between two centered data xi, xj from MNIST training dataset; (b) We run Adam with batch size 1000 to train a p Re LU network under Kaiming initialization (repeated 10 times), then plot the training/testing accuracy during training for different choice of p (The shade region indicates the range between the minimum and maximum values over 10 randomized runs); (c) We stack the hidden post-activation representation of each training sample into a matrix and compute its stable rank, and plot the evolution of this stable rank during training; (d) After training for 50 epoch, we carry out APGD l -attack on MNIST test dataset (in pixel space) and plot the robust accuracy of the trained p Re LU network under different choice of attack radius.

Figure 4. Classification on Caltech256 dataset (relabeled into 10 superclasses) with a pre-trained Res Net152 as a fixed feature extractor.

K1 = 6. With different choices of intra-subclass variance α, we take the network fp at the end of the training and estimate (we refer the readers to Appendix A for the estimation algorithm) its distance dist(fp, F) = infc>0 supx SD 1 |cfp(x) F(x)| to the classifier F(x), or the distance dist(fp, F (p)) to F (p)(x), depending on the value of p. As one sees from Figure 2, when p = 1, fp is close to F, and the estimated distance slightly increases as α gets larger. Similarly, when p = 3, fp is close to F (p). Furthermore, we investigate the alignment between the dominant neurons in fp and class/subclass centers. Figure 2 shows that indeed neurons in fp learn only average class centers µ+ and µ when p = 1 while learning every subclass center µk, k [K] when p = 3. Lastly, as our Theorem 1 predicts, fp, p = 3 is much more robust than the one with p = 1. This series of numerical evidence strongly support Conjecture 1 (with additional experiments in Appendix A).

5.2. Experiments on real datasets

Although we have shown good alignment between our theory and our numerical experiments in 5.1. The settings largely remain unrealistic. To show the potential value of p Re LU networks in practical settings in finding a robust classifier, we now study classification (albeit slightly mod-

ified) problems on real datasets.

5.2.1. PARITY CLASSIFICATION ON MNIST

We first consider training a p Re LU network of width h = 500 to predict whether an MNIST digit is even or odd. This poses a problem similar to the one studied in our theoretical analyses: each digit is naturally a subclass and they form classes of even digits and odd digits. Moreover, once the data is centered, as shown in Figure 3, two data points of the same digit are likely to have a large positive correlation, and two points of different digits to have a small, or negative correlation, which resembles the our orthogonality assumption among subclass centers. Therefore, with appropriate data preprocessing, we expect the p Re LU network to find a more robust classifier when p is large.

Data preprocessing. We relabel MNIST data by parity, which leads to a binary classification. Then we center both the training and test set by subtracting off the mean image of all training data and then normalize the residue, resulting in a centered, normalized training/test set 4.

4When training p Re LU networks, some normalization of the data is required to improve training stability. To see this, notice that the post-activation value for ith data scales as xi p; When p is large, this term diminishes or explodes depending on where the

Can Implicit Bias Imply Adversarial Robustness?

Training p Re LU. We use Kaiming initialization (He et al., 2015) with non-small variance for all the weights and run Adam with cross-entropy loss and with batch size 1000 for 50 epochs and summarize the training results in Figure 3. First of all, as p increases, the trained network becomes more robust against the adversarial l -attack computed from an adaptive projected gradient ascent (APGD) algorithm (Croce & Hein, 2020). Interestingly, p Re LU with p > 1 even has a slight edge over vanilla Re LU net in terms of test accuracy on clean data. Given that the MNIST dataset does not satisfy our data distribution, there is no reason to expect that neurons in p Re LU can learn each subclass (in this case, the individual digit) and indeed they cannot. However, we highlight that p Re LU empirically is more capable of learning distinct patterns within each superclass/class: We stack the hidden post-activation representation of each training sample into a matrix and compute its stable rank (defined as A 2 F A 2 for a matrix A, as an approximation of rank(A)). From Figure 3, we see that the hidden feature matrix of MNIST obtained from p Re LU network has a much larger stable rank than the one from vanilla Re LU net, i.e. the hidden features collapse less when p is large, and we conjecture it to be the reason why p Re LU still gains much more adversarial robustness than vanilla Re LU. Theoretically investigating such a phenomenon is an interesting future direction.

Additional experiments. To further illustrate the effectiveness of p Re LU activation in improving the adversarial robustness of the trained network, we conduct additional experiments, in Appendix B, of training p Re LU networks for the original digit classification task on MNIST and test the robustness of the trained network with different types of attacks. Despite that our theoretical results only suggest robustness gain can be achieved under datasets with subclasses, the additional experimental results show that the p Re LU networks are still much more robust than their Re LU counterpart even for the original digit classification task, which clearly does not follow our data assumption.

5.2.2. IMAGE CLASSIFICATION WITH PRE-TRAINED FEATURE EXTRACTOR

Our next experiment considers a transfer learning setting where we use some intermediate layer (more precisely, the feature layer before the fully-connected layers) of a pre-trained Res Net152 on Image Net as a feature extractor and classify the extracted feature from Caltech256 (Griffin et al., 2007) object classification dataset with p Re LU networks. The main reason behind this setting is that we expect the extracted feature representation of each class to be clustered, and this is verified in Figure 4.

value x is smaller or larger than one.

We intend to have a classification task with subclasses, thus we group the original 256 classes in the dataset into 10 superclasses (each superclass has no semantic meaning in this case) and train p Re LU networks of width h = 2000 that predict the superclass label. Then we obtain the feature representation of the dataset from our feature extractor, center the feature, and normalize, same as we did for the MNIST dataset. The training process is exactly the same as for the MNIST dataset, and we summarize the results in Figure 4. Even for this multi-class classification task, still p Re LU achieves higher test accuracy and is more robust when p gets larger.

6. Conclusion

By introducing a generalized p Re LU activation, we resolve the non-robustness issue, caused by the implicit bias of gradient flow, of Re LU networks when trained on a classification task in the presence of latent subclasses with small inter-subclass correlations. Future work includes formal analyses on neural alignment in p Re LU networks as well as more empirical evaluation of p Re LU activation for its ability to improve the robustness of neural networks.

Acknowledgement

The authors thank the support of the NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning (NSF grant 2031985), and the ONR MURI Program (ONR grant 503405-78051). The authors thank the feedback from anonymous reviewers, which has greatly improved our experimental results. The authors thank Enrique Mallada, Jeremias Sulam, and Ambar Pal for their suggestions and comments at various stages of this research project, and also thank Kyle Poe for carefully reading the manuscript.

Impact Statement

While this work is mostly theoretical, it tackles the issue of nonrobustness of neural networks trained by gradientbased algorithms, which potentially leads to more robust, reliable, and trustworthy neural networks in many application domains.

Abbe, E., Bengio, S., Boix-Adsera, E., Littwin, E., and Susskind, J. Transformers learn through gradual rank increase. Advances in Neural Information Processing Systems, 36, 2023.

Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing de-

Can Implicit Bias Imply Adversarial Robustness?

fenses to adversarial examples. In International conference on machine learning, pp. 274 283. PMLR, 2018.

Böhle, M., Fritz, M., and Schiele, B. B-cos networks: Alignment is all we need for interpretability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10329 10338, 2022.

Boursier, E. and Flammarion, N. Early alignment in twolayer networks training is a two-edged sword. ar Xiv preprint ar Xiv:2401.10791, 2024.

Boursier, E., Pullaud-Vivien, L., and Flammarion, N. Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs. In Advances in Neural Information Processing Systems, volume 35, pp. 20105 20118, 2022.

Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber, J., Tsipras, D., Goodfellow, I., Madry, A., and Kurakin, A. On evaluating adversarial robustness. ar Xiv preprint ar Xiv:1902.06705, 2019.

Chistikov, D., Englert, M., and Lazic, R. Learning a neuron by a shallow re LU network: Dynamics and implicit bias for correlated inputs. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Chizat, L. and Bach, F. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Proceedings of Thirty Third Conference on Learning Theory, volume 125, pp. 1305 1338. PMLR, 09 12 Jul 2020.

Clarke, F. H. Optimization and nonsmooth analysis. SIAM, 1990.

Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pp. 1310 1320. PMLR, 2019.

Croce, F. and Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML, 2020.

Dohmatob, E. Generalized no free lunch theorem for adversarial robustness. In International Conference on Machine Learning, pp. 1646 1654. PMLR, 2019.

Du, S. S., Hu, W., and Lee, J. D. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Faghri, F., Gowal, S., Vasconcelos, C., Fleet, D. J., Pedregosa, F., and Roux, N. L. Bridging the gap between adversarial robustness and optimization bias. ar Xiv preprint ar Xiv:2102.08868, 2021.

Fawzi, A., Fawzi, H., and Fawzi, O. Adversarial vulnerability for any classifier. Advances in neural information processing systems, 31, 2018.

Frei, S., Vardi, G., Bartlett, P., Srebro, N., and Hu, W. Implicit bias in leaky relu networks trained on highdimensional data. In The Eleventh International Conference on Learning Representations, 2022.

Frei, S., Vardi, G., Bartlett, P., and Srebro, N. The doubleedged sword of implicit bias: Generalization vs. robustness in re LU networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. 2007.

Gunasekar, S., Woodworth, B., Bhojanapalli, S., Neyshabur, B., and Srebro, N. Implicit regularization in matrix factorization. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6152 6160, 2017.

Guo, C., Rana, M., Cisse, M., and van der Maaten, L. Countering adversarial images using input transformations. In International Conference on Learning Representations, 2018.

He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015.

Jacot, A., Ged, F., Sim sek, B., Hongler, C., and Gabriel, F. Saddle-to-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity. ar Xiv preprint ar Xiv:2106.15933, 2021.

Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. In 7th International Conference on Learning Representations, ICLR 2019, 2019.

Ji, Z. and Telgarsky, M. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33:17176 17186, 2020.

Kinfu, K. A. and Vidal, R. Analysis and extensions of adversarial training for video classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3416 3425, 2022.

Kumar, A. and Haupt, J. Directional convergence near small initializations and saddles in two-homogeneous neural networks. ar Xiv preprint ar Xiv:2402.09226, 2024.

Can Implicit Bias Imply Adversarial Robustness?

Levine, A. and Feizi, S. (de) randomized smoothing for certifiable defense against patch attacks. Advances in Neural Information Processing Systems, 33:6465 6475, 2020.

Lyu, K. and Li, J. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019.

Lyu, K., Li, Z., Wang, R., and Arora, S. Gradient descent on two-layer nets: Margin maximization and simplicity bias. Advances in Neural Information Processing Systems, 34:12978 12991, 2021.

Maennel, H., Bousquet, O., and Gelly, S. Gradient descent quantizes relu network features. ar Xiv preprint ar Xiv:1803.08367, 2018.

Min, H., Tarmoun, S., Vidal, R., and Mallada, E. On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7760 7768. PMLR, 18 24 Jul 2021.

Min, H., Mallada, E., and Vidal, R. Early neuron alignment in two-layer relu networks with small initialization. In International Conference on Learning Representations, pp. 1 8, 5 2024.

Pal, A., Sulam, J., and Vidal, R. Adversarial examples might be avoidable: The role of data concentration in adversarial robustness. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Papernot, N., Mc Daniel, P., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pp. 582 597. IEEE, 2016.

Salimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.

Saxe, A. M., Mcclelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural network. In International Conference on Learning Representations, 2014.

Shafahi, A., Huang, W. R., Studer, C., Feizi, S., and Goldstein, T. Are adversarial examples inevitable? In International Conference on Learning Representations, 2018.

Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019.

Soltanolkotabi, M., Stöger, D., and Xie, C. Implicit balancing and regularization: Generalization and convergence guarantees for overparameterized asymmetric matrix sensing. In Proceedings of Thirty Sixth Conference on Learning Theory, volume 195, pp. 5140 5142. PMLR, 12 15 Jul 2023.

Stöger, D. and Soltanolkotabi, M. Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Advances in Neural Information Processing Systems, 34, 2021.

Sulam, J., Muthukumar, R., and Arora, R. Adversarial robustness of supervised sparse coding. Advances in neural information processing systems, 33:2110 2121, 2020.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, 2014.

Tarzanagh, D. A., Li, Y., Zhang, X., and Oymak, S. Maxmargin token selection in attention mechanism. Advances in Neural Information Processing Systems, 36, 2023.

Vardi, G., Shamir, O., and Srebro, N. On margin maximization in linear and relu networks. Advances in Neural Information Processing Systems, 35:37024 37036, 2022.

Wang, M. and Ma, C. Understanding multi-phase optimization dynamics and rich nonlinear behaviors of re LU networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Wang, Z. and Jacot, A. Implicit bias of sgd in l_{2}- regularized linear dnns: One-way jumps from high to low rank. ar Xiv preprint ar Xiv:2305.16038, 2023.

Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. In International Conference on Learning Representations, 2019.

Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., and Srebro, N. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pp. 3635 3673. PMLR, 2020.

Yang, G., Duan, T., Hu, J. E., Salman, H., Razenshteyn, I., and Li, J. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pp. 10693 10705. PMLR, 2020.

Can Implicit Bias Imply Adversarial Robustness?

A. Additional Discussion on Validating Conjecture 1

A.1. Estimating dist(fp, F)

To estimate the distance dist(fp, F) = infc>0 sup SD 1 |cfp(x) F(x)| between a trained network fp and a classifier F (or F (p)), we first pick an ˆc > 0 that yields the least-square match between ˆcfp(x) and F(x) over a set of points sampled from Unif(SD 1). Then given this choice of ˆc, we run normalized projected gradient ascent on |ˆcfp(x) F(x)|2 starting from an initial x sampled from Unif(SD 1), and repeat a large number of times, the maximum value of |ˆcfp(x) F(x)| at the end of Normalized PGA over all runs give an estimate of dist(fp, F).

Algorithm 1 Estimating dist(fp, F)

Input: Network fp; Classifier F (or F (p)); step size α; sample numbers N1, N2; Do:

1. Sample x, i = 1, , N1 from Unif(SD 1); ˆc arg minc>0 PN1 i=1 |cfp(x) F(x)|2

2. Sample new x, i = 1, , N2 from Unif(SD 1);

ℓ( ) |ˆcfp( ) F( )|2;

For i = 1, , N2:

For t = 1, , max_iter:

x(t) x(t 1) + α xℓ(x(t 1))

xℓ(x(t 1)) ; % normalized gradient ascent

x(t) ; % projection onto unit sphere

Return maxi |ˆcfp(x(max_iter) i ) F(x(max_iter) i )|

A.2. Additional experiment

We conduct the same experiment as in Section 5.1, with more subclasses K = 20, K1 = 8, the results are still well aligned with Conjecture 1. Notably, as K the total number of subclasses increases, the trained Re LU network is more vulnerable to l2 attacks, compared to the case in Section 5.1, while generalized p Re LU still is robust to these attacks.

Figure 5. Additional Numerical experiment (K = 20, K1 = 8) validates Conjecture 1. (a) We train p Re LU networks using SGD with small initialization, then estimate the distance dist(fp, F) between the trained network fp and the classifier F, when p = 1 (top plot); When p = 3, we estimate dist(fp, F (p)) instead (bottom plot). The training is done under different choices of intra-subclass variance α and repeated 10 times per α; the Solid line shows the average and the shade denotes the region between max and min values. (b) Given a trained network obtained from an instance of this training (α = 0.1), we reorder the neurons w.r.t. their contributions |vj| wj and then plot the contributions in a bar plot; (c)(d) Given neurons with large contributions, we plot a colormap, with each pixel represents some cos(wj, µ), where µ could be either average class centers µ+ and µ or subclass centers µk, k [K]. Note: For visibility, the neurons are reordered again so that neurons aligned with the same µ are grouped together. (e) Lastly, we carry out l2 PGD attack on a test dataset and plot the robust accuracy of the trained network under different choices of attack radius.

Can Implicit Bias Imply Adversarial Robustness?

B. Additional experiments on MNIST dataset

To further illustrate the effectiveness of p Re LU activation in improving the adversarial robustness of the trained network, we conduct the following additional experiments on MNIST dataset:

MNIST digits classification. We follow the same experiment setting in Section 5.2.1 and train the network to classify the digits instead of their parity. Figure 6 reports the training results. Despite that our theoretical results only suggest robustness gain can be achieved under datasets with subclasses, these additional experimental results show that the p Re LU networks are still much more robust than their Re LU counterpart even for the original digit classification task, which clearly does not follow our data assumption.

Figure 6. Digits classification on MNIST dataset with p Re LU networks. (a) We plot the data correlation as a colormap, where each pixel represents some cos (xi, xj) between two centered data xi, xj from MNIST training dataset; (b) We run Adam with batch size 1000 to train a p Re LU network under Kaiming initialization (repeated 10 times), then plot the training/testing accuracy during training for different choice of p; (c) We stack the hidden post-activation representation of each training sample into a matrix and compute its stable rank, and plot the evolution of this stable rank during training; (d) After training for 50 epoch, we carry out APGD l -attack on MNIST test dataset (in pixel space) and plot the robust accuracy of the trained p Re LU network under different choice of attack radius.

Evaluating robustness under different attacks. Lastly, we check the adversarial robustness of the trained networks above under attacks of different norms. The results are summarized in the table below, and they show that the robustness gained from p Re LU activation is agnostic to the choice of attacks.

Table 1. Robust accuracy of p Re LU networks under different attacks. The reported accuracy is the mean value over 5 runs with random initialization. Bold text indicates the best accuracy within the same row.

Re LU (p = 1) Re LU (p = 2) Re LU (p = 3) Re LU (p = 4) Clean Acc. 0.981( 0.000) 0.982 ( 0.000) 0.982 ( 0.000) 0.980 ( 0.000) Robust Acc. (ℓ , radius= 0.05) 0.512 ( 0.006) 0.844 ( 0.002) 0.892 ( 0.002) 0.913 ( 0.001) Robust Acc. (ℓ , radius= 0.1) 0.040 ( 0.002) 0.346 ( 0.005) 0.522 ( 0.004) 0.637 ( 0.004) Robust Acc. (ℓ2, radius= 1) 0.301 ( 0.004) 0.640 ( 0.006) 0.726 ( 0.004) 0.775 ( 0.002) Robust Acc. (ℓ2, radius= 2) 0.007 ( 0.001) 0.101 ( 0.002) 0.165 ( 0.003) 0.239 ( 0.003) Robust Acc. (ℓ1, radius= 5) 0.500 ( 0.007) 0.744 ( 0.007) 0.780 ( 0.004) 0.807 ( 0.002) Robust Acc. (ℓ1, radius= 10) 0.098 ( 0.004) 0.294 ( 0.002) 0.354 ( 0.002) 0.402 ( 0.003)

Can Implicit Bias Imply Adversarial Robustness?

C. Proofs for Proposition 1 and Theorem 1

C.0. Verifying the claim regarding realizability of F(x) and F (p)(x)

Before we present our poofs for the main results, we start with verifying our claim regarding realizability of F(x) and F (p)(x): Claim (restated). The following two statements are true:

When p = 1, let I+ and I be any two non-empty and disjoint subsets of [h], let λ := (λj) Rh 0 be any vector such that P

j I+ λ2 j = K1 and P

j I λ2 j = K2, and let

wj = λj µ+, vj = λj, j I+ ,

wj = λj µ , vj = λj, j I ,

wj = 0, vj = 0, otherwise.

Then fp(x; {wj, vj}h j=1) F(x) .

When p 1, let I1, , IK be any K non-empty and disjoint subsets of [h], let λ := (λj) Rh 0 be any vector such that P

j Ik λ2 j = 1, k [K], and let

wj = λjµk, vj = λj, j Ik, k K1 ,

wj = λjµk, vj = λj, j Ik, k > K1 ,

wj = 0, vj = 0 otherwise.

Then fp(x; {wj, vj}h j=1) F (p)(x) .

Proof. Statement 1: When p = 1.. Let I+, I , λ be define as in the statement, then we have

fp(x; {wj, vj}h j=1) =

j=1 vjσ( x, wj )

j I+ vjσ( x, wj ) + X

j I vjσ( x, wj )

j I+ λjσ( x, λj µ+ ) X

j I λjσ( x, λj µ )

j I+ λ2 jσ( x, µ+ ) X

j I λ2 jσ( x, µ )

K1σ( x, µ+ ) p

K2σ( x, µ ) = F(x) .

Statement 2: When p 1.. Let {Ik}K k=1, λ be define as in the statement, then we have

fp(x; {wj, vj}h j=1) =

j=1 vj σp( x, wj )

j Ik vj σp( x, wj )

j Ik vj σp( x, wj )

j Ik λj σp( x, λjµk )

j Ik λj σp( x, λjµk )

j Ik λ2 jσp( x, µk ) X

j Ik λ2 jσp( x, µk )

k K1 σp( x, µk ) X

k>K1 σp( x, µk ) = F (p)(x) .

Can Implicit Bias Imply Adversarial Robustness?

C.1. Auxiliary lemmas

The proof will use some basic facts in probability theory and we list them below. Lemma 2. Let E1, E2 be two events defined on some probability space, then

P(E1) P(E1 E2) + P(Ec 2) (9)

Proof. P(E1) = P (E1 (E2 Ec 2)) = P ((E1 E2) (E1 Ec 2)) P(E1 E2) + P(E1 Ec 2) P(E1 E2) + P(Ec 2).

Lemma 3 (Hoeffding inequality). For any unit vector µ SD 1, we have

D ID (| µ, ε | > t) 2 exp CDt2

for some constant C > 0.

C.2. Proof for Proposition 1

Proposition 1 (Test error on clean data, restated). Given classifiers F(x), F (p)(x) defined in (3),(2), and a test sample (x, y) DX,Y , we have, for some constant C > 0,

P (F(x)y > 0) 1 4 exp CD

P F (p)(x)y > 0 1 2(K + 1) exp CD

Proof. The proof is split into parts: We first prove the bound for F(x), then show the one for F (p)(x). But in both cases we have

P(x,y) DX,Y (G(x)y > 0) =

k=1 P(x,y) DX,Y (G(x)y > 0 | z = k) P (z = k) , (11)

where G can be either F or F (p). Thus, it suffices to show

P(x,y) DX,Y (G(x)y > 0 | z = k) 1 4 exp CD

, k = 1, , K . (12)

Bound for F(x) we start with the case of G being F. When k K1, we have

P(x,y) DX,Y (F(x)y > 0 | z = k) = Pε N 0, α2

D ID (F(µk + ε) > 0) , (13)

D ID (F(µk + ε) > 0)

= 1 Pε N 0, α2

D ID (F(µk + ε) < 0)

= 1 Pε N 0, α2

K1σ 1 K1 + ε, µ+ p

K2σ ε, µ < 0, E1

1 Pε N 0, α2

1 K1 ε, µ+ p

K2 ε, µ < 0 (14)

= 1 Pε N 0, α2

K2 ε, µ < 0

1 Pε N 0, α2

D ID ε, µ+ > 1 2 K1

D ID ε, µ > 1 2 K2

where (14) uses the fact that σ(x) is non-decreasing w.r.t. x., and that σ(x) |x|. The last line uses Lemma 3. The proof of the case k > K1 is identical to the one above by the symmetry of the problem.

Can Implicit Bias Imply Adversarial Robustness?

Bound for F (p)(x) When k K1, we have, again,

P(x,y) DX,Y F (p)(x)y > 0 | z = k = Pε N 0, α2

D ID F (p)(µk + ε) > 0 , (16)

we define the event E1 := {| µk, ε | < 1}. Then by Lemma 2,

D ID F (p)(µk + ε) > 0

= 1 Pε N 0, α2

D ID F (p)(µk + ε) < 0

1 Pε N 0, α2

D ID F (p)(µk + ε) < 0, E1 P (Ec 1)

1 Pε N 0, α2

σp (1 + µk, ε ) + X

l K1,l =k σp ( µl, ε ) X

K1<l K2 σp ( µl, ε ) < 0, E1

1 Pε N 0, α2

(1 | µk, ε |)p X

l =k | µl, ε |p < 0, E1

P (Ec 1) (17)

1 Pε N 0, α2

1 | µk, ε |

l =k | µl, ε |p

1 Pε N 0, α2

1 | µk, ε | X

l =k | µl, ε | < 0, E1

P (Ec 1) (18)

1 Pε N 0, α2

1 | µk, ε | X

l =k | µl, ε | < 0

k=1 Pε N 0, α2

D ID | µk, ε | > 1

1 2K exp CD

1 2(K + 1) exp CD

where (17) uses the fact that under the event E1, one has σp (1 + µk, ε ) (1 | µk, ε |)p, and (18) uses the fact that x p x 1 for any vector x and p 1. The last line uses Lemma 3. The proof of the case k > K1 is identical to the one above by the symmetry of the problem.

C.3. Proof for Theorem 1

Theorem 1 (l2-Adversarial Robustness, restated). Given classifiers F(x), F (p)(x) defined in (3),(2), and a test sample (x, y) DX,Y , the following statement is true for some constant C > 0:

Non-robustness of F(x) against O 1

-attack We let d0 :=

K SD 1, then for some ρ > 0,

P F x y(1 + ρ)

y>0 2 exp CDρ2

Robustness of F (p)(x) against O (1) -attack We let p 2, then for some 0 δ <

min d 1 F (p)

1 2(K + 1) exp CDδ2

Proof. This proof is split into two parts. One for F(x) and another for F (p)(x).

Can Implicit Bias Imply Adversarial Robustness?

Non-robustness of F(x) Since

P(x,y) DX,Y

F x y(1 + ρ)

k=1 P(x,y) DX,Y

F x y(1 + ρ)

y > 0 | z = k P (z = k) ,

(20) It suffices to show

P(x,y) DX,Y

F x y(1 + ρ)

y > 0 | z = k 2 exp CD2K (ρ 1)2α2

When k K1, we have

P(x,y) DX,Y

F x y(1 + ρ)

y > 0 | z = k

= Pε N 0, α2

D ID F µk + ε 1 + ρ

= Pε N 0, α2

K1σ 1 K1 + ε, µ+ (1 + ρ) K1

K2σ ε, µ + (1 + ρ) K2

To proceed, we define the event E2 := { ε, µ+ (1+ρ) K1

K + 1 K1 0}, then by Lemma 2

K1σ 1 K1 + ε, µ+ (1 + ρ) K1

K2σ ε, µ + (1 + ρ) K2

K1σ 1 K1 + ε, µ+ (1 + ρ) K1

K2σ ε, µ + (1 + ρ) K2

= Pε N 0, α2

K1σ 1 K1 + ε, µ+ (1 + ρ) K1

K2σ ε, µ + (1 + ρ) K2

K1 ε, µ+ (1 + ρ)K1

K2 ε, µ (1 + ρ)K2

K | ε, d0 | (1 + ρ) > 0

D ID | ε, d0 | > ρ

where (23) is because the second probability is 0, and (24) uses again the monotonicity of Re LU σ(x). The last line uses Lemma 3. The proof of the case k > K1 is identical to the one above by the symmetry of the problem.

Robustness of F (p)(x) It suffices to show

P(x,y) DX,Y

min d 1 F (p)

y < 0 | z = k

2(K2 + 2) exp CDδ2

2(K2 + 1)2α2

Can Implicit Bias Imply Adversarial Robustness?

When k K1, we have

P(x,y) DX,Y

min d 1 F (p)

y < 0 | z = k

= Pε N 0, α2

min d 1 F (p)

= Pε N 0, α2

1 + µk, ε +

l =k,1 l K1 σp

K1+1 l K2 σp

To proceed, we define the event

1 + µk, ε +

2 d, µk 0, d SD 1 )

and for ease of presentation, we write

G(ε, d) := σp

1 + µk, ε +

l =k,1 l K1 σp

Then, by Lemma 2,

(27) = Pε N 0, α2

D ID min d 1 G(ε, d) < 0

D ID min d 1 G(ε, d) < 0, E3

+ P (Ec 3) . (29)

Now under the event E3, we can lower bound G(ε, d) by

1 + µk, ε +

l =k,1 l K1 σp

1 + µk, ε +

| µl, ε | +

2 | d, µl |

Can Implicit Bias Imply Adversarial Robustness?

Thus, we have

1 + µk, ε +

| µl, ε | +

2 | d, µl |

= Pε N 0, α2

min d 1 1 + µk, ε +

| µl, ε | +

2 | d, µl |

min d 1 1 +

l=K1+1 (| d, µl |)p ! 1

| {z } M (δ)

l=K1+1 (| µl, ε |)p ! 1

p | µk, ε | < 0, E3

+ P (Ec 3) (32)

| µk, ε | +

l=K1+1 (| µl, ε |)p ! 1

| µk, ε | +

l=K1+1 | µl, ε | > M (δ)

+ P (Ec 3) (33)

D ID | µk, ε | > M (δ)

l=K1+1 Pε N 0, α2

D ID | µl, ε | > M (δ)

D ID | µk, ε | > M (δ)

l=K1+1 Pε N 0, α2

D ID | µl, ε | > M (δ)

+ Pε N 0, α2

| µk, ε | > 1

2(K2 + 1) exp CD(M (δ))2

(K2 + 1)2α2

2(K2 + 2) exp CD(M (δ))2

(K2 + 1)2α2

where (32) uses the sub-additivity of ℓp norm, and (33) uses again the inequality x p x 1 for any x and p 1. The last line uses Lemma 3. The remaining thing is to show that M (δ) = δ

2. First, by the property of lp norm (when p 2), we have

l=K1+1 (| d, µl |)p ! 1

l=K1+1 | d, µl |2 , (35)

Can Implicit Bias Imply Adversarial Robustness?

M (δ) = min d 1 1

l=K1+1 (| d, µl |)p ! 1

2 | d, µk |

l=K1+1 | d, µl |2

v u u t| d, µk |2 +

l=K1+1 | d, µl |2 (36)

where (36) uses the fact that a +

2(a + b) for any a, b 0, and (37) uses the fact that µ1, , µK are an

orthonormal basis, thus q PK l=1 | d, µl |2 d 1.

We have proved M (δ) δ

2, and this lower bound can be attained by d = µk+µl

2 for any K1+1 l K2. Therefore

2. Finally, we have

P(x,y) DX,Y

min d 1 F (p)

y < 0 | z = k

2(K2 + 2) exp CD(M (δ))2

(K2 + 1)2α2

= 2(K2 + 2) exp CDδ2

2(K2 + 1)2α2

2(K + 1) exp CDδ2

The proof of the case k > K1 is identical to the one above by the symmetry of the problem.

C.4. Complementary results to Theorem 1

Theorem 3 (l2-Adversarial Robustness, complementary to Theorem 1). Given classifiers F(x), F (p)(x) defined in (3),(2), and a test sample (x, y) DX,Y , the following statement is true for some constant C > 0:

For some 0 ρ 1,

P min d 1 F x + (1 ρ)

K d y > 0 1 2 exp CDρ2

For some 0 < δ,

min d 1 F (p)

The proof has the same spirit as those for Theorem 1 so we state it briefly.

Proof. Robustness of F(x), complementary result It suffices to show that

P(x,y) DX,Y

min d 1 F x + (1 ρ)

K x y < 0 | z = k 4 exp CDρ2

, k K . (39)

Can Implicit Bias Imply Adversarial Robustness?

When k K1, we have

P(x,y) DX,Y

min d 1 F x + (1 ρ)

K x y < 0 | z = k

= Pε N 0, α2

D ID min d 1 F µk + ε + (1 ρ)

= Pε N 0, α2

D ID min d 1

K1σ 1 K1 + ε, µ+ + 1 ρ

K2σ ε, µ + 1 ρ

d, µ < 0 (40)

If we still let d0 :=

K SD 1, then by the fact that |x| σ(x) x for any x, we have

D ID min d 1

D ID min d 1

K1 ε, µ+ < 0

D ID min d 1

K1 + K2 q d, µ+ 2 + d, µ 2

K1 ε, µ+ < 0 (41)

D ID min d 1 [1 (1 ρ)]

K1 ε, µ+ < 0 (42)

K1 ε, µ+ > ρ

D ID ε, µ + ε, µ+ > ρ

2Pε N 0, α2

D ID ε, µ > ρ 2

where (41) uses the fact that ab+cd

b2 + d2 for any a, b, c, d R, a simple application of Cauchy-Schwarz

inequality, and (42) uses the fact that µ+, µ are orthonormal, thus q d, µ+ 2 + d, µ 2 d 1. The last line uses Lemma 3. The proof of the case k > K1 is identical to the one above by the symmetry of the problem.

Non-robustness of F (p)(x), complementary result It suffices to show that

P(x,y) DX,Y

min d 1 F (p)

y > 0 | z = k

, k K . (44)

Can Implicit Bias Imply Adversarial Robustness?

When k K1, we define dk = µk+µK

P(x,y) DX,Y

min d 1 F (p)

y > 0 | z = k

P(x,y) DX,Y

y > 0 | z = k

= Pε N 0, α2

2 2 + ε, µk

2 2 + ε, µK

l =k,l =K | ε, µl |p > 0

2 2 + ε, µk

l =k,l =K | ε, µl |

> σp 1 + δ/

2 2 + ε, µK

2 2 + | ε, µk |

l =k,l =K | ε, µl |

> σp 1 + δ/

2 2 | ε, µK |

2 2 + | ε, µk |

l =k,l =K | ε, µl |

> σp 1 + δ/

2 2 | ε, µK |

2 2 + | ε, µk |

l =k,l =K | ε, µl | > σ

2 2 | ε, µK |

2 2 + | ε, µk |

l =k,l =K | ε, µl | > 1 + δ/

2 2 | ε, µK |

where (45) uses the fact that σp(x) is non-decreasing w.r.t. x, and (46) uses the fact that (a + b)p ap + bp for any a, b > 0. We define the event

2 2 + ε, µk > 0

Can Implicit Bias Imply Adversarial Robustness?

Then by Lemma 2,

(47) Pε N 0, α2

2 2 + | ε, µk | + X

l =k,l =K | ε, µl | > 1 + δ/

2 2 | ε, µK | , E5

+ Pε N 0, α2

| ε, µk | + X

l =k,l =K | ε, µl | > 1 + δ/

2 2 | ε, µK | , Ec 5

| ε, µK | + | ε, µk | + X

l =k,l =K | ε, µl | > δ

+ Pε N 0, α2

| ε, µK | + | ε, µk | + X

l =k,l =K | ε, µl | > 1 + δ/

2Pε N 0, α2

1 l K | ε, µl | > δ 2

2Pε N 0, α2

D ID | ε, µ1 | > δ 2

The last line uses Lemma 3. When k > K1, the proof is similar with dk := µk+µ1

Can Implicit Bias Imply Adversarial Robustness?

D. Proofs for Lemma 1 and Theorem 2

D.1. Alignment bias illustrated

As we showed in Lemma 1, during the alignment phase t T, one have the following approximation

d dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t)) , (50)

k=1 γk(w)ykxk p[cos(xk, w)]p 1 , (51)

which essentially shows that when wj is a positive neuron (sign(vj(0)) > 0), then gradient flow dynamics during alignment phase pushes wj/ wj toward the direction of x(p)(w).

Notably, x(p)(wj) critically depends on p. Roughly speaking, when p = 1, x(p)(wj) are more aligned with µ+ and µ , while when p > 3, x(p)(wj) are more aligned with one of the subclass centers, thus by moving toward x(p)(wj) in direction, the neurons are likely to align with average class centers in the former case, and with subclass centers in the latter case. We elaborate this statement here with a toy example.

Figure 7. Alignment bias visualized. During alignment phase wj is moving toward x(p)(wj) in direction. When p = 1, x(1)(wj) is aligned with average class center µ+; When p = 3, x(p)(wj) is more aligned with one of the subclass centers µ1 and µ2, depending on which one is closer to wj in cosine distance.

Suppose the dataset (K = 2, K1 = 2) only contains two orthogonal µ1, and µ2 in the 2-d plane and they both have positive labels. Given a positive neuron wj that is activated by both µ1, and µ2, as shown in Figure 7. During alignment phase wj wj is moving towards the direction of x(p)(wj), which is

when p = 1,

k=1 γk(wj)ykxk = µ1 + µ2 =

2 µ+ , (52)

exactly aligned with average class center µ+.

when p = 2,

k=1 γk(wj)ykxk 2[cos(xk, w)] = 2 (µ1 cos (µ1, wj) + µ2 cos (µ2, wj)) = 2 wj

exactly aligned with wj itself.

Can Implicit Bias Imply Adversarial Robustness?

when p = 3,

k=1 γk(wj)ykxk 3[cos(xk, wj)]2 = 3 µ1 cos (µ1, wj)2 + µ2 cos (µ2, wj)2 , (54)

getting closer to either µ1 or µ2, depending which one is closer to wj in cosine distance.

Although this example is even more simplified than the one in Section 4, it is easy to visualize and keeps the core relationship between the alignment bias of the neurons and the p Re LU activation. From this, we see how the alignment bias is altered under different choices of p.

D.2. Auxiliary Lemma

We first prove the following, most analyses on gradient flow with small initialization (Boursier et al., 2022; Boursier & Flammarion, 2024; Min et al., 2024) have similar results, saying that the norm of the neurons stays close to zero during the alignment phase.

Lemma 4. Given some initialization in (4), then for any ϵ 1 4

h M 2 , any solution to the gradient flow dynamics under the simplified training dataset satisfies

max j wj(t) 2 2ϵM 2

h , max |fp(xk; θ(t))| 2ϵ

h M 2 , (55)

t 1 4K log 1

and we need the following lemma

Lemma 5. Given nonnegative z1, , zn, consider a function

gp(q; {zi}n i=1) =

i=1 zp+1 q i

then gp(q; {zi}n i=1) is convex on R. Moreover, as long as zi = zj for some i, j, then gp(q; {zi}n i=1) is strictly convex with minimum at q = p+1

we leave their proofs at the end of this section. Lastly, we use the following lemma from Min et al. (2024)

Lemma 6. For ℓbeing either exponential or logistic loss, we have

| ˆyℓ(y, ˆy) y| 2|ˆy|, y {+1, 1}, |ˆy| 1 . (57)

D.3. Proof for Lemma 1

Lemma 1 (restated). Given some initialization from (4), if ϵ = O( 1

h), then there exists T = θ( 1

hϵ) such that the trajectory under gradient flow training with the simplified training dataset almost surely satisfies that t T,

d dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t)) = O ϵk

where P w = I ww

k=1 γk(w)ykxk p[cos(xk, w)]p 1, (58)

with γk(w) being a subgradient of σp(z) at z = xk, w .

Proof. For simplicity, we write wj(t) as wj.

Can Implicit Bias Imply Adversarial Robustness?

As we will show in the proof for Lemma 4, under balanced initialization,

k=1 γk(wj) ˆyℓ(yk, fp(xk; θ)) wj p[σ( xk, wj )]p 1

wj p 1 xk (p 1)[σ( xk, wj )]p

Then for any j [h],

d dt wj wj = P wj 1 wj d

= sign(vj(0))

k=1 γk(wj) ˆyℓ(yk, fp(xk; θ))P wj

p cos (xk, wj)p 1 xk (p 1)[σ( xk, wj )]p

= sign(vj(0))

i=1 γk(wj) ˆyℓ(yk, fp(xk; θ))P wjp cos (xk, wj)p 1 xk ,

d dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t))

i=1 γk(wj) ˆy( ℓ(yk, fp(xk; θ)) yk)P wjp cos (xk, wj)p 1 xk

i=1 γk(wj) | ˆy( ℓ(yk, fp(xk; θ)) yk| P wjp cos (xk, wj)p 1 xk

2Kp max k |fp(xk; θ(t))| ,

by Lemma 6. Finally, by Lemma 4, we have for any ϵ 1 4

h M 2 , t T = 1 4K log 1

hϵ, we have

d dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t)) 2Kp max |fp(xk; θ(t))| 4ϵ

h M 2Kp , (60)

which finishes the proof.

D.4. Proof for Theorem 2

Theorem 2 (Alignment bias of neurons, complete statement). Given some 0 < δ < δ < 1 and a fixed choice of p 1, then ϵ0 := ϵ0(δ, p) > 0 such that for any solution of the gradient flow on fp(x; θ) with the simplified training dataset, starting from some initialization from (4) with initialization scale ϵ < ϵ0, almost surely we have that at any time t T = θ( 1

j with sign(vj(0)) > 0,

d dtcos wj(t), µ+ cos(wj(t), µ+)=1 δ

( > 0, when p = 1 < 0, when p 3 , (61)

and d dtcos(wj(t), µk) cos(wj(t),µk)=1 δ > 0, k K1 . (62)

j with sign(vj(0)) < 0,

d dtcos wj(t), µ cos(wj(t), µ )=1 δ

( > 0, when p = 1 < 0, when p 3 , (63)

and d dtcos(wj(t), µk) cos(wj(t),µk)=1 δ > 0, k > K1 . (64)

Can Implicit Bias Imply Adversarial Robustness?

Proof. The proofs for positive neurons and for negative neurons are almost identical, we will prove it for positive neurons sign(vj(0)) > 0, i.e. sign(vj(0)) = 1. The first part concerns about d dt cos wj(t), µ+ cos(wj(t), µ+)=1 δ.

d dt cos wj(t), µ+ (65)

dt wj(t) wj(t) , µ+

D P wj(t)x(p)(wj(t)), µ+ E + 1 µ+

dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t)), µ+

When p = 1, if we can show that for some choice of δ > 0,

inf wj SD 1,cos(wj, µ+)=1 δ

D P wjx(1)(wj), µ+ E := 1(δ) > 0 , (68)

then we pick ϵ ϵ0 = 1(δ) 8

h M 2Kp, and by Lemma 1, we have that for t T = 1 4K log 1

d dt cos wj(t), µ+ cos(wj(t), µ+)=1 δ 1(δ) max j

d dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t)) (69)

h M 2Kp 1(δ)

2 > 0 , (70)

which is what we stated in the theorem.

Similarly, When p > 3, if we can show that for some choice of δ > 0,

inf wj SD 1,cos(wj, µ+)=1 δ

D P wjx(p)(wj), µ+ E := p(δ) < 0 , (71)

then we pick ϵ ϵ0 = p(δ) 8

h M 2Kp, and by Lemma 1, we have that for t T = 1 4K log 1

d dt cos wj(t), µ+ cos(wj(t), µ+)=1 δ p(δ) + max j

d dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t)) (72)

h M 2Kp p(δ)

2 < 0 . (73)

Therefore, for the first part, it suffices to show

inf wj SD 1,cos(wj, µ+)=1 δ

D P wjx(p)(wj), µ+ E := p(δ)

( > 0, for p = 1 > 0, for p 3 (74)

Now there exists 1 > δ1 > 0 such that when δ > δ1, and cos wj, µ+ =

1 δ, we have γk(wj) = 1, k K1 and γk(wj) = 0, k > K1, i.e., wj is activated by all xk, k K1 with positive label and is not activated by any of the xk, k > K1 with negative label. Moreover, there exists zk, k K1, such that wj = wj P k K1 zkxk and

zk = D xk, wj wj E , i.e., wj lies completely within the span of xk, k K1.

Can Implicit Bias Imply Adversarial Robustness?

With this, we have

D P wjx(p)(wj), µ+ E

I wjw T j wj 2

k γk(wj)ykxk p[cos(xk, wj)]p 1, µ+

I wjw T j wj 2

k K1 xk p[cos(xk, wj)]p 1, X

k K1 xk p[cos(xk, wj)]p 1 +

k K1 xk, wj

p[cos(xk, wj)]p 1

k K1 xk p xk, wj

k K1 xk, wj

k K1 p xk, wj

k K1 zp 1 k

Since wj lies completely within the span of xk, k K1, we have P k K1 z2 k = 1, then

k K1 zp 1 k

k K1 zp 1 k

= p [gp(2; {zk}k K1) gp(1; {zk}k K1)] ,

where gp( ; {zk}k K1) is defined in Lemma 5.

By Lemma 5, when p = 1, g1( ; {zk}k K1) is strictly convex and takes minimum at q = 1+p

2 = 1, thus

g1(2; {zk}k K1) g1(1; {zk}k K1) > 0 , (75)

then we know that inf wj SD 1,cos(wj, µ+)=1 δ

D P wjx(1)(wj), µ+ E := 1(δ) 0 . (76)

However, 1(δ) can not be zero: If this is the case, since the set {wj : wj SD 1, cos wj, µ+ = 1 δ} is compact

and 1 µ+ D P wjx(1)(wj), µ+ E is continuous on this set. It attains minimum 0 at some wj, which implies the non-strong

convexity of g1( ; {zk}k K1) that, by Lemma 5, requires all zk, k K1 to be equal to each other (This can only happen if cos wj, µ+ = 1 ). Contradiction. Then one must have

inf wj SD 1,cos(wj, µ+)=1 δ

D P wjx(1)(wj), µ+ E := 1(δ) > 0 . (77)

Similarly, by Lemma 5, when p 3, g1( ; {zk}k K1) is strictly convex and takes minimum at q = 1+p

g1(2; {zk}k K1) g1(1; {zk}k K1) < 0 , (78)

Can Implicit Bias Imply Adversarial Robustness?

then we know that

inf wj SD 1,cos(wj, µ+)=1 δ

D P wjx(1)(wj), µ+ E := p(δ) 0 . (79)

Using the same argument, we eliminate the case of this infimum being zero. Then one must have

inf wj SD 1,cos(wj, µ+)=1 δ

D P wjx(1)(wj), µ+ E := p(δ) < 0 . (80)

The second part concerns about d dt cos (wj(t), µk) cos(wj(t),µk)=1 δ, for some k K1. Without loss of generality, we

let k = 1. Thus we intend to show d dt cos (wj(t), µ1) cos(wj(t),µ1)=1 δ is positive.

We also let ζ := 1 (1 δ)2, so that the condition cos(wj(t), µ1) = 1 δ becomes cos(wj(t), µ1) = 1 ζ.

Since PK k=1 D wj wj , µk E 2 1, cos(wj(t), µ1) = 1 ζ implies PK l=2 D wj wj , µl E 2 ζ.

Similar to the first part of the proof, we have

d dt cos (wj(t), µ1) (81)

dt wj(t) wj(t) , µ1

= D P wj(t)x(p)(wj(t)), µ1 E + d

dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t)), µ1

if we can show that for some choice of δ > 0 (or equivalently some ζ > 0),

inf wj SD 1,cos(wj,µ1)=1 δ

D P wjx(1)(wj), µ1 E := Λp(δ) > 0 , (84)

then we pick ϵ ϵ0 = Λp(δ) 8

h M 2Kp, and by Lemma 1, we have that for t T = 1 4K log 1

d dt cos (wj(t), µ1) cos(wj(t),µ1)=1 δ Λp(δ) max j

d dt wj(t) wj(t) sign(vj(0))P wj(t)x(p)(wj(t)) (85)

h M 2Kp Λp(δ)

2 > 0 , (86)

Can Implicit Bias Imply Adversarial Robustness?

which is what we stated in the theorem. The remaining proof is to find a lower bound on Λp(δ), which is given by D P wjx(p)(wj), µ1 E

I wjw T j wj 2

k γk(wj)ykxk p[cos(xk, wj)]p 1, µ1

I wjw T j wj 2

k K1 γk(wj)ykxk p[cos(xk, wj)]p 1, µ1

= p cosp 1(µ1, wj) 1 cos2(µ1, wj) +

l=2 γl(wj)ylp cosp(µl, wj) cos(wj, µ1)

l=2 γl(wj)ylp cosp(µl, wj)

1 ζp (1 ζ) p 2

2 ζ ζ p 2 p

2 ζ ζ ζ p 2 = pζ + o(ζ) .

Therefore, as long as ζ > 0 is small enough, which can be achieved by picking some δ < δ2 < 1, then infwj SD 1,cos(wj,µ1)=1 δ D P wjx(1)(wj), µ1 E = Λp(δ) is positive.

D.5. Proof for Auxiliary Lemmas

Balancedness: Under GF, balancedness (Du et al., 2018) is preserved: v2 j (t) wj(t) 2 = 0, t 0, j [h], from the fact that:

d dt wj 2 = wj, wj

k=1 γk(wj) ˆyℓ(yk, fp(xk; W, v))vj

p[σ( xk, wj )]p 1

wj p 1 wj, xk (p 1)[σ( xk, wj )]p

wj p+1 wj 2

k=1 γk(wj) ˆyℓ(yk, fp(xk; W, v))vj

p[σ( xk, wj )]p 1

wj p (p 1)[σ( xk, wj )]p

k=1 γk(wj) ˆyℓ(yk, fp(xk; W, v))vj [σ( xk, wj )]p

In addition, sign(vj(t)) = sign(vj(0)), t 0, j [h], and the dynamical behaviors of neurons will be divided into two types, depending on sign(vj(0)). Therefore, throughout the gradient flow trajectory, we have vj = sign(vj(0)) wj . This fact will be used in the subsequent proof.

Proof for Lemma 4. Under gradient flow, we have

k=1 γk(wj) ˆyℓ(yk, fp(xk; W, v))vj

p[σ( xk, wj )]p 1

wj p 1 xk (p 1)[σ( xk, wj )]p

Can Implicit Bias Imply Adversarial Robustness?

and for wj ,

d dt wj 2 = wj, wj

k=1 γk(wj) ˆyℓ(yk, fp(xk; W, v))vj

p[σ( xk, wj )]p 1

wj p 1 wj, xk (p 1)[σ( xk, wj )]p

wj p+1 wj 2

k=1 γk(wj) ˆyℓ(yk, fp(xk; W, v))vj

p[σ( xk, wj )]p 1

wj p (p 1)[σ( xk, wj )]p

k=1 γk(wj) ˆyℓ(yk, fp(xk; W, v))vj [σ( xk, wj )]p

Balanced initialization enforces vj = sign(vj(0)) wj , hence

d dt wj 2 = 2

k=1 γk(wj) ˆyℓ(yk, fp(xk; W, v))sign(vj(0)) wj 2 [σ( xk, wj )]p

wj p . (88)

Let T := inf{t : maxi |f(xk; W(t), v(t))| > 2ϵ

h M 2}, then t T, j [h], we have

d dt wj 2 = 2

k=1 γk(wj) ˆyℓ(yk, fp(xk; W, v))sign(vj(0)) wj 2 [σ( xk, wj )]p

k=1 γk(wj) ˆyℓ(yk, f(xk; W, v))sign(vj(0)) wj 2 ( xk, wj )p

k=1 | ˆyℓ(yk, f(xk; W, v))| wj 2

k=1 (|yk| + 2|f(xk; W, v)|) wj 2

k=1 (1 + 4ϵ

h M 2) wj 2 . (89)

Let τj := inf{t : wj(t) 2 > 2ϵM 2

h }, and let j := arg minj τj, then τj = minj τj T due to the fact that

|f(xi; W, v)| =

j [h] 1 wj,xk >0vj ( wj, xk )p

j [h] wj 2 h max j [h] wj 2 ,

which implies "|f(xk; W(t), v(t))| > 2ϵ

h M 2 j, s.t. wj(t) 2 > 2ϵM 2

Then for t τj , we have d dt wj 2 2n(+4ϵ

h M 2) wj 2 . (90)

By Grönwall s inequality, we have t τj

wj (t) 2 exp 2n(+4ϵ

h M 2)t wj (0) 2 ,

= exp 2n(+4ϵ

h M 2)t ϵ2 [W0]:,j 2

h M 2)t ϵ2M 2 .

Can Implicit Bias Imply Adversarial Robustness?

Suppose τj < 1 4n log 1

, then by the continuity of wj (t) 2, we have

h wj (τj ) 2 exp 2n(+4ϵ

h M 2)τj ϵ2M 2

ϵ2M 2 = ϵM 2

which leads to a contradiction 2ϵ ϵ. Therefore, one must have T τj 1 4n log 1

. This finishes the proof.

Proof of Lemma 5. Since

g p(q; {zi}n i=1) =

k=1 zq i log zi

k=1 zp+1 q i

k=1 zp+1 q i log zi

we immediately find g p(q ; {zi}n i=1) = 0. Now we compute the second-order derivative

g p(q; {zi}n i=1) = 2

k=1 zq i log zi

k=1 zp+1 q i log zi

k=1 zp+1 q i log2 zi

k=1 zq i log2 zi

k=1 zp+1 q i

1 i,j n zq i zp+1 q j ( 2 log zi log zj)

1 i,j n zq i zp+1 q j log2 zj + X

1 i,j n zq i zp+1 q j log2 zi

1 i,j n zq i zp+1 q j (log zi log zj)2 0 ,

and the equality holds only when z1 = = zn. The desired results follow.