# adversarially_robust_hypothesis_transfer_learning__d86a7be8.pdf

Adversarially Robust Hypothesis Transfer Learning

Yunjuan Wang 1 Raman Arora 1

In this work, we explore Hypothesis Transfer Learning (HTL) under adversarial attacks. In this setting, a learner has access to a training dataset of size n from an underlying distribution D and a set of auxiliary hypotheses. These auxiliary hypotheses, which can be viewed as prior information originating either from expert knowledge or as pre-trained foundation models, are employed as an initialization for the learning process. Our goal is to develop an adversarially robust model for D.

We begin by examining an adversarial variant of the regularized empirical risk minimization learning rule that we term A-RERM. Assuming a nonnegative smooth loss function with a strongly convex regularizer, we establish a bound on the robust generalization error of the hypothesis returned by A-RERM in terms of the robust empirical loss and the quality of the initialization. If the initialization is good, i.e., there exists a weighted combination of auxiliary hypotheses with a small robust population loss, the bound exhibits a fast rate of O(1/n). Otherwise, we get the standard rate of O(1/ n). Additionally, we provide a bound on the robust excess risk which is similar in nature, albeit with a slightly worse rate.

We also consider solving the problem using a practical variant, namely proximal stochastic adversarial training, and present a bound that depends on the initialization. This bound has the same dependence on the sample size as the ARERM bound, except for an additional term that depends on the size of the adversarial perturbation.

1Department of Computer Science, Johns Hopkins University, Baltimore, USA. Correspondence to: Yunjuan Wang <ywang509@jhu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

1. Introduction

Despite the incredible success of machine learning on realworld problems and its widespread adoption, several studies over the years have shown that models trained using machine learning can be highly susceptible to adversarial attacks (Goodfellow et al., 2014; Kurakin et al., 2018). These attacks involve intentionally designing imperceptible perturbations of the input data that cause the deployed (trained) model to predict unreliably. A popular defense against such inference-time attacks is adversarial training wherein the learner is presented with simulated adversarial corruptions of clean training data. Empirical studies have consistently demonstrated that the use of adversarial training (Madry et al., 2018) and its variants (Cai et al., 2018; Zhang et al., 2019; Wang et al., 2020) result in models that exhibit greater resilience to perturbations in the input space.

However, we rarely train models from scratch in real-world scenarios, irrespective of whether we use adversarial training or standard training. One of the primary reasons is the substantial increase in the size of available training data training from scratch demands not only the storage of copious amounts of data but also significant computational expense (e.g., parameter tuning). It may also be the case that the underlying data distribution is not aligned with the distribution of training data. Finally, in many scenarios, training data might not be accessible due to privacy concerns.

A compelling solution in such large-scale, real-world settings is to consider transferring knowledge from a source domain to a target domain using an auxiliary set of hypotheses; in prior work, this is referred to as hypothesis transfer learning (HTL) (Kuzborskij & Orabona, 2013; 2017; Du et al., 2017; Aghbalou & Staerman, 2023). These auxiliary hypotheses can be viewed as prior information, originating either from expert knowledge or as pre-trained foundation models (trained on various related source tasks), and are employed as an initialization for the learning process. Given a hypothesis class, H, we linearly combine any candidate predictor hw H with a weighted combination of the auxiliary hypotheses, f aux 1 , . . . , f aux k , to construct the following model for the target task:

hw,β( ) := hw( ) +

j=1 βjf aux j ( ),

Adversarially Robust Hypothesis Transfer Learning

where β = [β1, . . . , βk] Rk can be interpreted as the effectiveness of the jth auxiliary hypothesis towards solving the target task. Typically, we consider H to be a simple hypothesis class (e.g., linear predictors) or a reproducing kernel Hilbert space (RKHS), whereas the auxiliary hypotheses f aux j are fairly complex (e.g., deep neural networks). We train the weight parameters while keeping auxiliary hypotheses fixed during the training process.

The HTL setup is related, yet distinct, from popular frameworks of transfer learning and domain adaptation, where the learner has access to data from one or more source domains. Instead, in HTL, we assume that all the knowledge from the source domains has been distilled into a set of auxiliary hypotheses presented to the learner as side information. Therefore, The HTL framework is also an excellent model for studying phenomena such as fine-tuning. Indeed, with the advent of foundation models such as the Vision Transformer (Vi T) (Dosovitskiy et al., 2020) and large language models (LLM) (Floridi & Chiriatti, 2020), we can view such pre-trained models as auxiliary hypotheses. This paradigm shift not only offers efficiency but also provides flexibility in adapting models to diverse tasks by building upon the foundational knowledge embedded in pre-trained models.

In this work, we study the theoretical aspects of hypothesis transfer learning while ensuring adversarial robustness. We make the following contributions.

In Section 3, we begin by exploring Adversarial Regularized Empirical Risk Minimization framework (A-RERM) for non-negative smooth loss functions with a strongly convex regularizer. We establish a data-dependent bound on robust generalization error of A-RERM that depends on the utility of the auxilliary hypotheses. In particular, assuming that there exists a hypothesis in the RKHS, H, which in conjunction with a linear combination of auxiliary hypotheses can achieve a small robust loss on the target task, then the generalization error of A-RERM converges at a fast rate of O(1/n). Otherwise, the error decays at the standard rate of O(1/ n). We also give an optimistic bound on excess robust risk with an O(1/n) rate, but a worse rate of O(1/n1/4) when near robust-realizability does not hold.

In Section 4, we explore a practical algorithm based on proximal stochastic gradient descent algorithm. We show a bound of O( 1

n + α) on the generalization gap (w.r.t. the robust loss). Unlike prior work, nowhere in our analysis we assume convexity.

1.1. Related Work

Hypothesis Transfer Learning (HTL). In an early work, Kuzborskij & Orabona (2013) analyze the generalization ability of hypothesis transfer learning by leveraging the stability of regularized least squares regression. Their frame-

work was subsequently extended to a metric learning setting by Perrot & Habrard (2015) wherein the auxiliary hypothesis is a PSD matrix defining a (Mahalanobis) distance on input features. For smooth non-negative loss functions, Kuzborskij & Orabona (2017) give an optimistic guarantee that exhibits a fast rate if the auxilliary hypotheses prove beneficial for the (target) task. Du et al. (2017) establish a similar fast rate for kernel smoothing and kernel ridge regression for the setting when the source and target tasks are related by a transformation. More recently, Aghbalou & Staerman (2023) study HTL for surrogate losses for the binary classification problem.

Robust Generalization Guarantees. Several works give generalization guarantees for adversarially robust empirical risk minimization using uniform convergence, i.e., by bounding the difference between the expected and the empirical errors on an i.i.d. sample, simultaneously for all hypotheses in the hypothesis class. These yield guarantees based on various complexity measures of the hypothesis class, including Rademacher complexity (Yin et al., 2019; Khim & Loh, 2018; Awasthi et al., 2020), VC dimension (Cullina et al., 2018; Montasser et al., 2020), the covering number (Balda et al., 2019; Mustafa et al., 2022; Li & Telgarsky, 2023), or utilizing PAC Bayesian analysis (Viallard et al., 2021; Xiao et al., 2023) or margin theory (Farnia et al., 2018). Another line of work focuses on analyzing robust generalization guarantees of adversarial training (Madry et al., 2018) for linear predictors (Zou et al., 2021) or shallow neural networks (Allen-Zhu & Li, 2022; Mianjy & Arora, 2023; Li & Telgarsky, 2023; Wang et al., 2024), albeit under somewhat restrictive distributional assumptions.

Algorithmic Stability Analysis. The stability-based analysis, introduced by Bousquet & Elisseeff (2002), offers an alternative approach for obtaining generalization bounds in scenarios where uniform convergence-based guarantees prove inadequate. Significant recent breakthroughs from Feldman & Vondrak (2018; 2019); Bousquet et al. (2020); Klochkov & Zhivotovskiy (2021) strengthen the nature of these guarantees by improving high-probability bounds for uniformly-stable learning algorithms. Relatedly, Hardt et al. (2016) provide stability-based analysis of stochastic gradient descent (SGD) for stochastic convex optimization with smooth loss functions. Kuzborskij & Lampert (2018) introduce a data-dependent notion of algorithmic stability to give novel generalization bounds. More recently, Zhang et al. (2022) show that the stability analysis of Hardt et al. (2016) is tight for convex and strongly convex functions while improving upon the results of Hardt et al. (2016) for non-convex loss functions and of Kuzborskij & Lampert (2018) to give a tighter bound for the data-dependent average stability of SGD for non-convex smooth loss functions. Complementing these advances, the smoothness assumption

Adversarially Robust Hypothesis Transfer Learning

in the stability analysis of SGD is relaxed in Lei & Ying (2020) to loss functions with H older continuous subgradients and Bassily et al. (2020) extend the analysis to nonsmooth loss functions. Zhou et al. (2022) characterize the stability of SGD for non-convex smooth functions in terms of on-average variance of the stochastic gradients and Lei (2023) extends the analysis to weakly convex problems with non-convex and nonsmooth objective functions.

Compared to the standard setting, there has been limited work applying stability-based analysis to study the generalization gap in adversarial learning. While Xing et al. (2021) examine the algorithmic stability of a generic adversarial training algorithm by leveraging the non-smooth nature of adversarial loss, Xiao et al. (2022b) introduce a notion of approximate smoothness to characterize adversarial loss. However, all prior works assume that the adversarial loss function is convex, which is unrealistic. In this work, we forgo such unrealistic assumptions and focus on the general setting where the adversarial loss is non-convex.

2. Problem Setup

Notation. Let X Rd and Y denote the input feature space and the output label space, respectively; for regression, Y [ 1, 1] and for binary classification, Y = { 1}. Write Z = X Y. Let H {h : X Y} denote a hypothesis class parameterized by some vector space W. The uncertain relationship between inputs and outputs is modeled using an (unknown) distribution D over X Y. We are given a training data S = {zi := (xi, yi)}n i=1, a sample of size n drawn i.i.d. from D. Let ℓ: H (X Y) R+ denote the loss function. Given a hypothesis hw parameterized by w W and a random example z = (x, y) D, the loss function can be written as ℓ(hw, (x, y)) = ϕ(yhw(x)), where ϕ : R R+. We also write ℓ(hw, (x, y)) as ℓ(w, z). Define the population and empirical loss, respectively, as L(h) := E(x,y) D[ℓ(h, (x, y))] and b L(h) := 1

n Pn i=1 [ℓ(h, (xi, yi))].

We make the following assumptions on the loss function.

Assumption 1. For all z D, we assume that the loss function satisfies the following: (1) ℓ( , z) is continuously differentiable; (2) ℓ( , z) is non-negative, monotonically decreasing, and |ℓ( , z)| is uniformly bounded by M; (3) ϕ( ) is H-smooth.

Adversarial Attacks. We consider ℓp-norm-bounded adversarial attacks with a perturbation budget of α > 0. For an input example x X, the set of all such perturbations is an ℓp-ball of size α centered at x, i.e., Bp(x, α) X. Given the threat model, it is natural to consider the following robust (or, adversarial) loss function: ℓα(h, (x, y)) := sup x Bp(x,α) ℓ(h, ( x, y)).

We refer to the population and empirical loss w.r.t. the robust loss as robust population loss and robust empirical loss, respectively:

Lα adv(h) := E(x,y) D

sup x Bp(x,α) ℓ(h, ( x, y))

b Lα adv(h) := 1

i=1 sup xi Bp(xi,α) ℓ(h, ( xi, yi)).

To simplify notation, we often suppress the superscript α in Lα adv, b Lα adv, ℓα. We emphasize that, unlike some prior works, we do not assume that the robust loss function ℓis convex.

2.1. Robust Transfer from Auxiliary Hypotheses

In this setup, the learner has access to a set of models or hypotheses Faux = f aux j : X Y k

j=1. These hypotheses, serving as prior information, are provided by experts with express domain knowledge or as a result of pre-training on source distributions possibly distinct from the underlying (target) distribution D. The learner s ultimate goal is to identify a classifier h that has a small robust population loss, i.e., find h = argminh H Ladv(h). The learner incorporates the auxiliary prior information Faux by augmenting its hypothesis class and considering a combination classifier, denoted hw,β, of the following form:

hw,β( ):=hw( )+f aux β ( ), with f aux β ( )=

j=1 βjf aux j ( ).

Here, the weight βj is a parameter that encodes the relevance of the jth auxiliary hypothesis for the target task. While in practice, β would be learned on training data for the target task, for simplicity, we assume that β is fixed throughout the training process. Nonetheless, it offers a form of capacity control as we combine different, potentially very complex, hypotheses, e.g., deep neural networks. Naturally, we find that a bound on the size of β offers a useful tradeoff between the ability of the auxiliary hypotheses to fit the training data versus generalizing to the unseen data. We use Ψ : Rk R+ to measure the size of weights on the auxiliary hypotheses.

Since f aux j are kept fixed throughout the training process, we can also interpret f aux β as a (warm) initialization for the learning algorithm which then refines the initial model akin to fine-tuning, albeit by additively incorporating a simple hypothesis hw from H into the model. The expanded hypothesis set, which we denote as H, is essentially the direct sum H = H L span{f aux 1 , , f aux k }. Formally,

H = { h = h+

j=1 βjf aux j | h H, βj R, j = 1, . . . , k}.

We formulate learning as solving an Adversarial Regularized Empirical Risk Minimization (A-RERM) problem.

Adversarially Robust Hypothesis Transfer Learning

Given a regularizer function Ω: H 7 R+ and parameter λ R+, let AA-RERM : Zn Faux H denote the learning algorithm that given an i.i.d. sample S Dn, returns hbw,β = hbw + f aux β , where

hbw = argmin h H

b Ladv(h + f aux β ) + λΩ(h) . (1)

Note that the weights β Rk are fixed and we only optimize over h H. For suitable choices of the regularizer, we can argue that the larger the regularization parameter λ, the closer the final model hbw,β to f aux β . More concretely, we can make the following connection between a special case of Problem (1) and ERM with a biased regularizer. To that end, consider the setting where the hypothesis class is that of linear predictors, i.e., hw(x) = w x, the auxiliary model f aux β (x) = u x is linear, and the regularizer Ω( ) = 2. Then, for squared loss ϕ(z) = (1 z)2, write the optimization Problem (1) as

bw = argmin w Rd 1 n

i=1 max xi Bp(xi,α)

w xi+u xi yi 2+λ w 2

= argmin w Rd 1 n

(w+u) (xi+εi) yi 2+λ w 2, (2)

where εij = αyisign(wj + uj) |wj+uj|q 1/ w+u q 1 q is the j-th component of the optimal adversarial perturbation εi of the training example xi given the combined model w + u; note that q is the H older conjugate to p, i.e., 1/p + 1/q = 1. Replace w := bw+u, and let ε i denote the optimal adversarial perturbation of xi given the model w . Then, ε ij = αyisign(w j) w j q 1 / w q 1 q , and Problem (2) is equivalent to

bw =argmin w Rd 1 n

w (xi+ε i) yi 2+λ w u 2

=argmin w Rd 1 n

i=1 max xi Bp(xi,α)

w xi yi 2+λ w u 2 .

In the standard (non-robust) setting, several works study biased regularization for both transfer learning (Tommasi & Caputo, 2009; Tommasi et al., 2012; Balcan et al., 2019; Denevi et al., 2019; Takada & Fujisawa, 2020; Denevi et al., 2020) as well as hypothesis transfer learning Kuzborskij & Orabona (2013; 2017).

2.2. Algorithmic Stability

Following Kuzborskij (2018), we consider the following two notions of algorithmic stability.

Definition (On-Average Stability). Given training data S = {zi}n i=1 Dn, let S(i) denote a copy of S with the i-th

example replaced by z D, where i Uniform[n] is sampled according to uniform distribution over {1, . . . , n}. We say that

1. An algorithm A is µ1-on-average stable with respect to loss ℓ( ) if the following holds:

sup z ES,z,i h ℓ(A(S), z ) ℓ(A(S(i)), z ) i µ1,

2. An algorithm A is µ2-second-order-on-average stable with respect to loss ℓ( ) if the following holds:

sup z ES,z,i

ℓ(A(S), z ) ℓ(A(S(i)), z ) 2 µ2.

While the notion of on-average-stability (the first notion above) is milder than uniform stability (Bousquet & Elisseeff, 2002), it is slightly stronger than on-average-replace-one stability (Shalev-Shwartz et al., 2010). The following result bounds the generalization gap of an algorithm in terms of its on-average-stability parameters.

Theorem 2.1 (Theorem 8 of Kuzborskij (2018)). Let Algorithm A be µ1-on-average-stable and µ2-second-orderon-average stable. Let δ > 0. Then, given a training set S Dn of size n, we have the following for the hypothesis A(S), with probability at least 1 δ

L(A(S)) b L(A(S)) µ1+

δ µ2 + 3M log 1

3. Robust Generalization Guarantees

In this section, we first consider H to be a reproducing kernel Hilbert space (RKHS) endowed with a symmetric positive semidefinite kernel function k : Rd Rd R, an inner product , and a norm k. For any x Rd, the function x 7 k(x, ) is contained in H. We study Problem (1) with Ω(h) = h 2 k. We assume that Ω( ) is strongly convex w.r.t. k and that the kernel and the auxiliary hypotheses are all bounded.

Assumption 2. We make the following boundedness assumptions on the hypotheses.

(1) Auxiliary hypotheses in Faux are bounded point-wise by C, i.e., supj [k],x X f aux j (x) =C < .

(2) Hypotheses in the RKHS H are bounded, i.e., the kernel k is bounded by κ R: supx1,x2 X k(x1, x2)=κ< .

Further, we assume that the regularizers Ω, Ψ are strongly convex w.r.t. corresponding norms.

(3) Ω( ) = 2 k is σ-strongly convex w.r.t. RKHS norm.

(4) Ψ( ) is 1-strongly convex w.r.t. the Euclidean norm.

Adversarially Robust Hypothesis Transfer Learning

3.1. Bounding Robust Generalization Gap

First, we provide an upper bound on the robust generalization gap for A-RERM.

Theorem 3.1. Assume that the learner is given a weighted linear combination f aux β ( ) = Pk j=1 βjf aux j ( ) of auxiliary hypotheses with weights β Rk such that Ψ(β) ρ. Fix any δ > 0, and say Assumptions 1 and 2 hold. Then, given an i.i.d. sample S Dn of size n, for any λ > 0, the ARERM rule returns hbw,β such that with probability at least 1 δ, Ladv(hbw,β) b Ladv(hbw,β)

Laux advρ + p

Laux adv λ + ρ

where Laux adv = Lα adv(f aux β ).

Some remarks are in order.

In the bound above, we use the O( ) notation to hide log (1/δ) terms as well as dependence on constants H, κ, C, σ, M; please see Appendix A for a detailed statement and proof.

Not surprisingly, the bound in Theorem 3.1 depends on the robust error of the auxiliary model f aux β . Indeed, our bound is optimistic in nature. For settings where Laux adv 0 is small or in small sample regimes where n = O(1/Laux adv), the bound above decays at a fast rate of O(1/n).

If we view the auxiliary model as a warm initialization and A-RERM as performing fine-tuning, then our result is an affirmation of the empirical finding of Hua et al. (2023) that initialization is important for adversarial transfer learning. We can also view Laux adv as characterizing transferability to the new domain, playing a role similar to domain divergence in transfer learning (Ben-David et al., 2010) and robust domain adaptation (Deng et al., 2023).

We remark that our proof of Theorem 3.1 allows for general adversarial attacks, wherein the set of perturbations for any input example can be arbitrary, including a discrete set of large-norm perturbations or (image) transformations.

The robust loss of the auxiliary model f aux β depends on the weights β. While Theorem 3.1 holds for any β Rk, it is fixed prior to learning. In practice, we can treat β as a hyperparameter and use cross-validation to pick a good model. An alternate approach would be to optimize over h and β simultaneously and consider the following learning rule instead,

(hbw, bβ) = argmin h H,β Rk

b Ladv(h + f aux β ) + λΩ(h) + νΨ(β) .

Two-layer Re LU Networks Next, we extend our result to two-layer Re LU networks of width m. A hypothesis hw H is parametrized using top-layer weights a = (a1, . . . , am) Rm, bottom layer weights ws W, into each of the hidden neurons, s {1, . . . , m}, with ws B. For any input x X := {x Rd | x κ}, the output of the model hw( ) is given as h W(x) = Pm s=1 asψ( ws, x ), where ψ(x) = max(0, x) is the Re LU activation function. The top-layer weights are set by sampling them uniformly as Unif 1

m and are kept fixed while training {ws}m s=1. Slightly abusing the notation, we write Ω(h W) = Pm s=1 Ω(ws). We assume that Ω is σ-strongly convex w.r.t. and that Ω(0) = 0. As before, we assume Ψ( ) is 1-strongly convex w.r.t. the Euclidean norm. With the setup above we obtain that Theorem 3.1 holds for two-layer Re LU networks as well.

3.2. Excess Robust Risk

The bound in the previous section is post hoc in nature, stating that if we find a model with a small training loss, then it will generalize well. Here, we give a bound on the excess robust risk that holds a priori, i.e., before the learner even sees any training data.

Theorem 3.2. Assume that the learner is given a weighted linear combination f aux β ( ) = Pk j=1 βjf aux j ( ) of auxiliary hypotheses with weights β Rk such that Ψ(β) ρ. Fix any δ > 0, and say Assumptions 1 and 2 hold. Let τ > 0 be such that suph H Ω(h) τ. Then, given an i.i.d. sample S Dn of size n, and setting λ as

Laux adv+ ρ +

Laux adv+ ρ

the A-RERM rule returns hbw,β such that with probability at least 1 δ,

Ladv(hbw,β) min hw:Ω(hw) τLadv(hw,β) O

p Laux adv n1/2

p Laux adv+ 4p Laux advρ n1/4 τ +

4p Laux adv+ 8p Laux advρ n3/8 τ

Akin to the bound in the previous section, the bound above is optimistic in nature. If Laux adv 0, the excess robust risk decays as O(1/n). However, owing to the non-convexity of the adversarial loss function we obtain a worse rate of O(1/n1/4) in general. We note that both of our results (Theorem 3.1 and 3.2) recover the results in the standard (nonrobust) setting (Kuzborskij & Orabona, 2017) for α = 0.

Adversarially Robust Hypothesis Transfer Learning

4. Robust Generalization Bounds via Proximal Stochastic Adversarial Training

Thus far, we focused on the A-RERM learning rule for adversarial hypothesis transfer. While the A-RERM rule exhibits an optimistic statistical learning rate, it is often computationally hard to implement even in a standard setting without robustness constraints. This motivates a more practical approach to learning based on proximal SGD which we refer to as Proximal Stochastic Adversarial Training (PSAT). Proximal algorithms are standard tool for solving nonsmooth convex optimization problems; see Schmidt et al. (2011); Xiao & Zhang (2014); Ghadimi et al. (2016); Asi & Duchi (2019) for a good introduction.

The setup is the same as in the previous section. We are given training data S = {zi = (xi, yi)}n i=1 Dn, an adversarial loss function ℓ( ) : H Z R, and a hypothesis class parameterized by w W. We consider a possibly nonsmooth regularizer function Ω( ). Then, at each round t of the PSAT algorithm, we sample an example uniformly randomly from the given dataset S, i.e., sample ξt uniformly over [n] without replacement, and perform the following update:

wt+1 = proxγt,λΩ wt γt ℓ(wt, zξt) ,

where the proximal map with parameter γ > 0 is defined as:

proxγ,Ω(w) := argmin u Ω(u) + 1

We initialize PSAT with the auxiliary model, w0 = f aux β . For any λ > 0, we define the regularized adversarial population and empirical loss, respectively, as follows

Φadv(w) := Ladv(w) + λΩ(w), and bΦadv(w) := b Ladv(w) + λΩ(w).

For simplicity, we assume Ω( ) to be 1-strongly convex function, not necessarily differentiable.

Since we work in a stochastic setting, we also assume that the variance of the stochastic gradients is bounded, an assumption that is rather standard in analysis of stochastic gradient-based algorithms for optimization.

Assumption 3. Given any sample S = {z1, . . . , zn} Dn, there exists a constant νS > 0 such that w W, we have

Eξ Uniform[n]

i=1 ℓ(w; zi)

Assumption 4. We assume that the loss function is Lipschitz and satisfies certain smoothness conditions:

(1) ℓ(w1, z) ℓ(w2, z) L w1 w2 , (2) wℓ(w1, z) wℓ(w2, z) H w1 w2 , (3) wℓ(w, (x1, y)) wℓ(w, (x2, y)) Hz x1 x2 .

The assumption above is mild and rather standard in several works studying adversarial training (Sinha et al., 2017; Wang et al., 2021; Farnia & Ozdaglar, 2021; Xing et al., 2021; Xiao et al., 2022a). Note that we do not assume that the loss function is convex.

Next, we establish generalization guarantees for PSAT by showing that it is a stable rule. First, we show that if PSAT is fed two datasets that are similar, then it produces models that are close to each other. Formally, given a training data S Dn, let S(i) denote the training data obtained by replacing the i-th example zi S by another example z D drawn independently; we refer to S, S(i) as neighboring datasets.

Lemma 4.1. Say Assumptions 1, 3 and 4 hold. Let w T and w T denote the outputs on two neighboring datasets S, S , respectively, after running PSAT for T iterations on each of the datasets using γt = c t+1 with 0 < c < 1

H . Then, for λ > H, we have that:

Eξ Uniform[n],S,S(i) w T w T 4(1 + λ)Hzα

2HΦadv(w0)+(4ES[ν2 S]+4H2zα2) log (T) n(λ H) .

Further, for λ > 2H + 1, we have

Eξ Uniform[n],S,S(i) w T w T 2 8(H+2)2(1+λ)2H2 zα2

(2λ 3H 2)HT

+ (1+λ)2 32HΦadv(w0)+ 64ES ν2 S +16H2 zα2 log (T)

(2λ 3H 2)HTn .

Using Lemma 4.1 in Theorem 2.1 gives the following bound on the robust generalization error of PSAT. Theorem 4.2. Say Assumptions 1, 3 and 4 hold. Let w T denote the output on a sample S Dn of size n after running PSAT for T iterations with γt = c t+1 for 0 < c < 1 H , and let λ > 2H + 1. Then, δ > 0, w.p. at least 1 δ, we have that

Ladv(w T ) b Ladv(w T ) 1.5M log (1/δ)

2L2(HΦadv(w0)+(4ES [ν2 S]+H2zα2) log (T)) log (1/δ) n TH(2λ 3H 2)

+ 2L n(λ H)

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T)

32(H+2)2L2H2zα2 log (1/δ)

(2λ 3H 2)HT

Ignoring the constants and higher order terms, to better understand the result above, we see that the bound scales as O( p

Φadv(w0) + ES[ν2 S] log (T) 1

The results above hold for sufficiently large regularization parameter. Next, we present generalization bound for the setting when λ is relatively small. Theorem 4.3. Say Assumptions 1, 3 and 4 hold. Let w T denote the output on a sample S Dn of size n after

Adversarially Robust Hypothesis Transfer Learning

running PSAT for T = O(n) iterations with γt = c t+1 for 0 < c < 1

H , and 0 < λ < H, we have that

Eξ,S h Ladv(w T ) b Ladv(w T ) i O

nq+1 L (Q+n Hzα)

Eξ,S b Ladv(w T ) q q+1 L (Q+n Hzα)

where Q = p

2HΦadv(w0)+(4ES [ν2 S]+4H2zα2) log (T), q = 1 λ

The bound above depends on the initialization as well as the expected adversarial empirical loss; i.e. Eξ,S b Ladv(w T ). For settings where Eξ,S b Ladv(w T,S) = O( T q

n1+q ), the bound scales as O 1 1 q T

n + α) . Again, setting T = n, it

simplifies to O 1 1 q( Q

5. Conclusion and Discussion

In this paper, we studied the problem of learning adversarially robust models using auxiliary hypotheses. Given a smooth loss and a strongly convex regularizer, we establish robust generalization guarantees for two learning algorithms adversarial regularized empirical risk minimization (A-RERM) and proximal stochastic adversarial training (PSAT). Our results highlight the importance of a good initialization for achieving fast generalization. There are a several promising directions for future research. Our theoretical analysis highlights the importance of the regularization parameter in achieving fast generalization guarantees. It would be interesting to explore principled approaches such as recursive regularization for controlling the regularizer strength. Further, developing a practical algorithm that mitigates the dependence of the robust generalization gap on the perturbation size would help advance the state-of-the-art.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

This research was supported, in part, by DARPA GARD award HR00112020004 and NSF CAREER award IIS1943251.

Aghbalou, A. and Staerman, G. Hypothesis transfer learning with surrogate classification losses: Generalization bounds through algorithmic stability. In International

Conference on Machine Learning, pp. 280 303. PMLR, 2023.

Allen-Zhu, Z. and Li, Y. Feature purification: How adversarial training performs robust deep learning. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 977 988. IEEE, 2022.

Asi, H. and Duchi, J. C. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257 2290, 2019.

Awasthi, P., Frank, N., and Mohri, M. Adversarial learning guarantees for linear hypotheses and neural networks. In International Conference on Machine Learning, pp. 431 441. PMLR, 2020.

Balcan, M.-F., Khodak, M., and Talwalkar, A. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning, pp. 424 433. PMLR, 2019.

Balda, E. R., Behboodi, A., Koep, N., and Mathar, R. Adversarial risk bounds for neural networks through sparsity based compression. ar Xiv preprint ar Xiv:1906.00698, 2019.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Bassily, R., Feldman, V., Guzm an, C., and Talwar, K. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33:4381 4391, 2020.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79:151 175, 2010.

Bousquet, O. and Elisseeff, A. Stability and generalization. The Journal of Machine Learning Research, 2:499 526, 2002.

Bousquet, O., Klochkov, Y., and Zhivotovskiy, N. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pp. 610 626. PMLR, 2020.

Cai, Q.-Z., Du, M., Liu, C., and Song, D. Curriculum adversarial training. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), 2018.

Cullina, D., Bhagoji, A. N., and Mittal, P. Pac-learning in the presence of adversaries. Advances in Neural Information Processing Systems, 31, 2018.

Adversarially Robust Hypothesis Transfer Learning

Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pp. 1566 1575. PMLR, 2019.

Denevi, G., Pontil, M., and Ciliberto, C. The advantage of conditional meta-learning for biased regularization and fine tuning. Advances in Neural Information Processing Systems, 33:964 974, 2020.

Deng, Y., Gazagnadou, N., Hong, J., Mahdavi, M., and Lyu, L. On the hardness of robustness transfer: A perspective from rademacher complexity over symmetric difference hypothesis space. ar Xiv preprint ar Xiv:2302.12351, 2023.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Du, S. S., Koushik, J., Singh, A., and P oczos, B. Hypothesis transfer learning via transformation functions. Advances in neural information processing systems, 30, 2017.

Farnia, F. and Ozdaglar, A. Train simultaneously, generalize better: Stability of gradient-based minimax learners. In International Conference on Machine Learning, pp. 3174 3185. PMLR, 2021.

Farnia, F., Zhang, J. M., and Tse, D. Generalizable adversarial training via spectral normalization. ar Xiv preprint ar Xiv:1811.07457, 2018.

Feldman, V. and Vondrak, J. Generalization bounds for uniformly stable algorithms. Advances in Neural Information Processing Systems, 31, 2018.

Feldman, V. and Vondrak, J. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pp. 1270 1279. PMLR, 2019.

Floridi, L. and Chiriatti, M. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681 694, 2020.

Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155 (1-2):267 305, 2016.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pp. 1225 1234. PMLR, 2016.

Hua, A., Gu, J., Xue, Z., Carlini, N., Wong, E., and Qin, Y. Initialization matters for adversarial transfer learning. ar Xiv preprint ar Xiv:2312.05716, 2023.

Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Regularization techniques for learning with matrices. The Journal of Machine Learning Research, 13(1):1865 1890, 2012.

Khim, J. and Loh, P.-L. Adversarial risk bounds via function transformation. ar Xiv preprint ar Xiv:1810.09519, 2018.

Klochkov, Y. and Zhivotovskiy, N. Stability and deviation optimal risk bounds with convergence rate o(1/n). Advances in Neural Information Processing Systems, 34: 5065 5076, 2021.

Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial examples in the physical world. In Artificial intelligence safety and security, pp. 99 112. Chapman and Hall/CRC, 2018.

Kuzborskij, I. Theory and algorithms for hypothesis transfer learning. Technical report, EPFL, 2018.

Kuzborskij, I. and Lampert, C. Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 2815 2824. PMLR, 2018.

Kuzborskij, I. and Orabona, F. Stability and hypothesis transfer learning. In International Conference on Machine Learning, pp. 942 950. PMLR, 2013.

Kuzborskij, I. and Orabona, F. Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106:171 195, 2017.

Lei, Y. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. In The Thirty Sixth Annual Conference on Learning Theory, pp. 191 227. PMLR, 2023.

Lei, Y. and Ying, Y. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pp. 5809 5819. PMLR, 2020.

Li, J. D. and Telgarsky, M. On achieving optimal adversarial test error. In International Conference on Learning Representations, 2023.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

Adversarially Robust Hypothesis Transfer Learning

Mianjy, P. and Arora, R. Robustness guarantees for adversarially trained neural networks. Advances in neural information processing systems, 2023.

Montasser, O., Hanneke, S., and Srebro, N. Reducing adversarially robust learning to non-robust pac learning. Advances in Neural Information Processing Systems, 33: 14626 14637, 2020.

Mustafa, W., Lei, Y., and Kloft, M. On the generalization analysis of adversarial learning. In International Conference on Machine Learning, pp. 16174 16196. PMLR, 2022.

Perrot, M. and Habrard, A. A theoretical analysis of metric hypothesis transfer learning. In International Conference on Machine Learning, pp. 1708 1717. PMLR, 2015.

Schmidt, M., Roux, N., and Bach, F. Convergence rates of inexact proximal-gradient methods for convex optimization. Advances in neural information processing systems, 24, 2011.

Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635 2670, 2010.

Sinha, A., Namkoong, H., Volpi, R., and Duchi, J. Certifying some distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017.

Srebro, N., Sridharan, K., and Tewari, A. Optimistic rates for learning with a smooth loss. ar Xiv preprint ar Xiv:1009.3896, 2010.

Takada, M. and Fujisawa, H. Transfer learning via ℓ1 regularization. Advances in Neural Information Processing Systems, 33:14266 14277, 2020.

Tommasi, T. and Caputo, B. The more you know, the less you learn: from knowledge transfer to one-shot learning of object categories. In Proceedings of the British Machine Vision Conference, pp. 80 1, 2009.

Tommasi, T., Orabona, F., Castellini, C., and Caputo, B. Improving control of dexterous hand prostheses using adaptive learning. IEEE Transactions on Robotics, 29(1): 207 219, 2012.

Viallard, P., VIDOT, E. G., Habrard, A., and Morvant, E. A pac-bayes analysis of adversarial robustness. Advances in Neural Information Processing Systems, 34:14421 14433, 2021.

Wang, Y., Zou, D., Yi, J., Bailey, J., Ma, X., and Gu, Q. Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, 2020.

Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., and Gu, Q. On the convergence and robustness of adversarial training. ar Xiv preprint ar Xiv:2112.08304, 2021.

Wang, Y., Zhang, K., and Arora, R. Benign overfitting in adversarially trained neural networks. In International Conference on Machine Learning. PMLR, 2024.

Xiao, J., Fan, Y., Sun, R., and Luo, Z.-Q. Adversarial rademacher complexity of deep neural networks. ar Xiv preprint ar Xiv:2211.14966, 2022a.

Xiao, J., Fan, Y., Sun, R., Wang, J., and Luo, Z.-Q. Stability analysis and generalization bounds of adversarial training. Advances in Neural Information Processing Systems, 35: 15446 15459, 2022b.

Xiao, J., Sun, R., and Luo, Z.-Q. Pac-bayesian adversarially robust generalization bounds for deep neural networks. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.

Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.

Xing, Y., Song, Q., and Cheng, G. On the algorithmic stability of adversarial training. Advances in neural information processing systems, 34:26523 26535, 2021.

Yin, D., Kannan, R., and Bartlett, P. Rademacher complexity for adversarially robust generalization. In International conference on machine learning, pp. 7085 7094. PMLR, 2019.

Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472 7482. PMLR, 2019.

Zhang, Y., Zhang, W., Bald, S., Pingali, V., Chen, C., and Goswami, M. Stability of SGD: Tightness analysis and improved bounds. In Uncertainty in artificial intelligence, pp. 2364 2373. PMLR, 2022.

Zhou, Y., Liang, Y., and Zhang, H. Understanding generalization error of SGD in nonconvex optimization. Machine Learning, pp. 1 31, 2022.

Zou, D., Frei, S., and Gu, Q. Provable robustness of adversarial training for learning halfspaces with noise. In International Conference on Machine Learning, pp. 13002 13011. PMLR, 2021.

Adversarially Robust Hypothesis Transfer Learning

A. Missing Proofs in Section 3

Before presenting the proof, we first give the definition of a smooth function and Rademacher complexity that will be used later.

Definition (Smoothness). A differentiable function f : Rd R is H-smooth if its gradient is H-Lipschitz; i.e., for all w1, w2, f(w1) f(w2) H w1 w2 .

Definition (Rademacher complexity (Bartlett & Mendelson, 2002)). Given distribution D, let S = {(xi.yi)}i=1 drawn i.i.d. from D. Let H be a class of functions h : Z R. We define the empirical Rademacher complexity of H measured on S as

b RS(H) = Eσi,i [n]

i=1 σih(xi)

where σi is Rademacher random variable such that P(σi = 1) = P(σi = 1) = 0.5. The Rademacher complexity of H is defined as

R(H) = ED [RS(H)] .

Theorem A.1 presents a Rademacher complexity-based generalization bound, which is the basis of the proof of Theorem 3.1. The proof of Theorem A.1 follows from Kuzborskij & Orabona (2017, Proof of Theorem 4) by replacing the standard loss to its adversarial counterpart.

Theorem A.1. Under Assumption 1, let the training set S of size n be sampled i.i.d. from D. For any r 0, define the adversarial loss class w.r.t. the hypothesis class H as

(x, y) 7 sup x Bp(x,α) ℓ(h; ( x, y)) : h H Ladv(h) r

Fix any δ > 0, for any h H and any training set S of size n, with probability at least 1 δ,

Ladv(h) b Ladv(h) 2R( L) + 3M log (1/δ)

n log 1 + q

2M log(1/δ) (4R( L)+r)n

(4R( L) + r)M log (1/δ)

2n + 1.5M log (1/δ)

We will use the following findings from Kakade et al. (2012) on strongly convex regularizers in a general setting.

Lemma A.2 (Corollary 4 in (Kakade et al., 2012)). If Ωis σ-strongly convex w.r.t. and Ω (0) = 0 (Ω is the Fenchel conjugate of Ω), then, denoting the partial sum P j i vj by v1:i, we have for any sequence v1, . . . , vm and for any u,

i=1 vi, u Ω(u) Ω (v1:m)

i=1 Ω (v1:i 1), vi + 1

We also leverage the following lemma, which demonstrates that the solution to the optimization problem (1) has a bounded radius associated with the given auxiliary hypotheses.

Lemma A.3. Under Assumption 2, the solution of Equation (1) lies in the set h H, h q

κ λ b Ladv(f aux β ) .

Proof of Lemma A.3. By the definition of hbw, Ladv(hbw + f aux β , S) + λ hbw 2 k b Ladv(f aux β ), which gives us that hbw k q

1 λ b Ladv(f aux β ). Therefore,

hbw = sup x | hbw, k(x, ) | hbw k sup x

k(x, x) κ hbw k rκ

λ b Ladv(f aux β ).

We now study the Rademacher complexity of the adversarial loss function class.

Lemma A.4. Under Assumption 1, define the adversarial loss class w.r.t. the expanded hypothesis class H as

Adversarially Robust Hypothesis Transfer Learning

L := n (x, y) 7 ℓ(h, (x, y)) : h H o .

Then given an i.i.d. sample S Dn of size n and the following set n τi : τi ℓ(h, (xi, yi)), (xi, yi) S h H o ,

we have that

b RS( L) Eσi,i [n]

i=1 σi τi min xi B(xi,α) yih( xi)

Proof of Lemma A.4. Recall from (Srebro et al., 2010) that for any H-smooth non-negative function ϕ : R 7 R+ and any z1, z2 R, |ϕ(z1) ϕ(z2)| p

6H(ϕ(z1) + ϕ(z2)) |z1 z2|. Here we define x1 = argmax x B(x1,α) ℓ(h1, ( x, y)), x2 = argmax x B(x2,α) ℓ(h2, ( x, y)). Then choose z1 = yh1( x1), z2 = yh2( x2), we have that

|ℓ(h1, ( x1, y)) ℓ(h2, ( x2, y))| p

6H (ℓ(h1, ( x1, y)) + ℓ(h2, ( x2, y))) |yh1( x1) yh2( x2)| .

Apply adversarial loss gives us that ℓ(h1, (x, y)) ℓ(h2, (x, y))

6H ℓ(h1, (x, y)) + ℓ(h2, (x, y)) min x1 B(x,α) yh1( x1) min x2 B(x,α) yh2( x2) .

Fix the training set S, by the definition of empirical Rademacher complexity, we have that

b RS( L) = 1

n Eσi,i [n]

i=1 σi ℓ(h, (xi, yi))

n Eσ1,...,σn 1

n un 1(h) + σn ℓ(h, (xn, yn)) o##

where un 1(h) = Pn 1 i=1 σi ℓ(h, (xi, yi)). By the definition of supremum, for any γ > 0, there exist h1, h2 H such that

un 1(h1) + ℓ(h1, (xn, yn)) (1 γ)

n un 1(h) + ℓ(h, (xn, yn)) o!

un 1(h2) ℓ(h2, (xn, yn)) (1 γ)

n un 1(h) ℓ(h, (xn, yn)) o!

Thus for any γ > 0, we have

n un 1(h) + σn ℓ(h, (xn, yn)) o#

n un 1(h) + ℓ(h, (xn, yn)) o + sup h H

n un 1(h) ℓ(h, (xn, yn)) o!

un 1(h1) + ℓ(h1, (xn, yn)) + un 1(h2) ℓ(h2, (xn, yn))

(Define h1 = arg suph H n un 1(h) + ℓ(h, (xn, yn)) o , h2 = arg suph H n un 1(h) ℓ(h, (xn, yn)) o )

un 1(h1)+un 1(h2)+

6H ℓ(h1; (xn, yn))+ ℓ(h2; (xn, yn)) min x1 n B(xn,α) ynh1( x1 n) min x2 n B(xn,α) ynh2( x2 n)

un 1(h1) + un 1(h2) + p

min x1 n B(xn,α) ynh1( x1 n) min x2 n B(xn,α) ynh2( x2 n)

(Define sn = sign min x1 n B(xn,α) ynh1( x1 n) min x2 n B(xn,α) ynh2( x2 n) )

un 1(h) + sn p

12Hτn min xn B(xn,α) ynh( xn) + 1

un 1(h) sn p

12Hτn min xn B(xn,α) ynh( xn)

Adversarially Robust Hypothesis Transfer Learning

un 1(h) + σn p

12Hτn min xn B(xn,α) ynh( xn) #

Induction in the same way for σi with i = n proves the result.

Theorem A.5. Assume that the learner is given a weighted linear combination f aux β ( ) = Pk j=1 βjf aux j ( ) of auxiliary hypotheses with weights β Rk. Given a scalar λ > 0, for any i.i.d. sample S Dn of size n, define classes

H = h H : h rκ

λ b Ladv(f aux β ) , V = {β : Ψ(β) ρ} ,

and the adversarial loss class L = n (x, y) 7 ℓ(h(x) + f aux β (x), y) : h H β V o .

Under Assumptions 1 and 2, for the adversarial loss class L, we have that

! Ladv(f aux β )/

Ladv(f aux β )ρ nσ .

Proof of Theorem A.5. Define xi(hβ) = min xi B(xi,α) yi(h( xi) + f aux β ( xi)), i [n]. Applying Lemma A.4 gives us that,

b RS( L) Eσ

sup h H,β V

i=1 σi τi min xi B(xi,α) yi(h( xi) + f aux β ( xi))

sup h H,β V

i=1 σi τiyi h, k( xi(hβ), ) + 2

i=1 σi τiyi β, faux( xi(hβ))

i=1 σi τiyi h, k( xi(hβ), )

i=1 σi τiyi β, faux( xi(hβ))

where faux is defined as faux = [f aux 1 , f aux 2 , . . . , f aux k ] . Let t > 0. For the first term, consider Ω(h) = h 2 k, setting vi = tσi τik( xi(hβ), ) and applying Lemma A.2 gives us

i=1 h, tσi τik( xi(hβ), )

i=1 σi τik( xi(hβ), ) 2 + sup h H Ω(h) +

i=1 Ω (v1:i 1), σit τik( xi(hβ), )

i=1 |τi| + b Ladv(f aux β )

λ (Assumption 2, definition of H, Eσ [σi] = 0.)

Similarly, for the second term, we have

i=1 β, tσi τifaux( xi(hβ))

i=1 |τi| + ρ

Combining the two terms and optimizing over t gives us that

1 n Pn i=1 |τi| b Ladv(f aux β )/λ + ρ

Since ϕ( ) is a H-smooth monotonic decreasing function, define x1 = argmin x B(x,α) yf aux β ( x), x2 =

Adversarially Robust Hypothesis Transfer Learning

argmin x B(x,α) y(h( x) + f aux β ( x)), we have ℓ(h + f aux β , (x, y)) = ϕ(y(h( x2) + f aux β ( x2)))

ϕ(yf aux β ( x1)) + ϕ (yf aux β ( x1))y(h( x2) + f aux β ( x2) f aux β ( x1)) + H

2 (h( x2) + f aux β ( x2) f aux β ( x1))2

(By the definition of x1, x2, we have yh( x2) y(h( x2) + f aux β ( x2) f aux β ( x1)) yf( x1))

ϕ(yf aux β ( x1)) + ϕ (yf aux β ( x1)) max {h( x1), h( x2)} + H

2 (max {h( x1), h( x2)})2

ℓ(yf aux β (x)) + 2 q

H ℓ(yf aux β (x)) h + H

ℓ(yf aux β (x)) + 2

λ ℓ(yf aux β (x))b Ladv(f aux β ) + Hκb Ladv(f aux β )

By the definition of τi, we have that

ℓ(h + f aux β , (xi, yi)) τi = ℓ(f aux β , (xi, yi)) + 2

λ ℓ(f aux β , (xi, yi))b Ladv(f aux β ) + Hκb Ladv(f aux β )

As a result, applying Jensen s inequality gives us that

i=1 |τi| b Ladv(f aux β ) + 2

λ b Ladv(f aux β ) + Hκb Ladv(f aux β )

!2 b Ladv(f aux β ).

Plugging back into Equation (3) gives us that

R( L) = ES h b RS( L) i ES

! v u u t b Ladv(f aux β ) b Ladv(f aux β )/λ + ρ

! Ladv(f aux β )/

Ladv(f aux β )ρ nσ .

Theorem 3.1. Assume that the learner is given a weighted linear combination f aux β ( ) = Pk j=1 βjf aux j ( ) of auxiliary hypotheses with weights β Rk such that Ψ(β) ρ. Fix any δ > 0, and say Assumptions 1 and 2 hold. Then, given an i.i.d. sample S Dn of size n, for any λ > 0, the A-RERM rule returns hbw,β such that with probability at least 1 δ, Ladv(hbw,β) b Ladv(hbw,β)

Laux advρ + p

Laux adv λ + ρ

where Laux adv = Lα adv(f aux β ).

Proof of Theorem 3.1. Define the adversarial loss class L := n (x, y) 7 ℓ(h, (x, y)) : h H o , define the expanded hypothesis class:

x 7 hw,β(x) : hw,β = hw + f aux β , hw H,

Ω(hw) b Ladv(f aux β )

λ b Ladv(f aux β ) Ψ(β) ρ b Ladv(hw,β) b Ladv(f aux β )

Adversarially Robust Hypothesis Transfer Learning

Recall the optimization problem:

hbw = argmin h H

n b Ladv(h + f aux β ) + λΩ(h) o , hbw,β = hbw + f aux β H.

We have b Ladv(hbw + f aux β ) + λΩ(hbw) b Ladv(f aux β ), which gives us that Ω(hbw) 1

λ b Ladv(f aux β ), b Ladv(hbw,β) b Ladv(f aux β ). Leveraging Lemma A.3, we have hbw,β H. From Theorem A.5 we have that

! Laux adv/

λ + p Laux advρ nσ

r = sup h H Ladv(h) = sup h H ES h b Ladv(h) i ES

sup h H b Ladv(h)

ES h b Ladv(f aux β ) i = Laux adv.

Plugging into Theorem A.1 gives us that Ladv(hbw,β) b Ladv(hbw,β)

(4R( L) + Laux adv)M log (1/δ)

2n + 1.5M log (1/δ)

Laux adv + 2R( L) p Laux adv

M log (1/δ)

n + 1.5M log (1/δ)

a + b a + b 2 a)

Laux advρ + q

Laux adv M log (1/δ)

Laux adv λ + ρ

M log (1/δ) + M log (1/δ)

Theorem 3.2. Assume that the learner is given a weighted linear combination f aux β ( ) = Pk j=1 βjf aux j ( ) of auxiliary hypotheses with weights β Rk such that Ψ(β) ρ. Fix any δ > 0, and say Assumptions 1 and 2 hold. Let τ > 0 be such that suph H Ω(h) τ. Then, given an i.i.d. sample S Dn of size n, and setting λ as

Laux adv+ ρ +

Laux adv+ ρ

the A-RERM rule returns hbw,β such that with probability at least 1 δ,

Ladv(hbw,β) min hw:Ω(hw) τLadv(hw,β) O

p Laux adv n1/2

p Laux adv+ 4p Laux advρ n1/4 τ +

4p Laux adv+ 8p Laux advρ n3/8 τ

Proof of Theorem 3.2. For any choice of β with Ψ(β) ρ, denote the optimal hypothesis in the class as

hw = argmin hw:Ω(hw) τ Ladv(hw,β)

By the definition of hbw, we have b Ladv(hbw,β) + λΩ(hbw) b Ladv(hw ,β) + λΩ(hw )

Adversarially Robust Hypothesis Transfer Learning

Now denote Z = (κ + σC) q

H2κLaux adv n (p Laux adv + ρ). Then follow the proof of Theorem 3.1 gives us that

Ladv(hbw,β) b Ladv(hw ,β) + λτ + Z

M log (1/δ)

Laux adv + Z

λ + M log (1/δ)

b Ladv(hw ,β) + λτ + Z

Laux adv M log (1/δ)

ZM log (1/δ)

nλ + M log (1/δ)

n Optimize over λ gives us that

ZM log (1/δ)

v u u t(κ + σC)

H2κLaux adv n ( p

Laux adv + ρ) + 1

H2κLaux adv n ( p

Laux adv + ρ)M log (1/δ)

Plug it back gives us that

Ladv(hbw,β) b Ladv(hw ,β) + τ

ZM log (1/δ)

Laux adv M log (1/δ)

n + M log (1/δ)

b Ladv(hw ,β) + (p Laux adv + (Laux advρ)1/4) n1/4

τ(κ + σC)H κ

+ (Laux adv)1/4 + (Laux advρ)1/8

n3/8 M log (1/δ) (κ + σC)H κτ 2 1/4

Laux adv M log (1/δ)

n + M log (1/δ)

We finally use Bernstein s inequality to concentrate b Ladv(hw ,β) around Ladv(hw ,β). Formally, with probability at least 1 δ,

b Ladv(hw ,β) Ladv(hw ,β) +

2 log (1/δ) E [Pn i=1(ℓ(hw ,β, (xi, yi)) Ladv(hw ,β))2]

n + 2M log (1/δ)

Ladv(hw ,β) + 2

Ladv(hw ,β)M log (1/δ)

n + 2M log (1/δ)

Ladv(hw ,β) + 2

Laux adv M log (1/δ)

n + 2M log (1/δ)

3n Plug it back into Equation (4) gives us that with probability at least 1 δ,

Ladv(hbw,β) Ladv(hw ,β) + (p Laux adv + (Laux advρ)1/4) n1/4

τ(κ + σC)H κ

+ (Laux adv)1/4 + (Laux advρ)1/8

n3/8 M log (1/δ) (κ + σC)H κτ 2 1/4

Laux adv M log (1/δ)

n + 2M log (1/δ)

We now provide theoretical results that generalize Section 3 from the RKHS additive hypothesis class to two-layer neural networks with Re LU activation functions.

Theorem A.6. Assume that the learner is given a weighted linear combination f aux β ( ) = Pk j=1 βjf aux j ( ) of auxiliary hypotheses with weights β Rk. Given a scalar λ > 0, for any i.i.d. sample S Dn of size n, define classes

h H : Ω(h) b Ladv(f aux β )

, V = {β : Ψ(β) ρ} .

Adversarially Robust Hypothesis Transfer Learning

and the adversarial loss class L = n (x, y) 7 ℓ(h(x) + f aux β (x), y) : h H β V o .

Under Assumptions 1, for the adversarial loss class L, we have that

6H (κ + σC)

Ladv(f aux β ) (Ladv(f aux β )/λ + ρ/2)

Proof of Theorem A.6. We follow the same proof of Theorem A.5.

Define xi(hβ) = min xi B(xi,α) yi(h( xi) + f aux β ( xi)), i [n]. Applying Lemma A.4 gives us that,

b RS( L) Eσ

sup h H,β V

i=1 σi τi min xi B(xi,α) yi(h( xi) + f aux β ( xi))

i=1 σi τiyih( xi(hβ))

i=1 σi τiyi β, faux( xi(hβ))

For the first term, recall that h = Pm s=1 asψ( ws, xi ), then we have

i=1 σi τiyih( xi(hβ))

i=1 σi τiyi

s=1 asψ( ws, xi(hβ) )

i=1 σi τiyiψ( ws, xi(hβ) )

(|Pm s=1 as| 1)

3H n sup ws W

i=1 σi τiyiψ( ws, xi(hβ) )

3H n sup ws W

i=1 σi τiyiψ( ws, xi(hβ) )

(Eσ,z suph H 1

n Pn i=1 σih(zi) 2Eσ,z suph H 1 n Pn i=1 σih(zi) )

3H n sup ws W

i=1 σi τiyi ws, xi(hβ)

(Talagrand s contraction Lemma)

3H nt sup ws W

i=1 ws, tσi τi xi(hβ)

i=1 σi τi xi(hβ) 2 + sup ws W Ω(ws) +

i=1 Ω (v1:i 1), σit τi xi(hβ)

(Let t 0 and set vi = tσi τi xi(hβ), apply Lemma A.2)

i=1 |τi| + b Ladv(f aux β )

(Assumption 2, definition of H, Eσ [σi] = 0, supws W Ω(ws) Ω(h) b Ladv(f aux β ) λ .)

The second term is derived in the same way as shown in the proof of Theorem A.5.

i=1 β, tσi τifaux( xi(hβ))

i=1 |τi| + ρ

Adversarially Robust Hypothesis Transfer Learning

Combining the two terms and optimizing over t gives us that

i=1 |τi| (b Ladv(f aux β )/λ + ρ/2)

Since ϕ( ) is a H-smooth monotonic decreasing function, define x1 = argmin x B(x,α) yf aux β ( x), x2 = argmin x B(x,α) y(h( x) + f aux β ( x)), then we have ℓ(h + f aux β , (x, y)) = ϕ(y(h( x2) + f aux β ( x2)))

ϕ(yf aux β ( x1)) + ϕ (yf aux β ( x1))y(h( x2) + f aux β ( x2) f aux β ( x1)) + H

2 (h( x2) + f aux β ( x2) f aux β ( x1))2

(By the definition of x1, x2, we have yh( x2) y(h( x2) + f aux β ( x2) f aux β ( x1)) yh( x1))

ϕ(yf aux β ( x1)) + ϕ (yf aux β ( x1)) max {h( x1), h( x2)} + H

2 (max {h( x1), h( x2)})2

Recall Ωis σ strongly convex, we have for its minimizer v and any ws,

σ (Ω(ws) Ω(v))

Choosing v = 0, we have that

s=1 asψ( ws, x ) κ

s=1 ws 2 2κ

s=1 Ω(ws) = 2κ

σ Ω(h) = 2κb Ladv(f aux β )

By the definition of τi, we have that

ℓ(h + f aux β , (xi, yi)) τi = ℓ(f aux β , (x, y)) + 2

σλ ℓ(f aux β , (x, y))b Ladv(f aux β ) + Hκb Ladv(f aux β )

As a result, applying Jensen s inequality gives us that

i=1 |τi| b Ladv(f aux β ) + 2

σλ b Ladv(f aux β ) + Hκb Ladv(f aux β )

!2 b Ladv(f aux β ).

Plugging back into Equation (5) gives us that

R( L) = ES h b RS( L) i

6H (κ + σC)

b Ladv(f aux β ) (b Ladv(f aux β )/λ + ρ/2)

6H (κ + σC)

Ladv(f aux β ) (Ladv(f aux β )/λ + ρ/2)

Theorem A.7. Assume that the learner is given a weighted linear combination f aux β ( ) = Pk j=1 βjf aux j ( ) of auxiliary hypotheses with weights β Rk such that Ψ(β) ρ. Fix any δ > 0. Then, given an i.i.d. sample S Dn of size n, for any λ > 0, the A-RERM rule returns hbw,β such that with probability at least 1 δ,

Ladv(hbw,β) b Ladv(hbw,β) + O

HLaux adv + BκH (κ + σC)

(Laux adv/λ + ρ)

+ 1.5M log (1/δ)

Laux adv + p

HLaux adv + BκH (κ + σC)

(Laux adv/λ + ρ)

M log (1/δ)

where Laux adv = Lα adv(f aux β ).

Proof of Theorem A.7. The procedure is similar as the proof of Theorem 3.1. Define the adversarial loss class L :=

Adversarially Robust Hypothesis Transfer Learning n (x, y) 7 ℓ(h; (x, y)) : h H o , define the expanded hypothesis class:

x 7 hw,β(x) : hw,β = hw + f aux β , hw H,

Ω(hw) b Ladv(f aux β )

λ h m Bκ Ψ(β) ρ b Ladv(hw,β) b Ladv(f aux β )

Recall the optimization problem:

hbw = argmin h H

n b Ladv(h + f aux β ) + λΩ(h) o , hbw,β = hbw + f aux β H.

We have b Ladv(hbw + f aux β ) + λΩ(hbw) b Ladv(f aux β ), which gives us that Ω(hbw) 1

λ b Ladv(f aux β ), b Ladv(hbw,β) b Ladv(f aux β ). Leveraging Lemma A.3, we have hbw,β H. From Theorem A.5 we have that

Ladv(f aux β ) (Ladv(f aux β )/λ + ρ)

r = sup h H Ladv(h) = sup h H ES h b Ladv(h) i ES

sup h H b Ladv(h)

ES h b Ladv(f aux β ) i = Laux adv.

Plugging into Theorem A.1 gives us that

Ladv(hbw,β) b Ladv(hbw,β)

(4R( L) + Laux adv)M log (1/δ)

2n + 1.5M log (1/δ)

Laux adv + 2R( L) p Laux adv

M log (1/δ)

n + 1.5M log (1/δ)

a + b a + b 2 a)

Laux advρ + q

Laux adv M log (1/δ)

Laux adv λ + ρ

M log (1/δ) + M log (1/δ)

If we further ignore the dependency on H, B, κ, C, log (1/δ), then the generalization gap can be rewritten as:

Ladv(hbw,β) b Ladv(hbw,β) O

Laux advρ + p

Laux adv λ + ρ

Similar as Theorem 3.1, the generalization gap exhibits a fast rate of O( 1

n) when Laux adv = O(1/n).

B. Missing Proofs in Section 4

Although adversarial loss is in general non-smooth, it can be characterized via a definition of approximately smoothness, which we introduce below.

Definition (Xiao et al. (2022b)). Let H > 0 and η > 0. We say a differentiable function g(w) is η-approximately H-gradient Lipschitz, if w1 and w2, we have

g(w1) g(w2) H w1 w2 + η

Within the above definition, Lemma B.1 introduces the properties that adversarial loss satisfies.

Adversarially Robust Hypothesis Transfer Learning

Lemma B.1 (Xiao et al. (2022b)). Let ℓbe the adversarial loss defined as ℓ(w; z) = max z Bp(z,α) ℓ(w; z)1 with ℓsatisfies Assumption 4. Then w1, w2 and z Z, adversarial loss ℓsatisfies:

1. (L-Lipschitz) ℓ(w1; z) ℓ(w2; z) L w1 w2 .

2. (2Hzα-approximately H-smooth) for all subgradient d(w, z) w ℓ(w; z), we have d(w1; z) d(w2; z) H w1 w2 + 2Hzα.

For any vector g Rd, we define the following quantity:

Gγ(w, g) := 1

γ w proxγ,λΩ(w γg) .

Then Proximal Stochastic Adversarial Training can be rewritten as

wt+1,S = wt,S γt Gγt(wt,S, ℓ(wt,S, zξt)).

We now present several properties of Gγ(w, g) that will be used later in our proof.

Lemma B.2 (Lemma 5 in (Zhou et al., 2022)). Let Ωbe a convex and possibly non-smooth function. Then, the following statements hold.

1. For any w, g1, g2 W, it holds that

Gγ(w, g1) Gγ(w, g2) g1 g2

2. If Ωis λ-strongly convex, then for all w, v W and γ > 0, it holds that proxγ,Ω(w) proxγ,Ω(v) 1 1 + γλ w v

Lemma B.3 (Lemma 1 in (Ghadimi et al., 2016)). For any w W, g Rd, and γ > 0, it holds that

g, Gγ(w, g) Gγ(w, g) 2 + λ

γ Ω(proxγ,λΩ(w γg)) Ω(w)

We first provide the result that connects the initialization with the norm of the gradient of the adversarial loss.

Lemma B.4. Under Assumption 1, 4 and 3, consider applying Proximal Stochastic Adversarial Training with training data S, choose γt c t+1 with 0 < c < 1

H . Then i [n], it holds that

ES,ξ ℓ(wt,S; zi) q

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) + 4Hzα.

Proof. Proof follows the similar idea as Zhou et al. (2022, Lemma 6). Denoting gt,S = ℓ(wt,S; zξt) as the stochastic gradient of adversarial loss sampled at iteration t. Setting w = wt,S, g = gt,S, apply Lemma B.3, we have gt,S, Gγ(wt,S, gt,S) Gγ(wt,S, gt,S) 2 + λ

γ (Ω(wt+1,S) Ω(wt,S)) (6)

Since ℓis non-negative, η-approximately H-smooth (with η = 2Hzα), apply Xiao et al. (2022a, Lemma 4.2) gives us that

ℓ(w1, z) ℓ(w2, z) D ℓ(w2, z), w1 w2 E + H

2 w1 w2 2 + η w1 w2 . (7)

Choosing w1 = w2 1

H ℓ(w2, z) gives us that

0 ℓ(w1, z) ℓ(w2, z) 1

ℓ(w2, z) 2 + η

Rearranging gives us that ℓ(w2, z) 2η + q

2H ℓ(w2, z). (8)

1Here we slightly abuse the notation, z Bp(z, α) is equivalent as x Bp(x, α).

Adversarially Robust Hypothesis Transfer Learning

Choose w2 = wt,S, take expectations w.r.t the training data and the randomness gives us that

Eξ,S ℓ(wt,S; zi) Eξ,S q

2H ℓ(wt,S; zi) + 2η

2HEξ,S ℓ(wt,S; zi) + 2η (Jensen s inequality)

v u u t2HEξ,S 1 n

i=1 ℓ(wt,S; zi) + 2η (All samples in S are generated i.i.d. from D)

2HEξ,S bΦadv(wt,S) + 2η. (9)

Moreover, consider a fixed S, we have b Ladv(wt+1,S) b Ladv(wt,S)

h ℓ(wt+1,S, zi) ℓ(wt,S, zi) i

D wt+1,S wt,S, ℓ(wt,S, zi) E + H

2 wt+1,S wt,S 2 + η wt+1,S wt,S (Equation (7))

= D γt Gγt(wt,S, gt,S), b Ladv(wt,S) E + Hγ2 t 2

Gγt(wt,S, gt,S) 2 + ηγt Gγt(wt,S, gt,S)

= γt Gγt(wt,S, gt,S), gt,S D γt Gγt(wt,S, gt,S), b Ladv(wt,S) gt,S E

Gγt(wt,S, gt,S) 2 + ηγt Gγt(wt,S, gt,S)

= γt Gγt(wt,S, gt,S), gt,S γt D Gγt(wt,S, b Ladv(wt,S)), b Ladv(wt,S) gt,S E + Hγ2 t 2

Gγt(wt,S, gt,S) 2

+ γt D Gγt(wt,S, b Ladv(wt,S)) Gγt(wt,S, gt,S), b Ladv(wt,S) gt,S E + ηγt Gγt(wt,S, gt,S)

Combined with Equation (6) gives us that bΦadv(wt+1,S) bΦadv(wt,S)

= b Ladv(wt+1; S) b Ladv(wt,S) + λ (Ω(wt+1,S) Ω(wt,S))

Gγt(wt,S, gt,S) 2 γt D Gγt(wt,S, b Ladv(wt,S)), b Ladv(wt,S) gt,S E

+ γt D Gγt(wt,S, b Ladv(wt,S)) Gγt(wt,S, gt,S), b Ladv(wt,S) gt,S E + ηγt Gγt(wt,S, gt,S)

Gγt(wt,S, gt,S) 2 γt D Gγt(wt,S, b Ladv(wt,S)), b Ladv(wt,S) gt,S E

+ γt Gγt(wt,S, b Ladv(wt,S)) Gγt(wt,S, gt,S) b Ladv(wt,S) gt,S + ηγt Gγt(wt,S, gt,S)

Gγt(wt,S, gt,S) η Hγt 2

2 γt D Gγt(wt,S, b Ladv(wt,S)), b Ladv(wt,S) gt,S E

+ γt b Ladv(wt,S) gt,S 2 + η2γt 4 2Hγt where the last line uses Lemma B.2. Conditioning on wt,S and taking the expectation w.r.t. ξ, we further have

Eξ h bΦadv(wt+1,S) bΦadv(wt,S)|wt,S i Hγ2 t 2 γt

" Gγt(wt,S, gt,S) η Hγt 2

b Ladv(wt,S) gt,S 2 |wt,S

+ η2γt 4 2Hγt (γt 2

Adversarially Robust Hypothesis Transfer Learning

b Ladv(wt,S) gt,S 2 |wt,S

+ η2γt 4 2Hγt

Further taking expectation w.r.t. the randomness of wt,S and S, telescoping the above inequality gives us that

Eξ,S h bΦadv(w T,S) i Eξ,S h bΦadv(w0) i + ES ν2 S T 1 X

η2γt 4 2Hγt (γt c t+1, c H 1)

Eξ,S h bΦadv(w0) i + 2c ES ν2 S log (T) + cη2

2 log (T) (10)

As a result, for c H 1, i [n], we have

Eξ,S ℓ(w T,S; zi) q

2HEξ,S bΦadv(wt,S) + 2η (Equation (9))

2HEξ,S h bΦadv(w0) i + (4ES [ν2 S] + η2) log (T) + 2η

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) + 4Hzα (η = 2Hzα)

We first consider the case when λ is relatively large.

Lemma 4.1. Say Assumptions 1, 3 and 4 hold. Let w T and w T denote the outputs on two neighboring datasets S, S , respectively, after running PSAT for T iterations on each of the datasets using γt = c t+1 with 0 < c < 1

H . Then, for λ > H, we have that:

Eξ Uniform[n],S,S(i) w T w T 4(1 + λ)Hzα

2HΦadv(w0)+(4ES[ν2 S]+4H2zα2) log (T) n(λ H) .

Further, for λ > 2H + 1, we have

Eξ Uniform[n],S,S(i) w T w T 2 8(H+2)2(1+λ)2H2 zα2

(2λ 3H 2)HT

+ (1+λ)2 32HΦadv(w0)+ 64ES ν2 S +16H2 zα2 log (T)

(2λ 3H 2)HTn .

Proof of Lemma 4.1. Given the training set S Dn and an additional example z D, let S(i) be the training set obtained by replacing the i-th example of S with z; namely, S(i) = (z1, . . . , zi 1, z, zi+1, . . . , zn). We define δt,S,S(i) = wt,S wt,S(i) . Recall that ℓis η-approximately H-smooth (η = 2Hzα). At the t-th iteration, if i / ξt, which happens w.p. n 1

n , we have

δt+1,S,S(i) = proxγt,λΩ(wt,S γt ℓ(wt,S; zξt)) proxγt,λΩ(wt,S(i) γt ℓ(wt,S(i); zξt))

wt,S γt ℓ(wt,S; zξt) wt,S(i) + γt ℓ(wt,S(i); zξt) (Lemma B.2)

1 + γtλ δt,S,S(i) + γtη 1 + γtλ ( ℓis η-approximately H-smooth)

On the other hand, if i ξt, which happens w.p. 1

δt+1,S,S(i) = proxγt,λΩ(wt,S γt ℓ(wt,S; zi)) proxγt,λΩ(wt,S(i) γt ℓ(wt,S(i); z))

wt,S γt ℓ(wt,S; zi) wt,S(i) + γt ℓ(wt,S(i); z) (Lemma B.2)

1 1 + γtλδt,S,S(i) + γt 1 + γtλ

ℓ(wt,S; zi) + ℓ(wt,S(i); z)

Adversarially Robust Hypothesis Transfer Learning

Combining the above two cases and taking expectation w.r.t. the randomness of ξ, S, S(i), we have

Eξ,S,S(i) δt+1,S,S(i)

1 + γtλ + 1

n 1 1 + γtλ

Eξ,S,S(i) δt,S,S(i) + n 1

n γtη 1 + γtλ + 2

n γt 1 + γtλEξ,S ℓ(wt,S; zi)

1 + γtλ Eξ,S,S(i) δt,S,S(i) + 2

n γt 1 + γtλ

2HΦadv(w0) + (4ES [ν2 S] + η2) log (T) + (n + 3)γtη

n(1 + γtλ) (Lemma B.4)

exp γt 1 + λ(H λ) Eξ,S,S(i) δt,S,S(i) + 2γt

2HΦadv(w0) + (4ES [ν2 S] + η2) log (t) + 4γtη

(1 + x exp (x))

Note that the relation xt+1 atxt + bt with x0 = 0 unwinds from T to 0 as x T PT t=1 bt QT k=t+1 ak. Recursively applying the above inequality over t = 0, 1, . . . , T 1, with δ0,S,S(i) = 0, γt = c t+1 gives us that

Eξ,S,S(i) δT,S,S(i)

k=t+1 exp γk

1 + λ(H λ) # 2γt

2HΦadv(w0) + (4ES [ν2 S] + η2) log (t) + 4γtη

c (k + 1)(1 + λ)

2HΦadv(w0) + (4ES [ν2 S] + η2) log (t) + 4γtη

c 1+λ (λ H) 2c p

2HΦadv(w0) + (4ES [ν2 S] + η2) log (t) n(t + 1) + 4cη

2HΦadv(w0) + (4ES [ν2 S] + η2) log (T) + 2(1 + λ)η

λ H (λ > H, PT 1 t=0 (t + 1)c(λ H)/(1+λ) 1 R T t=1 tc(λ H)/(1+λ) 1.)

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) + 4(1 + λ)Hzα

λ H (η = 2Hzα)

We can bound Eξ,S,S(i) h δ2 T,S,S(i) i in a similar fashion. ℓis η-approximately H-smooth gives us

Eξ,S ℓ(wt,S; zi) 2 4HEξ,S ℓ(wt,S, zi) + 8η2 (Equation (8))

4HEξ,S h bΦadv(wt,S) i + 8η2 (Equation (10))

4HΦadv(w0) + 8ES ν2 S + 2η2 log (t) + 8η2 (12)

At the t-th iteration, if i / ξt, which happens w.p. n 1

n , we have

δ2 t+1,S,S(i) = proxγt,λΩ(wt,S γt ℓ(wt,S; zξt)) proxγt,λΩ(wt,S(i) γt ℓ(wt,S(i); zξt)) 2

1 (1 + γtλ)2

wt,S γt ℓ(wt,S; zξt) wt,S(i) + γt ℓ(wt,S(i); zξt) 2 (Lemma B.2)

= 1 (1 + γtλ)2

wt,S wt,S(i) 2 2γt D wt,S wt,S(i), ℓ(wt,S; zξt) ℓ(wt,S(i); zξt) E

+ γ2 t ℓ(wt,S; zξt) ℓ(wt,S(i); zξt) 2

1 (1 + γtλ)2

δ2 t,S,S(i) + 2γtδt,S,S(i) Hδt,S,S(i) + η + γ2 t Hδt,S,S(i) + η 2

= 1 (1 + γtλ)2 (1 + Hγt)2δt,S,S(i) + 2γtη(1 + Hγt)δt,S,S(i) + γ2 t η2

(1 + γt H)(1 + γt H + γt)

(1 + γtλ)2 δ2 t,S,S(i) + γt(1 + γt H + γt)

(1 + γtλ)2 η2

Adversarially Robust Hypothesis Transfer Learning

On the other hand, if i ξt, which happens w.p. 1

δ2 t+1,S,S(i) = proxγt,λΩ(wt,S γt ℓ(wt,S; zi)) proxγt,λΩ(wt,S(i) γt ℓ(wt,S(i); zi)) 2

1 (1 + γtλ)2

wt,S γt ℓ(wt,S; zi) wt,S(i) + γt ℓ(wt,S(i); zi) 2 (Lemma B.2)

2 (1 + γtλ)2 δ2 t,S,S(i) + 4γ2 t (1 + γtλ)2

ℓ(wt,S; zi) 2 + ℓ(wt,S(i); zi) 2

Combining the above two cases and taking expectation w.r.t. the randomness of ξ, S, S(i), we have

Eξ,S,S(i) h δ2 t+1,S,S(i) i

n (1 + γt H)(1 + γt H + γt)

(1 + γtλ)2 + 1

n 1 (1 + γtλ)2

Eξ,S,S(i) h δ2 t,S,S(i) i + n 1

n γt(1 + γt H + γt)η2

n 4γ2 t (1 + γtλ)2 Eξ,S ℓ(wt,S; zi) 2

(1 + γt H)(1 + γt H + γt)

(1 + γtλ)2 Eξ,S,S(i) h δ2 t,S,S(i) i + n 1

n γt(1 + γt H + γt)η2

(1 + γtλ)2 + 64γ2 t η2

n(1 + γ2 t λ2)2 (Equation(12))

n 4γ2 t (1 + γtλ)2 4HEξ,S [Φadv(w0)] + 8ES ν2 S + 2η2 log (t)

exp γt(2H 2λ + 2) + γ2 t (H2 λ2) (1 + λ)2

Eξ,S,S(i) h δ2 t,S,S(i) i + 8γ2 t n 4HΦadv(w0) + 8ES ν2 S + 2η2 log (t)

+ 2(H + 2)γ2 t η2 (n 33)

Note that the relation xt+1 atxt + bt with x0 = 0 unwinds from T to 0 as x T PT t=1 bt QT k=t+1 ak. Recursively applying the above inequality over t = 0, 1, . . . , T 1, with δ0,S,S(i) = 0, γt = c t+1 gives us that

Eξ,S,S(i) h δ2 T,S,S(i) i

k=t+1 exp γk(2H 2λ + 2))

k=t+1 exp γ2 k (1 + λ)2 (H2 λ2) #

8γ2 t n 4HΦadv(w0) + 8ES ν2 S + 2η2 log (t) + 2(H + 2)2γ2 t η2

(2H 2λ + 2)

c (k + 1)(1 + λ)2

(k + 1)2(1 + λ)2

8γ2 t n 4HΦadv(w0) + 8ES ν2 S + 2η2 log (t) + 2(H + 2)2γ2 t η2

t=0 exp c2(H2 λ2)

(t + 2)(1 + λ)2

2c(λ H 1)/(1+λ)2

8γ2 t n 4HΦadv(w0) + 8ES ν2 S + 2η2 log (t) + 2(H + 2)2γ2 t η2

2c(λ H 1)/(1+λ)2 8c2 4HΦadv(w0) + 8ES ν2 S + 2η2 log (T)

n(t + 1)2 + 2c2(H + 2)2η2

2c2(1 + λ)2

(2c(λ H 1) 1)T

4 4HΦadv(w0) + 8ES ν2 S + 2η2 log (T)

n + (H + 2)2η2 !

(choose c = 1

H to maximize c2 (2c(λ H) 1)T .)

Adversarially Robust Hypothesis Transfer Learning

(2λ 3H 2)HT

16HΦadv(w0) + 32ES ν2 S + 8η2 log (T)

n + (H + 2)2η2 !

= 2(1 + λ)2

(2λ 3H 2)HT

16HΦadv(w0) + 32ES ν2 S + 8H2 zα2 log (T)

n + 4(H + 2)2H2 zα2 !

Leveraging Lemma 4.1 gives us the robust generalization guarantees. Theorem 4.2. Say Assumptions 1, 3 and 4 hold. Let w T denote the output on a sample S Dn of size n after running PSAT for T iterations with γt = c t+1 for 0 < c < 1

H , and let λ > 2H + 1. Then, δ > 0, w.p. at least 1 δ, we have that

Ladv(w T ) b Ladv(w T ) 1.5M log (1/δ)

2L2(HΦadv(w0)+(4ES [ν2 S]+H2zα2) log (T)) log (1/δ) n TH(2λ 3H 2)

+ 2L n(λ H)

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T)

32(H+2)2L2H2zα2 log (1/δ)

(2λ 3H 2)HT

Proof of Theorem 4.2. Apply Lemma 4.1 gives us µ1 and µ2 as follows.

sup z Eξ,S,S(i) h ℓ(w T,S, z ) ℓ(w T,S(i), z ) i

LEξ,S,S(i) w T,S w T,S(i)

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) + 4L(1 + λ)Hzα

sup z Eξ,S,S(i) ℓ(w T,S, z ) ℓ(w T,S(i), z ) 2

L2Eξ,S,S(i) w T,S w T,S(i) 2

32L2(1 + λ)2HΦadv(w0) + 64ES ν2 S + 16H2 zα2 L2(1 + λ)2 log (T) (2λ 3H 2)HTn + 8(H + 2)2L2(1 + λ)2H2 zα2

(2λ 3H 2)HT := µ2

Apply Theorem 2.1 on the adversarial loss gives us that

Ladv(w T,S) b Ladv(w T,S)

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) + 4L(1 + λ)Hzα

λ H + 1.5M log (1/δ)

4(1 + λ)2 32L2HΦadv(w0) + (64ES [ν2 S] + 16H2zα2) L2 log (T) (2λ 3H 2)HTn + 8L2(H + 2)2H2zα2

(2λ 3H 2)HT

We next discuss the case when λ is relatively small. We will be using the following Lemma.

Lemma B.5 (Lemma 5 in Kuzborskij & Lampert (2018)). Assume that the loss ℓ( , z) [0, M] is L-Lipschitz for all z. Then for every t0 {1, . . . , n}, we have

ES,z,ξ ℓ(w S,T , z) ℓ(w S(i),T , z) LES,z [Eξ [δT (S, z) = 0] |δt0(S, z) = 0] + ES,ξ [Ladv(w T,S)] t0

n Lemma B.6. Let a, y > 0 and 0 < b < 1. Then x axb y 0 implies

x max n 2 b 1 b a 1 1 b , a(2y)bo + y

Adversarially Robust Hypothesis Transfer Learning

Theorem 4.3. Say Assumptions 1, 3 and 4 hold. Let w T denote the output on a sample S Dn of size n after running PSAT for T = O(n) iterations with γt = c t+1 for 0 < c < 1

H , and 0 < λ < H, we have that

Eξ,S h Ladv(w T ) b Ladv(w T ) i O

nq+1 L (Q+n Hzα)

Eξ,S b Ladv(w T ) q q+1 L (Q+n Hzα)

where Q = p

2HΦadv(w0)+(4ES [ν2 S]+4H2zα2) log (T), q = 1 λ

Proof of Theorem 4.3. Given the training set S Dn and an additional example z D, let S(i) be the training set obtained by replacing the i-th example of S with z; namely, S(i) = (z1, . . . , zi 1, z, zi+1, . . . , zn). We define δt,S,S(i) = wt,S wt,S(i) . Let t0 {1, . . . , n} be the iteration that δt0,S,S(i) = 0, and PSAT picks two different samples from S and S(i) in iteration t0 + 1.

We follow the proof of Lemma 4.1 until Equation (11), which gives us that

Eξ,S,S(i) δT,S,S(i)

c(λ H) 2c p

2HΦadv(w0) + (4ES [ν2 S] + η2) log (t) n(t + 1) + 4cη

As we consider λ < H, we have

Eξ,S,S(i) δT,S,S(i)|δt0,S,S(i) = 0 4 (H λ)

2HΦadv(w0) + (4ES [ν2 S] + η2) log (T) n + 2η

We know from Lemma B.1 that adversarial loss ℓis L-Lipschitz. Recall that the PSAT assumes sampling from the uniform distribution over [n] without replacement, therefore apply Lemma B.5 gives us that

ES,z,ξ h ℓ(w S,T , z) ℓ(w S(i),T , z) i

2HΦadv(w0) + (4ES [ν2 S] + η2) log (T) n + 2η

+ ES,ξ [Ladv(w T,S)] t0

Define t0 = 4L HES,ξ [Ladv(w T,S)]

2HΦadv(w0) + (4ES [ν2 S] + η2) log (T) + 2ηn 1 q+1 T

where we define q = 1 λ

H (0, 1). As we are consider small T n, we have t0 n. Plugging t0 back gives us that

Eξ,S h Ladv(w T,S) b Ladv(w T,S) i

ES,ξ [Ladv(w T,S)] T

q q+1 " 4Lq H λ

2HΦadv(w0) + (4ES [ν2 S] + η2) log (T) n + 2η

ES,ξ [Ladv(w T,S)] T

q q+1 " 4Lq H λ

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) n + 4Hzα

Applying Lemma B.6 with

x = Eξ,SLadv(w T,S), y = Eξ,S b Ladv(w T,S), b = q q + 1

q q+1 " 4Lq H λ

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) n + 4Hzα

gives us that

Eξ,S h Ladv(w T,S) b Ladv(w T,S) i

max n 2qaq+1, a(2y)

Adversarially Robust Hypothesis Transfer Learning

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) n + 4Hzα

2TEξ,S b Ladv(w T,S)

! q q+1 " 4Lq H λ

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) n + 4Hzα

q 4L H(1 q)

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) n + 4Hzα

2TEξ,S b Ladv(w T,S)

! q q+1 " 4L H(1 q)

2HΦadv(w0) + (4ES [ν2 S] + 4H2zα2) log (T) n + 4Hzα

where the last line holds because 1 + 1

q q+1 q and 1 + 1

q q 1 q+1 are bounded when 0 < q < 1.