# generalization_bounds_for_adversarial_metric_learning__5a590eef.pdf Generalization Bounds for Adversarial Metric Learning Wen Wen1 , Han Li1, , Hong Chen1,2,3 , Rui Wu4 , Lingjuan Wu1 , Liangxuan Zhu1 1College of Informatics, Huazhong Agricultural University, Wuhan 430070, China 2Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan 430070, China 3Key Laboratory of Smart Farming for Agricultural Animals, Wuhan 430070, China 4Horizon Robotics, Haidian District, Bei Jing 100190, China lihan125@mail.hzau.edu.cn Recently, adversarial metric learning has been proposed to enhance the robustness of the learned distance metric against adversarial perturbations. Despite rapid progress in validating its effectiveness empirically, theoretical guarantees on adversarial robustness and generalization are far less understood. To fill this gap, this paper focuses on unveiling the generalization properties of adversarial metric learning by developing the uniform convergence analysis techniques. Based on the capacity estimation of covering numbers, we establish the first high-probability generalization bounds with order O(n 1 2 ) for adversarial metric learning with pairwise perturbations and general losses, where n is the number of training samples. Moreover, we obtain the refined generalization bounds with order O(n 1) for the smooth loss by using local Rademacher complexity, which is faster than the previous result of adversarial pairwise learning, e.g., adversarial bipartite ranking. Experimental evaluation on real-world datasets validates our theoretical findings. 1 Introduction The robustness of metric learning against adversarial perturbations has attracted increasing attention in the machine learning literature, where abundant adversarial algorithms have been proposed from various application motivations, e.g., [Huang et al., 2019; Bouniot et al., 2020; Liu et al., 2022]. Despite the previous adversarial metric learning enjoys the adversarial robustness [Madry et al., 2018; Kurakin et al., 2018; Carlini and Wagner, 2017] empirically, its generalization guarantee is touched scarcely in theory. In this paper, our goal is to fill this theoretical gap and provide the sharper high-probability generalization bounds of adversarial metric learning from the lens of statistical learning theory [Vapnik, 1999; Mohri et al., 2018]. Although theoretical foundations of metric learning have been well understood in [Huai et al., 2019; Lei et al., 2020; Ye et al., 2019], there are two-fold challenges in establishing Corresponding author. generalization analysis for adversarial counterparts. The first one is caused by the joint perturbations on sample pairs [Huai et al., 2022], which is more complicated than the case of the single-sample perturbation [Yin et al., 2019; Xing et al., 2021; Mustafa et al., 2022]. The other arises from the nonsmoothness and non-differentiable optimization objective associated with the adversarial loss function [Xing et al., 2021; Xiao et al., 2022], which leads to the standard analysis techniques (e.g., [Cao et al., 2016]) inapplicable. To surmount the above challenges, we introduce the ℓ covering number [Reeve and Kaban, 2020; Mustafa et al., 2022] to measure the complexity of function space with pairwise perturbations and employ a general loss class to approximate the adversarial loss class on training samples to tackle the non-smoothness problem. In addition to providing generalization guarantees for adversarial metric learning, we also validate our theoretical findings through experimental analysis on real-world datasets. In summary, the main contributions of this paper are listed as follows: We establish the high-probability generalization bounds with order O(n 1 2 ) for adversarial metric learning with pairwise perturbations, where n is the sample size. Indeed, our high probability bounds are beneficial to understand the robustness of optimization algorithms [Bousquet et al., 2020; Klochkov and Zhivotovskiy, 2021; Li and Liu, 2021] and are different from the existing bounds in expectation [Xing et al., 2021; Farnia and Ozdaglar, 2021; Xiao et al., 2022]. These developed learning bounds are valid for general adversarial perturbations measured by ℓr-norm (r 1), and adapt to linear metric learning models and deep metric learning models simultaneously. To the best of our knowledge, this is the first-ever-known generalization bounds for metric learning with pairwise perturbations. Under the self-bounding Lipschitz assumption [Reeve and Kaban, 2020] of loss function, we provide the sharper generalization bound with the order O(n 1) by developing the concentration estimation technique associated with the local Rademacher complexity [Bartlett et al., 2005]. As a by-product, the current generalization bounds with respect to the non-adversarial metric learning assure faster rates than the previous generalization analysis in [Huai et al., 2019; Lei et al., 2020]. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Perturbation Object Task Reference Perturbation Way Analysis Tool Learning Bound Single sample Classification Yin et al. (2019) ℓ -norm additive Rademacher complexity O(1/ n) Khim and Loh (2018) Tu et al. (2019) ℓr-norm additive Rademacher complexity O(1/ n) Mustafa (2022) ℓr-norm additive Covering number O(1/ n) Spatial transformation Local Rademacher complexity O(1/n) Optimization Xing et al. (2021) ℓr-norm additive Algorithmic stability O(1/n) Xiao et al. (2022) Bipartite Ranking Mo et al. (2022) ℓr-norm additive Rademacher complexity O(1/ n) Sample pair Metric Learning Ours ℓr-norm additive Covering number O(1/ n) Local Rademacher complexity O(1/n) Table 1: Summary of generalization analysis for adversarial learning ( -optimization bound; -generalization bound in expectation). 2 Related Work Adversarial Metric Learning. Adversarial metric learning plays a vital role in applications ranging from person re-identification [Dai et al., 2018; Bouniot et al., 2020; Liu et al., 2022] to zero-shot learning [Chen and Deng, 2019; Huang et al., 2019] and cross-modal retrieval [Xu et al., 2019]. Since metric learning methods learn on the original samples are limited in their capacity to distinguish ambiguous samples, adversarial learning methods are proposed to facilitate robust metric learning (see Appendix A). Although adversarial metric learning has shown empirical effectiveness, theoretical aspects of adversarial robustness has not been exhaustively studied. Adversarial Generalization. The nonsmoothness and nondifferentiability with respect to adversarial loss are central difficulties in the generalization analysis of adversarial learning. To overcome these obstacles, Xing et al. (2021) propose a noise injection method to avoid non-differentiability, and establish stability upper bound and lower bound for a generic adversarial training algorithm. Xiao et al. (2022) tackle the non-smooth problem by considering the η-approximate smoothness on adversarial loss. Based on this, they derive stability-based generalization bounds for stochastic gradient descent on the general class, which covers the adversarial loss. Tu et al. (2019) fit the adversarial learning problem into the minimax framework by introducing a transport map between distributions. They derive a new risk bound for the minimax problem through the lens of covering numbers under the Lipschitz assumption. Mo et al. (2022) extend the prior work of Tu et al. (2019) to pairwise learning and establish the generalization bounds for adversarial bipartite ranking. Yin et al. (2019) and Khim et al. (2018) derive a surrogate upper bound on the adversarial loss, and then show upper bounds for the adversarial Rademacher complexity of the surrogate. All the above generalization analysis are limited to adversarial learning with pointwise perturbation. To the best of our knowledge, there is no the related generalization analysis for adversarial learning with pairwise perturbation. This paper try to fill this gap. Table 1 summarizes the related work on adversarial learning. 3 Preliminaries This section introduces the main notations used in this paper, the necessary backgrounds on adversarial metric learning [Wang et al., 2020; Liu et al., 2022; Yang et al., 2021], and some theoretical techniques and structural results used for the generalization analysis. 3.1 Notations We denote vectors as lowercase letters (e.g., x) and matrices as uppercase letters (e.g., X). We write w p to denote the ℓp-norm of a vector w Rn. The dual norm of w is denoted by a star (i.e., w p ). For a matrix W Rn n with columns Wi, i [n], the matrix (p, q)-norm is defined by W p,q = ( W1 p, . . . , Wn p) q. 3.2 Adversarial Metric Learning Let S = {(xi, yi)}n i=1 be a set of training samples drawn according to an unknown distribution P, where xi Rd is the d dimensional feature vector and yi R is the class label. The d n input feature matrix is denoted by X = (xi : i [n]). Given a mapping f : Rd Rd , which maps the d-dimensional input into an embedding space with d - dimension, then the distance between samples xi and xj is measured by Df(xi, xj) := (f(xi) f(xj))T (f(xi) f(xj)). (1) The target of metric learning is to learn an adequate f such that reflects the similarity between sample pairs [Wang et al., 2020; Huai et al., 2022]. The widely adopted method of seeking such f is to minimize the following empirical risk over the given training samples En(f) = 1 n(n 1) i =j ℓ(τ(yi, yj)(1 Df(xi, xj))), (2) where ℓ: R R+ is a given loss function such as the hinge loss function, and τ(yi, yj) { 1, 1} indicates whether two samples are affiliated to the same class, i.e., τ(yi, yj) = 1 if yi = yj, and τ(yi, yj) = 1 otherwise. However, in the presence of adversaries, there will be imperceptible perturbations on the input samples that lead to Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) maximizaing empirical risk (2). Throughout this paper, we assume that the perturbation θ is adversarially chosen in the ℓr-ball B(ε) Rd of radius ε, for an arbitrary r 1. Given a sample pair (xi, yi), (xj, yj) and a learned mapping f, the adversary selects valid perturbations θ i and θ j by [Huai et al., 2022] θ i , θ j = arg max θi,θj B(ε) ℓ(τ(yi, yj)(1 Df(xi+θi, xj +θj))), and the adversarial loss ℓadv((xi, yi), (xj, yj); f) of f at (xi, yi), (xj, yj) can be written as max θi,θj B(ε)ℓ(τ(yi, yj)(1 Df(xi + θi, xj + θj))). (3) We then have the following adversarial empirical risk e En(f) i =j max θi,θj B(ε) ℓ(τ(yi, yj)(1 Df(xi + θi, xj + θj))), and the adversarial expected risk e E(f) EP max θi,θj B(ε) ℓ(τ(yi, yj)(1 Df(xi + θi, xj + θj))) . The adversarial empirical risk e En(f) measures the ability of f to place similar samples nearby and separate dissimilar samples on the training samples with adversarial perturbations. The adversarial expected risk e E(f) measures how well f generalizes to unseen adversarial samples. In this paper, we are interested in the difference between e En(f) and e E(f). Our main tool for bounding the generalization error for adversarial metric learning (i.e., e E(f) e En(f)) is the ℓ -covering number defined below. Definition 1 (ℓ -covering number). Let υ > 0 and let (A, ) be a metric space. We say that C A is an (υ, )- covering of A if sup a A inf c C a c υ. Then, the ℓ -covering number of A is the minimum cardinality of any subset covers A at scale υ, denoted as N (υ, A). As a special case of Zhang et al. (2002), Definition 1 generally characterizes the complexity of the function space measured by the infinite norm [Reeve and Kaban, 2020; Mustafa et al., 2022]. Let the mapping f be selected from the hypothesis class F := {x 7 f W (x) : x Rd, W Rd d }. The class of adversarial loss functions (3) is written as Ladv := {(xi, yi), (xj, yj) 7 ℓadv((xi, yi), (xj, yj); f) : f F}. (4) We have the following relationship between the generalization error (i.e., e E(f) e En(f)) and ℓ -covering number of adversarial loss class Ladv on the training sample S (i.e., N (Ladv, υ, S)), which extends the previous results of Bartlett et al. (2017) to adversarial learning. Lemma 1. Let Ladv be the adversarial loss class defined in (4) and bounded by 1. Then, for any δ (0, 1), with probability at least 1 δ over a sample S of size n, the following holds for all f F e E(f) e En(f) 3 log N (Ladv, υ, S)dυ . Lemma 1 allows us to control the generalization error by bounding the ℓ -covering number of the adversarial loss class on training samples. However, deriving an upper bound on N (Ladv, υ, S) is intractable due to the outer maximization of loss functions in class Ladv and the joint action of pairwise perturbations θi, θj. Our approach is to approximate the loss class Ladv on sample S by the following class e Ladv e Ladv := {((xi, θi), yi), ((xj, θj), yj) 7 ℓ(τ(yi, yj) (1 Df(xi + θi, xj + θj))) : f F}, (5) and incorporate perturbations θi, θj into the argument. Based on this, we reduce the problem of measuring the complexity of the adversarial loss class on training samples to measuring the complexity of a general loss class. Some necessary Lipschitz conditions are introduced for our theoretical analysis. Definition 2. Let denote a norm metric, and ξ, ζ 0. For a loss function ℓ: F R and a distance function Df : Rd Rd R parametrized by f F, 1) the loss function ℓis ξ-Lipschitz if, for f, f F |ℓ(f) ℓ(f )| ξ f f , 2) the distance function Df is the Muti-variate Lipschitz continuity if, for θi, θj, θ i, θ j Rd |Df(θi, θj) Df(θ i, θ j)| ζ (θi, θj) (θ i, θ j) . The Lipschitzness on the loss function in Definition 2 is a mild condition, which is satisfied by some common losses, e.g., the hinge loss and logistic loss [Yin et al., 2019; Tu et al., 2019; Lei et al., 2020]. By utilizing this Lipschitzness and the notion of Multi-variate Lipschitz continuity [Zantedeschi et al., 2016], we have the Lipschitzness on the functions θi, θj 7 ℓ(τ(yi, yj)(1 Df(xi + θi, xj + θj))) for f F, which is necessary and fulfilled by most attacks [Madry et al., 2018; Awasthi et al., 2021]. We now present our first main result as follows. Theorem 1. Let θi, θj 7 ℓ(τ(yi, yj)(1 Df(xi + θi, xj + θj))) be the Lipschitzness with constant L. Let CB(3υ/4L) be a 3υ/4L-cover of B(ε), and define the adversarial sample set S = {((xi, θi), yi) : i [n], θi CB(3υ/4L)}. Then, we have N (Ladv, υ, S) N ( e Ladv, υ/4, S). Detailed proofs are contained in Appendix C.2. Theorem 1 illustrates that the ℓ covering number of adversarial loss class Ladv on set S can be bounded by the ℓ covering number of class e Ladv on adversarial set S, which extends the Lemma 4.4 of Mustafa, Lei and Kloft (2022) for adversarial pointwise learning to adversarial pairwise learning. It will be served to derive high-probability generalization bounds of adversarial metric learning. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) 4 The Generalization Bounds for Adversarial Metric Learning In this section, we provide a sharp characterization of the generalization for two commonly-used adversarial metric learning models: the linear and deep metric learning models. The adversarial perturbation is measured in ℓr-norm. Moreover, we establish fast generalization bounds for adversarial metric learning through the local Rademacher complexity under the smooth Lipschitz assumption on loss functions. 4.1 Linear Metirc Learning Model We consider the following linear hypothesis class: F := {xi 7 Wxi : W Rd d , W p,1 Λ}. For any linear mapping f F, the distance metric function is defined by Df(xi, xj) = (xi xj)T W T W(xi xj). (6) The Lipschitz constant of the function θi, θj 7 ℓ(τ(yi, yj) (1 Df(xi+θi, xj +θj))) required in Theorem 1 is provided in the following lemma. Lemma 2. Let Df be the linear distance function defined in (6). Then, for any sample pair (xi, yi) and (xj, yj), the function θi, θj 7 ℓ(τ(yi, yj)(1 Df(xi + θi, xj + θj))) is 4ξΛ2Ψ-Lipschitz, where 4Λ2Ψ is the Multi-variate Lipschitz constant of the function θi, θj 7 Df(xi + θi, xj + θj), and Ψ is max(1, d1 1 r )( X r, + ε). The proof is provided in Appendix.D.1. Based on the Lipschitzness of the adversarial loss function in Lemma 2, we obtain an upper bound on the covering number of the loss class e Ladv on the sample S in the theorem below. Theorem 2. With the notation in Lemma 2. Let e Ladv be defined in (5) and S be defined in Theorem 1. Then, for υ > 0, we have log N ( e Ladv, υ/4, S) C ξ2Λ4 bΨ2 Llog = log 4 l32ξΛ2 bΨ υ + 1 m n 16ξΛ2εΨ Ψ = max(1, d1 1 r )( X r, + ε), bΨ = max(1, d1 1 r ) ( X r, + ε)2 and C is a constant. The detailed proof is contained in Appendix D.1. Based on Theorem 2 and Lemma 1, we establish the following highprobability generalization bound. Theorem 3. With the notation above. For any fixed ξ > 0 and all f F, with probability at least 1 δ, we have e E(f) e En(f) 3 2n + 8 n3/2 + C ξΛ2 bΨ log(n) log 4 l 32nξΛ2 bΨ + 1 m n 16nξΛ2εΨ d + 1 . where bΨ = max(1, d1 1 r )( X r, + ε)2 and C is a constant. The proof of this theorem is provided in Appendix D.2. Remark 1. The generalization bound in Therorem 3 suffers from additional dimension dependent terms as compared to its non-adversarial counterpart. The first d1 1 r dependence in bΨ is due to the mismatch between the norm on the input x and the norm in the ball B(ε). Indeed, we have used the inequality xi + θi p max(1, d1 1 r ) xi + θi r in Awasthi et al. (2021), where 1 1/p = 1/p . If 1 r 1, bΨ is dimension independent, which implies that one should choose a p-norm regularizer on W (i.e., the weight matrix), where p [1, r ]. The second d dependence in square root of the third term on the right side of Therorem 3, is attributed to the complexity of the perturbation ball B(ε). For example, if B(ε) is contained in a low dimensional space d < d, the dependence is reduced to O( d ). This motivates the mapping f to project the input x Rd into a low-dimensional subspace to reduce the effective dimensionality of adversarial perturbations. Remark 2. Theorem 3 is a high-probability generalization bound for adversarial metric learning in the linear case, motivated by the recent analyses in the adversarial pointwise learning (Awasthi et al. 2021; Mustafa, Lei and Kloft 2022). In contrast with prior work of Mustafa et al. (2022) that studies ℓ -norm perturbations, we consider the general case where the perturbations are measured in ℓr-norm. Moreover, our theoretical analysis is novel since it is the first touch for adversarial pairwise learning with pairwise perturbations than existing work [Yin et al., 2019; Mo et al., 2022; Mustafa et al., 2022]. Remark 3. Setting ε = 0, we obtain a standard risk bound for linear metric learning in non-adversarial case; see (7). Although the bound (7) with order O(1/ n) is similar to the generalization bounds in Cao at al. (2016), Ye et al. (2019), and Let et al. (2020), our result applies to a wider range of loss functions such as hinge loss and logistic loss. E(f) En(f) 3 2n + 8 n3/2 + C ξΛ2 X 2 p , n log 4 l 32nξΛ2 X 2 p , + 1 m n + 1 log(n). (7) 4.2 Deep Metirc Learning Model Let the mapping f be a L-layer neural network parametrized by the weights W = {W l Rhl hl 1}L l=1, where hl is the number of neurons in the l-th layer of the network and h0 = d. Given the input sample xi Rd, the output of the final layer in the network can be written as f(xi) = W Lρ(W L 1ρ( ρ(W 1xi))), (8) where ρ( ) denotes the non-linear 1-Lipschitz activation function. We consider norm-bounded networks with the following hypothesis class F := {xi 7 f(xi) : f F, W l F bl, W l σ sl}, where σ represents spectral norm and F denotes the Frobenius norm. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) As with the linear case, we first establish the Lipschitzness of the function θi, θj 7 ℓ(τ(yi, yj)(1 Df(xi+θi, xj+θj))). The results are summarized in the following lemma. Lemma 3. Let f F be the neural network defined in (8) and Df be the distance metric function defined as (1). For all (xi, yi), (xj, yj) and f F, the function ℓ(τ(yi, yj)(1 Df(xi + θi, xj + θj))) is 4ξ QL l=1 s2 l Ψ-Lipschitz in θi, θj, where 4 QL l=1 s2 l Ψ is the Multi-variate Lipschitz constant of the function θi, θj 7 Df(xi + θi, xj + θj), and Ψ is max(1, d1 1 r )( X r, + ε). The proof is given in Appendix E. With this lemma, we establish the following upper bound on the ℓ -covering number of the class e Ladv in nonlinear cases. Theorem 4. With the notation in Lemma 3. Let e Ladv be defined in (5) and S be defined in Theorem 1. Then, for ϵ > 0, we have log N ( e Ladv, υ/4, S) Cξ2 bΨ2L4 Llog = log l C1ξ bΨΓ2 υ + C2 m nˆh 16ξ QL l=1 s2 l εΨ υ Ψ = max(1, d1 1 r )( X r, + ε), bΨ = max(1, d1 1 r ) ( X r, + ε)2, Γ = maxl [L](QL i=1 si)bl/sl, ˆh = maxl [L] hl, and C, C1, C2 are universal constants. The proof is given in Appendix E. By combining Theorem 4 with Lemma 1, we obtain the following sharp bound with high probability. Theorem 5. With the notation in Theorem 4. For any fixed ξ > 0 and all f F, with probability at least 1 δ, we have e E(f) e En(f) δ ) 2n + 8 n3/2 + Cξ bΨL2 b2 l s2 l log(n)e Llog, where e Llog is defined by v u u tlog l C1nξ bΨΓ2 + C2 m nˆh 16nξ l=1 s2 l εΨ d + 1 . Remark 4. Similar to the linear case, the bound in Theorem 5 has max(1, d1 1 d dependencies. The first is in bΨ, which arises from the mismatch of norms and can be avoided by simply picking the appropriate norm regularization (ℓp) on the weight matrices (W) as discussed above. The second d dependence in e Llog. As discussed in the linear case, a projection on a low-dimensional represent space can help alleviate such dependence incurred by the complexity of the adversarial perturbation ball B(ε). Remark 5. Theorem 5 provides generalization guarantees for adversarial metric learning in nonlinear case. The bounds in Yin et al. (2019) and Awasthi et al. (2021) apply only to a one-hidden-layer neural network. This contrasts with our bound, which applies to multi-layer networks. While the bounds in Khim and Loh (2018) and Mustafa, Lei and Klof (2022) apply to multi-layer networks, they are only applicable to pointwise learning and the single-sample perturbation case. Remark 6. Similar to the linear case, Theorem 5 can recover the non-adversarial generalization bound (9) by setting ε = 0. The bound (9) is of the order O( d log(ˆh)/ n), where ˆh is the width of the hidden layer. The generalization bound in Huai et al. (2019) grows as O( p ˆh), while ours is O(log(ˆh)). 2n + 8 n3/2 + Cξ X 2 p , L2 log C1nξΓ2 X 2 p , + C2 nˆh + 1 log(n). (9) 4.3 Optimistic Bounds Optimistic bounds have been studied in [Srebro et al., 2010; Reeve and Kaban, 2020], where they have resulted in fastrate generalization bounds for smooth losses under low-noise conditions. We aim to extend these approaches to adversarial metric learning. Our results are based on the local Rademacher complexity [Bartlett et al., 2005] with respect to sample pairs [Cao et al., 2016]. Definition 3 (Local Rademacher complexity). Let H : X X R be a hypothesis class. Given a sample S = {(x1, y1), . . . , (xn, yn)} of size n, The local Rademacher complexity is the worst-case Rademacher complexity of H on S of cardinality n, that is, Rn(H) := sup|S| n RS(H), where RS(H) is the empirical Rademacher complexity with respect to sample pairs defined by RS(H) = 1 n/2 Eσ h sup h H i=1 σih(xi, x n/2 +i) i , where σ1, . . . , σn are i.i.d Rademacher random variables with P{σi = 1} = P{σi = 1} = 1 Let e D := {((xi, θi), yi), ((xj, θj), yj) 7 Df(xi+θi, xj + θj) : f F}. The local class e D|γ = {((xi, θi), yi), ((xj, θj) , yj) 7 Df(xi + θi, xj + θj) : f F, e En(f) γ} e D is defined as the set of function Df e D with the adversarial empirical error at most γ. Similarly, the local adversarial loss class is defined as Ladv|γ := {(xi, yi), (xj, yj) 7 maxθi,θj B(ε) ℓ(τ(yi, yj)(1 Df(xi + θi, xj + θj))) : f F, e En(f) γ}. We introduce the self-bounding Lipschitz [Reeve and Kaban, 2020] for the loss function, which assumes that the loss function is smooth. Srebro et al. (2010) show that such smoothness condition can give rise to an optimistic bound having a fast rate O(n 1) in the realisable case. Under Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) smoothness assumption, we derive an upper bound on the local Rademacher complexity of adversarial loss class, which serves as a key step in developing fast-rate bounds. Lemma 4. Let Ladv|γ and e D be defined as above. Suppose that for any f F, f B, and the loss ℓis (λ, η)- self-bounding Lipschitz bounded by b. Further let θi, θj 7 ℓ(τ(yi, yj)(1 Df(xi + θi, xj + θj))) be -Lipschitz with constant L and S = {((xi, θi), yi) : i [n], θi CB( 3υ λ(2γ)η8L)}. Suppose further that n 7 n R| S|( e D) is non-decreasing. Then, we have Rn(Ladv|γ) λ(γ)ηR| S|( e D) q where Ωgrows at the order of O log3/2 | S| R| S|( e D) log3/2 B2| S|λ The detailed proof is provided in Appendix F.1. Based on Lemma 4 and the sub-root property in [Bartlett et al., 2005], fast generalization bounds adversarial metric learning with smooth losses are given in the following theorem. Theorem 6. With the above notation and assumption of Lemma 4, for all f F, with probability at least 1 δ, we have e E(f) e En(f) 106λ2R2 | S|( e D)Ω2| S|/n n (log(1/δ) + log(log(n))) e En(f) 8λ2R2 | S|( e D)Ω2| S|/n + K where K = 4b n (log(1/δ) + log(log(n))). Remark 7. The convergence rate of the generalization bound in Theorem 6 grows as R2 | S|( e D). For the majority of function classes (e.g., linear models [Yin et al., 2019]), the Rademacher complexity is at least O(n 1/2). The second term would then grow as O(n 1), while the fourth term would grow at the usual O(n 1/2) rate. However, if e En(f) = 0, the fourth term vanishes, thus achieving a fast rate of convergence at least O(n 1). Remark 8. We establish the fast-rate generalization bound with high probability for metric learning in non-adversarial case, E(f) En(f) + 106λ2R2 n(D)ˆΩ2 + 48b n (log(1/δ) + log(log(n))) + En(f) 8λ2R2n(D)ˆΩ2 + K , where D is the class of distance function and ˆΩgrows at the order O(log3/2( n Rn(D)) log3/2( B2nλ b1 η )). It extends the previous optimistic results [Srebro et al., 2010] of pointwise learning to pairwise learning. In contrast with the bounds of order O(n 1/2) [Cao et al., 2016; Huai et al., 2019; Lei et al., 2020], this is the improved result. Dataset Size (n) Dimension (d) Wine 178 13 Spambase 4601 58 MNIST 70000 784 CIFAR-10 60000 3072 Table 2: The details of the adopted datasets. 5 Experiments 5.1 Experimental Setup Datasets. We adopt the following real-world datasets for experiments: the Wine1, Spambase2, MNIST3 and CIFAR104 datasets. Table 2 provides details of dimension (d) and size (n). Note that the input for adversarial metric learning models is a set of sample pairs rather than single samples. For the original dataset with single samples, we make oneby-one matching to form n(n 1) sample pairs, and then randomly select n pairs to construct the new dataset. The pair composed of samples with the same category is assigned to label 1, and the other is assigned to 0. We randomly split the new dataset into training, validation and test sets with a ratio of 6 : 2 : 2, where the validation set is used for early stopping to prevent overfitting of model. Model and Attack Settings. We use one-layer neural networks without nonlinear activation as the linear model. Denote the number of units in the output layer by d . We utilize five-layer feed-forward neural networks with Re LU activation [Hahnloser et al., 2000] as the non-linear model, where the number of the units in each layer is (512, 256, 128, 64, d ). All models trained with the Adam optimizer. The learning rates of the linear model and the nonlinear model are set as 1e 2 and 1e 3, respectively. We apply ℓ PGD attack [Madry et al., 2018] adversarial training to minimize the following objective function i =j max θi,θj B(ε) ℓ(τ(yi, yj)(1 Df W (xi + θi, xj + θj))) + λ W 1 1, where ℓ( ) is cross entropy loss, f W is the mapping function parameterized by W = {W l Rhl hl 1}L l=1, where hl is the number of neurons in the l-th layer of the network (especially, h0 = d, h L = d ), and λ 0 is the regularization parameter. Then, we run PGD attack to check the generalization error. Similar to Yin et al. (2019), the generalization error is approximately calculated by |adversarial train accuracy adversarial test accuracy|. (10) During the training and test phases, the adversarial samples are generated by PGD algorithm with step size ε/5, where ε is the maximum magnitude of the allowed perturbations that varies in {0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4}. Overall, each experiment is independently repeated 10 times, and average generalization error with standard deviation of adversarial metric learning models are reported. 1https://archive.ics.uci.edu/ml/datasets/wine/ 2https://archive.ics.uci.edu/ml/datasets/spambase/ 3http://yann.lecun.com/exdb/mnist/ 4https://www.kaggle.com/competitions/cifar-10/data Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) 0.10 0.20 0.30 0.40 0.0 The Generalization Error d'=13 d'=6 d'=3 0.10 0.20 0.30 0.40 0.0 The Generalization Error d'=58 d'=30 d'=15 (b) Spambase 0.10 0.20 0.30 0.40 0.0 The Generalization Error d'=512 d'=128 d'=32 0.10 0.20 0.30 0.40 0.0 The Generalization Error d'=512 d'=128 d'=32 (d) CIFAR-10 Figure 1: The generalization error (10) (mean and standard deviation over 10 runs) for different dimensional outputs (d ) of adversarial metric learning models. ε denotes the perturbation bound. Subfigures (a) and (b) present results for linear models. Subfigures (c) and (d) present results for nonlinear models on adopted datasets. 5.2 Experiment Results Theorem 3 and 5 suggest that projecting the input feature to low-dimensional output and applying appropriate regularization to the weights of models, can reduce the generalization error of adversarial metric learning models. Here, we conduct linear and non-linear experiments to validate these theoretical findings. The Effect of the Output Dimension. To investigate the effect of the output dimension on the generalization performance, we train models with different dimensional outputs. In the linear case, we consider three cases where the output dimension (i.e., d ) is set as d, d/2 and d/3 , respectively. For the nonlinear case, the output dimension of the final layer of neural networks is set as 512, 128 and 32, respectively. Figure 1 plots the generalization errors (10) of linear and nonlinear models on the adopted datasets. As we can see, the fewer the output features, the smaller the generalization error, which suggests that projecting input into the low-dimension feature space can potentially reduce the generalization gap of adversarial metric learning models. The Effect of Regularization. We evaluate the effect of the weight parameters on the generalization of adversarial metric learning models by comparing the performance of the models with and without regularization. We employ linear and nonlinear models with output dimensions d and 32, respectively, and apply the L1 regularization to W 1 (i.e., the weights of the first layer of models). The regularization parameter λ is set as 0, 0.01 and 0.02. Note that λ = 0 indicates the model trained without regularization. The generalization errors (10) 0.10 0.20 0.30 0.40 0.0 The Generalization Error λ=0.00 λ=0.01 λ=0.02 0.10 0.20 0.30 0.40 0.0 The Generalization Error λ=0.00 λ=0.01 λ=0.02 (b) Spambase 0.0 0.10 0.20 0.30 0.40 The Generalization Error λ=0.00 λ=0.01 λ=0.02 0.0 0.10 0.20 0.30 0.40 The Generalization Error λ=0.00 λ=0.01 λ=0.02 (d) CIFAR-10 Figure 2: The generalization error (10) (mean and standard deviation over 10 runs) for adversarial metric learning models with (λ = 0) and without (λ = 0) L1 regularization. λ denotes regularization parameter, and ε denotes the perturbation bound. Subfigures (a) and (b) present results for linear models. Subfigures (c) and (d) present results for nonlinear models on adopted datasets. of linear and nonlinear models on the adopted datasets are presented in Figure 2. We can see that generalization gap of the model with regularization is smaller than that of the model without regularization, thus we conclude that applying L1-norm regularization to adversarial metric learning models is helpful for reducing generalization error. 6 Conclusions This paper presents a detailed study of the generalization properties of adversarial metric learning under ℓr adversarial perturbations. We derive the high-probability generalization bounds for adversarial metric learning with pairwise perturbations by developing the uniform convergence analysis techniques. Our results apply to both linear and deep metric learning models, as well as to various loss functions. To our knowledge, this is the first generalization analysis for adversarial pairwise learning with pairwise perturbations. We further extended our analysis to the case of smooth losses, and establish a fast generalization bound at a rate of O(n 1) by the local Rademacher complexity. In future work, we will investigate the generalization properties of models under nonadditive attacks. Acknowledgments This work was supported by National Natural Science Foundation of China under Grant No. 12071166, the Fundamental Research Funds for the Central Universities of China under Grant Nos. 2662021JC008, 2662022XXYJ005, and HZAUAGIS Cooperation Fund No. SZYJY2023010. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) [Awasthi et al., 2021] Pranjal Awasthi, George Yu, Chun Sung Ferng, Andrew Tomkins, and Da-Cheng Juan. Adversarial robustness across representation spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7608 7616, 2021. [Bartlett et al., 2005] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005. [Bartlett et al., 2017] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, pages 6240 6249, 2017. [Bouniot et al., 2020] Quentin Bouniot, Romaric Audigier, and Angelique Loesch. Vulnerability of person reidentification models to metric adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 794 795, 2020. [Bousquet et al., 2020] Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pages 610 626, 2020. [Cao et al., 2016] Qiong Cao, Zheng-Chu Guo, and Yiming Ying. Generalization bounds for metric and similarity learning. Machine Learning, 102(1):115 132, 2016. [Carlini and Wagner, 2017] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Proceedings of 2017 IEEE Symposium on Security and Privacy (SP), pages 39 57, 2017. [Chen and Deng, 2019] Binghui Chen and Weihong Deng. Energy confused adversarial metric learning for zero-shot image retrieval and clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8134 8141, 2019. [Dai et al., 2018] Pingyang Dai, Rongrong Ji, Haibin Wang, Qiong Wu, and Yuyu Huang. Cross-modality person reidentification with generative adversarial training. In Proceedings of International Joint Conference on Artificial Intelligence, volume 1, page 6, 2018. [Farnia and Ozdaglar, 2021] Farzan Farnia and Asuman Ozdaglar. Train simultaneously, generalize better: Stability of gradient-based minimax learners. In International Conference on Machine Learning, pages 3174 3185. PMLR, 2021. [Hahnloser et al., 2000] Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947 951, 2000. [Huai et al., 2019] Mengdi Huai, Hongfei Xue, Chenglin Miao, Liuyi Yao, Lu Su, Changyou Chen, and Aidong Zhang. Deep metric learning: The generalization analysis and an adaptive algorithm. In Proceedings of the Twenty Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 2535 2541, 2019. [Huai et al., 2022] Mengdi Huai, Tianhang Zheng, Chenglin Miao, Liuyi Yao, and Aidong Zhang. On the robustness of metric learning: An adversarial perspective. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(5):1 25, 2022. [Huang et al., 2019] He Huang, Changhu Wang, Philip S Yu, and Chang-Dong Wang. Generative dual adversarial network for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 801 810, 2019. [Khim and Loh, 2018] Justin Khim and Po-Ling Loh. Adversarial risk bounds via function transformation. ar Xiv preprint ar Xiv:1810.09519, 2018. [Klochkov and Zhivotovskiy, 2021] Yegor Klochkov and Nikita Zhivotovskiy. Stability and deviation optimal risk bounds with convergence rate o(1/n). Advances in Neural Information Processing Systems, 34:5065 5076, 2021. [Kurakin et al., 2018] Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security, pages 99 112. 2018. [Lei et al., 2020] Yunwen Lei, Antoine Ledent, and Marius Kloft. Sharper generalization bounds for pairwise learning. Advances in Neural Information Processing Systems, 33:21236 21246, 2020. [Li and Liu, 2021] Shaojie Li and Yong Liu. High probability generalization bounds with fast rates for minimax problems. In International Conference on Learning Representations, 2021. [Liu et al., 2022] Deyin Liu, Lin Wu, Richang Hong, Zongyuan Ge, Jialie Shen, Farid Boussaid, and Mohammed Bennamoun. Generative metric learning for adversarially robust open-world person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2022. [Madry et al., 2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In Proceedings of International Conference on Learning Representations, 2018. [Mo et al., 2022] Yingxiang Mo, Hong Chen, Yuxiang Han, and Hao Deng. Error bounds of adversarial bipartite ranking. Neurocomputing, 478:81 88, 2022. [Mohri et al., 2018] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. [Mustafa et al., 2022] Waleed Mustafa, Yunwen Lei, and Marius Kloft. On the generalization analysis of adversarial learning. In International Conference on Machine Learning, pages 16174 16196, 2022. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) [Reeve and Kaban, 2020] Henry Reeve and Ata Kaban. Optimistic bounds for multi-output learning. In International Conference on Machine Learning, pages 8030 8040, 2020. [Srebro et al., 2010] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. Advances in Neural Information Processing Systems, 23, 2010. [Tu et al., 2019] Zhuozhuo Tu, Jingwei Zhang, and Dacheng Tao. Theoretical analysis of adversarial learning: A minimax approach. Advances in Neural Information Processing Systems, 32, 2019. [Vapnik, 1999] Vladimir Vapnik. The nature of statistical learning theory. Springer Science & Business Media, 1999. [Wang et al., 2020] Zhuoyi Wang, Yigong Wang, Bo Dong, Sahoo Pracheta, Kevin Hamlen, and Latifur Khan. Adaptive margin based deep adversarial metric learning. In 2020 IEEE 6th Intl Conference on Big Data Security on Cloud (Big Data Security), IEEE Intl Conference on High Performance and Smart Computing,(HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), pages 100 108, 2020. [Xiao et al., 2022] Jiancong Xiao, Yanbo Fan, Ruoyu Sun, Jue Wang, and Zhi-Quan Luo. Stability analysis and generalization bounds of adversarial training. In Advances in Neural Information Processing Systems, 2022. [Xing et al., 2021] Yue Xing, Qifan Song, and Guang Cheng. On the algorithmic stability of adversarial training. Advances in Neural Information Processing Systems, 34:26523 26535, 2021. [Xu et al., 2019] Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. Deep adversarial metric learning for crossmodal retrieval. World Wide Web, 22(2):657 672, 2019. [Yang et al., 2021] Xiaochen Yang, Mingzhi Dong, Yiwen Guo, and Jing-Hao Xue. Metric learning for categorical and ambiguous features: An adversarial method. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 223 238, 2021. [Ye et al., 2019] Han-Jia Ye, De-Chuan Zhan, and Yuan Jiang. Fast generalization rates for distance metric learning. Machine Learning, 108(2):267 295, 2019. [Yin et al., 2019] Dong Yin, Ramchandran Kannan, and Peter Bartlett. Rademacher complexity for adversarially robust generalization. In International Conference on Machine Learning, pages 7085 7094, 2019. [Zantedeschi et al., 2016] Valentina Zantedeschi, R emi Emonet, and Marc Sebban. Metric learning as convex combinations of local models with generalization guarantees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1478 1486, 2016. [Zhang, 2002] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527 550, 2002. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)