# adversarial_regression_with_doubly_nonnegative_weighting_matrices__dd9be080.pdf

Adversarial Regression with Doubly Non-negative Weighting Matrices

Tam Le RIKEN AIP tam.le@riken.jp

Truyen Nguyen University of Akron tnguyen@uakron.edu

Makoto Yamada Kyoto University and RIKEN AIP makoto.yamada@riken.jp

Jose Blanchet Stanford University jose.blanchet@stanford.edu

Viet Anh Nguyen Stanford University and Vin AI Research v.anhnv81@vinai.io

Many machine learning tasks that involve predicting an output response can be solved by training a weighted regression model. Unfortunately, the predictive power of this type of models may severely deteriorate under low sample sizes or under covariate perturbations. Reweighting the training samples has aroused as an effective mitigation strategy to these problems. In this paper, we propose a novel and coherent scheme for kernel-reweighted regression by reparametrizing the sample weights using a doubly non-negative matrix. When the weighting matrix is conﬁned in an uncertainty set using either the log-determinant divergence or the Bures-Wasserstein distance, we show that the adversarially reweighted estimate can be solved efﬁciently using ﬁrst-order methods. Numerical experiments show that our reweighting strategy delivers promising results on numerous datasets.

1 Introduction

We are interested in learning a parameter β that has a competitive predictive performance on a response variable Y . Given N training samples (bzi, bxi, byi)N i=1 in which (bzi, bxi) are the contexts that possess explanatory power on byi, learning the parameter β can be posed as a weighted regression problem of the form

i=1 ω(bzi)ℓ(β, bxi, byi). (1)

In problem (1), ω is a weighting function that indicates the contribution of the sample-speciﬁc loss to the objective. By aligning the covariate (bzi, bxi) appropriately to the weighting term ω(bzi) and the loss term ℓ(β, bxi, byi), the generic formulation of problem (1) can be adapted to many popular learning and estimation tasks in machine learning. For example, problem (1) encapsulates the family of kernel smoothers, including the Nadaraya-Watson estimator [16, 22, 39]. Example 1.1 (Nadaraya-Watson (NW) estimator for conditional expectation). Given the samples (bzi, byi)N i=1, we are interested in estimating the conditional expectation of Y given Z = z0 for some covariate z0 Z. The NW estimator is the optimizer of problem (1) with ℓ(β, y) = β y 2 2 and the weighting function ω is given through a kernel K via ω(bzi) = K(z0, bzi). The NW estimate of E[Y |Z = z0] admits a closed form expression

βNW = PN i=1 K(z0, bzi)byi PN i=1 K(z0, bzi) .

: equal contribution

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

The NW estimator utilizes a locally constant function to estimate the conditional expectation E[Y |Z = z0]. Locally linear regression [3, 32] extends the NW estimator to reduce the noise produced by the linear component of a target function [27, 3.2].

Example 1.2 (Locally linear regression (LLR)). For univariate output and z x, the LLR minimizes the kernel-weighted loss with ℓ([β1, β2], z, y) = (β1 +β 2 z y)2. The LLR estimate of E[Y |Z = z0] admits a closed form expression

βLLR = b Z W b Z 1 b Z W b Y 1 z0

with b Y = [by1, . . . , byn] Rn, W = diag (K(z0, bz1), . . . , K(z0, bzn)) Rn n and

1 (bz1 z0) ... ... 1 (bzn z0)

Intuitively, the NW and LLR estimators are special instances of the larger family of local polynomial estimators with order zero and one, respectively. Problem (1) is also the building block for local learning algorithms [6], density ratio estimation [5, pp.152], risk minimization with covariate shift [17, 4], domain adaptation [36], geographically weighted regression [7], local interpretable explanations [31], to name a few.

In all of the aforementioned applications, a prevailing trait is that the weight ω is given through a kernel. To avoid any confusion in the terminologies, it is instructive to revisit and distinguish the relevant deﬁnitions of kernels. The ﬁrst family is the non-negative kernels, which are popularly employed in nonparametric statistics [37].

Deﬁnition 1.3 (Non-negative kernel). A function K : Z Z R is non-negative if K(z, z ) 0 for any z, z Z.

In addition, there also exists a family of positive deﬁnite kernels, which forms the backbone of kernel machine learning [4, 33].

Deﬁnition 1.4 (Positive deﬁnite kernel). A symmetric function K : Z Z R is positive deﬁnite if for any n N and any choices of (zi)n i=1 Z and (αi)n i=1 R, we have

j=1 αiαj K(zi, zj) 0. (2)

Moreover, K is strictly positive deﬁnite if we have in addition that for mutually distinct (zi)n i=1 Z, the equality in (2) implies α1 = . . .= αn = 0.

Positive deﬁnite kernels are a powerful tool to model geographical interactions [7], to characterize the covariance structure in Gaussian processes [29, 4], and to construct non-linear kernel methods [33]. Interestingly, the two above-mentioned families of kernels have a signiﬁcant overlap. Examples of kernels that are both non-negative and strictly positive deﬁnite include the Gaussian kernel with bandwidth h > 0 deﬁned for any z, z Z as

K(z, z ) = exp( z z 2 2 /h2),

the Laplacian kernel, the Cauchy kernel, the Matérn kernel, the rational quadratic kernel, etc.

It is well-known that the non-parametric statistical estimator obtained by solving (1) is sensitive to the corruptions of the training data [9, 21, 28]. Similar phenomenon is also observed in machine learning where the solution of the risk minimization problem (1) is not guaranteed to be robust or generalizable [1, 2, 12, 14, 19, 23, 41, 42, 43]. The quality of the solution to (1) also deteriorates if the training sample size N is small. Reweighting, obtained by modifying ω(bzi), is arising as an attractive resolution to improve robustness and enhance the out-of-sample performance in the test data [30, 34, 40]. At the same time, reweighting schemes have shown to produce many favorable effects: reweighting can increase fairness [15, 20, 38], and can also effectively handle covariate shift [10, 17, 44].

While reweighting has been successfully applied to the empirical risk minimization regime in which the weights are uniformly 1/N, reweighting the samples when the weighting function ω is tied to a kernel is not a trivial task. In fact, the kernel captures inherently the relative positions of the relevant covariates bz, and any reweighting scheme should also reﬂect these relationship in a global viewpoint. Another difﬁculty also arises due to the lack of convexity or concavity, which prohibits the modiﬁcations of the kernel parameters. For example, the mapping h 7 exp( z z 2 2 /h2) for the Gaussian kernel is neither convex nor concave if z = z . Thus, it is highly challenging to optimize over h in the bandwidth parameter space. Alternatively, modifying the covariates (bzi)N i=1 will also result in reweighting effects. Nevertheless, optimizing over the covariates is intractable for sophisticated kernels such as the Matérn kernel.

Contributions. This paper relies fundamentally on an observation that the Gram matrix of a nonnegative, (strictly) positive deﬁnite kernel is a non-negative, positive (semi)deﬁnite (also known as doubly non-negative) matrix. It is thus natural to modify the weights by modifying the corresponding matrix parametrization in an appropriate manner. Our contributions in this paper are two-fold:

We propose a novel scheme for reweighting using a reparametrization of the sample weights as a doubly non-negative matrix. The estimate is characterized as the solution to a min-max optimization problem, in which the admissible values of the weights are obtained through a projection of an uncertainty set from the matrix space.

We report in-depth analysis on two reweighting approaches based on the construction of the matrix uncertainty set with the log-determinant divergence and the Bures-Wasserstein distance. Exploiting strong duality, we show that the worst-case loss function and its gradient can be efﬁciently evaluated by solving the univariate dual problems. Consequently, the adversarially reweighted estimate can be found efﬁciently using ﬁrst-order methods.

Organization of the paper. Section 2 introduces our generic framework of reweighting using doubly non-negative matrices. Sections 3 and 4 study two distinctive ways to customize our reweighting framework using the log-determinant divergence and the Bures-Wasserstein distance. Section 5 empirically illustrates that our reweighting strategy delivers promising results in the conditional expectation task based on numerous real life datasets. We have released code for these proposed tools2.

Notations. The identity matrix is denoted by I. For any A Rp p, Tr A denotes the trace of A, A 0 means that all entries of A are nonnegative. Let Sp denote the vector space of p-by-p real and symmetric matrices. The set of positive (semi-)deﬁnite matrices is denoted by Sp ++ (respectively, Sp +). For any A, B Rp p, we use A, B = Tr A B to denote the Frobenius inner product between A and B, and v 2 to denote the Euclidean norm of v Rp.

2 A Reweighting Framework with Doubly Non-negative Matrices

We delineate in this section our reweighting framework using doubly non-negative matrices. This framework relies on the following observation: we can reparametrize the weights in (1) into a matrix bΩand the loss terms in (1) into a matrix V (β), and the solution to the estimation problem (1) can be equivalently characterized as the minimizer of the problem

min β bΩ, V (β) . (3)

Notice that there may exist multiple equivalent reparametrizations of the form (3). However, in this paper, we focus on one speciﬁc parametrization where bΩis the nominal matrix of weights

bΩ00 bΩ01 bΩ0N bΩ10 bΩ11 bΩ1N ... ... ... ... bΩN0 bΩN1 bΩNN

2https://github.com/lttam/Adversarial-Regression

with the elements being given by the weighting function ω as bΩ0i = bΩi0 = ω(bzi) for i = 1, . . . , N, and the matrix-valued mapping V : β 7 V (β) SN+1 satisﬁes

0 ℓ(β, bx1, by1) ℓ(β, bx N, by N) ℓ(β, bx1, by1) 0 0 ... ... ... ... ℓ(β, bx N, by N) 0 0

A simple calculation reveals that the objective function of (3) is equivalent to that of (1) up to a positive constant factor of 2. As a consequence, their solutions coincide.

Problem (3) is an overparametrized reformulation of the weighted risk minimization problem (1). Indeed, the objective function of problem (3) involves an inner product of two symmetric matrices, while problem (1) can be potentially reformulated using an inner product of two vectors. While lifting the problem to the matrix space is not necessarily the most efﬁcient approach, it endows us with more ﬂexibility to perturb the weights in a coherent manner. This ﬂexibility comes from the following two observations: (i) there may exist multiple matrices that can be used as the nominal matrix bΩ, and one can potentially choose bΩto improve the quality of the estimator, (ii) the geometry of the space of positive (semi)deﬁnite matrices is richer than the space of vectors.

To proceed, we need to make the following assumption.

Assumption 2.1 (Regularity conditions). The following assumptions hold throughout the paper.

(i) The function ℓis nonnegative, and ℓ( , x, y) is convex, continuously differentiable for any (x, y).

(ii) The nominal weighting matrix bΩis symmetric positive deﬁnite and nonnegative.

In this paper, we propose to ﬁnd an estimate β that solves the following adversarially reweighted estimation problem min β max Ω Uϕ,ρ(bΩ)

Ω, V (β) (4)

for some set Uϕ,ρ(bΩ) of feasible weighting matrices. The estimate β thus minimizes the worst-case loss uniformly over all possible perturbations of the weight Ω Uϕ,ρ(bΩ). In particular, we explore the construction of the uncertainty set Uϕ,ρ(bΩ) that is motivated by the Gram matrix obtained via some non-negative and positive deﬁnite kernels. In this way, the weighting matrix can capture more information on the pair-wise relation among training data. Hence, it is reasonable to consider the set Uϕ,ρ(bΩ) of the form

Uϕ,ρ(bΩ) n Ω SN+1 + : Ω 0, ϕ Ω, bΩ ρ o . (5)

By deﬁnition, any Ω Uϕ,ρ(bΩ) is a symmetric, positive semideﬁnite matrix and all elements of Ω are nonnegative. A matrix with these properties is called doubly nonnegative. From a high level perspective, the set Uϕ,ρ(bΩ) is deﬁned as a ball of radius ρ centered at the nominal matrix bΩand this ball is prescribed by a pre-determined measure of dissimilarity ϕ. Throughout this paper, we prescribe the uncertainty set Uϕ,ρ(bΩ) using some divergence ϕ on the space of symmetric, positive semideﬁnite matrices SN+1 + .

Deﬁnition 2.2 (Divergence). For any N N, ϕ is a divergence on the symmetric positive semideﬁnite matrix space SN+1 + if it is: (i) non-negative: ϕ(Ω1, Ω2) 0 for all Ω1, Ω2 SN+1 + , and (ii) indiscernable: if ϕ(Ω1, Ω2) = 0 then Ω1 = Ω2.

If we denote the adversarially reweighted loss function associated with Uϕ,ρ(bΩ) by

Fϕ,ρ(β) max Ω Uϕ,ρ(bΩ)

then β can be equivalently rewritten as

β = arg min β Fϕ,ρ(β). (6)

A direct consequence is that the function Fϕ,ρ is convex in β as long as the loss function ℓsatisﬁes the convex property of Assumption 2.1(i). Hence, the estimate β can be found efﬁciently using convex optimization provided that the function Fϕ,ρ and its gradient can be efﬁciently evaluated. Moreover, because ϕ is a divergence, Uϕ,0(bΩ) = {bΩ}. Hence by setting ρ = 0, we will recover the nominal estimate that solves (1). In Section 3 and 4, we will subsequently specify two possible choices of ϕ that lead to the desired efﬁciency in computing Fϕ,ρ as well as its gradient. Further discussion on Assumption 2.1 is relegated to the appendix. We close this section by discussing the robustness effects of our weighting scheme (4) on the conditional expectation estimation problem.

Remark 2.3 (Connection to distributionally robust optimization). Consider the conditional expectation estimation setting, in which E[Y |Z = z0] is the solution of the minimum mean square error estimation problem E[Y |Z = z0] = arg min β E[(β Y )2|Z = z0].

In this setting, our reweighting scheme (4) coincides with the following distributionally robust optimization problem min β max QY |Z=z0 B(ˆPY |Z=z0) EQY |Z=z0[(β Y )2],

with the nominal conditional distribution deﬁned as ˆPY |Z=z0(dy) PN i=1 K(z0, ˆzi)δˆyi(dy). The ambiguity set B(ˆPY |Z=z0) is a set of conditional probability measures of Y |Z = z0 constructed speciﬁcally as

B(ˆPY |Z=z0) = QY |Z=z0 : Ω Uϕ,ρ(ˆΩ) so that Ω0i = Ωi0 = ω(ˆzi) i QY |Z=z0(dy) PN i=1 ω(ˆzi)δˆyi(dy)

Remark 2.3 reveals that our reweighting scheme recovers a speciﬁc robustiﬁcation with distributional ambiguity. This robustiﬁcation relies on using a kernel density estimate to construct the nominal conditional distribution, and the weights of the samples are induced by Uϕ,ρ(bΩ). Hence, our scheme is applicable for the emerging stream of robustifying conditional decisions, see [11, 18, 25, 26].

Remark 2.4 (Choice of the nominal matrix). The performance of the estimate may depend on the speciﬁc choice of the nominal matrix bΩ. However, in this paper, we do not study this dependence in details. When the weights ω(bzi) are given by a kernel, it is advised to choose bΩas the Gram matrix.

3 Adversarial Reweighting Scheme using the Log-Determinant Divergence

We here study the adversarially reweighting scheme when the ϕ is the log-determinant divergence.

Deﬁnition 3.1 (Log-determinant divergence). For any positive integer p N, the log-determinant divergence from Ω1 Sp ++ to Ω2 Sp ++ amounts to

D Ω1, Ω2 Tr Ω1Ω 1 2 log det(Ω1Ω 1 2 ) p.

The divergence D is the special instance of the log-determinant α-divergence with α = 1 [8]. Being a divergence, D is non-negative and it vanishes to zero if and only if Ω1 = Ω2. It is important to notice that the divergence D is only well-deﬁned when both Ω1 and Ω2 are positive deﬁnite. Moreover, D is non-symmetric and D(Ω1, Ω2) = D(Ω2, Ω1) in general. The divergence D is also tightly connected to the Kullback-Leibler divergence between two Gaussian distributions, and that D(Ω1, Ω2) = KL(N(0, Ω1) N(0, Ω2)), where N(0, Ω) is a normal distribution with mean 0 and covariance matrix Ω.

Suppose that bΩis invertible. Deﬁne the uncertainty set

UD,ρ(bΩ) = {Ω SN+1 + : Ω 0, D(Ω, bΩ) ρ}.

For any positive deﬁnite matrix bΩ, the function D( , bΩ) is convex, thus the set UD,ρ(bΩ) is also convex. For this section, we examine the following optimal value function

FD,ρ(β) = max Ω UD,ρ(bΩ)

Ω, V (β) , (7)

which corresponds to the worst-case reweighted loss using the divergence D. The maximization problem (7) constitutes a nonlinear, convex semideﬁnite program. Leveraging a strong duality argument, the next theorem asserts that the complexity of evaluating FD,ρ(β) is equivalent to the complexity of solving a univariate convex optimization problem.

Theorem 3.2 (Primal representation). For any bΩ SN+1 ++ and ρ (0, + ), the function FD,ρ is convex. Moreover, for any β such that V (β) = 0, let γ be the unique solution of the convex univariate optimization problem

inf γI bΩ 1 2 V (β)bΩ 1 2 γρ γ log det(I γ 1bΩ 1 2 V (β)bΩ 1 2 ), (8)

then FD,ρ(β) = Ω , V (β) , where Ω = bΩ 1 2 [I (γ ) 1bΩ 1 2 V (β)bΩ 1 2 ] 1bΩ 1 2 . Moreover, the symmetric matrix Ω is unique and doubly nonnegative.

Notice that the condition V (β) = 0 is not restrictive: if V (β) = 0, then Assumption 2.1(i) implies that the incumbent solution β incurs zero loss with ℓ(β, bxi, byi) = 0 for all i. In this case, β is optimal and reweighting will produce no effect whatsoever. Intuitively, the inﬁmum problem (8) is the dual counterpart of the supremum problem (7). The objective function of (8) is convex in the dual variable γ, and thus problem (8) can be efﬁciently solved using a gradient descent algorithm.

The gradient of FD,ρ is also easy to compute, as asserted in the following lemma. Lemma 3.3 (Gradient of FD,ρ). The function FD,ρ is continuously differentiable at β with

βFD,ρ(β) = 2

i=1 Ω 0i βℓ(β, bxi, byi),

where Ω is deﬁned as in Theorem 3.2 using the parametrization

Ω 00 Ω 01 Ω 0N Ω 10 Ω 11 Ω 1N ... ... ... ... Ω N0 Ω N1 Ω NN

The proof of Lemma 3.3 exploits Danskin s theorem and the fact that Ω is unique in Theorem 3.2. Minimizing FD,ρ is now achievable by applying state-of-the-art ﬁrst-order methods.

Sketch of Proof of Theorem 3.2. The difﬁculty in deriving the dual formulation (8) lies in the non-negativity constraint Ω 0. In fact, this constraint imposes (N + 1)(N + 2)/2 individual component-wise constraints, and as such, simply dualizing problem (7) using a Lagrangian multiplier will entail a large number of auxiliary variables. To overcome this difﬁculty, we consider the relaxed set VD,ρ(bΩ) {Ω SN+1 + : D(Ω, bΩ) ρ}. By deﬁnition, we have UD,ρ(bΩ) VD,ρ(bΩ), and VD,ρ(bΩ) omits the nonnegativity requirement Ω 0. The set VD,ρ(bΩ) is also more amenable to optimization thanks to the following proposition.

Proposition 3.4 (Properties of VD,ρ(bΩ)). For any bΩ SN+1 ++ and ρ 0, the set VD,ρ(bΩ) is convex and compact. Moreover, the support function of VD,ρ(bΩ) satisﬁes

δ VD,ρ(bΩ)(T) sup Ω VD,ρ(bΩ) Tr ΩT = inf γ>0 γbΩ 1 T

γρ γ log det(I bΩ 1 2 T bΩ 1 2 /γ)

for any symmetric matrix T SN+1.

Moreover, we need the following lemma which asserts some useful properties of the matrix V (β). Lemma 3.5 (Properties of V (β)). For any β, the matrix V (β) is symmetric, nonnegative, and it has

only two non-zero eigenvalues of value q PN i=1 ℓ(β, bxi, byi)2.

The proof of Theorem 3.2 proceeds by ﬁrst constructing a tight upper bound for FD,ρ(β) as

FD,ρ(β) max Ω VD,ρ(bΩ)

Ω, V (β) = inf γbΩ 1 V (β) γρ γ log det(I 1

γ bΩ 1 2 V (β)bΩ 1 2 ), (10)

where the inequality in (10) follows from the fact that UD,ρ(bΩ) VD,ρ(bΩ), and the equality follows from Proposition 3.4. Notice that V (β) has one nonnegative eigenvalue by virtue of Lemma 3.5 and thus the constraint γbΩ 1 V (β) already implies the condition γ > 0. Next, we argue that the optimizer Ω of problem (10) can be constructed from the optimizer γ of the inﬁmum problem via

Ω = bΩ 1 2 [I (γ ) 1bΩ 1 2 V (β)bΩ 1 2 ] 1bΩ 1 2 .

The last step involves proving that Ω is a nonnegative matrix, and hence Ω UD,ρ(bΩ). As a consequence, the inequality (10) holds as an equality, which leads to the postulated result. The proof is relegated to the Appendix.

4 Adversarial Reweighting Scheme using the Bures-Wasserstein Type Divergence

In this section, we explore the construction of the set of possible weighting matrices using the Bures-Wasserstein distance on the space of positive semideﬁnite matrices. Deﬁnition 4.1 (Bures-Wasserstein divergence). For any positive integer p N, the Bures-Wasserstein divergence between Ω1 Sp + and Ω2 Sp + amounts to

W Ω1, Ω2 Tr Ω1 + Ω2 2 Ω

For any positive semideﬁnite matrices Ω1 and Ω2, the value W(Ω1, Ω2) is equal to the square of the type-2 Wasserstein distance between two Gaussian distributions N(0, Ω1) and N(0, Ω2) [13]. As a consequence, W is a divergence: it is non-negative and indiscernable. However, W is not a proper distance because it may violate the triangle inequality. Compared to the divergence D studied in Section 3, the divergence W has several advantages as it is symmetric and is well-deﬁned for all positive semideﬁnite matrices. This divergence has also been of interest in quantum information, statistics, and the theory of optimal transport.

Given the nominal weighting matrix bΩ, we deﬁne the set of possible weighting matrices using the Bures-Wasserstein divergence W as

UW,ρ(bΩ) {Ω SN+1 + : Ω 0, W(Ω, bΩ) ρ}.

Correspondingly, the worst-case loss function is

FW,ρ(β) max Ω UW,ρ(bΩ)

Ω, V (β) . (11)

Theorem 4.2 (Primal representation). For any bΩ SN+1 + and ρ (0, + ), the function FW,ρ is convex. Moreover, for any β such that V (β) = 0, let γ be the unique solution of the convex univariate optimization problem

inf γI V (β) γ(ρ Tr bΩ ) + γ2 (γI V (β)) 1, bΩ , (12)

then FW,ρ(β) = Ω , V (β) , where Ω = (γ )2[γ I V (β)] 1bΩ[γ I V (β)] 1. Moreover, the symmetric matrix Ω is unique and doubly nonnegative.

Thanks to the uniqueness of Ω and Danskin s theorem, the gradient of FW,ρ is now a by-product of Theorem 4.2. Lemma 4.3 (Gradient of FW,ρ). The function FW,ρ is continuously differentiable at β with

βFW,ρ(β) = 2

i=1 Ω 0i βℓ(β, bxi, byi),

where Ω is deﬁned as in Theorem 4.2 using the similar parametrization (9).

A ﬁrst-order minimization algorithm can be used to ﬁnd the robust estimate with respect to the loss function FW,ρ. Notice that problem (12) is one-dimensional, and either a bisection search or a gradient descent subroutine can be employed to solve (12) efﬁciently.

Sketch of Proof of Theorem 4.2. The proof of Theorem 4.2 follows a similar line of argument as the proof of Theorem 3.2. Consider the relaxed set VW,ρ(bΩ) {Ω SN+1 + : W(Ω, bΩ) ρ}. By deﬁnition, VW,ρ(bΩ) omits the nonnegativity requirement Ω 0 and thus UW,ρ(bΩ) VW,ρ(bΩ). The advantage of considering VW,ρ(bΩ) arises from the fact that the support function of the set VW,ρ(bΩ) admits a simple form [24, Proposition A.4].

Proposition 4.4 (Properties of VW,ρ(bΩ)). For any bΩ SN+1 + and ρ 0, the set VW,ρ(bΩ) is convex and compact. Moreover, the support function of VW,ρ(bΩ) satisﬁes

δ VD,ρ(bΩ)(T) sup Ω VD,ρ(bΩ) Tr ΩT = inf γ>0 γI T γ(ρ Tr bΩ ) + γ2 (γI T) 1, bΩ .

The upper bound for FW,ρ(β) can be constructed as

FW,ρ(β) max Ω VW,ρ(bΩ)

Ω, V (β) = inf γI V (β) γ(ρ Tr bΩ )+γ2 (γI V (β)) 1, bΩ , (13)

where the inequality in (13) follows from the fact that UW,ρ(bΩ) VW,ρ(bΩ), and the equality follows from Proposition 4.4. In the second step, we argue that the optimizer Ω of problem (13) can be constructed from the optimizer γ of the inﬁmum problem via

Ω = (γ )2[γ I V (β)] 1bΩ[γ I V (β)] 1.

The last step involves proving that Ω is a nonnegative matrix by exploiting Lemma 3.5, and hence Ω UD,ρ(bΩ). Thus, inequality (13) is tight, leading to the desired result.

5 Numerical Experiments on Real Data

We evaluate our adversarial reweighting schemes on the conditional expectation estimation task. To this end, we use the proposed reweighted scheme on the NW estimator of Example 1.1. The robustiﬁcation using the log-determinant divergence and the Bures-Wasserstein divergence are denoted by NW-Log Det and NW-Bures W, respectively. We compare our NW robust estimates against four popular baselines for estimating the conditional expectation: (i) the standard NW estimate in Example 1.1 with Gaussian kernel, (ii) the LLR estimate in Example 1.2 with Gaussian kernel3, (iii) the intercepted β1 of LLR estimate (i.e., only the ﬁrst dimension of βLLR), denoted as LLR-I, and (iv) the NW-Metric [27] which utilizes the Mahalanobis distance in the Gaussian kernel.

Datasets. We use 8 real-world datasets: (i) abalone (Abalone), (ii) bank-32fh (Bank), (iii) cpu (CPU), (iv) kin40k (KIN), (v) elevators (Elevators), (vi) pol (POL), (vii) pumadyn32nm (PUMA), and (viii) slice (Slice) from the Delve datasets, the UCI datasets, the KEEL datasets and datasets in Noh et al. [27]. Due to space limitation, we report results on the ﬁrst 4 datasets and relegate the remaining results to the Appendix. Datasets characteristics can also be found in the supplementary material.

Setup. For each dataset, we randomly split 1200 samples for training, 50 samples for validation to choose the bandwith h of the Gaussian kernel, and 800 samples for test. More specially, we choose the squared bandwidth h2 for the Gaussian kernel from a predeﬁned set 10 2:1:4, 2 10 2:1:4, 5 10 2:1:4 . For a tractable estimation, we follow the approach in Brundsdon et al. [7] and Silverman [35] to restrict the relevant samples to N nearest neighbors of each test sample zi with N {10, 20, 30, 50}. The range of the radius ρ has 4 different values ρ {0.01, 0.1, 1, 10}. Finally, the prediction error is measured by the root mean square error (RMSE),

i.e., RMSE = q

n 1 t Pnt i=1(byi bβi)2, where nt is the test sample size (i.e., nt =800) and bβi is the conditional expectation estimate at the test sample zi. We repeat the above procedure 10 times to obtain the average RMSE. All our experiments are run on commodity hardware.

Ideal case: no sample perturbation. We ﬁrst study how different estimators perform when there is no perturbation in the training data. In this experiment, we set the nearest neighbor size to N =50, and our reweighted estimators are obtained with the uncertainty size of ρ=0.1.

3We omit results of LLR in Figures 1 and 2 for a better visualization. See the Appendix for detailed results.

Average RMSE

NW LLR-I NW-Metric NW-Bures W NW-Log Det Figure 1: Average RMSE for ideal case with no perturbation. Figure 1 shows the average RMSE across the datasets. The NW-Metric estimator outperforms the standard NW, which agrees with the empirical observation in Noh et al. [27]. More importantly, we observe that our adversarial reweighting schemes perform competitively against the baselines on several datasets.

When training samples are perturbed. We next evaluate the estimation performances when τ {0.2N, 0.4N, 0.6N, 0.8N, N} nearest samples from the N training neighbors of each test sample are perturbed. We speciﬁcally generate perturbations only in the response dimension by shifting y7 κy, where κ is sampled uniformly from [1.8, 2.2]. We set N =50 and ρ=0.1 as the experiment for the ideal case (no sample perturbation).

0.2N 0.4N 0.6N 0.8N 1.0N

Average RMSE

0.2N 0.4N 0.6N 0.8N 1.0N 0.08

0.2N 0.4N 0.6N 0.8N 1.0N

0.2N 0.4N 0.6N 0.8N 1.0N 0.6

NW LLR-I NW-Metric NW-Bures W NW-Log Det Figure 2: RMSE for varying perturbation levels τ. Results are averaged over 10 independent replications, each contains 800 test samples. Figure 2 shows the average RMSE for with varying perturbation level τ. The performance of all baseline approaches severely deteriorate, while both NW-Log Det and NW-Bures W can alleviate the effect of data perturbation. Our adversarial reweighting schemes consistently outperform all baselines in all datasets for the perturbed training data, across all 5 perturbations τ.

We then evaluate the effects of the uncertainty size ρ and the nearest neighbor size N on NW-Log Det and NW-Bures W.

0 0.01 0.1 1 10

Average RMSE

0 0.01 0.1 1 10

0 0.01 0.1 1 10

0 0.01 0.1 1 10

NW-Bures W NW-Log Det Figure 3: Average RMSE as a function of the ambiguity size ρ. Errors at ρ = 0 indicate the performance of the vanilla NW estimator.

Effects of the uncertainty size ρ. In this experiment, we set the nearest neighbor size to N = 50, the perturbation τ = N. Figure 3 illustrates the effects of the uncertainty size ρ for the adversarial reweighting schemes. Errors at ρ = 0 indicate the performance of the vanilla NW estimator. We observe that the adversarial reweighting schemes perform well at some certain ρ and when that ρ is increased more, the performances decrease. The uncertainty size ρ plays an important role for the adversarial reweighting schemes in applications. Tuning ρ may consequently improve the performances of the adversarial reweighting schemes.

Effects of the nearest neighbor size N. In this experiment, we set the uncertainty size to ρ= 0.1.

0.2N 0.4N 0.6N 0.8N 1.0N 0.56

Average RMSE

0.2N 0.4N 0.6N 0.8N 1.0N 0.56

0.7 NW-Bures W

N=10 N=20 N=30 N=50 Figure 4: Effects of the nearest neighbor size N.

Figure 4 shows the effects of the nearest neighbor size N for the adversarial reweighting schemes under varying perturbation τ in the KIN dataset. We observe that the performances of the adversarial reweighting schemes with N {20, 30} perform better than those with N {10, 50}. Note that when N is increased, the computational cost is also increased (see Equation (1) and Figure 6). Similar to the cases of the uncertainty size ρ, tuning N may also help to improve performances of the adversarial reweighting schemes.

0.2N 0.4N 0.6N 0.8N 1.0N 0.58

Average RMSE

0.01 0.1 1 10

[1.3, 1.7] [1.8, 2.2] [2.3, 2.7] [2.8, 3.2] Figure 5: Effects of perturbation intensity κ for NW-Log Det estimate. Left plot: different perturbation τ, right plot: different uncertainty size ρ.

Under varying shifting w.r.t. κ. In this experiment, we set the nearest neighbor size to N = 50. Figure 5 illustrates the performances of NW-Log Det under varying shifting w.r.t. κ in the KIN dataset. For the left plots of the Figures 5, we set the uncertainty size to ρ = 0.1 when varying the perturbation τ. We observe that the adversarial reweighting schemes provide different degrees of mitigation for the perturbation under varying shifting w.r.t. κ. For the right plots of the Figures 5, we set the perturbation to τ = 0.2N when varying the uncertainty size ρ. We observe that the reweighting schemes under varying shifting w.r.t. κ have the same behaviors as in Figure 3 when we consider the effects of the uncertainty size ρ (i.e., when κ [1.8, 2.2]). Similar results for NW-Bures W are reported in the supplementary material.

10 20 30 50

Time consumption (s)

10 20 30 50 N

=0.01 =0.1 =1 =10 Figure 6: Effects of N on computational time.

Time consumption. In Figure 6, we illustrate the time consumption for the adversarial reweighting schemes under varying neighbor size N and uncertainty size ρ in the KIN dataset. The adversarial reweighting schemes averagely take about 10 seconds for their estimation. When N and/or ρ increases, the computation of the adversarial reweighting schemes take longer. This is intuitive because the dimension of the weight matrix Ωscales quadratically in the neighbor size N. Bigger uncertainty size ρ implies a larger feasible set Uϕ,ρ(bΩ), which leads to longer computing time to evaluate Fϕ,ρ and its gradient.

Concluding Remarks. We introduce two novel schemes for sample reweighting using matrix reparametrization. These two invariants are particularly attractive when the original weights are given through kernel functions. Note that the adversarial reweighting with Bures-Wasserstein distance W can be generalized to cases where the nominal weighting matrix bΩis singular, unlike the reweighting with the log-determinant divergence D. Remark 5.1 (Invariant under permutation). Our results hold under any simultaneous row and column permutation of the nominal weighting matrix bΩand the mapping V (β). To see this, let P be any (N + 1)-dimensional permutation matrix, and let bΩP P bΩP and VP (β) PV (β)P. Then max Ω Uϕ,ρ(bΩP ) Ω, VP (β) = PΩ P, VP (β) = Ω , V (β) ,

where Ω is calculated as in Theorem 3.2 for ϕ = D, and as in Theorem 4.2 for ϕ = W. The proof relies on the fact that P P = PP = I, that both D and W are permutation invariant (in the sense that ϕ(Ω1, Ω2) = ϕ(PΩ1P, PΩ2P)), and that the inner product is also permutation invariant. Similar results hold for the gradient information, and hence the optimal solution of β is preserved under row and column permutations of bΩand V (β).

Acknowledgements. Material in this paper is based upon work supported by the Air Force Ofﬁce of Scientiﬁc Research under award number FA9550-20-1-0397. Support is gratefully acknowledged from NSF grants 1915967, 1820942, 1838676, Simons Foundation grant (#318995), JSPS KAKENHI grant 20K19873, and MEXT KAKENHI grants 20H04243, 21H04874. We thank the anonymous referees for their constructive feedbacks that helped improve and clarify this paper.

[1] David Alvarez-Melis and Tommi S Jaakkola. On the robustness of interpretability methods. ar Xiv preprint ar Xiv:1806.08049, 2018. [2] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, and Guoliang Chen. Deep speech 2 : End-to-end speech recognition in English and Mandarin. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 173 182, 2016. [3] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning. Lazy Learning, pages 11 73, 1997. [4] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semigroups: Theory of Positive Deﬁnite and Related Functions. Springer, 1984. [5] Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Space in Probability and Statistics. Kluwer Academic Publishers, 2004. [6] Léon Bottou and Vladimir Vapnik. Local learning algorithms. Neural Computation, 4(6):888 900, 11 1992. [7] Chris Brunsdon, Stewart Fotheringham, and Martin Charlton. Geographically weighted regression. Journal of the Royal Statistical Society: Series D (The Statistician), 47(3):431 443, 1998. [8] Zeineb Chebbi and Maher Moakher. Means of Hermitian positive-deﬁnite matrices based on the log-determinant α-divergence function. Linear Algebra and its Applications, 436(7):1872 1889, 2012. [9] Simon Du, Yining Wang, Sivaraman Balakrishnan, Pradeep Ravikumar, and Aarti Singh. Robust nonparametric regression under Huber s ϵ-contamination model. ar Xiv preprint ar Xiv:1805.10406, 2018. [10] John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts. Under review, 2019. [11] Adrián Esteban-Pérez and Juan M Morales. Distributionally robust stochastic programs with side information based on trimmings. ar Xiv preprint ar Xiv:2009.10592, 2020. [12] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 3681 3688, Jul. 2019. [13] C.R. Givens and R.M. Shortt. A class of Wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231 240, 1984. [14] Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, volume 29, pages 3315 3323, 2016. [15] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929 1938, 2018. [16] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2009. [17] Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classiﬁers? In Proceedings of the 35th International Conference on Machine Learning, 2018. [18] Rohit Kannan, Güzin Bayraksan, and James R Luedtke. Residuals-based distributionally robust optimization with covariate information. ar Xiv preprint ar Xiv:2012.01088, 2020.

[19] Joo Seuk Kim and Clayton D. Scott. Robust kernel density estimation. The Journal of Machine Learning Research, 13:2529 2565, 2012.

[20] Preethi Lahoti, Alex Beutel, Jilin Chen, Flavien Prost Kang Lee, Nithum Thain, Xuezhi Wang, and Ed Chi. Fairness without demographics through adversarially reweighted learning. In Advances in Neural Information Processing Systems, 2020.

[21] Haoyang Liu and Chao Gao. Density estimation with contaminated data: Minimax rates and theory of adaptation. ar Xiv preprint ar Xiv:1712.07801v3, 2017.

[22] Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141 142, 1964.

[23] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems 30, pages 2971 2980, 2017.

[24] Viet Anh Nguyen, S. Shaﬁeezadeh-Abadeh, D. Kuhn, and P. Mohajerin Esfahani. Bridging Bayesian and minimax mean square error estimation via Wasserstein distributionally robust optimization. ar Xiv preprint arxiv:1910.10583, 2019.

[25] Viet Anh Nguyen, Fan Zhang, Jose Blanchet, Erick Delage, and Yinyu Ye. Distributionally robust local non-parametric conditional estimation. In Advances in Neural Information Processing Systems 33, 2020.

[26] Viet Anh Nguyen, Fan Zhang, Jose Blanchet, Erick Delage, and Yinyu Ye. Robustifying conditional portfolio decisions via optimal transport. ar Xiv preprint ar Xiv:2103.16451, 2021.

[27] Yung-Kyun Noh, Masashi Sugiyama, Kee-Eung Kim, Frank Park, and Daniel D Lee. Generative local metric learning for kernel regression. In Advances in Neural Information Processing Systems, pages 2452 2462, 2017.

[28] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 784 789, 2018.

[29] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.

[30] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pages 4334 4343. PMLR, 2018.

[31] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?" Explaining the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135 1144, 2016.

[32] David Ruppert and Matthew P Wand. Multivariate locally weighted least squares regression. The Annals of Statistics, pages 1346 1370, 1994.

[33] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning series, 2018.

[34] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Metaweight-net: Learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems, volume 32, pages 1917 1928, 2019.

[35] Bernard W Silverman. Density Estimation for Statistics and Data Analysis. CRC Press, 1986.

[36] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Von Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems, volume 7, pages 1433 1440, 2007.

[37] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008.

[38] Serena Wang, Wenshuo Guo, Harikrishna Narasimhan, Andrew Cotter, Maya Gupta, and Michael I Jordan. Robust optimization for fairness with noisy protected groups. In Advances in Neural Information Processing Systems, 2020.

[39] Geoffrey S Watson. Smooth regression analysis. Sankhy a: The Indian Journal of Statistics, Series A, pages 359 372, 1964.

[40] Yifan Wu, Tianshu Ren, and Lidan Mu. Importance reweighting using adversarial-collaborative training. In NIPS 2016 Workshop, 2016. [41] Muhammad Bilal Zafar, Isabel Valera, Manuel Rodriguez, Krishna Gummadi, and Adrian Weller. From parity to preference-based notions of fairness in classiﬁcation. In Advances in Neural Information Processing Systems, volume 30, pages 229 239, 2017. [42] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 325 333, 2013. [43] Chongjie Zhang and Julie A Shah. Fairness in multi-agent sequential decision-making. In Advances in Neural Information Processing Systems, volume 27, pages 2636 2644, 2014. [44] Jingzhao Zhang, Aditya Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, and Suvrit Sra. Coping with label shift via distributionally robust optimisation. ar Xiv preprint ar Xiv:2010.12230, 2020.