# optimal_learning_from_verified_training_data__436446b2.pdf

Optimal Learning from Veriﬁed Training Data

Nicholas Bishop University of Southampton, UK nb8g13@soton.ac.uk

Enrico Gerding University of Southampton, UK eg@ecs.soton.ac.uk

Long Tran-Thanh University of Warwick, UK long.tran-thanh@warwick.ac.uk

Standard machine learning algorithms typically assume that data is sampled independently from the distribution of interest. In attempts to relax this assumption, ﬁelds such as adversarial learning typically assume that data is provided by an adversary, whose sole objective is to fool a learning algorithm. However, in reality, it is often the case that data comes from self-interested agents, with less malicious goals and intentions which lie somewhere between the two settings described above. To tackle this problem, we present a Stackelberg competition model for least squares regression, in which data is provided by agents who wish to achieve speciﬁc predictions for their data. Although the resulting optimisation problem is nonconvex, we derive an algorithm which converges globally, outperforming current approaches which only guarantee convergence to local optima. We also provide empirical results on two real-world datasets, the medical personal costs dataset and the red wine dataset, showcasing the performance of our algorithm relative to algorithms which are optimal under adversarial assumptions, outperforming the state of the art.

1 Introduction

Many machine learning algorithms perform under the assumption that data is sampled independently and identically from the distribution of interest. In reality, this assumption does not hold in many applications. A canonical example of this is spam email classiﬁcation. Spammers are constantly adapting and changing the formatting of their emails in an attempt to bypass spam ﬁlters. In order to apply machine learning in such settings, approaches such as adversarial machine learning and robust optimisation instead typically assume that data is provided by an adversary who is attempting to fool the learning algorithm. This is a worst-case assumption that is well suited to problems such as spam classiﬁcation. However, in many scenarios this assumption is overly pessimistic, and, thus, prevents the learning algorithm from achieving high performance (as it always assumes worst-case manipulation).

Instead of such extreme assumptions, we should consider the individuals or organisations that provide data samples and how their motives might inﬂuence the data distribution. In many cases, the motivation of data providers is not to hinder generalization but to manipulate the learning algorithm into selecting an outcome which is beneﬁcial for them. In other words, data providers should be viewed as strategic agents seeking to achieve a high payoff. By including the motivations of agents directly in our optimisation process, we can consider a smaller set of possible adversarial manipulations and cease our concerns over adversarial manipulations which are irrational.

Less pessimistic models of data provider manipulation were ﬁrst considered by Brückner and Scheffer [4], who formulated prediction from manipulated data as a Stackelberg competition. In this framework,

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

the learner acts as the "leader" by ﬁrst selecting a learning model with the goal of minimising their own loss. Following this, with full information regarding the model chosen by the learner, data providers modify and submit their data with the aim of minimising their own loss function. The goal of the learner is to select a good model under the assumption that data providers will manipulate their data optimally. Formally, the learner is tasked with solving a bilevel optimisation problem, known as a Stackelberg prediction game (SPG).

Whilst SPGs are nonconvex optimisation problems which are generally NP-hard, it is possible to ﬁnd local optima via sequential quadratic programming when both the loss functions of the learner and the data provider are convex. However, there is no guarantee that these local optima perform well compared to globally optimal solutions, and thus, existing methods still fail to provide good performance in many scenarios.

To overcome this issue, we present a subclass of SPGs, which include a number of important problems, for which we can ﬁnd global solutions. In particular, this class consists of least squares linear regression problems in which the learner is interested in predicting one set of outputs whilst data providers are interested in manipulating the learner into predicting another set of outputs. Moreover, we make no assumptions with respect to the output labels of either the learner or the data provider, and, as result, this subclass can be used to model a wide variety of data providers with different motivations.

In more detail, our contributions are as follows: For the ﬁrst time, we show that solving an SPG of this type is equivalent to solving a quadratically constrained quadratic fractional program with a single constraint using elementary linear algebra. Next, by using a novel combination of Dinkelbach s lemmas for fractional programming [13] and the S-lemma for quadratic programs [3, 19, 25], we show that a combination of bisection search and semideﬁnite programming can be used to converge to global optima. Finally, we provide numerical results, comparing our algorithm with current approaches. We show empirically that our algorithm outperforms existing methods on multiple real world datasets, halving mean squared error (MSE) in some of our empirical experiments.

2 Related Work

Stackelberg competition was originally introduced to model market behaviour and has since been applied in many other contexts. Perhaps the most notable of which being Stackelberg security games (SSGs) [1, 21], in which a defender must commit to a randomised allocation of resources to defend a number of targets from a best responding attacker. Finding the optimal defense strategy is equivalent to solving multiple linear programs [10, 16] and it has been shown that a near-optimal defense strategy can be found in a polynomial number of queries to the attacker [2]. SPGs are similar to SSGs in the sense that they formulate learning an optimal strategy as a bilevel optimisation problem. However, SPG s are tailored speciﬁcally to address manipulation of machine learning algorithms.

SPGs themselves have been studied widely since their introduction by Brückner and Scheffer [4]. However, most of the literature focuses on modelling data providers who are either only partly adversarial or are adversarial with limited capabilities. For example, Dong et al. [14] examine an online learning problem akin to spam ﬁltering in which data providers are honest or adversarial at each time step. More recently, Chen et al. [8] provided an online learning algorithm with sublinear Stackelberg regret for binary classiﬁcation in which data providers may neither be entirely adversarial or honest. In contrast, we consider ofﬂine SPGs for linear regression, as opposed to online binary classiﬁcation. To the best of our knowledge, our paper is the ﬁrst to consider SPGs for linear least squares regression.

For linear regression, there has been much focus on the development of algorithms which are strategyproof i.e. ensuring that data providers can gain no beneﬁt by manipulating their data. In this vein, Dekel et al. [12] studied a learning scenario in which multiple data providers are selﬁsh and wish to secure the best ﬁtting model for the subset of the training set they hold and gave bounds on the accuracy of any algorithm which is strategyproof in this setting. Many algorithms for strategyproof linear regression are inspired by the strategyproofness of the median for single-peaked preferences [7, 17, 18]. We are not concerned with strategyproofness in our scenario, but only about ﬁnding the best performing model regardless of the data manipulation which may take place.

Learning in the presence of noise is closely related to our problem setting. Speciﬁcally, worst-case optimisation in the presence of bounded noise is equivalent to minimising loss in the presence of an adversary with limited capability. In particular, El Ghaoui and Lebret [15] proved that minimising

least squares error in the presence of noise bounded in ℓ2-norm is equivalent to solving a ridge regression problem with a speciﬁc choice of hyperparameter. Similar results also hold for noise bounded in ℓ1-norm [26]. In fact, our proof techniques share many similarities with those of El Ghaoui and Lebret [15], including the application of an S-lemma. In contrast, we are looking beyond the worst-case, and the subclass of SPGs we consider allow us to model data providers with more nuanced motivations, which we believe are more likely to arise in practice. More speciﬁcally, we consider a setting in which the adversarial assumption, made in the worst-case noise setting, is replaced by the assumption that data providers will manipulate their data with the aim of minimising their own, predetermined, objective function.

We consider a scenario in which a learner is tasked with choosing a linear predictor, w Rn. Data providers sample tuples of the form (x, y, z) from a distribution D. The ﬁrst output, y R is the output label of interest to the learner, while the output label z R is the output label of interest to the data provider, and x Rn is the input example. We assume the learner has access to a sample, S = {(xi, yi, zi)}m i=1, of unmodiﬁed data points which it can use as a training set. For example, such a training sample may be obtained from a reliable, but costly veriﬁcation process. While it may be too expensive to verify every single data point, the learner may be able to verify a small sample for the purpose of training. After training is completed, the learner must commit to a single linear predictor.

At test time, data providers will continue to sample input examples from the distribution D. With knowledge of the linear predictor chosen by the learner at training time, the data provider is allowed to modify the input examples in each test example. However, the data provider pays a cost of c(x, ˆx) for modifying an input example x into a different input example ˆx. In what follows, we assume that c(x, ˆx) = γ||x ˆx||2 2, where γ > 0 is a parameter which controls the tradeoff the data provider makes between low cost data manipulations and bringing the learner s prediction as close to z as possible.

The goal of the data provider is to manipulate the learner into predicting z whilst ensuring that the cost of manipulation is low. Formally speaking, the data provider is tasked with solving the following optimisation problem for each test example:

x = argmin ˆx

w T ˆx z 2 + γ ||x ˆx||2

Meanwhile, the goal of the learner is to choose a linear predictor which accurately predicts the correct label y for the manipulated data point ˆx submitted by the learner in expectation:

w = argmin w E h w T x y 2i

As the learner does not have access to the distribution D, but only a sample S, the learner can only hope to minimise an empirical version of their loss:

argmin w 1 m

w T x i yi 2

s.t. x i = argmin ˆxi

w T ˆxi zi 2 + γ ||xi ˆxi||2 i [m]

By constructing a matrix, X Rm n, whose ith row corresponds to input example xi in the sample S, and vectors y, z, whose ith elements correspond to output examples yi and zi respectively, we can rewrite this optimisation problem as follows:

argmin w ||X w y||2

s.t. X = argmin ˆ X

ˆXw z + γ ˆX X 2

In the remaining sections, we will develop a practical algorithm which ﬁnds global solutions to this optimisation problem.

4 Preliminaries

Before we proceed with technical details, we will give a brief overview of how our algorithm is derived, and the optimisation tools used. We begin by reformulating the SPG (1) as a fractional programming problem. We will then claim that such problems can be solved by bisection search by leveraging the classical results of Dinkelbach [13]. Each step of this bisection search requires solving a nonconvex optimisation problem. We will show that this nonconvex optimisation problem can be solved via semideﬁnite programming through the application of an S-lemma, an equivalence theorem which states when one quadratic inequality is a consequence of a set of other quadratic inequalities.

4.1 Dinkelbach s Lemmas for Fractional Programming

Consider the following fractional program:

min w N(w) D(w) s.t. w W (2)

where both N : Rn R and D : Rn R are continuous functions and W is a compact subset of Rn. We can consider the following parameterised family of mathematical programs: min w N(w) q D(w) s.t. w W (3)

where q R is a parameter. In what follows, we will refer to these programs as Dinkelbach Programs. Furthermore, we can deﬁne a function F : R R mapping any q to the solution of its corresponding Dinkelbach Program. Dinkelbach [13] uncovered a number of relationships between F and the fractional program (3), which we shall use in our own analysis: Lemma 1 (Dinkelbach [13]). The function F(q) is continuous in q and is strictly monotonically decreasing. Moreover if w W and

D(w) then F(q) 0. Theorem 1 (Theorem 1 from Dinkelbach [13]). F(q ) = 0 has a unique solution. Furthermore q is the solution to the Fractional Program (3) if and only if F(q ) = 0.

In particular, in Section 6, we shall exploit Theorem 1 to circumvent the nonconvexities present in the fractional formulation of our SPG.

4.2 S-lemma with Equality

Consider a pair of quadratic functions: f(w) = w T Aw + 2a T w + c 0

h(w) = w T Bw + 2b T w + d = 0 where A, B Rn n are symmetric matrices, a, b Rn and c, d R. An S-lemma with equality is a theorem which speciﬁes the conditions under which the following two statements are equivalent:

(i) h(w) = 0 = f(w) 0 w Rn

(ii) There exists a number λ such that f(w) + λh(w) 0 w Rn

Nearly all S-lemmas with equality assume that the dual Slater condition holds: Assumption 1 (Dual Slater Condition). There exists w1 Rn and w2 Rn such that h(w1) < 0 and h(w2) > 0.

The main application of S-lemmas is their use in transforming quadratically constrained quadratic programs with a single constraint (QC1QPs) into convex semideﬁnite programs (SDPs), which can be solved efﬁciently via interior point methods [3]. After we derive our bisection search algorithm, we will see that each step of bisection search requires solving a QC1QP. We will make use of the following S-lemma with equality, proved by Xia et al. [25], to reformulate the QC1QP at each iteration into an SDP. Theorem 2 (Theorem 3 from Xia et al. [25]). Assume that assumption 1 holds. In addition, assume that B = 0. Then statements (i) and (ii) are equivalent.

5 Problem Reformulation

Before we can apply the theorems described in the previous section, we ﬁrst need to reformulate our optimisation problem into a more amenable form. We start by noting that the lower level optimisation problem in SPG (1) is in fact convex and therefore its solution can be found by ﬁnding a point at which the gradient with respect to ˆX is equal to zero. This yields a closed form solution for X which can be used to rewrite the constraint in problem (1):

argmin w ||X w y||2

s.t. X = (zw T + γX)(ww T + γI) 1

Substituting the right hand side of the constraint directly into the objective allows us get rid of constraints entirely. Moreover, applying the Sherman-Morrison formula [3] to the matrix ww T + γI allows us to derive an equivalent nonlinear least squares problem:

1 γ zw T w + Xw

Intuitively, we can see that the learner is minimising the error of a weighted average between the target labels z of the adversary and the predictions their chosen model would make on the true data matrix X. As the learner chooses a model bigger in Euclidean norm, predictions will be slowly pushed towards the desired labels of the adversary. This makes sense, as the adversary is constrained in Frobenius norm, so by making the weight vector bigger in ℓ2-norm, the learner makes it cheaper for the adversary to manipulate. We also see that, as γ grows, the slower the weight on the target labels of the adversary grows as the ℓ2-norm of the weight vector increases.

By moving y into the fraction we can express problem (4) as the following fractional programming problem:

γ zw T w + Xw y 1

To get rid of high degree terms, we introduce a new variable α R and set it equal to w T w:

γ z + Xw y α

γ )2 s.t. α = w T w (5)

We are now left with a quadratic fractional programming problem with a nonconvex feasible set. At this point, we are ready to begin applying the techniques discussed in Section 4 in order to develop a practical algorithm.

6 Solving the Fractional Program

We will now apply the theorems and lemmas of Dinkelbach to construct an algorithm for solving problem (5). To begin, we ﬁrst consider the Dinkelbach problems associated with problem (5):

F(q) = min w,α

α γ z + Xw y α

s.t. α = w T w

By Theorem 1, we know that ﬁnding a q R such that F(q ) = 0, implies that the linear predictor w corresponding to F(q ) is a global solution to problem (5), and therefore, the original SPG. We also know from Lemma 1 that F is a concave monotonically decreasing continuous function. As a result, given q1, q2 R for which F(q1) 0 and F(q2) 0 we can employ bisection search to ﬁnd q . This procedure is described formally by Algorithm 1.

Given the description of our algorithm, we now show that it indeed converges linearly to a globally optimal solution. In particular, we have the following:

Algorithm 1 Bisection search for q

Input: data matrix X, learner s labels y, data provider s labels z, tolerance ϵ Initialize q1 = 0 Initialize q2 = y T y repeat

q = (q1 + q2)/2 if F(q) 0 then

q1 = q else

q2 = q end if until q2 q1 ϵ return q2

Theorem 3. Algorithm 1 takes at most log2( 2y T y

ϵ ) iterations to return a q R such that q q ϵ

Proof. For bisection search to converge to q such that F(q ) = 0 we require two points. One point, q1 R, for which F(q1) 0, and another point, q2 R, for which F(q2) 0. We begin by claiming that setting a = 0 and b = y T y satisﬁes these conditions.

Finding a q2 such that F(q2) 0 is simple. Lemma 1 tells us that we can choose any feasible point for problem 5) and simply set q2 to the value of the objective at the chosen point. Thus we choose the zero vector which leads to an objective value of y T y.

To ﬁnd a q2 such that F(q) 0, we employ both Lemma 1 and Theorem 1. Since F(q ) = 0 and F is strictly monotonically decreasing, any value which lower bounds q will map to a nonnegative value when passed to F. Since the objective of problem (5) is always nonnegative, we can select q2 = 0 as a lower bound.

Since F(q1) is nonnegative, F(q2) is nonpositive and F is continuous, q must lie in the interval [q1, q2] by the intermediate value theorem. Thus q = q1+q2

2 , the mid-point of [q1, q2], is at most | q1 q2

2 | in distance from q . Note that, in Algorithm 1, q1 is initialised at 0 and q2 is initialised at y T y. Thus, initially |q1 q2| = y T y.

At each iteration, the interval [q1, q2] is updated and halved in length. Therefore, after log2( 2y T y

ϵ ) iterations |q q | ϵ

2. After the same number of iterations q2 q = ϵ

2. Therefore, q2 q ϵ. As q2 is returned by Algorithm 1, we have proved the result.

Theorem 3 essentially states that, if it is possible to evaluate F(q), then it is possible to converge linearly to a global solution. In contrast, existing methods can only guarantee convergence to stationary points which satisfy the Karush-Kuhn-Tucker (KKT) Conditions. In addition, Theorem 3 indicates that this subclass of bilevel optimisation problems are easy in comparison to general bilevel optimisation problems as long as F(q) can be evaluated cheaply.

A disadvantage of Algorithm 1 is that each iteration of the algorithm requires an evaluation of F, where F itself is in general a nonconvex optimisation problem. Fortunately, F(q) is a QC1QP. We can see this by rewriting F(q) as follows:

min ˆw f( ˆw) = ˆw T A ˆw + 2a T ˆw + c

s.t. h( ˆw) = ˆw T B ˆw + 2b T ˆw = 0 (6)

, a = XT y 1

γ (z y)T y q

, B = I 0 0T 0

, c = y T y q

A = XT X 1 γ XT (z y) 1 γ XT (z y) 1 γ2 (z y)T (z y) q γ2

We can now proceed with an S-procedure (the application of an S-lemma), in order to reformulate problem (6) as an SDP, which is convex and can be solved via interior point methods. First, we rewrite problem (6) as the following equivalent problem:

max τ τ s.t. h( ˆw) = 0 = f( ˆw) τ 0 ˆw Rn+1

Since B = 0 we can apply Theorem 2 to rewrite the constraint by introducing a new variable λ R:

max τ,λ τ s.t. f( ˆw) τ + λh( ˆw) 0 ˆw Rn+1 (7)

Furthermore, note that the constraint is equivalent to the following inequality: ˆw 1

T A + λB a + λb a T + λb T c τ

In what follows, we will refer to the matrix in the above constraint by M. It is easy to see that the constraint above is equivalent to a positive semideﬁnite constraint on M. For the sake of contradiction, suppose that x Rn+2 such that x T Mx < 0, but the constraint above holds. If the ﬁnal coordinate of x is nonzero, we can simply rescale x by this coordinate to achieve a contradiction. If the ﬁnal coordinate of x is nonzero, then we can use the continuity of quadratic forms to argue that there must exist a x whose ﬁnal coordinate is not equal to zero such that x T M x < 0, bringing us back to the nonzero case. Thus, we can replace the constraint in (7) as follows:

max τ,λ τ s.t. A + λB a + λb a T + λb T c τ

Thus, evaluating F(q) corresponds to solving the SDP (8), which can be solved efﬁciently using interior point methods. Assuming that strong conic duality holds, then a value for ˆw which attains F(q) can be found by taking a rank-1 decomposition of dual variables. A description of the dual program can be found in the supplementary materials.

Moreover, as each SDP has rank-1 solution, we can apply standard low-rank approximation schemes for SDPs in order to cope with data of high dimension [5, 6]. In addition, a variety of ﬁrst-order methods tailored to SDPs can be applied in scenarios where evaluating the Hessian is prohibitively expensive [22].

7 Empirical Evaluation

Given the theoretical performance analysis of our proposed algorithm, we now demonstrate that it is signiﬁcantly more accurate in practice compared to state of the art approaches. To this end, we evaluate Algorithm 1 on two real world datasets and compare it against both ridge regression, which is the optimal approach under the assumption that data providers are completely adversarial, and to the single level nonconvex reformulation of the SPG originally proposed by Brückner and Scheffer [4]. In what follows, we discuss the results with respect to one of these datasets, the medical personal costs dataset [9], and defer results regarding the second dataset, the red wine dataset [11], to the supplementary material.

7.1 Medical Personal Costs Dataset

The medical personal costs dataset consists of 1338 instances each with 7 features [9]. Each feature details information regarding an individual. Some are continuous, such as the individual s age and body mass index, while others are categorical, such as region and smoking status. So that we can perform linear regression on all features, categorical features are transformed into one-hot vectors, increasing the total number of features from 7 to 13. The response variable is the individual s medical costs billed by their health insurance.

We consider a scenario in which insurers would like to predict the medical costs of a new customer in order to provide a reasonable insurance quote. We assume that individuals may provide fake data in the hope of receiving a lower quote. Similarly to the experimental design of Tong et al. [24], we deﬁne each data provider s desired outcome as zi = yi + δi, where δi denotes the change in medical billing that each individual is striving for. In particular, we consider two types of provider, Amodest, who wishes to reduce their predicted medical billing by $100 (δmodest = 100.0), and Asevere, who

wishes to reduce their predicted medical billing by $300 (δsevere = 300.0). Since medical charges range from $1000 to $63,000, we numerically scale the data labels by dividing them by 100 before passing them to each algorithm.

In order to evaluate Algorithm 1, we perform 10-fold cross validation and compare its performance to ridge regression for γ [1 10 5, 1]. For each value of γ, we compute the regularisation parameter for ridge regression via grid search on 8 logarithmically spaced points in the interval [1 10 5, 1000] during cross validation. For ridge regression, and the SDPs in Algorithm 1, we use the SDPT3 solver [23] to ﬁnd global solutions. We also compare Algorithm 1 to the nonconvex single level reformulation of the SPG originally proposed by Brückner and Scheffer [4]. We employ the interior point method from the MATLAB optimisation toolbox to ﬁnd a stationary point for this problem reformulation. The same error tolerances are used for both Algorithm 1 and the interior point method we use to solve noncovex problem reformulation proposed by Brückner and Scheffer [4]. The results of this comparison are given in Figure 1.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2

Ridge Regression Brückner and Scheffer (2011) Bisection SDP

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2

Figure 1: A performance comparison between different algorithms ran on the medical personal costs dataset. The left plot compares the average MSE of each algorithm during 10-fold cross validation where data was generated by Amodest, whilst the right plot shows the average MSE where data is generated by Asevere

7.2 Results Analysis

Figure 1 shows the average mean squared error (MSE) achieved by each algorithm on the medical personal costs dataset. Firstly, observe that Algorithm 1 outperforms both ridge regression and the nonconvex problem reformulation for every value of γ. Also note that the MSE of Algorithm 1 is very stable, whilst the interior point solution seems to behave erratically for higher values of γ.

For values of γ > 0.4, we observe that, for modest data providers, our algorithm is at least $45 more accurate than ridge regression on average. For severe data providers, we observe an even greater difference. For example, when γ > 0.5, the predictions made by our algorithm are at least $120 more accurate than ridge regression on average. As one would expect, the beneﬁts of explicitly modelling the goals of data providers becomes more beneﬁcial as γ increases, as the data provider s capability for manipulation becomes more limited. In the supplementary material we have included numerical experiments conducted on another real world dataset (the red wine dataset [11]), which yield similar results.

8 Conclusions and Future Work

In this paper, we studied a particular subclass of SPGs, in which data providers have their own goal of manipulating the learner into selecting a given outcome. We showed that this bilevel optimisation problem is equivalent to solving a series of QC1QPs. Using a combination of fractional programming and the S-lemma, we derived an algorithm which linearly converges to a global solution. Lastly, we evaluated our derived algorithm on real world datasets and showed that our algorithm outperforms existing methods in terms of mean squared error.

One disadvantage of our approach is that we require a veriﬁed training dataset which details the motivations of data providers. In many cases, it may not be possible to know the ambitions of data providers apriori. In future work, we will consider an online version of this problem, in which the learner must choose a learning model based on the past data it has observed and its current beliefs regarding the motivations of the data provider. We also assumed that data providers were punished for making manipulations in the ℓ2-norm. It would be interesting to explore what happens when data providers are constrained differently, and whether algorithms of similar structure exist in such cases.

9 Broader Impact

The manipulation and fairness of algorithms form a signiﬁcant barrier to practical application of theoretically effective machine learning algorithms in many real world use cases. With this work, we have attempted to address the important problem of data manipulation, which has many societal consequences. Data manipulation is one of many ways in which an individual can game the system" in order to secure beneﬁcial outcomes for themselves to the detriment of others. Thus, reducing the potential beneﬁts of data manipulation is of worthwhile consideration and focus. Whilst this paper is primarily of theoretical focus, we hope that our work will form a contributing step towards safe, fair, and effective application of machine learning algorithms in more practical settings.

Acknowledgments and Disclosure of Funding

This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Partnership grant.

[1] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC 15, page 61 78, New York, NY, USA, 2015. Association for Computing Machinery.

[2] Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Learning optimal commitment to overcome insecurity. In Advances in Neural Information Processing Systems 27, pages 1826 1834. Curran Associates, Inc., 2014.

[3] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[4] Michael Brückner and Tobias Scheffer. Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 11, pages 547 555, New York, NY, USA, 2011. ACM.

[5] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semideﬁnite programs via low-rank factorization. Mathematical Programming, (2):329 357, 2003.

[6] Samuel Burer and Renato DC Monteiro. Local minima and convergence in low-rank semideﬁnite programming. Mathematical Programming, 103(3):427 444, 2005.

[7] Yiling Chen, Chara Podimata, Ariel D. Procaccia, and Nisarg Shah. Strategyproof linear regression in high dimensions. In Proceedings of the 2018 ACM Conference on Economics and Computation, EC 18, page 9 26, New York, NY, USA, 2018. Association for Computing Machinery.

[8] Yiling Chen, Yang Liu, and Chara Podimata. Grinding the space: Learning to classify against strategic agents. ar Xiv preprint ar Xiv:1911.04004, 2019.

[9] Miri Choi. Medical Cost Personal Datasets | Kaggle, 2018. URL https://www.kaggle.com/ mirichoi0218/insurance/metadata.

[10] Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Proceedings of the 7th ACM Conference on Electronic Commerce, EC 06, page 82 90, New York, NY, USA, 2006. Association for Computing Machinery.

[11] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547 553, 2009. Smart Business Networks: Concepts and Empirical Evidence.

[12] Ofer Dekel, Felix Fischer, and Ariel D. Procaccia. Incentive compatible regression learning. Journal of Computer and System Sciences, 76(8):759 777, December 2010.

[13] Werner Dinkelbach. On nonlinear fractional programming. Management Science, 13(7): 492 498, 1967.

[14] Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classiﬁcation from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, EC 18, pages 55 70, New York, NY, USA, 2018. ACM.

[15] Laurent. El Ghaoui and Hervé. Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18(4):1035 1064, 1997.

[16] Dmytro Korzhyk, Vincent Conitzer, and Ronald Parr. Complexity of computing optimal stackelberg strategies in security resource allocation games. In Proceedings of the Twenty Fourth AAAI Conference on Artiﬁcial Intelligence, AAAI 10, page 805 810. AAAI Press, 2010.

[17] H. Moulin. On strategy-proofness and single peakedness. Public Choice, 35(4):437 455, 1980.

[18] Javier Perote and Juan Perote-Peña. Strategy-proof estimators for simple regression. Mathematical Social Sciences, 47(2):153 176, 2004.

[19] I. Pólik and T. Terlaky. A survey of the s-lemma. SIAM Review, 49(3):371 418, 2007.

[20] Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In David Helmbold and Bob Williamson, editors, Computational Learning Theory, pages 416 426, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg. ISBN 978-3-540-44581-4.

[21] Arunesh Sinha, Fei Fang, Bo An, Christopher Kiekintveld, and Milind Tambe. Stackelberg security games: Looking beyond a decade of success. In Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI-18, pages 5494 5501. International Joint Conferences on Artiﬁcial Intelligence Organization, 7 2018.

[22] Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright. Optimization for Machine Learning. The MIT Press, 2011. ISBN 026201646X.

[23] K. C. Toh, M. J. Todd, and R. H. Tütüncü. Sdpt3 a matlab software package for semideﬁnite programming, version 1.3. Optimization Methods and Software, 11(1-4):545 581, 1999.

[24] Liang Tong, Sixie Yu, Scott Alfeld, and Yevgeniy Vorobeychik. Adversarial regression with multiple learners. In Proceedings of the 35th International Conference on Machine Learning, ICML, pages 4953 4961, 2018.

[25] Yong Xia, Shu Wang, and Ruey-Lin Sheu. S-lemma with equality and its applications. Mathematical Programming, 156(1 2):513 547, March 2016.

[26] Huan Xu, Constantine Caramanis, and Shie Mannor. Robust regression and lasso. In Advances in Neural Information Processing Systems 21, pages 1801 1808. Curran Associates, Inc., 2009.