# generalization_bounds_in_the_predictthenoptimize_framework__5b275049.pdf

Generalization Bounds in the Predict-then-Optimize Framework

Othman El Balghiti Rayens Capital Chicago, IL 60606 oe2161@columbia.edu

Adam N. Elmachtoub Columbia University New York, NY 10027 adam@ieor.columbia.edu

Paul Grigas University of California, Berkeley Berkeley, CA 94720 pgrigas@berkeley.edu

Ambuj Tewari University of Michigan Ann Arbor, MI 48109 tewaria@umich.edu

The predict-then-optimize framework is fundamental in many practical settings: predict the unknown parameters of an optimization problem, and then solve the problem using the predicted values of the parameters. A natural loss function in this environment is to consider the cost of the decisions induced by the predicted parameters, in contrast to the prediction error of the parameters. This loss function was recently introduced [7] and christened Smart Predict-then-Optimize (SPO) loss. Since the SPO loss is nonconvex and noncontinuous, standard results for deriving generalization bounds do not apply. In this work, we provide an assortment of generalization bounds for the SPO loss function. In particular, we derive bounds based on the Natarajan dimension that, in the case of a polyhedral feasible region, scale at most logarithmically in the number of extreme points, but, in the case of a general convex set, have poor dependence on the dimension. By exploiting the structure of the SPO loss function and an additional strong convexity assumption on the feasible region, we can dramatically improve the dependence on the dimension via an analysis and corresponding bounds that are akin to the margin guarantees in classiﬁcation problems.

1 Introduction

A common application of machine learning is to predict-then-optimize, i.e., predict unknown parameters of an optimization problem and then solve the optimization problem using the predictions. For instance, consider a navigation task that requires solving a shortest path problem. The key input into this problem are the travel times on each edge, typically called edge costs. Although the exact costs are not known at the time the problem is solved, the edge costs are predicted using a machine learning model trained on historical data consisting of features (time of day, weather, etc.) and edge costs (collected from app data). Fundamentally, a good model induces the optimization problem to ﬁnd good shortest paths, as measured by the true edge costs. In fact, recent work has been developed to consider how to solve problems in similar environments [3, 13, 6]. In particular, recent work [7] developed the Smart Predict-then-Optimize (SPO) loss function which exactly measures the quality of a prediction by the decision error, in contrast to the prediction error as measured by standard loss functions such as squared error. In this work, we seek to provide an assortment of generalization bounds for the SPO loss function.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Speciﬁcally, we shall assume that our optimization task is to minimize a linear objective over a convex feasible region. In the shortest path example, the feasible region is a polyhedron. We assume the objective cost vector is not known at the time the optimization problem is solved, but rather predicted from data. A decision is made with respect to the predicted cost vector, and the SPO loss is computed by evaluating the decision on the true cost vector and then subtracting the optimal cost assuming knowledge of the true cost vector. Unfortunately, the SPO loss is nonconvex and non-Lipschitz, and therefore proving generalization bounds is not immediate.

Our results consider two cases, depending on whether the feasible region is a polyhedron or a strongly convex body. In all cases, we achieve a dependency of 1 n up to logarithmic terms, where n is the number of samples. In the polyhedral case, our generalization bound is formed by considering the Rademacher complexity of the class obtained by compositing the SPO loss with our predict-thenoptimize models. This in turn can be bounded by a term on the order of square root of the Natarajan dimension times the logarithm of the number of extreme points in the feasible region. Since the number of extreme points is typically exponential in the dimension, this logarithm is essential so that the bound is at most linear in the dimension. When our cost vector prediction models are restricted to linear, we show that the Natarajan dimension of the predict-then-optimize hypothesis class is simply bounded by the product of the two relevant dimensions, the feature dimension and the cost vector dimension, of the linear hypothesis class. Using this polyhedral approach, we show that a generalization bound is possible for any convex set by looking at a covering of the feasible region, although the dependency on the dimension is at least linear.

Fortunately, we show that when the feasible region is strongly convex, tighter generalization bounds can be obtained using margin-based methods. The proof relies on constructing an upper bound on the SPO-loss function and showing it is Lipschitz. Our margin based bounds have no explicit dependence on dimensions of input features and of cost vectors. It is expressed as a function of the multivariate Rademacher complexity of the vector-valued hypothesis class being used. We show that for suitably constrained linear hypothesis classes, we get a much improved dependence on problem dimensions. Since the SPO loss generalizes the 0-1 multiclass loss from multiclass classiﬁcation (see Example 3), our work can be seen as extending classic Natarajan-dimension based [20, Ch. 29] and margin-based generalization bounds [14] to the predict-then-optimize framework.

We note that one can generally construct an instance of a multiclass classiﬁcation problem from an instance of an SPO problem by considering the label of each observed cost vector to be the corresponding optimal solution which is without loss of generality an extreme point of the feasible set of solutions. The number of classes in the resulting multiclass problem is the number of extreme points of the feasible set. It is therefore important to use those generalization bounds from the multiclass classiﬁcation literature that are not too large in the number of classes. For data-independent worst-case bounds, the dependency is at best square root in the number of classes [9, 4]. In contrast, we provide data-independent bounds that grow only logarithmically in the number of extreme points. Using data-dependent (margin-based) approaches, [15, 16] successfully decreased this complexity to logarithm in the number of classes.

However, the reduction of an SPO problem to multiclass classiﬁcation throws away potentially important information, namely the numerical values of the cost vectors. Our margin-based approach removes any explicit dependency on the number of classes by exploiting the structure of the SPO loss. In Section 4, we make an important assumption that the feasible set is strongly convex (which necessarily implies that the number of extreme points is inﬁnite) and also heavily uses the structure of the SPO loss via the construction of the γ-margin SPO loss. This reﬁned analysis allows us to circumvent a naive bound that depends on the inﬁnite number of classes, which would be vacuous.

Even though we construct a Lipschitz upper bound on SPO loss in a general norm setting (Theorem 3), our margin bounds (Theorem 4) are stated in the ℓ2 norm setting. This is because the most general contraction type lemma for vector valued Lipschitz functions we know of only works for the ℓ2norm [17]. Similar results are available in the inﬁnity norm setting [3] but our understanding of general norms appears limited at present. Our work will hopefully provide the motivation to develop contraction inequalities for vector valued Lipschitz functions in a general norm setting.

2 Predict-then-optimize framework and preliminaries

We now describe the predict-then-optimize framework which is central to many applications of optimization in practice. Speciﬁcally, we assume that there is a nominal optimization problem of interest which models our downstream decision-making task. Furthermore, we assume that the nominal problem has a linear objective and that the decision variable w Rd and feasible region S Rd are well-deﬁned and known with certainty. However, the cost vector of the objective, c Rd, is not observed directly, and rather an associated feature vector x Rp is observed. Let D be the underlying joint distribution of (x, c) and let Dx be the conditional distribution of c given x. Then the goal for the decision maker is to solve

min w S Ec Dx[c T w|x] = min w S Ec Dx[c|x]T w (1)

The predict-then-optimize framework relies on using a prediction/estimate for Ec Dx[c|x], which we denote by ˆc, and solving the deterministic version of the optimization problem based on ˆc. We deﬁne P(ˆc) to be the optimization task with objective cost vector ˆc, namely

P(ˆc) : min w ˆc T w s.t. w S. (2)

We assume S Rd is a nonempty, compact, and convex set representing the feasible region. We let w ( ) : Rd S denote any oracle for solving P( ). That is, w ( ) is a ﬁxed deterministic mapping such that w (c) arg minw S c T w for all c Rd. For instance, if (2) corresponds to a linear, conic, or even a particular combinatorial or mixed-integer optimization problem (in which case S can be implicitly described as a convex set), then a commercial optimization solver or a specialized algorithm sufﬁces for w (c).

In this framework, we assume that predictions are made from a model that is learned on a training data set. Speciﬁcally, the sample training data (x1, c1), . . . , (xn, cn) is drawn i.i.d. from the joint distribution D, where xi X is a feature vector representing auxiliary information associated with the cost vector ci. We denote by H our hypothesis class of cost vector prediction models, thus for a function f H, we have that f : X Rd. Most approaches for learning a model f H from the training data are based on specifying a loss function that quantiﬁes the error in making prediction ˆc when the realized (true) cost vector is actually c. Following prior work [7], our primary loss function of interest is the smart predict-then-optimize loss function that directly takes the nominal optimization problem P( ) into account when measuring errors in predictions. Namely, we consider the SPO loss function (relative to the optimization oracle w ( )) deﬁned by:

ℓSPO(ˆc, c) := c T (w (ˆc) w (c)) ,

where ˆc Rd is the predicted cost vector and c C Rd is the true realized cost vector. Notice that ℓSPO(ˆc, c) exactly measures the excess cost incurred when making a suboptimal decision due to an imprecise cost vector prediction. Also, note that the SPO loss is non-negative and bounded above by ωS(C) for all ˆc Rd and c C where ωS(C) is a diameter-like quantity that we will deﬁne shortly. Let us now present several examples to illustrate the applicability and generality of the SPO loss function and framework. Example 1. In the shortest path problem, the feature vector x may include features such as weather and time information that may be used to predict the cost vector c representing the travel times along each edge of the network. In this case, the network is assumed to be given (e.g., the road network of a city) and the feasible region S is a network ﬂow polytope that represents ﬂow conservation and capacity constraints on the underlying network. Example 2. In portfolio optimization, the returns of potential investments can depend on many features which typically include historical returns, news, economic factors, social media, and others. We presume that these auxiliary features may be used to predict the vector of returns r of d different assets, but that the covariance matrix of the asset returns does not depend on the auxiliary features. Here we are interested in maximizing returns, so we let the cost vector c be deﬁned by c = r where r = r r RFe, r represents the vector of asset returns, r RF is the risk-free rate, and e is the vector of all ones. If Σ Rd d denotes the (positive semideﬁnite) covariance matrix of the asset returns and γ 0 is a desired bound on the overall variance (risk level) of the portfolio, then we may deﬁne the feasible region by S := {w : w T Σw γ, e T w 1, w 0}.

Example 3. Our setting also captures multi-class (and binary) classiﬁcation by the following characterization: S is the d-dimensional simplex where d is the number of classes, and C = { ei|i = 1, . . . , d} where ei is the ith unit vector in Rd. It is easy to see that each vertex of the simplex corresponds to a label, and correct/incorrect prediction has a loss of 0/1.

As pointed out before [7], the SPO loss function is generally non-convex, may even be discontinuous, and is in fact a strict generalization of the 0-1 loss function in binary classiﬁcation. Thus, optimizing the SPO loss via empirical risk minimization may be intractable even when H is a linear hypothesis class. To circumvent these difﬁculties, one approach is to optimize a convex surrogate loss [7]. Our focus is on deriving generalization bounds that hold uniformly over the class H, and thus are valid for any training approach, including using a surrogate or other loss function within the framework of empirical risk minimization. Notice that a generalization bound for the SPO loss directly translates to an upper bound guarantee for problem (1) that holds on average over the distribution.

Useful notation. We will make use of a generic given norm on w Rd, as well as the ℓq-norm denoted by q for q [1, ]. For the given norm on Rd, denotes the dual norm deﬁned by c := maxw: w 1 c T w. Let B( w, r) := {w : w w r} denote the ball of radius r centered at w, and we analogously deﬁne Bq( w, r) for the ℓq-norm and B (c, r) for the dual norm. For a set S Rd, we deﬁne the size of S in the norm by ρ(S) := supw S w . We analogously deﬁne ρq( ) for the ℓq-norm and ρ ( ) for the dual norm. We deﬁne the linear optimization gap of S with respect to c by ωS(c) := maxw S c T w minw S c T w , and for a set C Rd we slightly abuse notation by deﬁning ωS(C) := supc C ωS(c). Deﬁne w (H) := {x 7 w (f(x)) : f H}.

Rademacher complexity. Let us now brieﬂy review the notion of Rademacher complexity and its application in our framework. Recall that H is a hypothesis class of functions mapping from the feature space X to Rd. Given a ﬁxed sample (x1, c1)...(xn, cn), we deﬁne the empirical risk with respect to the SPO loss of a function f H as

ˆRSPO(f) = 1

i=1 ℓSPO(f(xi), ci) ,

and the expected risk as RSPO(f) = E(x,c) D[ℓSPO(f(x), c)]. We also deﬁne the empirical Rademacher complexity of H with respect to the SPO loss, i.e., the empirical Rademacher complexity of the function class obtained by composing ℓSPO with H by

ˆRn SPO(H) := Eσ

i=1 σiℓSPO(f(xi), ci)

where σi are i.i.d. Rademacher random variables for i = 1, . . . , n. The expected version of the Rademacher complexity is deﬁned as Rn SPO(H) := E h ˆRn SPO(H) i where the expectation is w.r.t. an i.i.d. sample drawn from the underlying distribution D. The following theorem is an application of the classical generalization bounds based on Rademacher complexity due to [1] to our setting. Theorem 1 (Bartlett and Mendelson [1]). Let H be a family of functions mapping from X to Rd. Then, for any δ > 0, with probability at least 1 δ over an i.i.d. sample drawn from the distribution D, each of the following holds for all f H

RSPO(f) ˆRSPO(f) + 2Rn SPO(H) + ωS(C)

RSPO(f) ˆRSPO(f) + 2 ˆRn SPO(H) + 3ωS(C)

3 Combinatorial dimension based generalization bounds

In this section, we consider the case where S is a polyhedron and derive generalization bounds based on bounding the Rademacher complexity of the SPO loss and applying Theorem 1. Since S is polyhedral, the optimal solution of (2) can be found by considering only the ﬁnite set of extreme points of S, which we denote by the set S. Since the number of extreme points may be exponential

in d, our goal is to provide bounds that are logarithmic in |S|. At the end of the section, we extend our analysis to any compact and convex feasible region S by extending the polyhedral analysis with a covering number argument.

In order to derive a bound on the Rademacher complexity, we will critically rely on the notion of the Natarajan dimension [19], which is an extension of the VC-dimension to the multiclass classiﬁcation setting and is deﬁned in our setting as follows. Deﬁnition 1 (Natarajan dimension). Suppose that S is a polyhedron and S is the set of its extreme points. Let F SX be a hypothesis space of functions mapping from X to S, and let X X be given. We say that F N-shatters X if there exists g1, g2 F such that

g1(x) = g2(x) for all x X

For all T X, there exists g F such that (i) for all x T, g(x) = g1(x) and (ii) for all x X\T, g(x) = g2(x).

The Natarajan dimension of F, denoted d N(F), is the maximal cardinality of a set N-shattered by F.

The Natarajan dimension is a measure for the richness of a hypothesis class. In Theorem 2, we show that the Rademacher complexity for the SPO loss can be bounded as a function of the Natarajan dimension of w (H) := {x 7 w (f(x)) : f H}. The proof follows a classical argument and makes strong use of Massart s lemma and the Natarajan lemma. Theorem 2. Suppose that S is a polyhedron and S is the set of its extreme points. Let H be a family of functions mapping from X to Rd. Then we have that

Rn SPO(H) ωS(C)

2d N(w (H)) log(n|S|2)

Furthermore, for any δ > 0, with probability at least 1 δ over an i.i.d. sample (x1, c1), . . . , (xn, cn) drawn from the distribution D, for all f H we have

RSPO(f) ˆRSPO(f) + 2ωS(C)

2d N(w (H)) log(n|S|2)

Next, we show that when H is restricted to the linear hypothesis class Hlin = {x 7 Bx : B Rd p}, then the Natarajan dimension of w (Hlin) can be bounded by dp. The proof relies on translating our problem to an instance of linear multiclass prediction problem and using a result of [5]. Corollary 1. Suppose that S is a polyhedron and S is the set of its extreme points. Let Hlin be the hypothesis class of all linear functions, i.e., Hlin = {x 7 Bx : B Rd p}. Then we have

d N(w (Hlin)) dp.

Furthermore, for any δ > 0, with probability at least 1 δ over an i.i.d. sample (x1, c1), . . . , (xn, cn) drawn from the distribution D, for all f Hlin we have

RSPO(f) ˆRSPO(f) + 2ωS(C)

2dp log(n|S|2)

Next, we will build off the previous results to prove generalization bounds in the case where S is a general compact convex set. The arguments we made earlier made extensive use of the extreme points of the polyhedron. Nevertheless, this combinatorial argument can be modiﬁed in order to derive similar results for general S. The approach is to approximate S by a grid of points corresponding to the smallest cardinality ϵ-covering of S. To optimize over these grid of points, we ﬁrst ﬁnd the optimal solution in S and then round to the nearest point in the grid. Both the grid representation and the rounding procedure can fortunately both be handled by similar arguments made in Theorems 2 and Corollary 1, yielding a generalization bound below. Corollary 2. Let S be any compact and convex set, and let Hlin be the hypothesis class of all linear functions. Then, for any δ > 0, with probability at least 1 δ over an i.i.d. sample (x1, c1), . . . , (xn, cn) drawn from the distribution D, for all f Hlin we have

RSPO(f) ˆRSPO(f) + 4dωS(C)

2p log(2nρ2(S)d)

Although the dependence on the sample size n in the above bound is favorable, the dependence on the number of features p and the dimension of the feasible region d is relatively weak. Given that the proofs of Corollary 2 and Theorem 2 are purely combinatorial and hold for worst-case distributions, this is not surprising. In the next section, we demonstrate how to exploit the structure of the SPO loss function and additional convexity properties of S in order to develop improved bounds.

4 Margin-based generalization bounds under strong convexity

In this section, we develop improved generalization bounds for the SPO loss function under the additional assumption that the feasible region S is strongly convex. Our developments are akin to and in fact are a strict generalization of the margin guarantees for binary classiﬁcation based on Rademacher complexity developed in [14]. We adopt the deﬁnition of strongly convex sets presented in [10, 8], which is reviewed in Deﬁnition 2 below. Recall that is a generic given norm on Rd and B( w, r) := {w : w w r} denotes the ball of radius r centered at w.

Deﬁnition 2. We say that a convex set S Rd is µ-strongly convex with respect to the norm if, for any w1, w2 S and for any λ [0, 1], it holds that:

B λw1 + (1 λ)w2, µ

2 λ(1 λ) w1 w2 2 S .

Informally, Deﬁnition 2 says that, for every convex combination of points in S, a ball of appropriate radius also lies in S. Several examples of strongly convex sets are presented in [10, 8], including ℓq and Schatten ℓq balls for q (1, 2], certain group norm balls, and generally any level set of a smooth and strongly convex function.

Our analysis herein relies on the following Proposition, which strengthens the ﬁrst-order general optimality condition for differentiable convex optimization problems under the additional assumption of strong convexity. Proposition 1 may be of independent interest and, to the best of our knowledge, has not appeared previously in the literature.

Proposition 1. Let S Rd be a non-empty µ-strongly convex set and let F( ) : Rd R be a convex and differentiable function. Consider the convex optimization problem:

min w F(w) s.t. w S . (3)

Then, w S is an optimal solution of (3) if and only if:

F( w)T (w w) µ

2 F( w) w w 2 for all w S . (4)

In fact, we prove a slightly more general version of the proposition where the function F need only be deﬁned on an open set containing S. In the case of linear optimization with F(w) = ˆc T w, the inequality (4) implies that w (ˆc) is the unique optimal solution of P(ˆc) whenever ˆc = 0 and µ > 0. Hence, in the context of the SPO loss function with a strongly convex feasible region, ˆc provides a degree of conﬁdence regarding the decision w (ˆc) implied by the cost vector prediction ˆc. This intuition motivates us to deﬁne the γ-margin SPO loss , which places a greater penalty on cost vector predictions near 0.

Deﬁnition 3. For a ﬁxed parameter γ > 0, given a cost vector prediction ˆc and a realized cost vector c, the γ-margin SPO loss ℓγ SPO(ˆc, c) is deﬁned as:

ℓγ SPO(ˆc, c) :=

( ℓSPO(ˆc, c) if ˆc > γ ˆc

γ ℓSPO(ˆc, c) + 1 ˆc

γ ωS(c) if ˆc γ

Recall that, for any ˆc, c Rd, it holds that ℓSPO(ˆc, c) ωS(c). Hence, we also have that ℓSPO(ˆc, c) ℓγ SPO(ˆc, c), that is the γ-margin SPO loss provides an upper bound on the SPO loss. Notice that the γ-margin SPO loss interpolates between the SPO loss and the upper bound ωS(c) whenever ˆc γ. The γ-margin SPO loss also satisﬁes a simple monotonicity property whereby ℓγ SPO(ˆc, c) ℓ γ SPO(ˆc, c) for any ˆc, c Rd and γ γ > 0. We can also deﬁne a hard γ-margin SPO loss that simply returns the upper bound ωS(c) whenever ˆc γ.

Deﬁnition 4. For a ﬁxed parameter γ 0, given a cost vector prediction ˆc and a realized cost vector c, the hard γ-margin SPO loss ℓγ SPO(ˆc, c) is deﬁned as:

ℓγ SPO(ˆc, c) := ℓSPO(ˆc, c) if ˆc > γ ωS(c) if ˆc γ

It is simple to see that ℓSPO(ˆc, c) ℓγ SPO(ˆc, c) ℓγ SPO(ˆc, c) ωS(c) for all ˆc, c Rd and γ > 0. Due to this additional upper bound, in all of the subsequent generalization bound results, the empirical γ-margin SPO loss can be replaced by its hard margin counterpart.

We are now ready to state a theorem concerning the Lipschitz properties of the optimization oracle w ( ) and the γ-margin SPO loss, which will then be used to derive margin-based generalization bounds. Theorem 3 below ﬁrst demonstrates that the optimization oracle w ( ) satisﬁes a Lipschitzlike property away from zero. Subsequently, this Lipschitz-like property is a key ingredient in demonstrating that the γ-margin SPO loss is Lipschitz. Theorem 3. Suppose that feasible region S is µ-strongly convex with µ > 0. Then, the optimization oracle w ( ) satisﬁes the following Lipschitz-like property: for any ˆc1, ˆc2 Rd, it holds that:

w (ˆc1) w (ˆc2) 1 µ min { ˆc1 , ˆc2 } ˆc1 ˆc2 . (5)

Moreover, for any ﬁxed c Rd and γ > 0, the γ-margin SPO loss is (5 c /γµ)-Lipschitz with respect to the dual norm , i.e., it holds that:

|ℓγ SPO(ˆc1, c) ℓγ SPO(ˆc2, c)| 5 c

γµ ˆc1 ˆc2 for all ˆc1, ˆc2 Rd . (6)

Proof. We present here only the proof of (5) and defer the proof of (6), which relies crucially on (5), to the supplementary materials. Let τ := min { ˆc1 , ˆc2 }. We assume without loss of generality that τ > 0 (otherwise the right-hand side of (5) is equal to + by convention). Applying Proposition 1 twice yields: ˆc T 1 (w (ˆc2) w (ˆc1)) µ

2 τ w (ˆc1) w (ˆc2) 2 , and ˆc T 2 (w (ˆc1) w (ˆc2)) µ

2 τ w (ˆc1) w (ˆc2) 2 . Adding the above two inequalities together yields:

µτ w (ˆc1) w (ˆc2) 2 (ˆc2 ˆc1)T (w (ˆc1) w (ˆc2)) ˆc1 ˆc2 w (ˆc1) w (ˆc2) ,

where the second inequality is Hölder s inequality. Dividing both sides of the above by µτ w (ˆc1) w (ˆc2) yields the desired result.

Margin-based generalization bounds. We are now ready to present our main generalization bounds of interest in the strongly convex case. Our results are based on combining Theorem 3 with the Lipschitz vector-contraction inequality for Rademacher complexities developed in [17], as well as the results of [1]. Following [3, 17], given a ﬁxed sample ((x1, c1)...(xn, cn)), we deﬁne the multivariate empirical Rademacher complexity of H as

ˆRn(H) := Eσ

j=1 σijfj(xi)

i=1 σT i f(xi)

where σij are i.i.d. Rademacher random variables for i = 1, . . . , n and j = 1, . . . , d, and σi = (σi1, . . . , σid)T . The expected version of the multivariate Rademacher complexity is deﬁned as

Rn(H) := E h ˆRn(H) i where the expectation is w.r.t. the i.i.d. sample drawn from the underlying distribution D.

Let us also deﬁne the empirical γ-margin SPO loss and the empirical Rademacher complexity of H with respect to the γ-margin SPO loss as follows:

ˆRγ SPO(f) := 1

i=1 ℓγ SPO(f(xi), ci) , and ˆRn γSPO(H) := Eσ

i=1 σiℓγ SPO(f(xi), ci)

where f H on the left side above and σi are i.i.d. Rademacher random variables for i = 1, . . . , n.

In the following two theorems, we focus only on the case of the ℓ2-norm set-up, i.e., the norm on the space of w variables as well as the norm on the space of cost vectors c are both the ℓ2-norm. To the best of our knowledge, extending the vector-contraction inequality of [17] to an arbitrary norm setting (or even the case of general ℓq-norms) remains an open question that would have interesting applications to our framework. Theorem 4 below presents our margin based generalization bounds for a ﬁxed γ > 0. Recall that C denotes the domain of the true cost vectors c, ρ2(C) = supc C c 2, and ωS(C) := supc C ωS(c). Theorem 4. Suppose that feasible region S is µ-strongly convex with respect to the ℓ2-norm with µ > 0, and let γ > 0 be ﬁxed. Let H be a family of functions mapping from X to Rd. Then, for any ﬁxed sample ((x1, c1)...(xn, cn)) we have that

ˆRn γSPO(H) 5

2ρ2(C) ˆRn(H)

Furthermore, for any δ > 0, with probability at least 1 δ over an i.i.d. sample Sn drawn from the distribution D, each of the following holds for all f H

RSPO(f) ˆRγ SPO(f) + 10

2ρ2(C)Rn(H)

RSPO(f) ˆRγ SPO(f) + 10

2ρ2(C) ˆRn(H)

γµ + 3ωS(C)

Proof. The bound on ˆRn γSPO(H) follows simply by combining Theorem 3, particularly (6), with equation (1) of [17]. The subsequent generalization bounds then simply follow since RSPO(f) Rγ SPO(f) for all f H and by applying the version of Theorem 1 for the γ-margin SPO loss.

It is often the case that the structure of the hypothesis class H naturally leads to a bound on Rn(H) that can have mild, even logarithmic, dependence on dimensions p and d. For example, let us consider the general setting of a constrained linear function class, namely H = HB := {f : f(x) = Bx for some B Rd p, B B}, where B Rd p. In Section A.2.4 of the supplementary materials, we derive a result that extends Theorem 3 of [12] to multivariate Rademacher complexity and provides a convenient way to bound Rn(HB) in the case when B corresponds to the level set of a strongly convex function. When B = {B : B F β} (where B F denotes the Frobenius norm of B) this

result implies that Rn(HB) ρ2(X)β

2d n , and when B = {B : B 1 β} (where B 1 denotes the

ℓ1-norm of the vectorized matrix B) this result implies that Rn(HB) ρ (X)β

6 log(pd) n . Note the absence of any explicit dependence on p in the ﬁrst bound and only logarithmic dependence on p, d in the second. We discuss the details of these and additional examples, including the group-lasso" norm, in Section A.2.4.

Theorem 4 may also be extended to bounds that hold uniformly over all values of γ (0, γ], where γ > 0 is a ﬁxed parameter. This extension is presented below in Theorem 5. Theorem 5. Suppose that feasible region S is µ-strongly convex with respect to the ℓ2-norm with µ > 0, and let γ > 0 be ﬁxed. Let H be a family of functions mapping from X to Rd. Then, for any δ > 0, with probability at least 1 δ over an i.i.d. sample drawn from the distribution D, each of the following holds for all f H and for all γ (0, γ]

RSPO(f) ˆRγ SPO(f) + 20

2ρ2(C)Rn(H)

log(log2(2 γ/γ))

RSPO(f) ˆRγ SPO(f) + 20

2ρ2(C) ˆRn(H)

log(log2(2 γ/γ))

Note that a natural choice for γ in Theorem 5 is γ supf H,x X f(x) 2, presuming that one can bound this quantity based on the properties of H and X. Example 4 below discusses how Theorems 4 and 5 relate to known results in binary classiﬁcation.

Example 4. In [7], it is shown that the SPO loss corresponds exactly to the 0-1 loss in binary classiﬁcation when d = 1, S = [ 1/2, +1/2], and C = { 1, +1}. In this case, using our notation, the margin value of a prediction ˆc is cˆc. It is also easily seen that ωS(C) = ρ2(C) = 1, the γ-margin SPO loss corresponds exactly to the margin loss (or ramp loss) that interpolates between 1 and 0 when cˆc [0, γ], and the hard γ-margin SPO loss corresponds exactly to the margin loss that returns 1 when cˆc γ and 0 otherwise. Furthermore, note that the interval S = [ 1

2] is 2-strongly convex [8]. Thus, except for some worse absolute constants, Theorems 4 and 5 generalize the well-known results on margin guarantees based on Rademacher complexity for binary classiﬁcation [14].

As in the case of binary classiﬁcation, the utility of Theorems 4 and 5 is strengthened when the underlying distribution D has a favorable margin property. Namely, the bounds in Theorems 4 and 5 can be much stronger than those of Corollary 2 when the distribution D and the sample are such that there exists a relatively large value of γ such that the empirical γ-margin SPO loss is small. One is thus motivated to choose the value of γ in a data-driven way so that, given a prediction function ˆf trained on the data Sn, the upper bound on ˆRSPO( ˆf) is minimized. Since Theroem 5 is a uniform result over γ (0, γ], this data-driven procedure for choosing γ is indeed valid.

5 Conclusions and Future Directions

Our work extends learning theory, as developed for binary and multiclass classiﬁcation, to predictthen-optimize problems in two very signiﬁcant directions: (i) obtaining worst-case generalization bounds using combinatorial parameters that measure the capacity of function classes, and (ii) exploiting special structure in data by deriving margin-based generalization bounds that scale more gracefully w.r.t. problem dimensions. It also motivates several interesting avenues for future work. Beyond the margin theory, other aspects of the problem that lead to improvements over worst case rates should be studied. In this respect, developing a theory of local Rademacher complexity for predict-then-optimize problems would be a promising approach. It will be good to use minimax constructions to provide matching lower bounds for our upper bounds. Extending the margin theory for strongly convex sets, where the SPO loss is ill-behaved only near 0, to polyhedral sets, where it can be much more ill-behaved, is a challenging but fascinating direction. Developing a theory of surrogate losses, especially convex ones, that are calibrated w.r.t. the non-convex SPO loss will also be extremely important. Finally, the assumption that the optimization objective is linear could be relaxed to include non-linear objectives.

Acknowledgments

OB thanks Rayens Capital for their support. AE acknowledges the support of NSF via grant CMMI-1763000. PG acknowledges the support of NSF Awards CCF-1755705 and CMMI-1762744. AT acknowledges the support of NSF via CAREER grant IIS-1452099 and of a Sloan Research Fellowship.

[1] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

[2] D. P. Bertsekas and A. Scientiﬁc. Convex optimization algorithms. Athena Scientiﬁc Belmont, 2015.

[3] D. Bertsimas and N. Kallus. From predictive to prescriptive analytics. ar Xiv preprint ar Xiv:1402.5481, 2014.

[4] A. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz. Multiclass learnability and the erm principle. The Journal of Machine Learning Research, 16(1):2377 2404, 2015.

[5] A. Daniely and S. Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on Learning Theory, pages 287 316, 2014.

[6] P. Donti, B. Amos, and J. Z. Kolter. Task-based end-to-end model learning in stochastic optimization. In Advances in Neural Information Processing Systems, pages 5484 5494, 2017.

[7] A. N. Elmachtoub and P. Grigas. Smart "predict, then optimize". ar Xiv preprint ar Xiv:1710.08005, 2017.

[8] D. Garber and E. Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In 32nd International Conference on Machine Learning, ICML 2015, 2015.

[9] Y. Guermeur. Vc theory of large margin multi-category classiﬁers. Journal of Machine Learning Research, 8(Nov):2551 2594, 2007.

[10] M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11(Feb):517 553, 2010.

[11] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13:1865 1890, June 2012.

[12] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pages 793 800, 2009.

[13] Y.-h. Kao, B. V. Roy, and X. Yan. Directed regression. In Advances in Neural Information Processing Systems, pages 889 897, 2009.

[14] V. Koltchinskii, D. Panchenko, et al. Empirical margin distributions and bounding the generalization error of combined classiﬁers. The Annals of Statistics, 30(1):1 50, 2002.

[15] Y. Lei, U. Dogan, A. Binder, and M. Kloft. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, pages 2035 2043, 2015.

[16] J. Li, Y. Liu, R. Yin, H. Zhang, L. Ding, and W. Wang. Multi-class learning: From theory to algorithm. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1586 1595. Curran Associates, Inc., 2018.

[17] A. Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3 17. Springer, 2016.

[18] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.

[19] B. K. Natarajan. On learning sets and functions. Machine Learning, 4(1):67 97, 1989.

[20] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[21] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015.