# taskbased_endtoend_model_learning_in_stochastic_optimization__314b0393.pdf Task-based End-to-end Model Learning in Stochastic Optimization Priya L. Donti Dept. of Computer Science Dept. of Engr. & Public Policy Carnegie Mellon University Pittsburgh, PA 15213 pdonti@cs.cmu.edu Brandon Amos Dept. of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 bamos@cs.cmu.edu J. Zico Kolter Dept. of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 zkolter@cs.cmu.edu With the increasing popularity of machine learning techniques, it has become common to see prediction algorithms operating within some larger process. However, the criteria by which we train these algorithms often differ from the ultimate criteria on which we evaluate them. This paper proposes an end-to-end approach for learning probabilistic machine learning models in a manner that directly captures the ultimate task-based objective for which they will be used, within the context of stochastic programming. We present three experimental evaluations of the proposed approach: a classical inventory stock problem, a real-world electrical grid scheduling task, and a real-world energy storage arbitrage task. We show that the proposed approach can outperform both traditional modeling and purely black-box policy optimization approaches in these applications. 1 Introduction While prediction algorithms commonly operate within some larger process, the criteria by which we train these algorithms often differ from the ultimate criteria on which we evaluate them: the performance of the full closed-loop system on the ultimate task at hand. For instance, instead of merely classifying images in a standalone setting, one may want to use these classifications within planning and control tasks such as autonomous driving. While a typical image classification algorithm might optimize accuracy or log likelihood, in a driving task we may ultimately care more about the difference between classifying a pedestrian as a tree vs. classifying a garbage can as a tree. Similarly, when we use a probabilistic prediction algorithm to generate forecasts of upcoming electricity demand, we then want to use these forecasts to minimize the costs of a scheduling procedure that allocates generation for a power grid. As these examples suggest, instead of using a generic loss, we instead may want to learn a model that approximates the ultimate task-based true loss. This paper considers an end-to-end approach for learning probabilistic machine learning models that directly capture the objective of their ultimate task. Formally, we consider probabilistic models in the context of stochastic programming, where the goal is to minimize some expected cost over the models probabilistic predictions, subject to some (potentially also probabilistic) constraints. As mentioned above, it is common to approach these problems in a two-step fashion: first to fit a predictive model to observed data by minimizing some criterion such as negative log-likelihood, and then to use this model to compute or approximate the necessary expected costs in the stochastic programming setting. While this procedure can work well in many instances, it ignores the fact that the true cost of the system (the optimization objective evaluated on actual instantiations in the real world) may benefit from a model that actually attains worse overall likelihood, but makes more accurate predictions over certain manifolds of the underlying space. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. We propose to train a probabilistic model not (solely) for predictive accuracy, but so that when it is later used within the loop of a stochastic programming procedure it produces solutions that minimize the ultimate task-based loss. This formulation may seem somewhat counterintuitive, given that a perfect predictive model would of course also be the optimal model to use within a stochastic programming framework. However, the reality that all models do make errors illustrates that we should indeed look to a final task-based objective to determine the proper error tradeoffs within a machine learning setting. This paper proposes one way to evaluate task-based tradeoffs in a fully automated fashion, by computing derivatives through the solution to the stochastic programming problem in a manner that can improve the underlying model. We begin by presenting background material and related work in areas spanning stochastic programming, end-to-end training, and optimizing alternative loss functions. We then describe our approach within the formal context of stochastic programming, and give a generic method for propagating task loss through these problems in a manner that can update the models. We report on three experimental evaluations of the proposed approach: a classical inventory stock problem, a real-world electrical grid scheduling task, and a real-world energy storage arbitrage task. We show that the proposed approach outperforms traditional modeling and purely black-box policy optimization approaches. 2 Background and related work Stochastic programming Stochastic programming is a method for making decisions under uncertainty by modeling or optimizing objectives governed by a random process. It has applications in many domains such as energy [1], finance [2], and manufacturing [3], where the underlying probability distributions are either known or can be estimated. Common considerations include how to best model or approximate the underlying random variable, how to solve the resulting optimization problem, and how to then assess the quality of the resulting (approximate) solution [4]. In cases where the underlying probability distribution is known but the objective cannot be solved analytically, it is common to use Monte Carlo sample average approximation methods, which draw multiple iid samples from the underlying probability distribution and then use deterministic optimization methods to solve the resultant problems [5]. In cases where the underlying distribution is not known, it is common to learn or estimate some model from observed samples [6]. End-to-end training Recent years have seen a dramatic increase in the number of systems building on so-called end-to-end learning. Generally speaking, this term refers to systems where the end goal of the machine learning process is directly predicted from raw inputs [e.g. 7, 8]. In the context of deep learning systems, the term now traditionally refers to architectures where, for example, there is no explicit encoding of hand-tuned features on the data, but the system directly predicts what the image, text, etc. is from the raw inputs [9, 10, 11, 12, 13]. The context in which we use the term end-to-end is similar, but slightly more in line with its older usage: instead of (just) attempting to learn an output (with known and typically straightforward loss functions), we are specifically attempting to learn a model based upon an end-to-end task that the user is ultimately trying to accomplish. We feel that this concept of describing the entire closed-loop performance of the system as evaluated on the real task at hand is beneficial to add to the notion of end-to-end learning. Also highly related to our work are recent efforts in end-to-end policy learning [14], using value iteration effectively as an optimization procedure in similar networks [15], and multi-objective optimization [16, 17, 18, 19]. These lines of work fit more with the pure end-to-end approach we discuss later on (where models are eschewed for pure function approximation methods), but conceptually the approaches have similar motivations in modifying typically-optimized policies to address some task(s) directly. Of course, the actual methodological approaches are quite different, given our specific focus on stochastic programming as the black box of interest in our setting. Optimizing alternative loss functions There has been a great deal of work in recent years on using machine learning procedures to optimize different loss criteria than those naturally optimized by the algorithm. For example, Stoyanov et al. [20] and Hazan et al. [21] propose methods for optimizing loss criteria in structured prediction that are different from the inference procedure of the prediction algorithm; this work has also recently been extended to deep networks [22]. Recent work has also explored using auxiliary prediction losses to satisfy multiple objectives [23], learning dynamics models that maximize control performance in Bayesian optimization [24], and learning adaptive predictive models via differentiation through a meta-learning optimization objective [25]. The work we have found in the literature that most closely resembles our approach is the work of Bengio [26], which uses a neural network model for predicting financial prices, and then optimizes the model based on returns obtained via a hedging strategy that employs it. We view this approach of both using a model and then tuning that model to adapt to a (differentiable) procedure as a philosophical predecessor to our own work. In concurrent work, Elmachtoub and Grigas [27] also propose an approach for tuning model parameters given optimization results, but in the context of linear programming and outside the context of deep networks. Whereas Bengio [26] and Elmachtoub and Grigas [27] use hand-crafted (but differentiable) algorithms to approximately attain some objective given a predictive model, our approach is tightly coupled to stochastic programming, where the explicit objective is to attempt to optimize the desired task cost via an exact optimization routine, but given underlying randomness. The notions of stochasticity are thus naturally quite different in our work, but we do hope that our work can bring back the original idea of task-based model learning. (Despite Bengio [26] s original paper being nearly 20 years old, virtually all follow-on work has focused on the financial application, and not on what we feel is the core idea of using a surrogate model within a task-driven optimization procedure.) 3 End-to-end model learning in stochastic programming We first formally define the stochastic modeling and optimization problems with which we are concerned. Let (x 2 X, y 2 Y) D denote standard input-output pairs drawn from some (real, unknown) distribution D. We also consider actions z 2 Z that incur some expected loss LD(z) = Ex,y D[f(x, y, z)]. For instance, a power systems operator may try to allocate power generators z given past electricity demand x and future electricity demand y; this allocation s loss corresponds to the overor under-generation penalties incurred given future demand instantiations. If we knew D, then we could select optimal actions z? D = argminz LD(z). However, in practice, the true distribution D is unknown. In this paper, we are interested in modeling the conditional distribution y|x using some parameterized model p(y|x; ) in order to minimize the real-world cost of the policy implied by this parameterization. Specifically, we find some parameters to parameterize p(y|x; ) (as in the standard statistical setting) and then determine optimal actions z?(x; ) (via stochastic optimization) that correspond to our observed input x and the specific choice of parameters in our probabilistic model. Upon observing the costs of these actions z?(x; ) relative to true instantiations of x and y, we update our parameterized model p(y|x; ) accordingly, calculate the resultant new z?(x; ), and repeat. The goal is to find parameters such that the corresponding policy z?(x; ) optimizes the loss under the true joint distribution of x and y. Explicitly, we wish to choose to minimize the task loss L( ) in the context of x, y D, i.e. L( ) = Ex,y D[f(x, y, z?(x; ))]. (1) Since in reality we do not know the distribution D, we obtain z?(x; ) via a proxy stochastic optimization problem for a fixed instantiation of parameters , i.e. z?(x; ) = argmin z Ey p(y|x; )[f(x, y, z)]. (2) The above setting specifies z?(x; ) using a simple (unconstrained) stochastic program, but in reality our decision may be subject to both probabilistic and deterministic constraints. We therefore consider more general decisions produced through a generic stochastic programming problem1 z?(x; ) = argmin z Ey p(y|x; )[f(x, y, z)] subject to Ey p(y|x; )[gi(x, y, z)] 0, i = 1, . . . , nineq hi(z) = 0, i = 1, . . . , neq. 1It is standard to presume in stochastic programming that equality constraints depend only on decision variables (not random variables), as non-trivial random equality constraints are typically not possible to satisfy. In this setting, the full task loss is more complex, since it captures both the expected cost and any deviations from the constraints. We can write this, for instance, as L( ) = Ex,y D[f(x, y, z?(x; ))]+ I{Ex,y D[gi(x, y, z?(x; ))] 0}+ Ex[I{hi(z?(x; )) = 0}] (4) (where I( ) is the indicator function that is zero when its constraints are satisfied and infinite otherwise). However, the basic intuition behind our approach remains the same for both the constrained and unconstrained cases: in both settings, we attempt to learn parameters of a probabilistic model not to produce strictly accurate predictions, but such that when we use the resultant model within a stochastic programming setting, the resulting decisions perform well under the true distribution. Actually solving this problem requires that we differentiate through the argmin operator z?(x; ) of the stochastic programming problem. This differentiation is not possible for all classes of optimization problems (the argmin operator may be discontinuous), but as we will show shortly, in many practical cases including cases where the function and constraints are strongly convex we can indeed efficiently compute these gradients even in the context of constrained optimization. 3.1 Discussion and alternative approaches We highlight our approach in contrast to two alternative existing methods: traditional model learning and model-free black-box policy optimization. In traditional machine learning approaches, it is common to use to minimize the (conditional) log-likelihood of observed data under the model p(y|x; ). This method corresponds to approximately solving the optimization problem Ex,y D [ log p(y|x; )] . (5) If we then need to use the conditional distribution y|x to determine actions z within some later optimization setting, we commonly use the predictive model obtained from (5) directly. This approach has obvious advantages, in that the model-learning phase is well-justified independent of any future use in a task. However, it is also prone to poor performance in the common setting where the true distribution y|x cannot be represented within the class of distributions parameterized by , i.e. where the procedure suffers from model bias. Conceptually, the log-likelihood objective implicitly trades off between model error in different regions of the input/output space, but does so in a manner largely opaque to the modeler, and may ultimately not employ the correct tradeoffs for a given task. In contrast, there is an alternative approach to solving (1) that we describe as the model-free black-box policy optimization approach. Here, we forgo learning any model at all of the ran- dom variable y. Instead, we attempt to learn a policy mapping directly from inputs x to actions z?(x; ) that minimize the loss L( ) presented in (4) (where here defines the form of the policy itself, not a predictive model). While such model-free methods can perform well in many settings, they are often very data-inefficient, as the policy class must have enough representational power to describe sufficiently complex policies without recourse to any underlying model.2 Algorithm 1 Task Loss Optimization 1: input: D // samples from true distribution 2: initialize // some initial parameterization 3: for t = 1, . . . , T do 4: sample (x, y) D 5: compute z?(x; ) via Equation (3) 6: // step in violated constraint or objective 7: if 9i s.t. gi(x, y, z?(x; )) > 0 then 8: update with r gi(x, y, z?(x; )) 9: else 10: update with r f(x, y, z?(x; )) 11: end if 12: end for Our approach offers an intermediate setting, where we do still use a surrogate model to determine an optimal decision z?(x; ), yet we adapt this model based on the task loss instead of any model prediction accuracy. In practice, we typically want to minimize some weighted combination of log-likelihood and task loss, which can be easily accomplished given our approach. 3.2 Optimizing task loss To solve the generic optimization problem (4), we can in principle adopt a straightforward (constrained) stochastic gradient approach, as detailed in Algorithm 1. At each iteration, we 2This distinction is roughly analogous to the policy search vs. model-based settings in reinforcement learning. However, for the purposes of this paper, we consider much simpler stochastic programs without the multiple rounds that occur in RL, and the extension of these techniques to a full RL setting remains as future work. Pred. demand (uncertain; discrete) (randomly generated) stocking decision - ℝ 1 2 5 10 20 (a) Inventory stock problem Past demand, past temperature, temporal features Pred. demand (w/ uncertainty) Generation schedule (e.g.) (b) Load forecasting problem Pred. prices (w/ uncertainty) Battery schedule (e.g.) Past prices, past temperature, temporal features, load forecasts (c) Price forecasting problem Figure 1: Features x, model predictions y, and policy z for the three experiments. solve the proxy stochastic programming problem (3) to obtain z?(x, ), using the distribution defined by our current values of . Then, we compute the true loss L( ) using the observed value of y. If any of the inequality constraints gi in L( ) are violated, we take a gradient step in the violated constraint; otherwise, we take a gradient step in the optimization objective f. We note that if any inequality constraints are probabilistic, Algorithm 1 must be adapted to employ mini-batches in order to determine whether these probabilistic constraints are satisfied. Alternatively, because even the gi constraints are probabilistic, it is common in practice to simply move a weighted version of these constraints to the objective, i.e., we modify the objective by adding some appropriate penalty times the positive part of the function, λgi(x, y, z)+, for some λ > 0. In practice, this has the effect of taking gradient steps jointly in all the violated constraints and the objective in the case that one or more inequality constraints are violated, often resulting in faster convergence. Note that we need only move stochastic constraints into the objective; deterministic constraints on the policy itself will always be satisfied by the optimizer, as they are independent of the model. 3.3 Differentiating the optimization solution to a stochastic programming problem While the above presentation highlights the simplicity of the proposed approach, it avoids the issue of chief technical challenge to this approach, which is computing the gradient of an objective that depends upon the argmin operation z?(x; ). Specifically, we need to compute the term which involves the Jacobian @z? @ . This is the Jacobian of the optimal solution with respect to the distribution parameters . Recent approaches have looked into similar argmin differentiations [28, 29], though the methodology we present here is more general and handles the stochasticity of the objective. At a high level, we begin by writing the KKT optimality conditions of the general stochastic programming problem (3). Differentiating these equations and applying the implicit function theorem gives a set of linear equations that we can solve to obtain the necessary Jacobians (with expectations over the distribution y p(y|x; ) denoted Ey , and where g is the vector of inequality constraints) 2 z Ey f(z) + z Ey gi(z) (rz Ey g(z))T AT diag(λ) (rz Ey g(z)) diag(Ey g(z)) 0 A 0 0 @z @ @λ @ @ @ @rz Ey f(z) i=1 λirz Ey gi(z) (7) The terms in these equations look somewhat complex, but fundamentally, the left side gives the optimality conditions of the convex problem, and the right side gives the derivatives of the relevant functions at the achieved solution with respect to the governing parameter . In practice, we calculate the right-hand terms by employing sequential quadratic programming [30] to find the optimal policy z?(x; ) for the given parameters , using a recently-proposed approach for fast solution of the argmin differentiation for QPs [31] to solve the necessary linear equations; we then take the derivatives at the optimum produced by this strategy. Details of this approach are described in the appendix. 4 Experiments We consider three applications of our task-based method: a synthetic inventory stock problem, a real-world energy scheduling task, and a real-world battery arbitrage task. We demonstrate that the task-based end-to-end approach can substantially improve upon other alternatives. Source code for all experiments is available at https://github.com/locuslab/e2e-model-learning. 4.1 Inventory stock problem Problem definition To highlight the performance of the algorithm in a setting where the true underlying model is known to us, we consider a conditional variation of the classical inventory stock problem [4]. In this problem, a company must order some quantity z of a product to minimize costs over some stochastic demand y, whose distribution in turn is affected by some observed features x (Figure 1a). There are linear and quadratic costs on the amount of product ordered, plus different linear/quadratic costs on over-orders [z y]+ and under-orders [y z]+. The objective is given by fstock(y, z) = c0z + 1 2q0z2 + cb[y z]+ + 1 2qb([y z]+)2 + ch[z y]+ + 1 2qh([z y]+)2, (8) where [v]+ max{v, 0}. For a specific choice of probability model p(y|x; ), our proxy stochastic programming problem can then be written as z Ey p(y|x; )[fstock(y, z)]. (9) To simplify the setting, we further assume that the demands are discrete, taking on values d1, . . . , dk with probabilities (conditional on x) (p )i p(y = di|x; ). Thus our stochastic programming problem (9) can be written succinctly as a joint quadratic program3 minimize z2R,zb,zh2Rk c0z + 1 cb(zb)i + 1 i + ch(zh)i + 1 subject to d z1 zb, z1 d zh, z, zh, zb 0. Further details of this approach are given in the appendix. Experimental setup We examine our algorithm under two main conditions: where the true model is linear, and where it is nonlinear. In all cases, we generate problem instances by randomly sampling some x 2 Rn and then generating p(y|x; ) according to either p(y|x; ) / exp( T x) (linear true model) or p(y|x; ) / exp(( T x)2) (nonlinear true model) for some 2 Rn k. We compare the following approaches on these tasks: 1) the QP allocation based upon the true model (which performs optimally); 2) MLE approaches (with linear or nonlinear probability models) that fit a model to the data, and then compute the allocation by solving the QP; 3) pure end-to-end policy-optimizing models (using linear or nonlinear hypotheses for the policy); and 4) our task-based learning models (with linear or nonlinear probability models). In all cases, we evaluate test performance by running on 1000 random examples, and evaluate performance over 10 folds of different true ? parameters. Figures 2(a) and (b) show the performance of these methods given a linear true model, with linear and nonlinear model hypotheses, respectively. As expected, the linear MLE approach performs best, as the true underlying model is in the class of distributions that it can represent and thus solving the stochastic programming problem is a very strong proxy for solving the true optimization problem under the real distribution. While the true model is also contained within the nonlinear MLE s generic nonlinear distribution class, we see that this method requires more data to converge, and when given less data makes error tradeoffs that are ultimately not the correct tradeoffs for the task at hand; our task-based approach thus outperforms this approach. The task-based approach also substantially outperforms the policy-optimizing neural network, highlighting the fact that it is more data-efficient to run the learning process through a reasonable model. Note that here it does not make a difference whether we use the linear or nonlinear model in the task-based approach. Figures 2(c) and (d) show performance in the case of a nonlinear true model, with linear and nonlinear model hypotheses, respectively. Case (c) represents the non-realizable case, where the true underlying distribution cannot be represented by the model hypothesis class. Here, the linear MLE, as expected, performs very poorly: it cannot capture the true underlying distribution, and thus the resultant stochastic programming solution would not be expected to perform well. The linear policy model similarly performs poorly. Importantly, the task-based approach with the linear model performs much better here: despite the fact that it still has a misspecified model, the task-based nature of the learning process lets us learn a different linear model than the MLE version, which is 3This is referred to as a two-stage stochastic programming problem (though a very trivial example of one), where first stage variables consist of the amount of product to buy before observing demand, and second-stage variables consist of how much to sell back or additionally purchase once the true demand has been revealed. Figure 2: Inventory problem results for 10 runs over a representative instantiation of true parameters (c0 = 10, q0 = 2, cb = 30, qb = 14, ch = 10, qh = 2). Cost is evaluated over 1000 testing samples (lower is better). The linear MLE performs best for a true linear model. In all other cases, the task-based models outperform their MLE and policy counterparts. particularly tuned to the distribution and loss of the task. Finally, also as to be expected, the non-linear models perform better than the linear models in this scenario, but again with the task-based non-linear model outperforming the nonlinear MLE and end-to-end policy approaches. 4.2 Load forecasting and generator scheduling We next consider a more realistic grid-scheduling task, based upon over 8 years of real electrical grid data. In this setting, a power system operator must decide how much electricity generation z 2 R24 to schedule for each hour in the next 24 hours based on some (unknown) distribution over electricity demand (Figure 1b). Given a particular realization y of demand, we impose penalties for both generation excess (γe) and generation shortage (γs), with γs γe. We also add a quadratic regularization term, indicating a preference for generation schedules that closely match demand realizations. Finally, we impose a ramping constraint cr restricting the change in generation between consecutive timepoints, reflecting physical limitations associated with quick changes in electricity output levels. These are reasonable proxies for the actual economic costs incurred by electrical grid operators when scheduling generation, and can be written as the stochastic programming problem Ey p(y|x; ) γs[yi zi]+ + γe[zi yi]+ + 1 subject to |zi zi 1| cr 8i, where [v]+ max{v, 0}. Assuming (as we will in our model), that yi is a Gaussian random variable with mean µi and variance σ2 i , then this expectation has a closed form that can be computed via analytically integrating the Gaussian PDF.4 We then use sequential quadratic programming (SQP) to iteratively approximate the resultant convex objective as a quadratic objective, iterate until convergence, and then compute the necessary Jacobians using the quadratic approximation at the solution, which gives the correct Hessian and gradient terms. Details are given in the appendix. To develop a predictive model, we make use of a highly-tuned load forecasting methodology. Specifically, we input the past day s electrical load and temperature, the next day s temperature forecast, and additional features such as non-linear functions of the temperatures, binary indicators of weekends or holidays, and yearly sinusoidal features. We then predict the electrical load over all 24 4 Part of the philosophy behind applying this approach here is that we know the Gaussian assumption is incorrect: the true underlying load is neither Gaussian distributed nor homoskedastic. However, these assumptions are exceedingly common in practice, as they enable easy model learning and exact analytical solutions. Thus, training the (still Gaussian) system with a task-based loss retains computational tractability while still allowing us to modify the distribution s parameters to improve actual performance on the task at hand. Figure 4: Results for 10 runs of the generation-scheduling problem for representative decision parameters γe = 0.5, γs = 50, and cr = 0.4. (Lower loss is better.) As expected, the RMSE net achieves the lowest RMSE for its predictions. However, the task net outperforms the RMSE net on task loss by 38.6%, and the cost-weighted RMSE on task loss by 8.6%. hours of the next day. We employ a 2-hidden-layer neural network for this purpose, with an additional residual connection from the inputs to the outputs initialized to the linear regression solution. (Past Temp)2 Future Temp (Future Temp)2 (Future Temp)3 ((DST) sin(2-. DOY) Future Load Figure 3: 2-hidden-layer neural network to predict hourly electric load for the next day. An illustration of the architecture is shown in Figure 3. We train the model to minimize the mean squared error between its predictions and the actual load (giving the mean prediction µi), and compute σ2 i as the (constant) empirical variance between the predicted and actual values. In all cases we use 7 years of data to train the model, and 1.75 subsequent years for testing. Using the (mean and variance) predictions of this base model, we obtain z?(x; ) by solving the generator scheduling problem (11) and then adjusting network parameters to minimize the resultant task loss. We compare against a traditional stochastic programming model that minimizes just the RMSE, as well as a cost-weighted RMSE that periodically reweights training samples given their task loss.5 (A pure policy-optimizing network is not shown, as it could not sufficiently learn the ramp constraints. We could not obtain good performance for the policy optimizer even ignoring this infeasibility.) Figure 4 shows the performance of the three models. As expected, the RMSE model performs best with respect to the RMSE of its predictions (its objective). However, the task-based model substantially outperforms the RMSE model when evaluated on task loss, the actual objective that the system operator cares about: specifically, we improve upon the performance of the traditional stochastic programming method by 38.6%. The cost-weighted RMSE s performance is extremely variable, and overall, the task net improves upon this method by 8.6%. 4.3 Price forecasting and battery storage Finally, we consider a battery arbitrage task, based upon 6 years of real electrical grid data. Here, a grid-scale battery must operate over a 24 hour period based on some (unknown) distribution over future electricity prices (Figure 1c). For each hour, the operator must decide how much to charge (zin 2 R24) or discharge (zout 2 R24) the battery, thus inducing a particular state of charge in the battery (zstate 2 R24). Given a particular realization y of prices, the operator optimizes over: 1) profits, 2) flexibility to participate in other markets, by keeping the battery near half its capacity B (with weight λ), and 3) battery health, by discouraging rapid charging/discharging (with weight , 5It is worth noting that a cost-weighted RMSE approach is only possible when direct costs can be assigned independently to each decision point, i.e. when costs do not depend on multiple decision points (as in this experiment). Our task-based method, however, accommodates the (typical) more general setting. Hyperparameters RMSE net Task-based net (our method) % Improvement λ 0.1 0.05 1.45 4.67 2.92 0.30 1.02 1 0.5 4.96 4.85 2.28 2.99 0.54 10 5 131 145 95.9 29.8 0.27 35 15 173 7.38 170 2.16 0.02 Table 1: Task loss results for 10 runs each of the battery storage problem, given a lithium-ion battery with attributes B = 1, γeff = 0.9, cin = 0.5, and cout = 0.2. (Lower loss is better.) Our task-based net on average somewhat improves upon the RMSE net, and demonstrates more reliable performance. < λ). The battery also has a charging efficiency (γeff), limits on speed of charge (cin) and discharge (cout), and begins at half charge. This can be written as the stochastic programming problem minimize zin,zout,zstate2R24 Ey p(y|x; ) yi(zin zout)i + λ ----zstate B + kzink2 + kzoutk2 subject to zstate,i+1 = zstate,i zout,i + γeffzin,i 8i, zstate,1 = B/2, 0 zin cin, 0 zout cout, 0 zstate B. Assuming (as we will in our model) that yi is a random variable with mean µi, then this expectation has a closed form that depends only on the mean. Further details are given in the appendix. To develop a predictive model for the mean, we use an architecture similar to that described in Section 4.2. In this case, we input the past day s prices and temperature, the next day s load forecasts and temperature forecasts, and additional features such as non-linear functions of the temperatures and temporal features similar to those in Section 4.2. We again train the model to minimize the mean squared error between the model s predictions and the actual prices (giving the mean prediction µi), using about 5 years of data to train the model and 1 subsequent year for testing. Using the mean predictions of this base model, we then solve the storage scheduling problem by solving the optimization problem (12), again learning network parameters by minimizing the task loss. We compare against a traditional stochastic programming model that minimizes just the RMSE. Table 1 shows the performance of the two models. As energy prices are difficult to predict due to numerous outliers and price spikes, the models in this case are not as well-tuned as in our load forecasting experiment; thus, their performance is relatively variable. Even then, in all cases, our task-based model demonstrates better average performance than the RMSE model when evaluated on task loss, the objective most important to the battery operator (although the improvements are not statistically significant). More interestingly, our task-based method shows less (and in some cases, far less) variability in performance than the RMSE-minimizing method. Qualitatively, our task-based method hedges against perverse events such as price spikes that could substantially affect the performance of a battery charging schedule. The task-based method thus yields more reliable performance than a pure RMSE-minimizing method in the case the models are inaccurate due to a high level of stochasticity in the prediction task. 5 Conclusions and future work This paper proposes an end-to-end approach for learning machine learning models that will be used in the loop of a larger process. Specifically, we consider training probabilistic models in the context of stochastic programming to directly capture a task-based objective. Preliminary experiments indicate that our task-based learning model substantially outperforms MLE and policy-optimizing approaches in all but the (rare) case that the MLE model perfectly characterizes the underlying distribution. Our method also achieves a 38.6% performance improvement over a highly-optimized real-world stochastic programming algorithm for scheduling electricity generation based on predicted load. In the case of energy price prediction, where there is a high degree of inherent stochasticity in the problem, our method demonstrates more reliable task performance than a traditional predictive method. The task-based approach thus demonstrates promise in optimizing in-the-loop predictions. Future work includes an extension of our approach to stochastic learning models with multiple rounds, and further to model predictive control and full reinforcement learning settings. Acknowledgments This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1252522, and by the Department of Energy Computational Science Graduate Fellowship. [1] Stein W Wallace and Stein-Erik Fleten. Stochastic programming models in energy. Handbooks in operations research and management science, 10:637 677, 2003. [2] William T Ziemba and Raymond G Vickson. Stochastic optimization models in finance, volume 1. World Scientific, 2006. [3] John A Buzacott and J George Shanthikumar. Stochastic models of manufacturing systems, volume 4. Prentice Hall Englewood Cliffs, NJ, 1993. [4] Alexander Shapiro and Andy Philpott. A tutorial on stochastic programming. Manuscript. Available at www2.isye.gatech.edu/ashapiro/publications.html, 17, 2007. [5] Jeff Linderoth, Alexander Shapiro, and Stephen Wright. The empirical behavior of sampling methods for stochastic programming. Annals of Operations Research, 142(1):215 241, 2006. [6] R Tyrrell Rockafellar and Roger J-B Wets. Scenarios and policy aggregation in optimization under uncertainty. Mathematics of operations research, 16(1):119 147, 1991. [7] Yann Le Cun, Urs Muller, Jan Ben, Eric Cosatto, and Beat Flepp. Off-road obstacle avoidance through end-to-end learning. In NIPS, pages 739 746, 2005. [8] Ryan W Thomas, Daniel H Friend, Luiz A Dasilva, and Allen B Mackenzie. Cognitive networks: adaptation and learning to achieve end-to-end performance objectives. IEEE Communications Magazine, 44(12):51 57, 2006. [9] Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1457 1464. IEEE, 2011. [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. [11] Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng. End-to-end text recognition with convolutional neural networks. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 3304 3308. IEEE, 2012. [12] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, volume 14, pages 1764 1772, 2014. [13] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. Deep speech 2: End-toend speech recognition in english and mandarin. ar Xiv preprint ar Xiv:1512.02595, 2015. [14] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1 40, 2016. [15] Aviv Tamar, Sergey Levine, Pieter Abbeel, YI WU, and Garrett Thomas. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2146 2154, 2016. [16] Ken Harada, Jun Sakuma, and Shigenobu Kobayashi. Local search for multiobjective function optimization: pareto descent method. In Proceedings of the 8th annual conference on Genetic and evolutionary computation, pages 659 666. ACM, 2006. [17] Kristof Van Moffaert and Ann Nowé. Multi-objective reinforcement learning using sets of pareto dominating policies. Journal of Machine Learning Research, 15(1):3483 3512, 2014. [18] Hossam Mossalam, Yannis M Assael, Diederik M Roijers, and Shimon Whiteson. Multi- objective deep reinforcement learning. ar Xiv preprint ar Xiv:1610.02707, 2016. [19] Marco A Wiering, Maikel Withagen, and M ad alina M Drugan. Model-based multi-objective reinforcement learning. In Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014 IEEE Symposium on, pages 1 6. IEEE, 2014. [20] Veselin Stoyanov, Alexander Ropson, and Jason Eisner. Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. International Conference on Artificial Intelligence and Statistics, 15:725 733, 2011. ISSN 15324435. [21] Tamir Hazan, Joseph Keshet, and David A Mc Allester. Direct loss minimization for structured prediction. In Advances in Neural Information Processing Systems, pages 1594 1602, 2010. [22] Yang Song, Alexander G Schwing, Richard S Zemel, and Raquel Urtasun. Training deep neural networks via direct loss minimization. In Proceedings of The 33rd International Conference on Machine Learning, pages 2169 2177, 2016. [23] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. ar Xiv preprint ar Xiv:1611.05397, 2016. [24] Somil Bansal, Roberto Calandra, Ted Xiao, Sergey Levine, and Claire J Tomlin. Goal-driven dynamics learning via bayesian optimization. ar Xiv preprint ar Xiv:1703.09260, 2017. [25] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta- tion of deep networks. ar Xiv preprint ar Xiv:1703.03400, 2017. [26] Yoshua Bengio. Using a financial training criterion rather than a prediction criterion. Interna- tional Journal of Neural Systems, 8(04):433 443, 1997. [27] Adam N Elmachtoub and Paul Grigas. Smart "predict, then optimize". ar Xiv preprint ar Xiv:1710.08005, 2017. [28] Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447, 2016. [29] Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. ar Xiv preprint ar Xiv:1609.07152, 2016. [30] Paul T Boggs and Jon W Tolle. Sequential quadratic programming. Acta numerica, 4:1 51, [31] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. ar Xi V preprint ar Xiv:1703.00443, 2017. [32] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015. [33] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.