# multiobjective_nonparametric_sequential_prediction__a51c9ecb.pdf Multi-Objective Non-parametric Sequential Prediction Guy Uziel Computer Science Department Technion - Israel Institute of Technology guziel@cs.technion.ac.il Ran El-Yaniv Computer Science Department Technion - Israel Institute of Technology rani@cs.technion.ac.il Online-learning research has mainly been focusing on minimizing one objective function. In many real-world applications, however, several objective functions have to be considered simultaneously. Recently, an algorithm for dealing with several objective functions in the i.i.d. case has been presented. In this paper, we extend the multi-objective framework to the case of stationary and ergodic processes, thus allowing dependencies among observations. We first identify an asymptomatic lower bound for any prediction strategy and then present an algorithm whose predictions achieve the optimal solution while fulfilling any continuous and convex constraining criterion. 1 Introduction In the traditional online learning setting, and in particular in sequential prediction under uncertainty, the learner is evaluated by a single loss function that is not completely known at each iteration [6]. When dealing with multiple objectives, since it is impossible to simultaneously minimize all of the objectives, one objective is chosen as the main function to minimize, leaving the others to be bound by pre-defined thresholds. Methods for dealing with one objective function can be transformed to deal with several objective functions by giving each objective a pre-defined weight. The difficulty, however, lies in assigning an appropriate weight to each objective in order to keep the objectives below a given threshold. This approach is very problematic in real world applications, where the player is required to to satisfy certain constraints. For example, in online portfolio selection [4], the player may want to maximize wealth while keeping the risk (i.e., variance) contained below a certain threshold. Another example is the Neyman-Pearson (NP) classification paradigm (see, e.g., [19]) (which extends the objective in classical binary classification) where the goal is to learn a classifier achieving low classification error whose type I error is kept below a given threshold. In the adversarial setting it is known that multiple-objective is generally impossible when the constraints are unknown a-priory [18]. In the stochastic setting, Mahdavi et al. [17] proposed a framework for dealing with multiple objectives in the i.i.d. case. They proved that if there exists a solution that minimizes the main objective function while keeping the other objectives below given thresholds, then their algorithm will converge to the optimal solution. In this work, we study online prediction with multiple objectives but now consider the challenging general case where the unknown underlying process is stationary and ergodic, thus allowing observations to depend on each other arbitrarily. The (single-objective) sequential prediction under stationary and ergodic sources, has been considered in many papers and in various application domains. For example, in online portfolio selection, [12, 9, 10] proposed non-parametric online strategies that guarantee, under mild conditions, the best possible outcome. Another interesting example in this regard is the work on time-series prediction by [2, 8, 3]. A common theme to all these results is that the asymptotically optimal strategies are constructed by combining the predictions of many simple 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. experts. The above strategies use a countably infinite set of experts, and the guarantees provided for these strategies are always asymptotic. This is no coincidence, as it is well known that finite sample guarantees for these methods cannot be achieved without additional strong assumptions on the source distribution [7, 16]. Approximate implementations of non-parametric strategies (which apply only a finite set of experts), however, turn out to work exceptionally well and, despite the inevitable approximation, are reported [11, 10, 9] to significantly outperform strategies designed to work in an adversarial, no-regret setting, in various domains. The algorithm presented in this paper utilizes as a sub-routine the Weak Aggregating Algorithm (WAA) of [21], and [13] to handle multiple objectives. While we discuss here the case of only two objective functions, our theorems can be extended easily to any fixed number of functions. 2 Problem Formulation We consider the following prediction game. Let X [ D, D]d Rd be a compact observation space where D > 0. At each round, n = 1, 2, . . ., the player is required to make a prediction yn Y, where Y Rm is a compact and convex set, based on past observations, Xn 1 1 (x1, . . . , xn 1) and, xi X (X0 1 is the empty observation). After making the prediction yn, the observation xn is revealed and the player suffers two losses, u(yn, xn) and c(yn, xn), where u and c are real-valued continuous functions and convex w.r.t. their first argument. We view the player s prediction strategy as a sequence S {Sn} n=1 of forecasting functions Sn : X (n 1) Y; that is, the player s prediction at round n is given by Sn(Xn 1 1 ) (for brevity, we denote S(Xn 1 1 )). Throughout the paper we assume that x1, x2, . . . are realizations of random variables X1, X2, . . . such that the stochastic process (Xn) is jointly stationary and ergodic and P(Xi X) = 1. The player s goal is to play the game with a strategy that minimizes the average u-loss, 1 N PN i=1 u(S(Xi 1 1 ), xi), while keeping the average c-loss 1 N PN i=1 c(S(Xi 1 1 ), xi) bounded below a prescribed threshold γ. Formally, we define the following: Definition 1 (γ-bounded strategy). A prediction strategy S will be called γ-bounded if i=1 c(S(Xi 1 1 ), Xi) almost surely. The set of all γ-bounded strategies will be denoted Sγ. The well known result of [1] states that for the single objective case the best possible outcome is E maxy Y() EP [u(y, X0)] where P is the regular conditional probability distribution of X0 given F (the σ-algebra generated by the infinite past X 1, X 2, . . .) and the maximization is over the F -measurable functions. This motivates us to define the following: Definition 2 (γ-feasible process). We say that the stationary and ergodic process {Xi} is γfeasible w.r.t. the functions u and c, if for a threshold γ > 0, there exists some y Y() such that E [c(y , X0)] < γ. If γ-feasibility holds, then we will denote by y (y is not necessarily unique) the minimizer of the following minimization problem: minimize y Y() E [u(y, X0)] subject to E [c(y, X0)] γ, (1) (1) and we define the γ-feasible optimal value as V = E [EP [u(y , X0)]] . Note that problem (1) is a convex minimization problem over Y(). Therefore, the problem is equivalent to finding the saddle point of the Lagrangian function [15], namely, min y Y() max λ R+ L(y, λ), where the Lagrangian is L(y, λ) (E [u(y, X0)] + λ (E [c(y, X0)] γ)) . We denote the optimal dual by λ and assume that L can be maximize by a unique F -measurable function, λ (). Moreover, we set a constant λmax such that λmax > λ () P -a.s., and set Λ [0, λmax]. We also define the instantaneous Lagrangian function as l(y, λ, x) u(y, x) + λ (c(y, x) γ) . (2) In Brief, we are seeking a strategy S Sγ that is as good as any other γ-bounded strategy, in terms of the average u-loss, when the underlying process is γ-feasible. Such a strategy will be called γ-universal. 3 Optimality of V In this section, we show that the average u-loss of any γ-bounded prediction strategy cannot be smaller than V , the γ-feasible optimal value. This result is a generalization of the well-known result of [1] regarding the best possible outcome under a single objective. Before stating and proving this optimality result, we state three lemmas that will be used repeatedly in this paper. The first lemma is known as Breiman s generalized ergodic theorem. The second and the third lemmas concern the continuity of the saddle point w.r.t. the probability distribution, their proofs appear in the supplementary material. Lemma 1 (Ergodicity, [5]). Let X = {Xi} be a stationary and ergodic process. For each positive integer i, let Ti denote the operator that shifts any sequence by i places to the left. Let f1, f2, . . . be a sequence of real-valued functions such that limn fn(X) = f(X) almost surely, for some function f. Assume that E supn |fn(X)| < . Then, i=1 fi(T i X) = Ef(X) almost surely. Lemma 2 (Continuity and Minimax). Let Y, Λ, X be compact real spaces. l : Y Λ X R be a continuous function. Denote by P(X) the space of all probability measures on X (equipped with the topology of weak-convergence). Then the following function L : P(X) R is continuous L (Q) = inf y Y sup λ Λ EQ [l(y, λ, x)] . (3) Moreover, for any Q P(X), inf y Y sup λ Λ EQ [l(y, λ, x)] = sup λ Λ inf y Y EQ [l(y, λ, x)] . Lemma 3 (Continuity of the optimal selection). Let Y, Λ, X be compact real spaces. Then, there exist two measurable selection functions h X,hλ such that hy(Q) arg min y Y max λ Λ EQ [l(y, λ, x)] , hλ(Q) arg max λ Λ min y Y EQ [l(y, λ, x)] for any Q P(X). Moreover, let L be as defined in Equation (3). Then, the set Gr(L ) {(u , v , Q) | u hy(Q), v hλ(Q), Q P(X)}, is closed in Y Λ P(X). The importance of Lemma 3 stems from the fact that it proves the continuity properties of the multi-valued correspondences Q hy(Q) and Q hλ(Q). This leads to the knowledge that if for the limiting distribution, Q , the optimal set is a singleton, then Q hy(Q) and Q hλ(Q) are continuous in Q . We are now ready to prove the optimality of V . Theorem 1 (Optimality of V ). Let {Xi} be a γ-feasible process. Then, for any strategy S Sγ, the following holds a.s. lim inf N 1 N i=1 u(S(Xi 1 1 ), Xi) V . Proof. For any given strategy S Sγ, we will look at the following sequence: i=1 l(S(Xi 1 1 ), λ i , Xi). (4) where λ i hλ(PXi|Xi 1 1 ) Observe that l(S(Xi 1 1 ), λ i , Xi) E h l(S(Xi 1 1 ), λ i , Xi) | Xi 1 1 i i=1 E h l(S(Xi 1 1 ), λ i , Xi) | Xi 1 1 i . Since Ai = l(S(Xi 1 1 ), λ i , Xi) E h l(S(Xi 1 1 ), λ i , Xi) | Xi 1 1 i is a martingale difference sequence, the last summand converges to 0 a.s., by the strong law of large numbers (see, e.g., [20]). Therefore, lim inf N 1 N i=1 l(S(Xi 1 1 ), λ i , Xi) = lim inf N 1 N i=1 E h l(S(Xi 1 1 ), λ i , Xi) | Xi 1 1 i lim inf N 1 N i=1 min y Y() E h l(y, λ i , Xi) | Xi 1 1 i , (5) where the minimum is taken w.r.t. all the σ(Xi 1 1 )-measurable functions. Because the process is stationary, we get for ˆλ i hλ(PX0|X 1 1 i), (5) = lim inf N 1 N i=1 min y Y() E h l(y, ˆλ i , X0) | X 1 1 i i = lim inf N 1 N i=1 L (PX0|X 1 1 i). (6) Using Levy s zero-one law, PX0|X 1 1 i P weakly as i approaches and from Lemma 2 we know that L is continuous. Therefore, we can apply Lemma 1 and get that a.s. (6) = E [L (P )] = E [EP [l (y , λ , X0)]] = E [L (y , λ , X0)] . (7) Note also, that due to the complementary slackness condition of the optimal solution, i.e., E [λ (EP [c(y , X0)] γ)] = 0, we get (7) = E [EP [u (y , X0)]] = V . From the uniqueness of λ , and using Lemma 3 ˆλ i λ as i approaches . Moreover, since l is continuous on a compact set, l is also uniformly continuous. Therefore, for any given ϵ > 0, there exists δ > 0, such that if |λ λ| < δ, then |l(y, λ , x) l(y, λ, x)| < ϵ for any y Y and x X. Therefore, there exists i0 such that if i > i0 then |l(y, ˆλ i , x) l(y, λ , x)| < ϵ for any y Y and x X. Thus, lim inf N 1 N i=1 l(S(Xi 1 1 ), λ , Xi) lim inf N 1 N i=1 l(S(Xi 1 1 ), ˆλ i , Xi) = lim inf N 1 N i=1 l(S(Xi 1 1 ), λ , Xi) + lim sup N i=1 l(S(Xi 1 1 ), ˆλ i , Xi) lim inf N 1 N i=1 l(S(Xi 1 1 ), ˆλ i , Xi) 1 i=1 l(S(Xi 1 1 ), λ , Xi) ϵ a.s., Algorithm 1 Minimax Histogram Based Aggregation (MHA) Input: Countable set of experts {Hk,h}, y0 Y, λ0 Λ, initial probability {αk,h}, For n = 0 to Play yn, λn. Nature reveals xn Suffer loss l(yn, λn, xn). Update the cumulative loss of the experts i=0 l(yi k,h, λi, xi) lk,h λ,n i=0 l(yi, λi k,h, xi) Update experts weights wy,(k,h) n αk,h exp 1 nlk,h y,n py,(k,h) n+1 wy,(k,h) n+1 P h=1 P k=1 wy,(k,h) n+1 Update experts weights wλ,(k,h) n+1 wλ,(k,h) n+1 αk,h exp 1 nlk,h λ,n pλ,(k,h) n+1 = wλ,(k,h) n+1 P h=1 P k=1 wλ,(k,h) n+1 Choose yn+1 and λn+1 as follows k,h py,(k,h) n+1 yn+1 k,h λn+1 = X k,h pλ,(k,h) n+1 λn+1 k,h and since ϵ is arbitrary, lim inf N 1 N i=1 l(S(Xi 1 1 ), λ , Xi) lim inf N 1 N i=1 l(S(Xi 1 1 ), ˆλ i , Xi). Therefore we can conclude that lim inf N 1 N i=1 l(S(Xi 1 1 ), λ , Xi) V a.s. We finish the proof by noticing that since S Sγ, then by definition i=1 c(S(Xi 1 1 ), Xi) γ a.s. and since λ is non negative, we will get the desired result. The above lemma also provides the motivation to find the saddle point of the Lagrangian L. Therefore, for the reminder of the paper we will use the loss function l as defined in Equation 2. 4 Minimax Histogram Based Aggregation We are now ready to present our algorithm Minimax Histogram based Aggregation (MHA) and prove that its predictions are as good as the best strategy. By Theorem 1 we can restate our goal: find a prediction strategy S Sγ such that for any γ-feasible process {Xi} the following holds: i=1 u(S(Xi 1 1 ), Xi) = V a.s. Such a strategy will be called γ-universal. We do so by maintaining a countable set of experts {Hk,h} k, h = 1, 2, . . ., which are constructed in a similar manner to the experts used in [10]. Each expert is defined using a histogram which gets finer as h grows, allowing us to construct an empirical measure on X. An expert Hk,h therefore outputs a pair (yi k,h, λi k,h) Y Λ at round i. This pair is the minimax w.r.t. its empirical measure. We show that those emprical measures converge weakly to P , thus, the experts prediction will converge to V . Our algorithm outputs at round i a pair (yi, λi) Y Λ where the sequence of predictions y1, y2, . . . tries to minimize the average loss 1 N PN i=1 l(y, λi, xi) and the sequence of predictions λ1, λ2, . . . tries to maximize the average loss 1 N PN i=1 l(yi, λ, xi). Each of yi and λi is the aggregation of predictions yi k,h and λi k,h, k, h = 1, 2, . . . , respectively. In order to ensure that the performance of MHA will be as good as any other expert for both the y and the λ predictions, we apply the Weak Aggregating Algorithm of [21], and [13] twice alternately. Theorem 2 states that the selection of points made by the experts above converges to the optimal solution, the proof of Theorem 2 and the explicit construction of the experts appears in the supplementary material. Then, in Theorem 3 we prove that MHA applied on the experts defined in Theorem 2 generates a sequence of predictions that is γ-bounded and as good as any other strategy w.r.t. any γ-feasible process. Theorem 2. Assume that {Xi} is a γ-feasible process. Then, it is possible to construct a countable set of experts {Hk,h} for which lim k lim h lim n 1 N i=1 l(yi k,h, λi k,h, Xi) = V a.s., where (yi k,h, λi k,h) are the predictions made by expert Hk,h at round i. Before stating the main theorem regarding MHA, we state the following lemma (the proof appears in the supplementary material), which is used in the proof of the main result regarding MHA. Lemma 4. Let {Hk,h} be a countable set of experts as defined in the proof of Theorem 2. Then, the following relation holds a.s.: inf k,h lim sup n 1 N i=1 l yi k,h, λi, Xi V sup k,h lim inf n 1 N i=1 l yi, λi k,h, Xi , where (yi, λi) are the predictions of MHA when applied on {Hk,h}. We are now ready to state and prove the optimality of MHA. Theorem 3 (Optimality of MHA). Let (yi, λi) be the predictions generated by MHA when applied on {Hk,h} as defined in the proof of Theorem 2. Then, for any γ-feasible process {Xi} : MHA is a γ-bounded and γ-universal strategy. Proof. We first show that i=1 l(yi, λi, Xi) = V a.s. (8) Applying Lemma 5 in [13], we know that the x updates guarantee that for every expert Hk,h, i=1 l(yi, λi, xi) 1 i=1 l(yi k,h, λi, xi) + Ck,h i=1 l(yi, λi, xi) 1 i=1 l(yi, λi k,h, xi) C k,h where Ck,h, C k,h > 0 are some constants independent of N. In particular, using Equation (9), i=1 l(yi, λi, xi) inf k,h i=1 l(yi k,h, λi, xi) + Ck,h Therefore, we get i=1 l(yi, λi, xi) lim sup N inf k,h i=1 l(yi k,h, λi, xi) + Ck,h inf k,h lim sup N i=1 l(yi k,h, λi, xi) + Ck,h inf k,h lim sup N i=1 l(yi k,h, λi, xi) where in the last inequality we used the fact that lim sup is sub-additive. Using Lemma (4), we get that (11) V sup k,h lim inf n 1 N i=1 l yi, λi k,h, Xi . (12) Using similar arguments and using Equation (10) we can show that (12) lim inf N 1 N i=1 l(yi, λi, xi). Summarizing, we have i=1 l(yi, λi, xi) V lim inf N 1 N i=1 l(yi, λi, xi). Therefore, we can conclude that a.s. lim N 1 N PN i=1 l(yi, λi, Xi) = V . To show that MHA is indeed a γ-bounded strategy, we use two special experts H0,0, H 1, 1 whose predictions are λn 0,0 = λmax and λn 1, 1 = 0 for every n and to shorten the notation, we denote g(y, λ, x) λ(c(y, x) γ). First, from Equation (10) applied on the expert H0,0, we get that: i=1 g(yi, λmax, x) lim sup N i=1 g(yi, λi, x). (13) Moreover, since l is uniformly continuous, for any given ϵ > 0, there exists δ > 0, such that if |λ λ| < δ, then |l(y, λ , x) l(y, λ, x)| < ϵ for any y Y and x X. We also know from the proof of Theorem 2 that limk limh limi λi k,h = λ . Therefore, there exist k0, h0, i0 such that |λi k0,h0 λ | < δ for any i > i0. Therefore, i=1 l(yi, λ , xi) 1 i=1 l(yi, λi, xi) i=1 l(yi, λ , xi) 1 i=1 l(yi, λi k0,h0, xi) i=1 l(yi, λi k0,h0, xi) 1 i=1 l(yi, λi, xi) From the uniform continuity we also learn that the first summand is bounded above by ϵ, and from Equation (10), we get that the last summand is bounded above by 0. Thus, and since ϵ is arbitrary, we get that i=1 l(yi, λ , xi) 1 i=1 l(yi, λi, xi) Thus, lim sup N 1 N PN i=1 l(yi, λ , Xi) V , and from Theorem 1 we can conclude that lim N 1 N PN i=1 l(yi, λ , Xi) = V . Therefore, we can deduce that i=1 g(yi, λi, xi) lim sup N i=1 g(yi, λ , xi) = i=1 g(yi, λi, xi) + lim inf N 1 N i=1 g(yi, λ , xi) i=1 g(yi, λi, xi) 1 i=1 g(yi, λ , xi) = lim sup N i=1 l(yi, λi, xi) 1 i=1 l(yi, λ , xi) = 0, which results in i=1 g(yi, λi, xi) lim sup N i=1 g(yi, λ , xi). Combining the above with Equation (13), we get that i=1 g(yi, λmax, xi) lim sup N i=1 g(yi, λ , xi). Since 0 λ < λmax, we get that MHA is γ-bounded. This also implies that i=1 λi(c(yi, xi) γ) 0. Now, if we apply Equation (10) on the expert H 1, 1, we get that lim inf N 1 N i=1 λi(c(yi, xi) γ) 0. i=1 λi(c(yi, xi) γ) = 0, and using Equation (8), we get that MHA is also γ-universal. 5 Concluding Remarks In this paper, we introduced the Minimax Histogram Aggregation (MHA) algorithm for multipleobjective sequential prediction. We considered the general setting where the unknown underlying process is stationary and ergodic., and given that the underlying process is γ-feasible, we extended the well-known result of [1] regarding the asymptotic lower bound of prediction with a single objective, to the case of multi-objectives. We proved that MHA is a γ-bounded strategy whose predictions also converge to the optimal solution in hindsight. In the proofs of the theorems and lemmas above, we used the fact that the initial weights of the experts, αk,h, are strictly positive thus implying a countably infinite expert set. In practice, however, one cannot maintain an infinite set of experts. Therefore, it is customary to apply such algorithms with a finite number of experts (see [12, 9, 10]). Despite the fact that in the proof we assumed that the observation set X is known a priori, the algorithm can also be applied in the case that X is unknown by applying the doubling trick. For a further discussion on this point, see [8]. In our proofs, we relied on the compactness of the set X. It will be interesting to see whether the universality of MHA can be sustained under unbounded processes as well. A very interesting open question would be to identify conditions allowing for finite sample bounds when predicting with multiple objectives. Acknowledgments We would like to thank the anonymous reviewers for providing helpful comments. This research was supported by The Israel Science Foundation (grant No. 1890/14) [1] P.H. Algoet. The strong law of large numbers for sequential decisions under uncertainty. IEEE Transactions on Information Theory, 40(3):609 633, 1994. [2] G. Biau, K. Bleakley, L. Györfi, and G. Ottucsák. Nonparametric sequential prediction of time series. Journal of Nonparametric Statistics, 22(3):297 317, 2010. [3] G. Biau and B. Patra. Sequential quantile prediction of time series. IEEE Transactions on Information Theory, 57(3):1664 1674, 2011. [4] A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge University Press, 2005. [5] L. Breiman. The individual ergodic theorem of information theory. The Annals of Mathematical Statistics, 28(3):809 811, 1957. [6] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [7] L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013. [8] L. Györfiand G. Lugosi. Strategies for sequential prediction of stationary time series. In Modeling uncertainty, pages 225 248. Springer, 2005. [9] L. Györfi, G. Lugosi, and F. Udina. Nonparametric kernel-based sequential investment strategies. Mathematical Finance, 16(2):337 357, 2006. [10] L. Györfiand D. Schäfer. Nonparametric prediction. Advances in Learning Theory: Methods, Models and Applications, 339:354, 2003. [11] L. Györfi, F. Udina, and H. Walk. Nonparametric nearest neighbor based empirical portfolio selection strategies. Statistics & Decisions, International Mathematical Journal for Stochastic Methods and Models, 26(2):145 157, 2008. [12] L. Györfi, A. Urbán, and I. Vajda. Kernel-based semi-log-optimal empirical portfolio selection strategies. International Journal of Theoretical and Applied Finance, 10(03):505 516, 2007. [13] Y. Kalnishkan and M. Vyugin. The weak aggregating algorithm and weak mixability. In International Conference on Computational Learning Theory, pages 188 203. Springer, 2005. [14] D. Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997. [15] U.V. Luxburg and B. Schölkopf. Statistical learning theory: Models, concepts, and results. ar Xiv preprint ar Xiv:0810.4752, 2008. [16] M. Mahdavi, T. Yang, and R. Jin. Stochastic convex optimization with multiple objectives. In Advances in Neural Information Processing Systems, pages 1115 1123, 2013. [17] S. Mannor, J. Tsitsiklis, and J.Y. Yu. Online learning with sample path constraints. Journal of Machine Learning Research, 10(Mar):569 590, 2009. [18] P. Rigollet and X. Tong. Neyman-pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research, 12(Oct):2831 2855, 2011. [19] W. Stout. Almost sure convergence, vol. 24 of probability and mathematical statistics, 1974. [20] V. Vovk. Competing with stationary prediction strategies. In International Conference on Computational Learning Theory, pages 439 453. Springer, 2007.