# optimization_over_a_probability_simplex__627125dd.pdf Journal of Machine Learning Research 26 (2025) 1-35 Submitted 9/23; Revised 12/24; Published 4/25 Optimization Over a Probability Simplex James Chok1,2 james.chok@ed.ac.uk Geoffrey M. Vasil1 gvasil@ed.ac.uk 1School of Mathematics and Maxwell Institute for Mathematical Sciences, The University of Edinburgh, Edinburgh EH9 3FD, United Kingdom 2Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, United Kingdom Editor: Silvia Villa We propose a new iteration scheme, the Cauchy-Simplex, to optimize convex problems over the probability simplex {w Rn | P i wi = 1 and wi 0}. Specifically, we map the simplex to the positive quadrant of a unit sphere, envisage gradient descent in latent variables, and map the result back in a way that only depends on the simplex variable. Moreover, proving rigorous convergence results in this formulation leads inherently to tools from information theory (e.g., cross-entropy and KL divergence). Each iteration of the Cauchy-Simplex consists of simple operations, making it well-suited for high-dimensional problems. In continuous time, we prove that f(x T ) f(x ) = O(1/T) for differentiable real-valued convex functions, where T is the number of time steps and w is the optimal solution. Numerical experiments of projection onto convex hulls show faster convergence than similar algorithms. Finally, we apply our algorithm to online learning problems and prove the convergence of the average regret for (1) Prediction with expert advice and (2) Universal Portfolios. Keywords: Constrained Optimization, Convex Hull, Simplex, Orthogonal Matrix, Gradient Flow 1. Introduction Optimization over the probability simplex, (i.e., unit simplex) occurs in many subject areas, including portfolio management (Helmbold et al., 1998; Kalai and Vempala, 2003; Das and Banerjee, 2011; Canyakmaz et al., 2023), machine learning (Collins et al., 2008; Rakotomamonjy et al., 2008; Duan, 2020; Floto et al., 2023), population dynamics (Schuster and Sigmund, 1983; Bomze, 2002), and multiple others including statistics and chemistry (Kuznetsova and Strekalovsky, 2001; de Klerk et al., 2008; Amsler et al., 2018; Candelieri et al., 2025). In this paper, we look at minimizing differentiable real-valued convex functions f(w) with w Rn within the probability simplex, min w n f(w), where n = {w Rn | X i wi = 1 and wi 0}. (1) While quadratic programs can produce an exact solution under certain restrictions on f, they tend to be computationally expensive when n is large (Nocedal and Wright, 2006). Here, we provide an overview of iterative algorithms to solve this problem approximately c 2025 James Chok and Geoffrey M. Vasil. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1166.html. Chok and Vasil and introduce a new algorithm for differentiable real-valued convex functions f. We also show how our method can be applied to other constraints, particularly orthogonal matrices. Previous works (e.g., projected and exponentiated gradient descent) are equivalent to the discretizations of an underlying continuous-time gradient flow method. However, in each case, the gradient flow only enforces either the positivity or the unit-sum condition. Alternatively, the Frank-Wolfe algorithm is not derived from a gradient flow method and is inherently a discrete-time method. We propose a continuous-time gradient flow that is able to satisfy both constraints. Being a continuous-time gradient flow allows us to prove convergence in a continuous-time setting, and we also prove convergence of its forward Euler discretization scheme. Moreover, we see the ideas of projected and exponentiated gradient descent in our algorithm. The remainder of this paper is structured as follows. In Section 2, we provided an overview of methods used to solve (1). In Section 3, we present our proposed algorithm, show how it can be derived using simple calculus, and reveal connections to previous works. Theoretical analysis of the algorithm is shown in Section 4, proving a convergence rate of O(1/T). Section 5 shows how to extend the method presented in Section 3 to constraints over orthogonal matrices. Finally, Section 6 provides theoretical applications in game theory and finance, as well as numerical applications in geometry and finding how to weight survey questions to follow a desired distribution. 2. Previous Works Two common classes of methods currently exist to find the minimum value of a smooth, convex function within a non-empty, convex, and compact R Rn: (i) Projection-Based Methods, adjust guesses to stay in the set; and (ii) Frank-Wolfe Methods which reduce the function s value while explicitly staying in the set. Projection-based methods use a projection to satisfy the constraints of R. This paper uses two definitions of a projection. Classically in the optimization literature (Gal antai, 2004), projection of a point y Rn refers to the optimization procedure proj R(y) = min x R x y . (2) Since R is convex and compact, a unique solution exists. Any linear idempotent map generalizes the classical notion of projection (Halmos, 1998), i.e., g : D 7 D for D a non-empty set with g(g(x)) = g(x), x D. Defining the set g(D) = {g(x)|x D} D, g is said to project D onto g(D). This generalization of projection still encompasses the optimization procedure as if we take g(y) = proj R(y), then g(g(y)) = proj R(g(y)) = g(y). Projected gradient descent (PGD) is a simple method to solve differentiable real-valued convex problems over R. The iteration scheme follows wt+1 = proj R(wt αt wf(wt)), where proj R(y) = min x R x y , (3) and learning rate αt > 0. In general, wt αt wf(wt) neither satisfies the positivity or unit-sum constraint. Projection of this vector into the unit simplex can be performed in O(mn) operations (Michelot, 1986; Chen and Ye, 2011; Wang and Carreira-Perpin an, 2013; Optimization Over a Probability Simplex Kyrillidis et al., 2013), where n is the dimension of w and 1 m n is the number of iterations of the algorithm. While this added cost is often negligible, it can become significant for large n. More recent applications increasingly require large dimensions, e.g. with n 1000 (Markowitz et al., 1994). This formulation of PGD can be simplified with linear constraints Aw = b, where A is an k n matrix and b Rk. Luenberger (1997) projects wf(wt) into the nullspace of A and descends the result along the projected direction. For the unit-sum constraint, this algorithm requires solving the constrained optimization problem arg min x 1 2 wf(wt) x 2, with X This problem yields to the method of Lagrange multipliers, giving the solution wt+1 i = wt i αt While this scheme satisfies the unit-sum constraint similarly to (3), it does not satisfy the positivity constraint. This requires the same projection used in PGD, thus costing O(mn) operations (Nocedal and Wright, 2006; Yousefzadeh, 2021). Exponentiated Gradient Descent (EGD) first presented by Nemirovsky and Yudin (1983), and later by Kivinen and Warmuth (1997), is a specific case of mirror descent. They consider gradient flow in the mirror space ( n) , and one maps between n and ( n) via the functions h : n 7 ( n) and ( h) 1 : ( n) 7 n, where h(w) = P i(wi log(wi) wi) (Vishnoi, 2021). This yields the gradient flow in the mirror space dt = α xf ( h) 1(θ) , for continuous learning rate α > 0. Mapping back into n yields the gradient flow d dt log(wi) = α wif(w), (4) While this preserves positivity, the unit-sum constraint is not preserved. This can be seen as the differential equation can be rewritten as dwi/dt = αwi wif(w), and P i dwi/dt is not zero in general. A forward Euler discretization of (4) yields wt+1 i = wt i exp( αt wif(wt)), thus mapping Rn 0 7 Rn 0 where Rn 0 = {x Rn {0}|xi 0}. To enforce the unit-sum constraint, one can use the projection defined in (2) by taking R = n; however, as suggested by Kivinen and Warmuth (1997), a simple idempotent map to project Rn 0 onto n can be used, given by wt+1 i = wt i exp( αt wif(wt)) P j wt j exp( αt wjf(wt)), with discrete learning-rate αt > 0. Thus the projection, given by the normalization factor, takes O(n) operations. Moreover, discretization is necessary for the algorithm to satisfy the constraint. Chok and Vasil Frank-Wolfe-based methods exploit the convexity of the domain R, eliminating the need of a projection. The Frank-Wolfe method is a classic scheme (Frank and Wolfe, 1956) that is experiencing a recent surge popularity (Lacoste-Julien et al., 2013; Bellet et al., 2015; Mu et al., 2016; Tajima et al., 2021). The method skips projection by assuming R is convex. That is, wt+1 = (1 γt)wt + γtst, where st = arg min s R s wf(wt) and 0 γt 1. (5) Since R is convex, wt+1 R automatically. For the simplex, the subproblem (5) is known in closed form. Frank-Wolfe-based methods tend to be fast for sparse solutions but display oscillatory behavior near the solution, resulting in slow convergence (Lacoste-Julien and Jaggi, 2015; Bomze and Zeffiro, 2021). The Pairwise Frank-Wolfe (PFW) method improves upon the original by introducing an away-step to prevent oscillations allowing faster convergence (Gu elat and Marcotte, 1986; Jaggi, 2013; Lacoste-Julien and Jaggi, 2015). While PFW, EGD, and PGD have guaranteed convergence under constant and decreasing step sizes (Vishnoi, 2021; Jaggi, 2013), a line search is often used in practice to improve run-time (Nocedal and Wright, 2006). Taking f in (1) to be quadratic (Bomze, 1998, 2002; Selvi et al., 2023), a line search has analytical solutions for the Frank-Wolfe and PFW methods but not for EGD and PGD. EGD and PGD require approximate methods (e.g., backtracking method (Nocedal and Wright, 2006)), adding extra run time per iteration. 3. The Main Algorithm For convex problems over a probability simplex, we propose what we named the Cauchy Simplex (CS). For an initial vector w0 relint( n) = {w Rn| P i wi = 1, wi > 0}, we propose the following iteration scheme wt+1 i = wt i ηt dt i, (6) with dt i = wt i ( wif(wt) wt wf(wt)), 0 < ηt < ηt,max, and ηt,max = 1 maxi( wif(wt) wt wf(wt)). Remark 1 Rewriting max i wif(wt) wt wf(wt) = (ej wt) f(wt), (7) where ej Rn is the standard unit vector with j = arg maxi wif(wt). For finite iterations T > 0, w T relint( n) and thus ej w T = 0. Therefore (7) is zero when f(w T ) = 0, i.e., the optimal solution has been reached. When maxi( wif(wt) wt wf(wt)) = 0, then ηt,max = and ηt is set to any finite positive constant. As seen in Remark 1, this condition being met implies f(wt) = 0, and wt+1 = wt for finite ηt. Optimization Over a Probability Simplex The upper bound on the learning rate, ηt, ensures that wt+1 i is positive for all i. Summing over the indices of dt i wt i wif(wt) wt wf(wt) = (wt wf(wt)) 1 X Defining the linear operator L(v) = P i vi, if L(wt) = 1, then L(dt) = 0. This makes wt+1 also satisfy the unit-sum constraint as L(wt+1) = L(wt) = 1. However, it is important to note that once an index is set to zero, it will stay zero for future iterations. Thus, having the initial condition in the relative interior of n ensures flow for each of the indices. Another important implication of this is that taking ηt = ηt,max may cause an index to be incorrectly set to zero. As such, step sizes should be taken such that ηt < ηt,max to prevent this. 3.1 Motivating Derivation Our derivation begins by modeling w n through a latent variable, ψ Rn, wi = wi(ψ) = ψ2 i P j ψ2 j , which automatically satisfies positivity and unit probability. Thus, the optimization problem over the probability simplex can be considered as an unconstrained optimization problem min ψ Rn F(ψ), where F(ψ) = f(w(ψ)). Now consider the continuous-time gradient descent on ψ, for continuous learning rate α > 0. This then induces a gradient flow in w. To find this, first note that wi ψj = 2ψi ψ 2δij ψ2 i ψj ψ 4 = 2 ψ 2 (ψiδij 2wiψj) , where δij = 1 if i = j and zero otherwise. Using the notation ψ = dψ/dt, the chain rule gives the gradient flow ψi ψi wi ψ ψ . By equation (8), i,j ψj wi ψj f wi = 2α(w wf) Chok and Vasil Thus, the gradient flow in w can be simplified to dt = 2 ψ 2 ψi ψi = βwi( wif(w) w wf(w)) where β = 4α ψ 2 . (9) Thus giving the iterative scheme wt+1 i = wt i ηtdt i where dt i = wt i wif(wt) wt wf(wt) . (10) While w is initially modeled through a latent variable, the resulting gradient flow is only in terms of the constrained variable w. This allows the iteration scheme to only reference vectors in n, and we no longer need to consider the latent domain Rn. This derivation is to make a connection to recent work using a latent-variable approach (Lezcano-Casado and Mart ınez-Rubio, 2019; Bucci et al., 2022; Li et al., 2023). 3.2 On the Learning Rate The proof of convergence in Section 4 assumes that ηt < ηt,max; thus, all weights stay strictly positive but may be arbitrarily close to zero. However, accumulated rounding errors may result in some weights becoming zero or negative. As such, in our numerical implementation, a weight is set to zero once it is lower than some threshold (we choose 1e-10). Once an index is set to zero, it will remain zero and can be ignored. This gives an altered maximum learning rate ηt,max = 1 maxi S( wif(wt) wt wf(wt)), where S = {i | wi > 0}. It follows that ηt,max ηt,max, allowing for larger step-sizes to be taken. The requirement of a maximum learning rate is not unique to the CS and is shared with PGD and EGD using fixed learning rates (Lu et al., 2018). However, this maximum learning rate is based on the Lipschitz constant of f, which is typically unknown. While numerical methods to approximate the Lipschitz constant exist, they can cause a failure of convergence (Hansen et al., 1992). More generally, PGD and EGD converge with a line search bounded by an arbitrarily large positive constant (Xiu et al., 2007; Li and Cevher, 2018), and becomes an extra parameter in the method. In contrast, each iteration of the CS has an easily computable maximum possible learning rate, ηt,max, to be used in a line search. 3.3 Connections to Previous Methods There are two ways of writing the gradient flow for the Cauchy-Simplex. In terms of the flow in w: dw dt = WΠw wf(w), where Πw = I 1 w, (11) and W is a diagonal matrix filled with w, I is the identity matrix, and 1 is a vector full of ones. Optimization Over a Probability Simplex In terms of the flow in log(w): d dt log(w) = Πw wf(w), (12) giving an alternative exponential iteration scheme wt+1 = wt exp( ηtΠw wf(wt)). (13) Claim 2 Πw projects Rn on the null space of the operator Lw(v) = P i wivi, provided that P i wi = 1. Proof To see that Πw is an idempotent map, Π2 w = I2 2(1 w) + (1 w)(1 w) = I 2(1 w) + (1 w) = Πw, and is, therefore, a projection. To see that Πw(Rn) is the null space of Lw, it is easy to see that for any u Rn, Lw(Πwu) = 0. For the converse, let v be in the null space of Lw. By definition of the null space, Lw(v) = P i wivi = 0, hence Πw(v) = (I 1 w)v = v. Thus Πw is surjective, mapping Rn 7 Nul(Lw). Remark 3 While Πw is a projection, it takes O(n) operations. The formulations (11) and (12) draw a direct parallel to both PGD and EGD, as summarized in Table 1. PGD can be written in continuous form as dt = Π1/n wf(w) where Π1/n = I 1 The projector helps PGD satisfy the unit-sum constraint, but not perfectly for general w. However, the multiplication with the matrix W slows the gradient flow for a given index wi as it nears the boundary, with zero flow once it hits the boundary. Thus preserving positivity. EGD, similarly, can be written in continuous form as d dt log(w) = wf(w). Performing the descent through log(w) helps EGD preserve positivity. Introducing the projector helps the resulting exponential iteration scheme (13) to agree with its linear iteration scheme (10) up to O((ηt)2) terms. Thus helping preserve the unit-sum constraint. Chok and Vasil Table 1: Comparison of gradient flow for different optimization methods and constraints. P i wi = 1 P i wi = 1 wi 0 GD: dw dt = wf(w) (Luenberger) PGD: dw dt = Π1/n wf(w) wi 0 EGD: d dt log(w) = wf(w) CS: d dt log(w) = Πw wf(w) Claim 4 The Cauchy-Simplex exponentiated iteration scheme (13) agrees up to O((ηt)2) with it s linear iteration scheme (6). This can be seen by Taylor expanding (13) w.r.t. ηt around zero. Remark 5 Combining PGD and EGD as d dt log(w) = Π1/n wf(w) preserves positivity but not the unit-sum constraint. This can be seen as the differential equation can be rewritten as dwi/dt = wiΠ1/n wif(w). In general, P i dwi/dt is non-zero and thus not an invariant property of the gradient flow. Unlike both PGD and EGD, the continuous-time dynamics of the CS are enough to enforce the probability-simplex constraint. This allows us to use the gradient flow of CS, i.e. (9), to prove convergence when optimizing convex functions (seen in Section 4). This contrasts with PGD and EGD, in which the continuous dynamics only satisfy one constraint. The discretization of these schemes is necessary to allow an additional projection step, thus satisfying both constraints. 3.4 The Algorithm The pseudo-code of our method can be seen in Algorithm 1. 4. Convergence Proof We prove the convergence of the Cauchy-Simplex via its gradient flow. We also state the theorems for convergence of the discrete linear scheme but leave the proof in the appendix. Theorem 6 Let f be convex with Lipschitz continuous gradient, real-valued and continuously differentiable w.r.t. wt and wt continuously differentiable w.r.t. t. For the Cauchy Simplex gradient flow (9) with initial condition w0 relint( n), f(wt) is a strictly decreasing function and stationary at the optimal solution for all t [0, T] and finite T > 0. Proof Notice that dwt/dt can be rewritten as d dt log(wt i) = Πwt wf(wt). Optimization Over a Probability Simplex Algorithm 1 Our proposed algorithm Require: ε 10 10 (Tolerance for the zero set) w0 (1/n, , 1/n) while termination conditions not met do S {i = 1, , n | wi > ε} Q {i = 1, , n | wi ε} Choose ηt > 0 ηmax 1 maxi S( wif(wt)) wt wf(wt) ηt min(ηt, ηmax) ˆwt+1 i wt i ηtwt i( wif(wt) wt wf(wt)) ˆwt+1 j 0, j Q wt+1 i ˆwt+1 i / P j ˆwt+1 j (Normalizing for numerical stability) end while Since f is Lipschitz, and n is a compact subset of Rn, f(wt) is bounded for all wt n. Thus, if w0 relint( n), strict positivity of wt i is preserved for t [0, T]. Furthermore, since P i dwi/dt = P i(WΠw wf(w))i = 0, P i wi is an invariant quantity of the gradient flow. Therefore, if w0 relint( n) then wt relint( n) for t [0, T], and finite T > 0. By direction computation df dt = f wt dwt i wif(wt) wt i( wif(wt) wt wf(wt)) i wt i wif(wt) 2 wt wf(wt) 2 := Var[ wf(wt) | wt], where Var[v | w] is the variance of a vector v Rn with respect to the discrete measure w n. The variance can be rewritten as Var[v | w] = X i wi(vi v w)2 = X i wi(Πwv)2 i , for all v Rn, and thus is non-negative. Since wt relint( n), it follows that df dt 0 and that f is decreasing in time. As wt relint( n), df/dt = 0 only when Πwt wf(wt) = 0. However, this occurs only if wt is an optimal solution to (1), which can be verified by checking the KKT conditions with the constraints of the simplex, i.e., P i wi = 1, and wi 0, shown in Appendix C. Thus f is strictly decreasing in time, and stationary only at the optimal solution. Chok and Vasil Theorem 7 Let f be real-valued, convex and continuously differentiable w.r.t. wt, wt continuously differentiable w.r.t. t, and w n be a solution to (1). Under the Cauchy Simplex gradient flow (9), for all t [0, T] and finite T > 0, the relative entropy D(w |wt) = X w i =0 w i log w i wt i is a decreasing function in time for wt relint( n). Proof We rewrite the relative entropy into D(w |wt) = X w i =0 w i log(w i ) X i w i log(wt i). By direction computation d dt D(w |wt) = X dwt i dt = X i w i (Πwt wif(wt)) = wf(wt) (w wt). Since f is convex, and w is a minimum of f in the simplex, d dt D(w |wt) f(w ) f(wt) 0. (14) Theorem 8 Let f be real-valued, convex and continuously differentiable w.r.t. wt, wt continuously differentiable w.r.t. t, and w n a solution to (1). For w0 = (1/n, . . . , 1/n) and finite T > 0, the Cauchy-Simplex gradient flow (9) gives the bound f(w T ) f(w ) log(n) Proof Taking u = w and integrating (14) w.r.t. t gives Z T f(wt) f(w ) dt Z T d dt D(w |wt) dt = D(w |w0) D(w |w T ). By Jensen s inequality, the relative entropy can be bounded from below by ui =0 ui log(vi/ui) log for u n and v relint( n). This can also be shown using the Bergman divergence (Chou et al., 2023). Thus, relative entropy is non-negative (Jaynes, 2003; Gibbs, 2010), yielding Z T f(wt) f(w ) dt D(w |w0). Optimization Over a Probability Simplex Completing the integral on the left side of the inequality and dividing by T gives 0 f(wt) dt f(w ) 1 T D(w |w0). Using Theorem 6, f(wt) is a decreasing function. Thus f(w T ) = 1 T 0 f(w T ) dt 1 T 0 f(wt) dt. Let w0 = (1/n, . . . , 1/n), then the relative entropy can be bounded by 1 D(u|w0) = X ui =0 ui log(ui) + log(n) log(n), for all u n. This gives the required bound f(w T ) f(w ) D(w |w0) Theorem 9 (Convergence of Linear Scheme) Let f be a differentiable convex function that obtains a minimum at w with f L-Lipschitz continuous. Let w0 = (1/n, . . . , 1/n) and {ηt}T 1 t=0 be a sequence that satisfies 0 < ηt < min{1/L, ηt,max} and 2 Cγt wt (Πt wf(wt))2 2 maxi(Πt wf(wt))2 i , (15) with γt = ηt/ηt,max, Cγt = γ 2 t log(e γt/(1 γt)) and ηt,max defined in (6). Then the linear Cauchy-Simplex scheme (6) produces iterates {wt}T 1 t=0 such that, for finite integer T > 0, f(w T ) f(w ) log(n) PT 1 t=0 ηt . The proof can be found in Appendix B.4. In practice, finding ηt, and by extension γt, to satisfy the assumptions of Theorem 9 can be hard. Instead, we show asymptotic convergence of the linear Cauchy-Simplex scheme under a line search. Lemma 10 (Asymptotic Convergence of Linear Scheme) Let f be a differentiable convex function with f Lipschitz continuous. The linear Cauchy-Simplex (6) has asymptotic convergence when ηt is chosen through a line search, i.e., f(wt+1) = min ηt [0,ηt,max ε] f(wt ηtdt), for some 0 < ε 1 and dt i = wt t(Πt ft)i. That is f(wt) f(w ) as t . The proof can be found in Appendix B.3. 1. Intuitively, this can be seen by computing the relative entropy between a state with maximal entropy (the uniform distribution) and minimal entropy (the Dirac mass). 2. Note that Cγt is an increasing function of γt, with Cγt 0 for γt [0, 1]. Thus γt = ηt/ηt,max can also be chosen to satisfy (15) and that {ηt}t is a sequence with 0 < ηt < min{1/L, ηt,max}. Chok and Vasil 5. Extension: Optimization over Orthogonal Matrices Another often explored constraint is the orthogonal matrix constraint min Q Sn f(Q), where Sn = {Q Rn n | QQT = I}, with Sn also known as the Stiefel manifold. To name a few, this problem occurs in blind source separation (S arel a and Valpola, 2005; Omlor and Giese, 2011), principle component analysis (Bertsimas and Kitane, 2023), and neural networks (Fiori, 2005; Achour et al., 2022; Woodworth et al., 2020; Chou et al., 2024). Various iterative methods have been suggested to solve this problem, with them split up between Riemannian optimization, which uses expensive iterations that remain inside the Stiefel manifold (Wen and Yin, 2012; Jiang and Dai, 2014), and landing methods (Ablin and Peyr e, 2022), which use cheap iterations that are not in the Stiefel manifold but over time will be arbitrarily close to the manifold. Using a similar method as in Section 3, we can also derive an explicit scheme that preserves orthogonality up to an arbitrary accuracy. Let X Rn n be an anti-symmteric matrix (X = XT ). Then the Cayley transform Q = (I X)(I + X) 1 = 2(I X) 1 I, parameterizes orthogonal matrices with Det(Q) = 1, where I is the identity matrix. Performing similar calculations as above yields the gradient flow dt = η Ω(ΛΩT ΩΛT )Ω where Ω= Q + I, and Λ = ΩT Qf (full derivation can be found in Appendix A). However, an Euler discretization of this differential equation does not produce a scheme that preserves orthogonality. Instead, we consider a corrected iteration scheme Qt+1 = (I + C)(Qt ηd Q) where d Q = Ω(ΛΩT ΩΛT )Ω, for some matrix C. An iteration scheme that preserves orthogonality up to arbitrary accuracy can then be made by looking at the coefficients for powers of η in the expansion of Qt+1QT t+1 and solving for C. In particular, a O(η2) correct scheme can be written as Qt+1 = (I η2 2 d Q d QT )(Qt ηd Q), and a O(η4) correct scheme is Qt+1 = (I η2 2 d Q d QT + 3η4 8 d Q d QT d Q d QT )(Qt ηd Q). As such, parameterizing the manifold by a latent Euclidean space yields a gradient flow that remains explicitly inside the Stiefel manifold. Moreover, this gradient flow is explicitly in terms of the orthogonal matrix. This contrasts with Riemannian optimization methods, which use Riemannian gradient flow on the manifold. Moreover, since the tangent space Optimization Over a Probability Simplex at each point of the Stiefel manifold does not lie in the manifold itself, once the scheme is discretized, an exponential map (or retraction) must be used to remain on the manifold. These often involve costly matrix operations like exponentials, square roots, and inversions (Jiang and Dai, 2014; Chen et al., 2021; Lezcano-Casado and Mart ınez-Rubio, 2019). In contrast, working in Euclidean space allows the discretized scheme to remain on the manifold using cheaper addition and multiplication matrix operations. 6. Applications As noted in Section 3.2, approximations of the Lipschitz constant L can cause failure of convergence. As such, our experiments use a line search outlined in Lemma 10. 6.1 Projection onto the Convex Hull Projection onto a convex hull arises in many areas like machine learning (Mizutani, 2014; Nandan et al., 2014; Gr unew alder, 2018), collision detection (Wang et al., 2020) and imaging (Jung et al., 2009; Jenatton et al., 2011). It involves finding a point in the convex hull of a set of points {xi}1 i n, with xi Rd, that is closest to an arbitrary point y Rd, i.e., min w w X y 2 where X i wi = 1 and wi 0, and X = [x1, , xn]T is a Rn d matrix. This is also known as simplex-constrained regression. Experimental Details: We look at a convex hull sampled from the unit hypercube [0, 1]d for d [10, 15, . . . , 50]. For each hypercube, we sample 50 points uniformly on each of its surfaces, giving a convex hull X with n = 100d data points. Once X is sampled, 50 y s are created outside the hypercube perpendicular to a surface and unit length away from it. This is done by considering the 50 points in X lying on a randomly selected surface of the hypercube. A point ytrue is created as a random convex combination of these points. The point y can then be created perpendicular to this surface and a unit length away from ytrue, and thus also from the convex hull of X. Each y is then projected onto X using CS, EGD, and PFW. These algorithms are ran until a solution, ˆy, is found such that ˆy ytrue 1e 5 or 10 000 iterations have been made. We do not implement PGD and Frank-Wolfe due to their inefficiency in practice. Implementation Details: The learning rate for EGD, PFW, and CS is found through a line search. In the case of the PFW and CS algorithms, an explicit solution can be found and used. At the same time, EGD implements a back-tracking linear search with Armijo conditions (Bonnans et al., 2006) to find an approximate solution. Experiments were written in Python and ran on Google Colab. The code can be found on Git Hub3. The random data is seeded for reproducibility. Results: The results can be seen in Fig. 1. For d = 10, we see that PFW outperforms both CS and EGD in terms of the number of iterations required and the time taken. But for d > 10, on average, CS converges with the least iterations and time taken. 3. https://github.com/infamoussoap/Convex Hull Chok and Vasil 10 20 30 40 50 Dimension d (a) Average steps taken in log-scale 10 20 30 40 50 Dimension d (b) Average time taken in log-scale Figure 1: Number of steps and time required for PFW, EGD, and CS to project 50 randomly sampled points onto the d-hypercube. The bars indicate the minimum and maximum values. 6.2 Optimal Question Weighting It is often desirable that the distribution of exam marks matches a target distribution, but this rarely happens. Modern standardized tests (e.g. IQ exams) solve this problem by transforming the distribution of the raw score of a given age group so it fits a normal distribution (Bartholomew, 2004; Gottfredson, 2009; Mackintosh, 2011). While IQ exams have many criticisms (Richardson, 2002; Shuttleworth-Edwards et al., 2004; Shuttleworth-Edwards, 2016), we are interested in the raw score. As noted by Gottfredson (2009), the raw score has no intrinsic meaning as it can be boosted by adding easier questions to the test. We also argue it is hard to predict the difficulty of a question relative to an age group and, thus, even harder to give it the correct weight. Hence making the raw score a bad reflection of a person s performance. Here we propose a framework to find an optimum weighting of questions such that the weighted scores will fit a target distribution. A demonstration can be seen in Fig. 2. Consider d students taking an exam with n true or false questions. For simplicity, assume that person j getting question i correct can be modeled as a random variable Xi,j for i = 1, . . . , n and j = 1, . . . , d. Consider the discrete distribution of Xj = P i wi Xi,j for some w n, the weighted mark of person j. This distribution can be approximated as a continuous distribution, j=1 µε(z Xj) with µε(x) = 1 ε > 0 and µ is a continuous probability distribution, i.e. µ 0 and R µ(x)dx = 1. This is also known as kernel density estimation (KDE). We want to minimize the distance between ρε and some target distribution f. A natural choice is the relative entropy, min w n D(ρε|f) = min w n Z ρε(x) log ρε(x) Optimization Over a Probability Simplex of which we take its Riemann approximation, min w n ˆD(ρε|f) = min w n k=1 ρε(xk) log ρε(xk) (xk xk 1), (16) where {xk}0 k M is a partition of a finite interval [a, b]. We remark that this problem is not convex, as µ cannot be chosen to be convex w.r.t. w and be a probability distribution. This is similar to robust location parameter estimation (Huber, 1964; Maronna et al., 2006), which considers data sampled from a contaminated distribution (i.e. an uncertain mixture of two known distributions), for instance, when a predicting apparatus failures from two populations. Specifically for some contamination level 0 < α 1, data was sampled at a rate of 1 α from a distribution, G (i.e. easy questions), and α from an alternate distribution, H (i.e. hard questions). Thus, data is sampled from the distribution F = (1 α)G + αH. Under this contaminated distribution, robust parameter estimation finds point estimates (e.g., mean and variance) that are robust when α is small. In this sense, robust parameter estimation is interested in finding weights, w, that make the KDE, ρε(z), robust to contamination levels α. Instead, we wish to make the KDE match a target distribution. Experiment Details: We consider 25 randomly generated exam marks, each having d = 200 students taking an exam with n = 75 true or false questions. For simplicity, we assume that Xi,j Bernoulli(qisj) where 0 < qi < 1 is the difficulty of question i and 0 < sj < 1 the j-th student s smartness. For each scenario, qi = 7/8 for 1 i 60 and qi = 1/5 for 60 < i 75, while sj = 7/10 for 1 j 120 and sj = 1/2 for 120 < j 200. Xi,j Bernoulli(qisj) are then sampled. This setup results in a bimodal distribution, with an expected average of 0.532 and an expected standard deviation of 0.1206, as shown in Figure 2. For the kernel density estimate, µ(x) is chosen as a unit normal distribution truncated to [0, 1], with smoothing parameter ε = 0.05. Similarly, f is a normal distribution with mean 0.5 and variance 0.1, truncated to [0, 1]. We take the partition {k/400}0 k 400 for the Riemann approximation (16). The algorithms CS, EGD, and PFW are ran for 150 iterations. Implementation Details: The learning rate for EGD, PFW, and CS is found through a line search. However, explicit solutions are not used. Instead, a back-tracking line search with Armijo conditions is used to find an approximate solution. Experiments were written in Python and ran on Google Colab and can be found on Git Hub4. The random data is seeded for reproducibility. Results: A table with the results can be seen in Table 2. In summary, of the 25 scenarios, PFW always produces solutions with the smallest relative entropy, with CS producing the largest relative entropy 13 times and EGD 12 times. For the time taken to make the 150 steps, PFW is the quickest 15 times, EGD 7 times, and CS 3 times. At the same time, EGD is the slowest 13 times, CS 7 times, and PFW 5 times. 4. https://github.com/infamoussoap/Convex Hull Chok and Vasil 0.0 0.2 0.4 0.6 0.8 1.0 Student Weighted Mark Iteration 0 T arget Distribution Est. Distribution of Scores Histogram of Scores 0.0 0.2 0.4 0.6 0.8 1.0 Student Weighted Mark Iteration 2 0.0 0.2 0.4 0.6 0.8 1.0 Student Weighted Mark Iteration 20 0.3 0.4 0.5 0.6 0.7 Theoretical Quantiles Sample Quantiles 0.3 0.4 0.5 0.6 0.7 Theoretical Quantiles 0.3 0.4 0.5 0.6 0.7 Theoretical Quantiles Figure 2: Optimal question weighting for (randomly generated) exam scores with 200 students and 75 questions. The setup follows the experimental details in Section 6.2. The kernel density estimate uses a truncated unit normal distribution with ε = 0.05, and the target distribution is a truncated normal distribution with a mean of 0.5 and a standard deviation of 0.1. We take w0 = (0.01, . . . , 0.01), and the Cauchy-Simplex is applied, with each step using a backtracking line search. The resulting weighted histogram and kernel density estimate is shown. The distribution of the weighted marks is shown on the top row, and its QQ plots against a normal distribution of mean 0.5 and a standard deviation of 0.1 are shown on the bottom row. At iterations 0, 5, and 20, the weighted scores have a mean of 0.499, 0.514, and 0.501, with standard deviations of 0.128, 0.124, and 0.109, respectively. Optimization Over a Probability Simplex Table 2: Results for optimal question weighting after running CS, EGD, and PFW for 150 iterations. The final distance (relative entropy) and the time taken in seconds are shown. Bold represents the minimum (best) value, and underline represents the maximum (worst) value. Trial Distance Time Distance Time Distance Time 1 0.032432 8.14 0.032426 10.98 0.032114 6.02 2 0.010349 5.95 0.010535 5.08 0.010101 4.46 3 0.016186 5.74 0.016252 4.93 0.015848 5.50 4 0.025684 4.62 0.025726 6.19 0.025309 4.51 5 0.020561 4.63 0.020486 6.03 0.020213 4.50 6 0.016559 5.58 0.016514 5.00 0.016287 4.52 7 0.025957 5.42 0.025867 4.87 0.025757 5.38 8 0.014506 4.77 0.014343 6.14 0.013504 4.28 9 0.032221 4.68 0.032412 6.02 0.032028 4.55 10 0.023523 5.59 0.023528 4.92 0.023232 5.34 11 0.016153 4.93 0.016231 5.01 0.015792 5.63 12 0.035734 4.53 0.035738 6.07 0.035212 4.22 13 0.030205 4.53 0.030234 6.13 0.029859 4.37 14 0.021725 5.80 0.021598 4.99 0.021282 5.85 15 0.030026 4.44 0.029982 4.99 0.029751 5.64 16 0.009212 4.75 0.009182 5.94 0.008931 4.27 17 0.015573 5.22 0.015661 5.46 0.015188 4.48 18 0.017681 5.69 0.017618 4.92 0.017321 5.48 19 0.017888 4.64 0.017874 5.34 0.017283 5.11 20 0.013597 4.55 0.013719 6.04 0.013075 4.42 21 0.016933 5.76 0.016780 4.84 0.016687 4.77 22 0.032185 5.80 0.032141 5.03 0.032039 6.03 23 0.018377 4.69 0.018250 6.06 0.018084 4.54 24 0.031167 4.59 0.031211 6.10 0.030820 4.55 25 0.035608 5.46 0.035674 4.98 0.035408 5.98 Average 5.22 5.68 4.97 It is perhaps expected that PFW outperforms both EGD and CS due to the low dimensionality of this problem. However, the CS produces similar relative entropies to EGD while maintaining a lower average run time of 5.22 seconds compared to the average run time of 5.68 sec for EGD. 6.3 Prediction from Expert Advice Consider N experts (e.g., Twitter) who give daily advice for 1 t T days. By taking advice from expert i on day t, we incur a loss lt i [0, 1]. However, this loss is not known Chok and Vasil beforehand, only once the advice has been taken. Since the wisdom of the crowd is typically better than any given expert (Landemore and Elster, 2012), we d like to take the weighted average of our experts, giving a daily loss of wt lt, where wt N are the weights given to the experts. This problem is also known as the multi-armed bandit problem. A simple goal is to generate a sequence of weight vectors {wt}t to minimize the averaged expected loss. This goal is, however, a bit too ambitious as the loss vectors lt are not known beforehand. An easier problem is to find a sequence, {wt}t, such that its averaged expected loss approaches the average loss of the best expert as T , that is 1 T RT 0, where RT = t=1 wt lt min i as T . RT is commonly known as the regret of the strategy {wt}t. Previous works (Littlestone and Warmuth, 1994; Cesa-Bianchi et al., 1997; Freund and Schapire, 1997; Arora et al., 2012) all yield O( T log N) convergence rate for the regret. Theorem 11 Consider a sequence of adversary loss vectors lt [0, 1]N. For any u N, the regret generated by the Cauchy-Simplex scheme wt+1 = wt(1 ηtΠwt ft) where ft = lt, is bounded by t=1 u lt D(u|w1) η + Tη 2(1 η), for a fixed learning rate ηt = η < 1. In particular, taking w1 = (1/N, . . . , 1/N) and η = T gives the bound 2T log(N) + log(N). Moreover, this holds when u = ej, where j is the best expert and ej is the standard basis vector. The proof can be found in Appendix B.5. 6.4 Universal Portfolio Consider an investor with a fixed-time trading horizon, T, managing a portfolio of N assets. Define the price relative for the i-th stock at time t as xt i = Ct i/Ct 1 i , where Ct i is the closing price at time t for the i-th stock. So today s closing price of asset i equals xt i times yesterday s closing price, i.e. today s price relative to yesterday s. A portfolio at day t can be described as wt N, where wt i is the proportion of an investor s total wealth in asset i at the beginning of the trading day. Then the wealth of Optimization Over a Probability Simplex the portfolio at the beginning of day t + 1 is wt xt times the wealth of the portfolio at day t. Consider the average log-return of the portfolio t=1 wt xt ! t=1 log(wt xt). Similarly to predicting with expert advice, it is too ambitious to find a sequence of portfolio vectors {wt}t that maximizes the average log-return. Instead, we wish to find such a sequence that approaches the best fixed-weight portfolio, i.e. 1 T LRT 0, where LRT = t=1 log(u xt) t=1 log(wt xt) as T , for some u N. If such a sequence can be found, {wt}t is a universal portfolio. LRT is commonly known as the log-regret. Two standard assumptions are made when proving universal portfolios: (1) For every day, all assets have a bounded price relative, and at least one is non-zero, i.e. 0 < maxi xt i < for all t, and (2) No stock goes bankrupt during the trading period, i.e. a := mini,t xt i > 0, where a known as the market variability parameter. This is also known as the no-junk-bond assumption. Over the years, various bounds on the log-regret have been proven under both assumptions. Some examples include Cover (1991), in his seminal paper, with O(log T), Helmbold et al. (1998) with O( T log N), Agarwal et al. (2006) with O(N1.5 log(NT)), Hazan and Kale (2015) with O(N log(T + N)), and Gaivoronski and Stella (2000) with O(C2 log(T)) where C = supx N b log(b x) and adding an extra assumption on independent price relatives. Each with varying levels of computational complexity. Remark 12 Let xt [a, b]N be a bounded sequence of price relative vectors for 0 a b < . Since the log-regret is invariant under re-scalings of xt, w.l.o.g. we can look at the log-regret for the re-scaled return vectors xt [ a, 1]N for 0 a 1. Theorem 13 Consider a bounded sequence of price relative vectors xt [a, 1]N for some positive constant 0 < a 1 (no-junk-bond), and maxi xt i = 1 for all t. Then the log-regret generated by the Cauchy-Simplex wt+1 = wt(1 ηΠwt ft), where ft = xt is bounded by t=1 log(u xt) t=1 log(wt xt) D(u|w1) η + Tη 2a2(1 η), for any u N and 0 < η 1. In particular, taking w1 = (1/N, . . . , 1/N) and η = a T gives the bound t=1 log(u lt) t=1 log(wt lt) a + log(N). Chok and Vasil The proof can be found in Appendix B.6. Experimental Details: We look at the performance of our algorithm on four standard datasets used to study the performance of universal portfolios: (1) NYSE is a collection of 36 stocks traded on the New York Stock Exchange from July 3, 1962, to Dec 31, 1984, (2) DJIA is a collection of 30 stocks tracked by the Dow Jones Industrial Average from Jan 14, 2009, to Jan 14, 2003, (3) SP500 is a collection of the 25 largest market cap stocks tracked by the Standard & Poor s 500 Index from Jan 2, 1988, to Jan 31, 2003, and (4) TSE is a collection of 88 stocks traded on the Toronto Stock Exchange from Jan 4, 1994, to Dec 31, 1998.5 Two other portfolio strategies are considered: (1) Helmbold et. al. Helmbold et al. (1998) (EGD), who uses the EGD scheme wt+1 = wt exp(η ft)/ P i wt i exp(η ift), with ft = xt/wt xt, and (2) Buy and Hold (B&H) strategy, where one starts with an equally weighted portfolio and the portfolio is left to its own devices. Two metrics are used to evaluate the performance of the portfolio strategies: (1) The Annualized Percentage Yield: APY = R1/y 1, where R is the total return over the full trading period, and y = T/252, where 252 is the average number of annual trading days, and (2) The Sharpe Ratio: SR = (APY Rf)/σ, where σ2 is the variance of the daily returns of the portfolio, and Rf is the risk-free interest rate. Intuitively, the Sharpe ratio measures the performance of the portfolio relative to a risk-free investment while also factoring in the portfolio s volatility. Following Li et al. (2012), we take Rf = 4%. We take the learning rate as the one used to prove that CS and EGD are universal portfolios. In particular, ηCS = a 2 log N a 2 log N+ T and ηEGD = 2a p 2 log(N)/T, respectively, where a is the market variability parameter. We assume that the market variability parameter is given for each dataset. Experiments were written in Python and can be found on Git Hub6. Results: A table with the results can be seen in Table 3. For the NYSE, DJIA, and SP500 datasets, CS slightly outperforms EGD in both the APY and Sharpe ratio, with EGD having a slight edge on the APY for the NYSE dataset. But curiously, the B&H strategy outperforms both CS and EGD on the TSE. We remark that this experiment does not reflect real-world performance, as the market variability parameter is assumed to be known, transaction costs are not factored into our analysis, and the no-junk-bond assumption tends to overestimate performance (Gilbert and Strugnell, 2010; Bailey et al., 2014; Bailey and de Prado, 2014). However, this is outside of the scope of this paper. It is only shown as a proof of concept. 7. Conclusion This paper presents a new iterative algorithm, the Cauchy-Simplex, to solve convex problems over a probability simplex. Within this algorithm, we find ideas from previous works which only capture a portion of the simplex constraint. Combining these ideas, the Cauchy Simplex provides a numerically efficient framework with nice theoretical properties. 5. The datasets were original found on http://www.cs.technion.ac.il/~rani/portfolios/, but is now unavailable. It was retrieved using the Way Back Machine https://web.archive.org/web/ 20220111131743/http://www.cs.technion.ac.il/~rani/portfolios/. 6. https://github.com/infamoussoap/Universal Portfolio Optimization Over a Probability Simplex Table 3: Performance of different portfolio strategies on different datasets. Bold represents the maximum (best) value, and underline represents the minimum (worst) value. Dataset APY Sharpe APY Sharpe APY Sharpe NYSE 0.162 14.360 0.162 14.310 0.129 9.529 DJIA -0.099 -8.714 -0.101 -8.848 -0.126 -10.812 SP500 0.104 4.595 0.101 4.395 0.061 1.347 TSE 0.124 10.225 0.123 10.204 0.127 10.629 The Cauchy-Simplex maintains the linear form of Projected Gradient Descent, allowing one to find analytical solutions to a line search for certain convex problems. But unlike projected gradient descent, this analytical solution will remain in the probability simplex. A backtracking line search can be used when an analytical solution cannot be found. However, this requires an extra parameter, the maximum candidate step size. The Cauchy-Simplex provides a natural answer as a maximum learning rate is required to enforce positivity, rather than the exponentials used in Exponentiated Gradient Descent. Since the Cauchy-Simplex satisfies both constraints of the probability simplex in its iteration scheme, its gradient flow can be used to prove convergence for differentiable and convex functions. This implies the convergence of its discrete linear scheme. This is in contrast to EGD, PFW, and PGD, in which its discrete nature is crucial in satisfying both constraints of the probability simplex. More surprisingly, we find that in the proofs, formulas natural to probability, i.e., variance, and relative entropy, are necessary when proving convergence. We believe that the strong numerical results and simplicity seen through its motivating derivation, gradient flow, and iteration scheme make it a strong choice for solving problems with a probability simplex constraint. Acknowledgments We thank Prof. Johannes Ruf for the helpful discussion and his suggestion for potential applications in the multi-armed bandit problem, which ultimately helped the proof for universal portfolios. We also thank the anonymous reviewers for their helpful suggestions about improving the presentation and readability of this paper. The authors declare no competing interests. Chok and Vasil Appendix A. Gradient Flow for Orthogonal Matrix Constraint Consider the optimization problem min Q Sn f(Q), where Sn = {Q Rn n | QQT = I}. Orthogonal matrices with det(Q) = 1 can be parameterized using the Cayley Transform Q = 2(I X) 1 I, where X Rn n is an anti-symmetric matrix, and let F(X) = f(Q). The anti-symmetric matrix X can be parameterized as j>k Xjk Ejk, for Xjk R and Ejk pq = δpjδqk δpkδqj. Consider the gradient flow Under the Cayley Transform, Q Xjk = 2(I X) 1 X Xjk (I X) 1 Thus Qab Xjk = 2 X lm Ωal Ejk lm Ωmb = 2(ΩajΩkb ΩakΩjb) The gradient flow for the latent variables Xjk is therefore f Qab (ΩajΩkb ΩakΩjb) = 2(Λ ΩT ΩΛT )jk, where Λ = ΩT f Q. Converting this to the gradient flow for the orthogonal matrix Q yields jk (ΩpjΩkq ΩpkΩjq)(ΛΩT ΩΛT )jk = 2[Ω(ΛΩT ΩΛT )Ω Ω(ΛΩT ΩΛT )T Ω] = 4Ω(ΛΩT ΩΛT )Ω, as required. Optimization Over a Probability Simplex Appendix B. Convergence Proofs We will use the notation Πt = Πwt and ft = wf(wt) for the remaining section. B.1 Decreasing Relative Entropy Theorem 14 For any u N, each iteration of the linear Cauchy-Simplex (6) satisfies the bound D(u|wt+1) D(u|wt) ηtu (Πt ft) + Cγt(ηt)2u (Πt ft)2, where Cγt = γ 2 t log(e γt/(1 γt)) and ηt = γtηt,max with γt (0, 1). Proof By the CS update scheme (6) D(u|wt+1) D(u|wt) = X i ui log wt+1 i /wt i i ui log 1 1 ηtΠt ift 1 ηtΠt ift eηtΠt ift ! = ηtu (Πt ft) + X Since the learning rate can be written as ηt = γtηt,max, with γt (0, 1), ηtΠt ift (γtηt,max) max i Πt ift = γt < 1. It can be shown that x 2 log(e x/(1 x)) > 0 for x (0, 1) and is an increasing function on (0, 1) with x 2 log(e x/(1 x)) as x 1. Thus, in the interval x (0, γt] with γt < 1, this can be upper-bounded by 0 < x 2 log[e x/(1 x)] γ 2 t log[e γt/(1 γt)] = Cγt, for x (0, γt]. This yields the inequality 0 < log(e x/(1 x)) Cγt x2 for 0 < x γt. Since Cγt > 0 for γt (0, 1) and Cγt as γt 1, we have the bound Cγt(ηt)2(Πt ift)2. Giving the required inequality D(u|wt+1) D(u|wt) ηtu (Πt ft) + Cγt(ηt)2u (Πt ft)2. Chok and Vasil B.2 Progress Bound Theorem 15 Let f be convex, differentiable, and f is L-Lipschitz continuous. Then each iteration of the linear Cauchy-Simplex (6) guarantees f(wt+1) f(wt) ηt 2 Var[ ft | wt], for 0 < ηt < min 1 L, ηt,max , where ηt,max is defined in (6). Proof Since f is convex with f Lipschitz continuous, the descent lemma yields (Bertsekas, 2016) f(wt+1) f(wt) + ft (wt+1 wt) + L 2 wt+1 wt 2. (17) Our iteration scheme gives that wt+1 wt 2 = (ηt)2 X wt i Πt ift 2 (ηt)2 X i wt i(Πt ift)2, since 0 wt i 1. Hence f(wt+1) f(wt) Var[ ft|wt] where z = ηt L. However, (2z z2) z for 0 z 1. Therefore, f(wt+1) f(wt) ηt 2 Var[ ft | wt], for 0 < ηt min 1 L, ηt,max . B.3 Proof of Lemma 10 Proof Notice that for ηt (0, ηt,max ε], the bound (17) still holds, and that a line search minimizes the left-hand side of (17). Bounding the right-hand side yields f(wt+1) f(wt) ηt 2 Var[ ft | wt], for ηt = min 1 L, ηt,max ε . As shown in Theorem 6, Var[ ft|wt] = 0 only when wt = w . Thus w.l.o.g., for finite T > 0, assume the sequence of vectors {wt}T t=0 generated by the line search satisfies Πt ft = 0, i.e., the optimality condition has not been reached. Then {f(wt)}T t=0 is a strictly decreasing sequence. Since f is convex and is compact, it is bounded from below on . Hence f(w T ) f(w ) as T . Optimization Over a Probability Simplex B.4 Proof of Theorem 9 Proof W.l.o.g, we assume that the sequence {ηt}T t=0, which satisfies the assumptions of Theorem 9, generates the sequence {wt}T t=0 such that Πt ft = 0, i.e., the optimality condition has not been reached. By Theorem 14, D(w |wt+1) D(w |wt) ηt ft (w wt) + Cγt(ηt)2w (Πt ft)2. By convexity of f, rearranging gives ηt f(wt) f(w ) D(w |wt) D(w |wt+1) + Cγt(ηt)2w (Πt ft)2. (18) By Theorem 15, we have that f(wt) f(wt+1) + ηt 2 wt (Πt ft)2. Repeatedly applying this inequality gives f(w T ) + 1 k=t ηkwk (Πk fk)2 f(wt), for t T 1. Thus, (18) gives the bound ηt(f(w T ) f(w )) D(w |wt) D(w |wt+1) + Cγt(ηt)2w (Πt ft)2 k=t ηkwk (Πk fk)2. Summing over time and collapsing the sum gives (f(w T ) f(w )) t=0 ηt D(w |w0) D(w |w T ) (19) t=0 Cγt(ηt)2w (Πt ft)2 k=t ηkηtwk (Πk fk)2. We can rewrite the last term as k=t ηkηtwk (Πk fk)2 = k=0 ηkηtwt (Πt ft)2. Thus, the last two terms of (19) can be bounded by Cγtηtw (Πt ft)2 wt (Πt ft)2 Cγtηt max i (Πt ft)2 i wt (Πt ft)2 Chok and Vasil as w n. By assumption on Cγt, Cγtηt max i (Πt ft)2 i wt (Πt ft)2 k=0 ηk wt (Πt ft)2 It follows that S 0. Thus (19) gives f(w T ) f(w ) D(w |w0) D(w |w T ) PT 1 t=0 ηt . Taking w0 = (1/n, . . . , 1/n), then D(u|w0) log(n) for all u n. Since relative entropy is non-negative, f(w T ) f(w ) log(n) PT 1 t=0 ηt . B.5 Proof of Theorem 11 Proof Rearranging Theorem 14 gives ηtu (Πt ft) D(u|wt) D(u|wt+1) + Cγt(ηt)2u (Πt ft)2. (20) Since ft = lt, we have the inequality 1 Πt ift = lt i wt lt 1, as lt i [0, 1]. Thus dividing (20) by ηt gives wt lt u lt D(u|wt) D(u|wt+1) ηt + Cγtηt, where ηt = γtηt,max, for some γt (0, 1). Since the maximum learning rate has the lower bound ηt,max = 1 maxi lt i wt lt 1 maxi lt i 1, we can take a fixed learning rate ηt = η (0, 1). Moreover, γt = η/ηt,max η. Since Cγt is an increasing function of γt, Cγt Cη, thus giving the bound wt lt u lt D(u|wt) D(u|wt+1) Optimization Over a Probability Simplex Summing over time and collapsing the sum gives the bound t=1 u lt D(u|w1) D(u|w T+1) η + T log(e η/(1 η)) by definition of Cη. Using the inequality log(e x/(1 x))/x x/(2(1 x)) for 0 x 1, t=1 u lt D(u|w1) η + Tη 2(1 η). Let w1 = (1/N, . . . , 1/N), then D(u|w1) log(N) for all u N. Thus giving the desired bound t=1 u lt log(N) η + Tη 2(1 η). The right side of this inequality is minimized when η = T < 1. Upon substitution gives the bound 2T log(N) + log(N). B.6 Proof of Theorem 13 Proof Rearranging Theorem 14 gives ηtu (Πt ft) D(u|wt) D(u|wt+1) + Cγt(ηt)2u (Πt ft)2. Since f = xt/(wt xt), we have the inequality a Π if = 1 xt i/(wt xt) 1, as xt i [a, 1]. Thus diving by ηt gives the bound wt xt 1 D(u|wt) D(u|wt+1) Chok and Vasil Using the inequality ex 1 x for all x gives D(u|wt) D(u|wt+1) Since the maximum learning rate has the lower bound ηt,max = 1 maxi(1 xt i/wt xt) = 1 1 mini(xt i/wt xt) 1, we can take a fixed learning rate ηt = η. Following the steps from Theorem 11 gives the bound t=1 log(u lt) t=1 log(wt lt) D(u|w1) η + Tη 2a2(1 η). Taking w1 = (1/N, . . . , 1/N) and minimizing the right-hand side of the inequality w.r.t. η gives η = a T . Thus giving the bound t=1 log(u lt) t=1 log(wt lt) a + log(N). Appendix C. Karush-Kuhn-Tucker Conditions The Karush-Kuhn-Tucker (KKT) Conditions are first-order conditions that are necessary but insufficient for optimality in constrained optimization problems. For convex problems, it becomes a sufficient condition for optimality. We give a brief overview of the KKT conditions here, and for a full treatment of the subject, we suggest Kochenderfer and Wheeler (2019). Consider a general constrained optimization problem min w f(w) s.t. gi(w) 0 and hj(w) = 0, (21) for i = 1, ..., n and j = 1, ..., m. The (primal) Lagrangian is defined as L(w, α, β) = f(w) + X i αigi(w) + X Consider the new optimization problem θ(w) = max α,β; αi 0 L(w, α, β). ( f(w) if w satisfies the constraints otherwise. Optimization Over a Probability Simplex Hence, αi and βj are slack variables that render a given Lagrangian variation equation irrelevant when violated. To solve (21), we can instead consider the new optimization problem min w θ(w) = min w max α,β; αi 0 L(w, α, β). Assume f and gi are convex, hj is affine, and the constraints are feasible. A solution (w , α , β ) is an optimal solution to (21) if the following conditions, known as the KKT conditions, are satisfied: wi L(w , α , β ) = 0 (Stationarity) α i gi(w ) = 0 (Complementary Slackness) gi(w ) 0, hj(w) = 0 (Primal Feasibility) α i 0 (Dual Feasibility), for all i and j. When the constraint is a simplex, the Lagrangian becomes L(w, α, β) = f(w) X Thus stationarity gives wi L = if αi + β = 0. Let Q = {i : wi = 0} be the active set and S = {i : wi > 0} be the support. The complementary slackness requires αi = 0 for i S, so stationarity gives β = if, i.e. constant on the support. The active set s dual feasibility and stationarity conditions thus require αi = β + if 0. P. Ablin and G. Peyr e. Fast and accurate optimization on the orthogonal manifold without retraction. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. PMLR, 2022. E. M. Achour, F. Malgouyres, and F. Mamalet. Existence, stability and scalability of orthogonal convolutional neural networks. Journal of Machine Learning Research, 23(1), jan 2022. A. Agarwal, E. Hazan, S. Kale, and R. E. Schapire. Algorithms for portfolio management based on the Newton method. In Proceedings of the 23rd international conference on Machine learning. ACM Press, 2006. M. Amsler, V. I. Hegde, S. D. Jacobsen, and C. Wolverton. Exploring the high-pressure materials genome. Physical Review X, 2018. Chok and Vasil S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: A metaalgorithm and applications. Theory of Computing, 2012. D. H. Bailey and M. L. de Prado. The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting and non-normality. SSRN Electronic Journal, 2014. D. H. Bailey, J. M. Borwein, M. L. de Prado, and Q. J. Zhu. Pseudomathematics and financial charlatanism: The effects of backtest over fitting on out-of-sample performance. Notices of the AMS, 2014. D. J. Bartholomew. Measuring intelligence. Cambridge University Press, Cambridge, England, August 2004. A. Bellet, Y. Liang, A. B. Garakani, M. F. Balcan, and F. Sha. A Distributed Frank Wolfe Algorithm for Communication-Efficient Sparse Learning. Society for Industrial and Applied Mathematics, 2015. D. P. Bertsekas. Nonlinear programming, page 667. Athena Scientific, 3rd edition, 2016. D. Bertsimas and D. L. Kitane. Sparse PCA: A geometric approach. Journal of Machine Learning Research, 2023. F. Bomze, I. M.and Rinaldi and D. Zeffiro. Frank Wolfe and friends: A journey into projection-free first-order optimization methods, 2021. I. M. Bomze. On standard quadratic optimization problems. Journal of Global Optimization, 1998. I. M. Bomze. Regularity versus degeneracy in dynamics, games, and optimization: A unified approach to different aspects. SIAM Review, 2002. J. F. Bonnans, J. C. Gilbert, C. Lemar echal, and C. A. Sagastiz abal. Numerical Optimization: Theoretical and Practical Aspects. Springer-Verlag, 2006. A. Bucci, L. Ippoliti, and P. Valentini. Comparing unconstrained parametrization methods for return covariance matrix prediction. Statistics and Computing, 2022. A. Candelieri, A. Ponti, and F. Archetti. Bayesian optimization over the probability simplex. Annals of Mathematics and Artificial Intelligence, 2025. I. Canyakmaz, W. Lin, G. Piliouras, and A. Varvitsiotis. Multiplicative updates for online convex optimization over symmetric cones. ar Xiv preprint ar Xiv:2307.03136, 2023. N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 1997. S. Chen, A. Garcia, M. Hong, and S. Shahrampour. Decentralized Riemannian gradient descent on the Stiefel manifold. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. Optimization Over a Probability Simplex Yunmei Chen and Xiaojing Ye. Projection onto a simplex. ar Xiv preprint ar Xiv:1101.6081, 2011. H.-H. Chou, J. Maly, and H. Rauhut. More is less: Inducing sparsity via overparameterization. Information and Inference: A Journal of the IMA, 2023. Hung-Hsu Chou, Holger Rauhut, and Rachel Ward. Robust implicit regularization via weight normalization, 2024. M. Collins, A. Globerson, T. Koo, X. Carreras, and P. L. Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 2008. T. M. Cover. Universal portfolios. Mathematical Finance, 1(1):1 29, January 1991. P. Das and A. Banerjee. Meta optimization and its application to portfolio selection. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2011. E. de Klerk, D. den Hertog, and G. Elabwabi. On the complexity of optimization over the standard simplex. European Journal of Operational Research, 2008. L. L. Duan. Latent simplex position model: High dimensional multi-view clustering with uncertainty quantification. Journal of Machine Learning Research, 2020. S. Fiori. Quasi-geodesic neural learning algorithms over the orthogonal group: A tutorial. Journal of Machine Learning Research, 2005. G. Floto, T. Jonsson, M. Nica, S. Sanner, and E. Z. Zhu. Diffusion on the probability simplex. ar Xiv preprint ar Xiv:2309.02530, 2023. M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997. A. A. Gaivoronski and F. Stella. Stochastic nonstationary optimization for finding universal portfolios. Annals of Operations Research, 2000. A. Gal antai. Projectors and Projection Methods. Springer US, 2004. J. W. Gibbs. Elementary Principles in Statistical Mechanics. Cambridge University Press, September 2010. E. Gilbert and D. Strugnell. Does survivorship bias really matter? An empirical investigation into its effects on the mean reversion of share returns on the JSE (1984 2007). Investment Analysts Journal, 2010. L. S. Gottfredson. Logical fallacies used to dismiss the evidence on intelligence testing. In Correcting fallacies about educational and psychological testing. American Psychological Association, 2009. Chok and Vasil S. Gr unew alder. Compact convex projections. Journal of Machine Learning Research, 2018. J. Gu elat and P. Marcotte. Some comments on Wolfe s away step . Mathematical Programming, 1986. P. R. Halmos. Introduction to Hilbert Space and the Theory of Spectral Multiplicity: Second Edition. Dover Books on Mathematics. Dover Publications, 1998. P. Hansen, B. Jaumard, and S. H. Lu. On using estimates of Lipschitz constants in global optimization. Journal of Optimization Theory and Applications, 1992. E. Hazan and S. Kale. An online portfolio selection algorithm with regret logarithmic in price variation. Mathematical Finance, 2015. D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth. On-line portfolio selection using multiplicative updates. Mathematical Finance, October 1998. P. J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 1964. M. Jaggi. Revisiting Frank Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning. PMLR, 2013. E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003. R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsityinducing norms. Journal of Machine Learning Research, 2011. B. Jiang and Y.-H. Dai. A framework of constraint preserving update schemes for optimization on Stiefel manifold. Mathematical Programming, 2014. S. W. Jung, T. H. Kim, and S. J. Ko. A novel multiple image deblurring technique using fuzzy projection onto convex sets. IEEE Signal Processing Letters, 2009. A. Kalai and S. Vempala. Efficient algorithms for universal portfolios. Journal of Machine Learning Research, 3:423 440, mar 2003. J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 1997. M. J. Kochenderfer and T. A. Wheeler. Algorithms for optimization. The MIT Press, 2019. A. Kuznetsova and A. Strekalovsky. On solving the maximum clique problem, 2001. A. Kyrillidis, S. Becker, V. Cevher, and C. Koch. Sparse projections onto the simplex. In Proceedings of the 30th International Conference on Machine Learning. PMLR, 2013. S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank Wolfe optimization variants, 2015. Optimization Over a Probability Simplex S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank Wolfe optimization for structural SVMs. In Proceedings of the 30th International Conference on Machine Learning. PMLR, 2013. H. Landemore and J. Elster. Collective Wisdom: Principles and Mechanisms. Cambridge University Press, 2012. M. Lezcano-Casado and D. Mart ınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019. B. Li, P. Zhao, S. C. H. Hoi, and V. Gopalkrishnan. PAMR: Passive aggressive mean reversion strategy for portfolio selection. Machine Learning, 2012. Q. Li, D. Mc Kenzie, and W. Yin. From the simplex to the sphere: Faster constrained optimization using the hadamard parametrization. Information and Inference: A Journal of the IMA, 2023. Y.-H. Li and V. Cevher. Convergence of the exponentiated gradient method with Armijo line search. Journal of Optimization Theory and Applications, 2018. N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 1994. H. Lu, R. M. Freund, and Y. Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 2018. D. G. Luenberger. Optimization by Vector Space Methods. John Wiley and Sons, Inc., USA, 1st edition, 1997. N. Mackintosh. IQ and Human Intelligence. Oxford University Press, London, England, 2nd edition, 2011. H. M. Markowitz, R. Lacey, J. Plymen, M. A. H. Dempster, and R. G. Tompkins. The general mean-variance portfolio selection problem [and discussion]. Philosophical Transactions: Physical Sciences and Engineering, 1994. R. A. Maronna, R. D. Martin, and V. J. Yohai. Robust Statistics. Wiley, 2006. C. Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of Rn. Journal of Optimization Theory and Applications, 1986. T. Mizutani. Ellipsoidal rounding for nonnegative matrix factorization under noisy separability. Journal of Machine Learning Research, 2014. C. Mu, Y. Zhang, J. Wright, and D. Goldfarb. Scalable robust matrix recovery: Frank Wolfe meets proximal methods. SIAM Journal on Scientific Computing, 2016. M. Nandan, P. P. Khargonekar, and S. S. Talathi. Fast SVM training using approximate extreme points. Journal of Machine Learning Research, 2014. Chok and Vasil A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience publication. Wiley, 1983. J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2nd edition, 2006. L. Omlor and M. A. Giese. Anechoic blind source separation using Wigner marginals. Journal of Machine Learning Research, 2011. A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. Simple MKL. Journal of Machine Learning Research, 2008. K. Richardson. What IQ tests test. Theory & Psychology, 2002. J. S arel a and H. Valpola. Denoising source separation. Journal of Machine Learning Research, 2005. P. Schuster and K. Sigmund. Replicator dynamics. Journal of Theoretical Biology, 100(3): 533 538, 1983. A. Selvi, D. den Hertog, and W. Wiesemann. A reformulation-linearization technique for optimization over simplices. Mathematical Programming, 2023. A. B. Shuttleworth-Edwards. Generally representative is representative of none: commentary on the pitfalls of IQ test standardization in multicultural settings. The Clinical Neuropsychologist, 2016. A. B. Shuttleworth-Edwards, R. D. Kemp, A. L. Rust, J. G. L. Muirhead, N. P. Hartman, and S. E. Radloff. Cross-cultural effects on IQ test performance: A review and preliminary normative indications on WAIS-III test performance. Journal of Clinical and Experimental Neuropsychology, 2004. K. Tajima, Y. Hirohashi, E. R. R. Zara, and T. Kato. Frank Wolfe algorithm for learning SVM-type multi-category classifiers. Society for Industrial and Applied Mathematics, 2021. N. K. Vishnoi. Mirror Descent and the Multiplicative Weights Update, page 108 142. Cambridge University Press, 2021. W. Wang and M. A Carreira-Perpin an. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. ar Xiv preprint ar Xiv:1309.1541, 2013. X. Wang, J. Zhang, and W. Zhang. The distance between convex sets with Minkowski sum structure: Application to collision detection. Computational Optimization and Applications, 2020. Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 2012. B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. In Proceedings of Thirty Third Conference on Learning Theory. PMLR, 2020. Optimization Over a Probability Simplex N. Xiu, C. Wang, and L. Kong. A note on the gradient projection method with exact stepsize rule. Journal of Computational Mathematics, 2007. R. Yousefzadeh. A sketching method for finding the closest point on a convex hull. ar Xiv preprint ar Xiv:2102.10502, 2021.