# optimization_over_a_probability_simplex__627125dd.pdf

Journal of Machine Learning Research 26 (2025) 1-35 Submitted 9/23; Revised 12/24; Published 4/25

Optimization Over a Probability Simplex

James Chok1,2 james.chok@ed.ac.uk Geoﬀrey M. Vasil1 gvasil@ed.ac.uk 1School of Mathematics and Maxwell Institute for Mathematical Sciences, The University of Edinburgh, Edinburgh EH9 3FD, United Kingdom 2Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, United Kingdom

Editor: Silvia Villa

We propose a new iteration scheme, the Cauchy-Simplex, to optimize convex problems over the probability simplex {w Rn | P

i wi = 1 and wi 0}. Speciﬁcally, we map the simplex to the positive quadrant of a unit sphere, envisage gradient descent in latent variables, and map the result back in a way that only depends on the simplex variable. Moreover, proving rigorous convergence results in this formulation leads inherently to tools from information theory (e.g., cross-entropy and KL divergence). Each iteration of the Cauchy-Simplex consists of simple operations, making it well-suited for high-dimensional problems. In continuous time, we prove that f(x T ) f(x ) = O(1/T) for diﬀerentiable real-valued convex functions, where T is the number of time steps and w is the optimal solution. Numerical experiments of projection onto convex hulls show faster convergence than similar algorithms. Finally, we apply our algorithm to online learning problems and prove the convergence of the average regret for (1) Prediction with expert advice and (2) Universal Portfolios. Keywords: Constrained Optimization, Convex Hull, Simplex, Orthogonal Matrix, Gradient Flow

1. Introduction

Optimization over the probability simplex, (i.e., unit simplex) occurs in many subject areas, including portfolio management (Helmbold et al., 1998; Kalai and Vempala, 2003; Das and Banerjee, 2011; Canyakmaz et al., 2023), machine learning (Collins et al., 2008; Rakotomamonjy et al., 2008; Duan, 2020; Floto et al., 2023), population dynamics (Schuster and Sigmund, 1983; Bomze, 2002), and multiple others including statistics and chemistry (Kuznetsova and Strekalovsky, 2001; de Klerk et al., 2008; Amsler et al., 2018; Candelieri et al., 2025). In this paper, we look at minimizing diﬀerentiable real-valued convex functions f(w) with w Rn within the probability simplex,

min w n f(w), where n = {w Rn | X

i wi = 1 and wi 0}. (1)

While quadratic programs can produce an exact solution under certain restrictions on f, they tend to be computationally expensive when n is large (Nocedal and Wright, 2006). Here, we provide an overview of iterative algorithms to solve this problem approximately

c 2025 James Chok and Geoﬀrey M. Vasil.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1166.html.

Chok and Vasil

and introduce a new algorithm for diﬀerentiable real-valued convex functions f. We also show how our method can be applied to other constraints, particularly orthogonal matrices. Previous works (e.g., projected and exponentiated gradient descent) are equivalent to the discretizations of an underlying continuous-time gradient ﬂow method. However, in each case, the gradient ﬂow only enforces either the positivity or the unit-sum condition. Alternatively, the Frank-Wolfe algorithm is not derived from a gradient ﬂow method and is inherently a discrete-time method. We propose a continuous-time gradient ﬂow that is able to satisfy both constraints. Being a continuous-time gradient ﬂow allows us to prove convergence in a continuous-time setting, and we also prove convergence of its forward Euler discretization scheme. Moreover, we see the ideas of projected and exponentiated gradient descent in our algorithm. The remainder of this paper is structured as follows. In Section 2, we provided an overview of methods used to solve (1). In Section 3, we present our proposed algorithm, show how it can be derived using simple calculus, and reveal connections to previous works. Theoretical analysis of the algorithm is shown in Section 4, proving a convergence rate of O(1/T). Section 5 shows how to extend the method presented in Section 3 to constraints over orthogonal matrices. Finally, Section 6 provides theoretical applications in game theory and ﬁnance, as well as numerical applications in geometry and ﬁnding how to weight survey questions to follow a desired distribution.

2. Previous Works

Two common classes of methods currently exist to ﬁnd the minimum value of a smooth, convex function within a non-empty, convex, and compact R Rn: (i) Projection-Based Methods, adjust guesses to stay in the set; and (ii) Frank-Wolfe Methods which reduce the function s value while explicitly staying in the set. Projection-based methods use a projection to satisfy the constraints of R. This paper uses two deﬁnitions of a projection. Classically in the optimization literature (Gal antai, 2004), projection of a point y Rn refers to the optimization procedure

proj R(y) = min x R x y . (2)

Since R is convex and compact, a unique solution exists. Any linear idempotent map generalizes the classical notion of projection (Halmos, 1998), i.e., g : D 7 D for D a non-empty set with g(g(x)) = g(x), x D. Deﬁning the set g(D) = {g(x)|x D} D, g is said to project D onto g(D). This generalization of projection still encompasses the optimization procedure as if we take g(y) = proj R(y), then g(g(y)) = proj R(g(y)) = g(y). Projected gradient descent (PGD) is a simple method to solve diﬀerentiable real-valued convex problems over R. The iteration scheme follows

wt+1 = proj R(wt αt wf(wt)), where proj R(y) = min x R x y , (3)

and learning rate αt > 0. In general, wt αt wf(wt) neither satisﬁes the positivity or unit-sum constraint. Projection of this vector into the unit simplex can be performed in O(mn) operations (Michelot, 1986; Chen and Ye, 2011; Wang and Carreira-Perpin an, 2013;

Optimization Over a Probability Simplex

Kyrillidis et al., 2013), where n is the dimension of w and 1 m n is the number of iterations of the algorithm. While this added cost is often negligible, it can become signiﬁcant for large n. More recent applications increasingly require large dimensions, e.g. with n 1000 (Markowitz et al., 1994). This formulation of PGD can be simpliﬁed with linear constraints Aw = b, where A is an k n matrix and b Rk. Luenberger (1997) projects wf(wt) into the nullspace of A and descends the result along the projected direction. For the unit-sum constraint, this algorithm requires solving the constrained optimization problem

arg min x 1 2 wf(wt) x 2, with X

This problem yields to the method of Lagrange multipliers, giving the solution

wt+1 i = wt i αt

While this scheme satisﬁes the unit-sum constraint similarly to (3), it does not satisfy the positivity constraint. This requires the same projection used in PGD, thus costing O(mn) operations (Nocedal and Wright, 2006; Yousefzadeh, 2021). Exponentiated Gradient Descent (EGD) ﬁrst presented by Nemirovsky and Yudin (1983), and later by Kivinen and Warmuth (1997), is a speciﬁc case of mirror descent. They consider gradient ﬂow in the mirror space ( n) , and one maps between n and ( n) via the functions h : n 7 ( n) and ( h) 1 : ( n) 7 n, where h(w) = P i(wi log(wi) wi) (Vishnoi, 2021). This yields the gradient ﬂow in the mirror space

dt = α xf ( h) 1(θ) ,

for continuous learning rate α > 0. Mapping back into n yields the gradient ﬂow

d dt log(wi) = α wif(w), (4)

While this preserves positivity, the unit-sum constraint is not preserved. This can be seen as the diﬀerential equation can be rewritten as dwi/dt = αwi wif(w), and P i dwi/dt is not zero in general. A forward Euler discretization of (4) yields wt+1 i = wt i exp( αt wif(wt)), thus mapping Rn 0 7 Rn 0 where Rn 0 = {x Rn {0}|xi 0}. To enforce the unit-sum constraint, one can use the projection deﬁned in (2) by taking R = n; however, as suggested by Kivinen and Warmuth (1997), a simple idempotent map to project Rn 0 onto n can be used, given by

wt+1 i = wt i exp( αt wif(wt)) P j wt j exp( αt wjf(wt)),

with discrete learning-rate αt > 0. Thus the projection, given by the normalization factor, takes O(n) operations. Moreover, discretization is necessary for the algorithm to satisfy the constraint.

Chok and Vasil

Frank-Wolfe-based methods exploit the convexity of the domain R, eliminating the need of a projection. The Frank-Wolfe method is a classic scheme (Frank and Wolfe, 1956) that is experiencing a recent surge popularity (Lacoste-Julien et al., 2013; Bellet et al., 2015; Mu et al., 2016; Tajima et al., 2021). The method skips projection by assuming R is convex. That is,

wt+1 = (1 γt)wt + γtst,

where st = arg min s R s wf(wt) and 0 γt 1. (5)

Since R is convex, wt+1 R automatically. For the simplex, the subproblem (5) is known in closed form. Frank-Wolfe-based methods tend to be fast for sparse solutions but display oscillatory behavior near the solution, resulting in slow convergence (Lacoste-Julien and Jaggi, 2015; Bomze and Zeﬃro, 2021). The Pairwise Frank-Wolfe (PFW) method improves upon the original by introducing an away-step to prevent oscillations allowing faster convergence (Gu elat and Marcotte, 1986; Jaggi, 2013; Lacoste-Julien and Jaggi, 2015). While PFW, EGD, and PGD have guaranteed convergence under constant and decreasing step sizes (Vishnoi, 2021; Jaggi, 2013), a line search is often used in practice to improve run-time (Nocedal and Wright, 2006). Taking f in (1) to be quadratic (Bomze, 1998, 2002; Selvi et al., 2023), a line search has analytical solutions for the Frank-Wolfe and PFW methods but not for EGD and PGD. EGD and PGD require approximate methods (e.g., backtracking method (Nocedal and Wright, 2006)), adding extra run time per iteration.

3. The Main Algorithm

For convex problems over a probability simplex, we propose what we named the Cauchy Simplex (CS). For an initial vector w0 relint( n) = {w Rn| P i wi = 1, wi > 0}, we propose the following iteration scheme

wt+1 i = wt i ηt dt i, (6)

with dt i = wt i ( wif(wt) wt wf(wt)), 0 < ηt < ηt,max,

and ηt,max = 1 maxi( wif(wt) wt wf(wt)).

Remark 1 Rewriting

max i wif(wt) wt wf(wt) = (ej wt) f(wt), (7)

where ej Rn is the standard unit vector with j = arg maxi wif(wt). For ﬁnite iterations T > 0, w T relint( n) and thus ej w T = 0. Therefore (7) is zero when f(w T ) = 0, i.e., the optimal solution has been reached.

When maxi( wif(wt) wt wf(wt)) = 0, then ηt,max = and ηt is set to any ﬁnite positive constant. As seen in Remark 1, this condition being met implies f(wt) = 0, and wt+1 = wt for ﬁnite ηt.

Optimization Over a Probability Simplex

The upper bound on the learning rate, ηt, ensures that wt+1 i is positive for all i. Summing over the indices of dt

i wt i wif(wt) wt wf(wt) = (wt wf(wt)) 1 X

Deﬁning the linear operator L(v) = P i vi, if L(wt) = 1, then L(dt) = 0. This makes wt+1 also satisfy the unit-sum constraint as L(wt+1) = L(wt) = 1. However, it is important to note that once an index is set to zero, it will stay zero for future iterations. Thus, having the initial condition in the relative interior of n ensures ﬂow for each of the indices. Another important implication of this is that taking ηt = ηt,max

may cause an index to be incorrectly set to zero. As such, step sizes should be taken such that ηt < ηt,max to prevent this.

3.1 Motivating Derivation

Our derivation begins by modeling w n through a latent variable, ψ Rn,

wi = wi(ψ) = ψ2 i P j ψ2 j ,

which automatically satisﬁes positivity and unit probability. Thus, the optimization problem over the probability simplex can be considered as an unconstrained optimization problem

min ψ Rn F(ψ), where F(ψ) = f(w(ψ)).

Now consider the continuous-time gradient descent on ψ,

for continuous learning rate α > 0. This then induces a gradient ﬂow in w. To ﬁnd this, ﬁrst note that

wi ψj = 2ψi ψ 2δij ψ2 i ψj ψ 4 = 2 ψ 2 (ψiδij 2wiψj) ,

where δij = 1 if i = j and zero otherwise. Using the notation ψ = dψ/dt, the chain rule gives the gradient ﬂow

ψi ψi wi ψ ψ .

By equation (8),

i,j ψj wi ψj

f wi = 2α(w wf)

Chok and Vasil

Thus, the gradient ﬂow in w can be simpliﬁed to

dt = 2 ψ 2 ψi ψi = βwi( wif(w) w wf(w)) where β = 4α ψ 2 . (9)

Thus giving the iterative scheme

wt+1 i = wt i ηtdt i where dt i = wt i wif(wt) wt wf(wt) . (10)

While w is initially modeled through a latent variable, the resulting gradient ﬂow is only in terms of the constrained variable w. This allows the iteration scheme to only reference vectors in n, and we no longer need to consider the latent domain Rn. This derivation is to make a connection to recent work using a latent-variable approach (Lezcano-Casado and Mart ınez-Rubio, 2019; Bucci et al., 2022; Li et al., 2023).

3.2 On the Learning Rate

The proof of convergence in Section 4 assumes that ηt < ηt,max; thus, all weights stay strictly positive but may be arbitrarily close to zero. However, accumulated rounding errors may result in some weights becoming zero or negative. As such, in our numerical implementation, a weight is set to zero once it is lower than some threshold (we choose 1e-10). Once an index is set to zero, it will remain zero and can be ignored. This gives an altered maximum learning rate

ηt,max = 1 maxi S( wif(wt) wt wf(wt)),

where S = {i | wi > 0}. It follows that ηt,max ηt,max, allowing for larger step-sizes to be taken. The requirement of a maximum learning rate is not unique to the CS and is shared with PGD and EGD using ﬁxed learning rates (Lu et al., 2018). However, this maximum learning rate is based on the Lipschitz constant of f, which is typically unknown. While numerical methods to approximate the Lipschitz constant exist, they can cause a failure of convergence (Hansen et al., 1992). More generally, PGD and EGD converge with a line search bounded by an arbitrarily large positive constant (Xiu et al., 2007; Li and Cevher, 2018), and becomes an extra parameter in the method. In contrast, each iteration of the CS has an easily computable maximum possible learning rate, ηt,max, to be used in a line search.

3.3 Connections to Previous Methods

There are two ways of writing the gradient ﬂow for the Cauchy-Simplex. In terms of the ﬂow in w: dw

dt = WΠw wf(w), where Πw = I 1 w, (11)

and W is a diagonal matrix ﬁlled with w, I is the identity matrix, and 1 is a vector full of ones.

Optimization Over a Probability Simplex

In terms of the ﬂow in log(w):

d dt log(w) = Πw wf(w), (12)

giving an alternative exponential iteration scheme

wt+1 = wt exp( ηtΠw wf(wt)). (13)

Claim 2 Πw projects Rn on the null space of the operator Lw(v) = P i wivi, provided that P i wi = 1.

Proof To see that Πw is an idempotent map,

Π2 w = I2 2(1 w) + (1 w)(1 w)

= I 2(1 w) +

(1 w) = Πw,

and is, therefore, a projection. To see that Πw(Rn) is the null space of Lw, it is easy to see that for any u Rn, Lw(Πwu) = 0. For the converse, let v be in the null space of Lw. By deﬁnition of the null space, Lw(v) = P i wivi = 0, hence

Πw(v) = (I 1 w)v = v.

Thus Πw is surjective, mapping Rn 7 Nul(Lw).

Remark 3 While Πw is a projection, it takes O(n) operations.

The formulations (11) and (12) draw a direct parallel to both PGD and EGD, as summarized in Table 1. PGD can be written in continuous form as

dt = Π1/n wf(w) where Π1/n = I 1

The projector helps PGD satisfy the unit-sum constraint, but not perfectly for general w. However, the multiplication with the matrix W slows the gradient ﬂow for a given index wi as it nears the boundary, with zero ﬂow once it hits the boundary. Thus preserving positivity. EGD, similarly, can be written in continuous form as

d dt log(w) = wf(w).

Performing the descent through log(w) helps EGD preserve positivity. Introducing the projector helps the resulting exponential iteration scheme (13) to agree with its linear iteration scheme (10) up to O((ηt)2) terms. Thus helping preserve the unit-sum constraint.

Chok and Vasil

Table 1: Comparison of gradient ﬂow for diﬀerent optimization methods and constraints.

P i wi = 1 P i wi = 1

wi 0 GD: dw

dt = wf(w) (Luenberger) PGD: dw

dt = Π1/n wf(w)

wi 0 EGD: d

dt log(w) = wf(w) CS: d

dt log(w) = Πw wf(w)

Claim 4 The Cauchy-Simplex exponentiated iteration scheme (13) agrees up to O((ηt)2) with it s linear iteration scheme (6). This can be seen by Taylor expanding (13) w.r.t. ηt

around zero.

Remark 5 Combining PGD and EGD as

d dt log(w) = Π1/n wf(w)

preserves positivity but not the unit-sum constraint. This can be seen as the diﬀerential equation can be rewritten as dwi/dt = wiΠ1/n wif(w). In general, P i dwi/dt is non-zero and thus not an invariant property of the gradient ﬂow.

Unlike both PGD and EGD, the continuous-time dynamics of the CS are enough to enforce the probability-simplex constraint. This allows us to use the gradient ﬂow of CS, i.e. (9), to prove convergence when optimizing convex functions (seen in Section 4). This contrasts with PGD and EGD, in which the continuous dynamics only satisfy one constraint. The discretization of these schemes is necessary to allow an additional projection step, thus satisfying both constraints.

3.4 The Algorithm

The pseudo-code of our method can be seen in Algorithm 1.

4. Convergence Proof

We prove the convergence of the Cauchy-Simplex via its gradient ﬂow. We also state the theorems for convergence of the discrete linear scheme but leave the proof in the appendix.

Theorem 6 Let f be convex with Lipschitz continuous gradient, real-valued and continuously diﬀerentiable w.r.t. wt and wt continuously diﬀerentiable w.r.t. t. For the Cauchy Simplex gradient ﬂow (9) with initial condition w0 relint( n), f(wt) is a strictly decreasing function and stationary at the optimal solution for all t [0, T] and ﬁnite T > 0.

Proof Notice that dwt/dt can be rewritten as

d dt log(wt i) = Πwt wf(wt).

Optimization Over a Probability Simplex

Algorithm 1 Our proposed algorithm

Require: ε 10 10 (Tolerance for the zero set) w0 (1/n, , 1/n) while termination conditions not met do S {i = 1, , n | wi > ε} Q {i = 1, , n | wi ε}

Choose ηt > 0

ηmax 1 maxi S( wif(wt)) wt wf(wt) ηt min(ηt, ηmax)

ˆwt+1 i wt i ηtwt i( wif(wt) wt wf(wt))

ˆwt+1 j 0, j Q wt+1 i ˆwt+1 i / P j ˆwt+1 j (Normalizing for numerical stability) end while

Since f is Lipschitz, and n is a compact subset of Rn, f(wt) is bounded for all wt n. Thus, if w0 relint( n), strict positivity of wt i is preserved for t [0, T]. Furthermore, since P i dwi/dt = P i(WΠw wf(w))i = 0, P i wi is an invariant quantity of the gradient ﬂow. Therefore, if w0 relint( n) then wt relint( n) for t [0, T], and ﬁnite T > 0. By direction computation

df dt = f wt dwt

i wif(wt) wt i( wif(wt) wt wf(wt))

i wt i wif(wt) 2 wt wf(wt) 2

:= Var[ wf(wt) | wt],

where Var[v | w] is the variance of a vector v Rn with respect to the discrete measure w n. The variance can be rewritten as

Var[v | w] = X

i wi(vi v w)2 = X

i wi(Πwv)2 i ,

for all v Rn, and thus is non-negative. Since wt relint( n), it follows that df

dt 0 and that f is decreasing in time. As wt relint( n), df/dt = 0 only when Πwt wf(wt) = 0. However, this occurs only if wt is an optimal solution to (1), which can be veriﬁed by checking the KKT conditions with the constraints of the simplex, i.e., P i wi = 1, and wi 0, shown in Appendix C. Thus f is strictly decreasing in time, and stationary only at the optimal solution.

Chok and Vasil

Theorem 7 Let f be real-valued, convex and continuously diﬀerentiable w.r.t. wt, wt continuously diﬀerentiable w.r.t. t, and w n be a solution to (1). Under the Cauchy Simplex gradient ﬂow (9), for all t [0, T] and ﬁnite T > 0, the relative entropy

D(w |wt) = X

w i =0 w i log w i wt i

is a decreasing function in time for wt relint( n).

Proof We rewrite the relative entropy into

D(w |wt) = X

w i =0 w i log(w i ) X

i w i log(wt i).

By direction computation

d dt D(w |wt) = X

dwt i dt = X

i w i (Πwt wif(wt)) = wf(wt) (w wt).

Since f is convex, and w is a minimum of f in the simplex,

d dt D(w |wt) f(w ) f(wt) 0. (14)

Theorem 8 Let f be real-valued, convex and continuously diﬀerentiable w.r.t. wt, wt continuously diﬀerentiable w.r.t. t, and w n a solution to (1). For w0 = (1/n, . . . , 1/n) and ﬁnite T > 0, the Cauchy-Simplex gradient ﬂow (9) gives the bound

f(w T ) f(w ) log(n)

Proof Taking u = w and integrating (14) w.r.t. t gives Z T

f(wt) f(w ) dt Z T

d dt D(w |wt) dt = D(w |w0) D(w |w T ).

By Jensen s inequality, the relative entropy can be bounded from below by

ui =0 ui log(vi/ui) log

for u n and v relint( n). This can also be shown using the Bergman divergence (Chou et al., 2023). Thus, relative entropy is non-negative (Jaynes, 2003; Gibbs, 2010), yielding Z T

f(wt) f(w ) dt D(w |w0).

Optimization Over a Probability Simplex

Completing the integral on the left side of the inequality and dividing by T gives

0 f(wt) dt f(w ) 1 T D(w |w0).

Using Theorem 6, f(wt) is a decreasing function. Thus

f(w T ) = 1 T

0 f(w T ) dt 1 T

0 f(wt) dt.

Let w0 = (1/n, . . . , 1/n), then the relative entropy can be bounded by 1

D(u|w0) = X

ui =0 ui log(ui) + log(n) log(n), for all u n.

This gives the required bound

f(w T ) f(w ) D(w |w0)

Theorem 9 (Convergence of Linear Scheme) Let f be a diﬀerentiable convex function that obtains a minimum at w with f L-Lipschitz continuous. Let w0 = (1/n, . . . , 1/n) and {ηt}T 1 t=0 be a sequence that satisﬁes 0 < ηt < min{1/L, ηt,max} and 2

Cγt wt (Πt wf(wt))2

2 maxi(Πt wf(wt))2 i , (15)

with γt = ηt/ηt,max, Cγt = γ 2 t log(e γt/(1 γt)) and ηt,max deﬁned in (6). Then the linear Cauchy-Simplex scheme (6) produces iterates {wt}T 1 t=0 such that, for ﬁnite integer T > 0,

f(w T ) f(w ) log(n) PT 1 t=0 ηt .

The proof can be found in Appendix B.4. In practice, ﬁnding ηt, and by extension γt, to satisfy the assumptions of Theorem 9 can be hard. Instead, we show asymptotic convergence of the linear Cauchy-Simplex scheme under a line search.

Lemma 10 (Asymptotic Convergence of Linear Scheme) Let f be a diﬀerentiable convex function with f Lipschitz continuous. The linear Cauchy-Simplex (6) has asymptotic convergence when ηt is chosen through a line search, i.e.,

f(wt+1) = min ηt [0,ηt,max ε] f(wt ηtdt),

for some 0 < ε 1 and dt i = wt t(Πt ft)i. That is f(wt) f(w ) as t .

The proof can be found in Appendix B.3.

1. Intuitively, this can be seen by computing the relative entropy between a state with maximal entropy (the uniform distribution) and minimal entropy (the Dirac mass). 2. Note that Cγt is an increasing function of γt, with Cγt 0 for γt [0, 1]. Thus γt = ηt/ηt,max can also be chosen to satisfy (15) and that {ηt}t is a sequence with 0 < ηt < min{1/L, ηt,max}.

Chok and Vasil

5. Extension: Optimization over Orthogonal Matrices

Another often explored constraint is the orthogonal matrix constraint

min Q Sn f(Q), where Sn = {Q Rn n | QQT = I},

with Sn also known as the Stiefel manifold. To name a few, this problem occurs in blind source separation (S arel a and Valpola, 2005; Omlor and Giese, 2011), principle component analysis (Bertsimas and Kitane, 2023), and neural networks (Fiori, 2005; Achour et al., 2022; Woodworth et al., 2020; Chou et al., 2024). Various iterative methods have been suggested to solve this problem, with them split up between Riemannian optimization, which uses expensive iterations that remain inside the Stiefel manifold (Wen and Yin, 2012; Jiang and Dai, 2014), and landing methods (Ablin and Peyr e, 2022), which use cheap iterations that are not in the Stiefel manifold but over time will be arbitrarily close to the manifold. Using a similar method as in Section 3, we can also derive an explicit scheme that preserves orthogonality up to an arbitrary accuracy. Let X Rn n be an anti-symmteric matrix (X = XT ). Then the Cayley transform

Q = (I X)(I + X) 1 = 2(I X) 1 I,

parameterizes orthogonal matrices with Det(Q) = 1, where I is the identity matrix. Performing similar calculations as above yields the gradient ﬂow

dt = η Ω(ΛΩT ΩΛT )Ω

where Ω= Q + I, and Λ = ΩT Qf (full derivation can be found in Appendix A). However, an Euler discretization of this diﬀerential equation does not produce a scheme that preserves orthogonality. Instead, we consider a corrected iteration scheme

Qt+1 = (I + C)(Qt ηd Q) where d Q = Ω(ΛΩT ΩΛT )Ω,

for some matrix C. An iteration scheme that preserves orthogonality up to arbitrary accuracy can then be made by looking at the coeﬃcients for powers of η in the expansion of Qt+1QT t+1 and solving for C. In particular, a O(η2) correct scheme can be written as

Qt+1 = (I η2

2 d Q d QT )(Qt ηd Q),

and a O(η4) correct scheme is

Qt+1 = (I η2

2 d Q d QT + 3η4

8 d Q d QT d Q d QT )(Qt ηd Q).

As such, parameterizing the manifold by a latent Euclidean space yields a gradient ﬂow that remains explicitly inside the Stiefel manifold. Moreover, this gradient ﬂow is explicitly in terms of the orthogonal matrix. This contrasts with Riemannian optimization methods, which use Riemannian gradient ﬂow on the manifold. Moreover, since the tangent space

Optimization Over a Probability Simplex

at each point of the Stiefel manifold does not lie in the manifold itself, once the scheme is discretized, an exponential map (or retraction) must be used to remain on the manifold. These often involve costly matrix operations like exponentials, square roots, and inversions (Jiang and Dai, 2014; Chen et al., 2021; Lezcano-Casado and Mart ınez-Rubio, 2019). In contrast, working in Euclidean space allows the discretized scheme to remain on the manifold using cheaper addition and multiplication matrix operations.

6. Applications

As noted in Section 3.2, approximations of the Lipschitz constant L can cause failure of convergence. As such, our experiments use a line search outlined in Lemma 10.

6.1 Projection onto the Convex Hull

Projection onto a convex hull arises in many areas like machine learning (Mizutani, 2014; Nandan et al., 2014; Gr unew alder, 2018), collision detection (Wang et al., 2020) and imaging (Jung et al., 2009; Jenatton et al., 2011). It involves ﬁnding a point in the convex hull of a set of points {xi}1 i n, with xi Rd, that is closest to an arbitrary point y Rd, i.e.,

min w w X y 2 where X

i wi = 1 and wi 0,

and X = [x1, , xn]T is a Rn d matrix. This is also known as simplex-constrained regression. Experimental Details: We look at a convex hull sampled from the unit hypercube [0, 1]d for d [10, 15, . . . , 50]. For each hypercube, we sample 50 points uniformly on each of its surfaces, giving a convex hull X with n = 100d data points. Once X is sampled, 50 y s are created outside the hypercube perpendicular to a surface and unit length away from it. This is done by considering the 50 points in X lying on a randomly selected surface of the hypercube. A point ytrue is created as a random convex combination of these points. The point y can then be created perpendicular to this surface and a unit length away from ytrue, and thus also from the convex hull of X. Each y is then projected onto X using CS, EGD, and PFW. These algorithms are ran until a solution, ˆy, is found such that ˆy ytrue 1e 5 or 10 000 iterations have been made. We do not implement PGD and Frank-Wolfe due to their ineﬃciency in practice. Implementation Details: The learning rate for EGD, PFW, and CS is found through a line search. In the case of the PFW and CS algorithms, an explicit solution can be found and used. At the same time, EGD implements a back-tracking linear search with Armijo conditions (Bonnans et al., 2006) to ﬁnd an approximate solution. Experiments were written in Python and ran on Google Colab. The code can be found on Git Hub3. The random data is seeded for reproducibility. Results: The results can be seen in Fig. 1. For d = 10, we see that PFW outperforms both CS and EGD in terms of the number of iterations required and the time taken. But for d > 10, on average, CS converges with the least iterations and time taken.

3. https://github.com/infamoussoap/Convex Hull

Chok and Vasil

10 20 30 40 50

Dimension d

(a) Average steps taken in log-scale

10 20 30 40 50

Dimension d

(b) Average time taken in log-scale

Figure 1: Number of steps and time required for PFW, EGD, and CS to project 50 randomly sampled points onto the d-hypercube. The bars indicate the minimum and maximum values.

6.2 Optimal Question Weighting

It is often desirable that the distribution of exam marks matches a target distribution, but this rarely happens. Modern standardized tests (e.g. IQ exams) solve this problem by transforming the distribution of the raw score of a given age group so it ﬁts a normal distribution (Bartholomew, 2004; Gottfredson, 2009; Mackintosh, 2011). While IQ exams have many criticisms (Richardson, 2002; Shuttleworth-Edwards et al., 2004; Shuttleworth-Edwards, 2016), we are interested in the raw score. As noted by Gottfredson (2009), the raw score has no intrinsic meaning as it can be boosted by adding easier questions to the test. We also argue it is hard to predict the diﬃculty of a question relative to an age group and, thus, even harder to give it the correct weight. Hence making the raw score a bad reﬂection of a person s performance. Here we propose a framework to ﬁnd an optimum weighting of questions such that the weighted scores will ﬁt a target distribution. A demonstration can be seen in Fig. 2. Consider d students taking an exam with n true or false questions. For simplicity, assume that person j getting question i correct can be modeled as a random variable Xi,j for i = 1, . . . , n and j = 1, . . . , d. Consider the discrete distribution of Xj = P i wi Xi,j for some w n, the weighted mark of person j. This distribution can be approximated as a continuous distribution,

j=1 µε(z Xj) with µε(x) = 1

ε > 0 and µ is a continuous probability distribution, i.e. µ 0 and R µ(x)dx = 1. This is also known as kernel density estimation (KDE). We want to minimize the distance between ρε and some target distribution f. A natural choice is the relative entropy,

min w n D(ρε|f) = min w n

Z ρε(x) log ρε(x)

Optimization Over a Probability Simplex

of which we take its Riemann approximation,

min w n ˆD(ρε|f) = min w n

k=1 ρε(xk) log ρε(xk)

(xk xk 1), (16)

where {xk}0 k M is a partition of a ﬁnite interval [a, b]. We remark that this problem is not convex, as µ cannot be chosen to be convex w.r.t. w and be a probability distribution. This is similar to robust location parameter estimation (Huber, 1964; Maronna et al., 2006), which considers data sampled from a contaminated distribution (i.e. an uncertain mixture of two known distributions), for instance, when a predicting apparatus failures from two populations. Speciﬁcally for some contamination level 0 < α 1, data was sampled at a rate of 1 α from a distribution, G (i.e. easy questions), and α from an alternate distribution, H (i.e. hard questions). Thus, data is sampled from the distribution

F = (1 α)G + αH.

Under this contaminated distribution, robust parameter estimation ﬁnds point estimates (e.g., mean and variance) that are robust when α is small. In this sense, robust parameter estimation is interested in ﬁnding weights, w, that make the KDE, ρε(z), robust to contamination levels α. Instead, we wish to make the KDE match a target distribution. Experiment Details: We consider 25 randomly generated exam marks, each having d = 200 students taking an exam with n = 75 true or false questions. For simplicity, we assume that Xi,j Bernoulli(qisj) where 0 < qi < 1 is the diﬃculty of question i and 0 < sj < 1 the j-th student s smartness. For each scenario, qi = 7/8 for 1 i 60 and qi = 1/5 for 60 < i 75, while sj = 7/10 for 1 j 120 and sj = 1/2 for 120 < j 200. Xi,j Bernoulli(qisj) are then sampled. This setup results in a bimodal distribution, with an expected average of 0.532 and an expected standard deviation of 0.1206, as shown in Figure 2. For the kernel density estimate, µ(x) is chosen as a unit normal distribution truncated to [0, 1], with smoothing parameter ε = 0.05. Similarly, f is a normal distribution with mean 0.5 and variance 0.1, truncated to [0, 1]. We take the partition {k/400}0 k 400 for the Riemann approximation (16). The algorithms CS, EGD, and PFW are ran for 150 iterations. Implementation Details: The learning rate for EGD, PFW, and CS is found through a line search. However, explicit solutions are not used. Instead, a back-tracking line search with Armijo conditions is used to ﬁnd an approximate solution. Experiments were written in Python and ran on Google Colab and can be found on Git Hub4. The random data is seeded for reproducibility. Results: A table with the results can be seen in Table 2. In summary, of the 25 scenarios, PFW always produces solutions with the smallest relative entropy, with CS producing the largest relative entropy 13 times and EGD 12 times. For the time taken to make the 150 steps, PFW is the quickest 15 times, EGD 7 times, and CS 3 times. At the same time, EGD is the slowest 13 times, CS 7 times, and PFW 5 times.

4. https://github.com/infamoussoap/Convex Hull

Chok and Vasil

0.0 0.2 0.4 0.6 0.8 1.0

Student Weighted Mark

Iteration 0

T arget Distribution

Est. Distribution of Scores

Histogram of Scores

0.0 0.2 0.4 0.6 0.8 1.0

Student Weighted Mark

Iteration 2

0.0 0.2 0.4 0.6 0.8 1.0

Student Weighted Mark

Iteration 20

0.3 0.4 0.5 0.6 0.7

Theoretical Quantiles

Sample Quantiles

0.3 0.4 0.5 0.6 0.7

Theoretical Quantiles

0.3 0.4 0.5 0.6 0.7

Theoretical Quantiles

Figure 2: Optimal question weighting for (randomly generated) exam scores with 200 students and 75 questions. The setup follows the experimental details in Section 6.2. The kernel density estimate uses a truncated unit normal distribution with ε = 0.05, and the target distribution is a truncated normal distribution with a mean of 0.5 and a standard deviation of 0.1. We take w0 = (0.01, . . . , 0.01), and the Cauchy-Simplex is applied, with each step using a backtracking line search. The resulting weighted histogram and kernel density estimate is shown. The distribution of the weighted marks is shown on the top row, and its QQ plots against a normal distribution of mean 0.5 and a standard deviation of 0.1 are shown on the bottom row. At iterations 0, 5, and 20, the weighted scores have a mean of 0.499, 0.514, and 0.501, with standard deviations of 0.128, 0.124, and 0.109, respectively.

Optimization Over a Probability Simplex

Table 2: Results for optimal question weighting after running CS, EGD, and PFW for 150 iterations. The ﬁnal distance (relative entropy) and the time taken in seconds are shown. Bold represents the minimum (best) value, and underline represents the maximum (worst) value.

Trial Distance Time Distance Time Distance Time

1 0.032432 8.14 0.032426 10.98 0.032114 6.02 2 0.010349 5.95 0.010535 5.08 0.010101 4.46 3 0.016186 5.74 0.016252 4.93 0.015848 5.50 4 0.025684 4.62 0.025726 6.19 0.025309 4.51 5 0.020561 4.63 0.020486 6.03 0.020213 4.50 6 0.016559 5.58 0.016514 5.00 0.016287 4.52 7 0.025957 5.42 0.025867 4.87 0.025757 5.38 8 0.014506 4.77 0.014343 6.14 0.013504 4.28 9 0.032221 4.68 0.032412 6.02 0.032028 4.55 10 0.023523 5.59 0.023528 4.92 0.023232 5.34 11 0.016153 4.93 0.016231 5.01 0.015792 5.63 12 0.035734 4.53 0.035738 6.07 0.035212 4.22 13 0.030205 4.53 0.030234 6.13 0.029859 4.37 14 0.021725 5.80 0.021598 4.99 0.021282 5.85 15 0.030026 4.44 0.029982 4.99 0.029751 5.64 16 0.009212 4.75 0.009182 5.94 0.008931 4.27 17 0.015573 5.22 0.015661 5.46 0.015188 4.48 18 0.017681 5.69 0.017618 4.92 0.017321 5.48 19 0.017888 4.64 0.017874 5.34 0.017283 5.11 20 0.013597 4.55 0.013719 6.04 0.013075 4.42 21 0.016933 5.76 0.016780 4.84 0.016687 4.77 22 0.032185 5.80 0.032141 5.03 0.032039 6.03 23 0.018377 4.69 0.018250 6.06 0.018084 4.54 24 0.031167 4.59 0.031211 6.10 0.030820 4.55 25 0.035608 5.46 0.035674 4.98 0.035408 5.98

Average 5.22 5.68 4.97

It is perhaps expected that PFW outperforms both EGD and CS due to the low dimensionality of this problem. However, the CS produces similar relative entropies to EGD while maintaining a lower average run time of 5.22 seconds compared to the average run time of 5.68 sec for EGD.

6.3 Prediction from Expert Advice

Consider N experts (e.g., Twitter) who give daily advice for 1 t T days. By taking advice from expert i on day t, we incur a loss lt i [0, 1]. However, this loss is not known

Chok and Vasil

beforehand, only once the advice has been taken. Since the wisdom of the crowd is typically better than any given expert (Landemore and Elster, 2012), we d like to take the weighted average of our experts, giving a daily loss of wt lt, where wt N are the weights given to the experts. This problem is also known as the multi-armed bandit problem. A simple goal is to generate a sequence of weight vectors {wt}t to minimize the averaged expected loss. This goal is, however, a bit too ambitious as the loss vectors lt are not known beforehand. An easier problem is to ﬁnd a sequence, {wt}t, such that its averaged expected loss approaches the average loss of the best expert as T , that is

1 T RT 0, where RT =

t=1 wt lt min i

as T . RT is commonly known as the regret of the strategy {wt}t. Previous works (Littlestone and Warmuth, 1994; Cesa-Bianchi et al., 1997; Freund and Schapire, 1997; Arora et al., 2012) all yield O( T log N) convergence rate for the regret.

Theorem 11 Consider a sequence of adversary loss vectors lt [0, 1]N. For any u N, the regret generated by the Cauchy-Simplex scheme

wt+1 = wt(1 ηtΠwt ft) where ft = lt,

is bounded by

t=1 u lt D(u|w1)

η + Tη 2(1 η),

for a ﬁxed learning rate ηt = η < 1.

In particular, taking w1 = (1/N, . . . , 1/N) and η =

T gives the bound

2T log(N) + log(N).

Moreover, this holds when u = ej, where j is the best expert and ej is the standard basis vector.

The proof can be found in Appendix B.5.

6.4 Universal Portfolio

Consider an investor with a ﬁxed-time trading horizon, T, managing a portfolio of N assets. Deﬁne the price relative for the i-th stock at time t as xt i = Ct i/Ct 1 i , where Ct i is the closing price at time t for the i-th stock. So today s closing price of asset i equals xt i times yesterday s closing price, i.e. today s price relative to yesterday s. A portfolio at day t can be described as wt N, where wt i is the proportion of an investor s total wealth in asset i at the beginning of the trading day. Then the wealth of

Optimization Over a Probability Simplex

the portfolio at the beginning of day t + 1 is wt xt times the wealth of the portfolio at day t. Consider the average log-return of the portfolio

t=1 wt xt !

t=1 log(wt xt).

Similarly to predicting with expert advice, it is too ambitious to ﬁnd a sequence of portfolio vectors {wt}t that maximizes the average log-return. Instead, we wish to ﬁnd such a sequence that approaches the best ﬁxed-weight portfolio, i.e.

1 T LRT 0, where LRT =

t=1 log(u xt)

t=1 log(wt xt)

as T , for some u N. If such a sequence can be found, {wt}t is a universal portfolio. LRT is commonly known as the log-regret. Two standard assumptions are made when proving universal portfolios: (1) For every day, all assets have a bounded price relative, and at least one is non-zero, i.e. 0 < maxi xt i < for all t, and (2) No stock goes bankrupt during the trading period, i.e. a := mini,t xt i > 0, where a known as the market variability parameter. This is also known as the no-junk-bond assumption. Over the years, various bounds on the log-regret have been proven under both assumptions. Some examples include Cover (1991), in his seminal paper, with O(log T), Helmbold et al. (1998) with O( T log N), Agarwal et al. (2006) with O(N1.5 log(NT)), Hazan and Kale (2015) with O(N log(T + N)), and Gaivoronski and Stella (2000) with O(C2 log(T)) where C = supx N b log(b x) and adding an extra assumption on independent price relatives. Each with varying levels of computational complexity.

Remark 12 Let xt [a, b]N be a bounded sequence of price relative vectors for 0 a b < . Since the log-regret is invariant under re-scalings of xt, w.l.o.g. we can look at the log-regret for the re-scaled return vectors xt [ a, 1]N for 0 a 1.

Theorem 13 Consider a bounded sequence of price relative vectors xt [a, 1]N for some positive constant 0 < a 1 (no-junk-bond), and maxi xt i = 1 for all t. Then the log-regret generated by the Cauchy-Simplex

wt+1 = wt(1 ηΠwt ft), where ft = xt

is bounded by

t=1 log(u xt)

t=1 log(wt xt) D(u|w1)

η + Tη 2a2(1 η),

for any u N and 0 < η 1.

In particular, taking w1 = (1/N, . . . , 1/N) and η = a

T gives the bound

t=1 log(u lt)

t=1 log(wt lt)

a + log(N).

Chok and Vasil

The proof can be found in Appendix B.6. Experimental Details: We look at the performance of our algorithm on four standard datasets used to study the performance of universal portfolios: (1) NYSE is a collection of 36 stocks traded on the New York Stock Exchange from July 3, 1962, to Dec 31, 1984, (2) DJIA is a collection of 30 stocks tracked by the Dow Jones Industrial Average from Jan 14, 2009, to Jan 14, 2003, (3) SP500 is a collection of the 25 largest market cap stocks tracked by the Standard & Poor s 500 Index from Jan 2, 1988, to Jan 31, 2003, and (4) TSE is a collection of 88 stocks traded on the Toronto Stock Exchange from Jan 4, 1994, to Dec 31, 1998.5

Two other portfolio strategies are considered: (1) Helmbold et. al. Helmbold et al. (1998) (EGD), who uses the EGD scheme wt+1 = wt exp(η ft)/ P i wt i exp(η ift), with ft = xt/wt xt, and (2) Buy and Hold (B&H) strategy, where one starts with an equally weighted portfolio and the portfolio is left to its own devices. Two metrics are used to evaluate the performance of the portfolio strategies: (1) The Annualized Percentage Yield: APY = R1/y 1, where R is the total return over the full trading period, and y = T/252, where 252 is the average number of annual trading days, and (2) The Sharpe Ratio: SR = (APY Rf)/σ, where σ2 is the variance of the daily returns of the portfolio, and Rf is the risk-free interest rate. Intuitively, the Sharpe ratio measures the performance of the portfolio relative to a risk-free investment while also factoring in the portfolio s volatility. Following Li et al. (2012), we take Rf = 4%. We take the learning rate as the one used to prove that CS and EGD are universal portfolios. In particular, ηCS = a 2 log N a 2 log N+

T and ηEGD = 2a p

2 log(N)/T, respectively, where a is the market variability parameter. We assume that the market variability parameter is given for each dataset. Experiments were written in Python and can be found on Git Hub6. Results: A table with the results can be seen in Table 3. For the NYSE, DJIA, and SP500 datasets, CS slightly outperforms EGD in both the APY and Sharpe ratio, with EGD having a slight edge on the APY for the NYSE dataset. But curiously, the B&H strategy outperforms both CS and EGD on the TSE. We remark that this experiment does not reﬂect real-world performance, as the market variability parameter is assumed to be known, transaction costs are not factored into our analysis, and the no-junk-bond assumption tends to overestimate performance (Gilbert and Strugnell, 2010; Bailey et al., 2014; Bailey and de Prado, 2014). However, this is outside of the scope of this paper. It is only shown as a proof of concept.

7. Conclusion

This paper presents a new iterative algorithm, the Cauchy-Simplex, to solve convex problems over a probability simplex. Within this algorithm, we ﬁnd ideas from previous works which only capture a portion of the simplex constraint. Combining these ideas, the Cauchy Simplex provides a numerically eﬃcient framework with nice theoretical properties.

5. The datasets were original found on http://www.cs.technion.ac.il/~rani/portfolios/, but is now unavailable. It was retrieved using the Way Back Machine https://web.archive.org/web/ 20220111131743/http://www.cs.technion.ac.il/~rani/portfolios/. 6. https://github.com/infamoussoap/Universal Portfolio

Optimization Over a Probability Simplex

Table 3: Performance of diﬀerent portfolio strategies on diﬀerent datasets. Bold represents the maximum (best) value, and underline represents the minimum (worst) value.

Dataset APY Sharpe APY Sharpe APY Sharpe

NYSE 0.162 14.360 0.162 14.310 0.129 9.529 DJIA -0.099 -8.714 -0.101 -8.848 -0.126 -10.812 SP500 0.104 4.595 0.101 4.395 0.061 1.347 TSE 0.124 10.225 0.123 10.204 0.127 10.629

The Cauchy-Simplex maintains the linear form of Projected Gradient Descent, allowing one to ﬁnd analytical solutions to a line search for certain convex problems. But unlike projected gradient descent, this analytical solution will remain in the probability simplex. A backtracking line search can be used when an analytical solution cannot be found. However, this requires an extra parameter, the maximum candidate step size. The Cauchy-Simplex provides a natural answer as a maximum learning rate is required to enforce positivity, rather than the exponentials used in Exponentiated Gradient Descent. Since the Cauchy-Simplex satisﬁes both constraints of the probability simplex in its iteration scheme, its gradient ﬂow can be used to prove convergence for diﬀerentiable and convex functions. This implies the convergence of its discrete linear scheme. This is in contrast to EGD, PFW, and PGD, in which its discrete nature is crucial in satisfying both constraints of the probability simplex. More surprisingly, we ﬁnd that in the proofs, formulas natural to probability, i.e., variance, and relative entropy, are necessary when proving convergence. We believe that the strong numerical results and simplicity seen through its motivating derivation, gradient ﬂow, and iteration scheme make it a strong choice for solving problems with a probability simplex constraint.

Acknowledgments

We thank Prof. Johannes Ruf for the helpful discussion and his suggestion for potential applications in the multi-armed bandit problem, which ultimately helped the proof for universal portfolios. We also thank the anonymous reviewers for their helpful suggestions about improving the presentation and readability of this paper. The authors declare no competing interests.

Chok and Vasil

Appendix A. Gradient Flow for Orthogonal Matrix Constraint

Consider the optimization problem

min Q Sn f(Q), where Sn = {Q Rn n | QQT = I}.

Orthogonal matrices with det(Q) = 1 can be parameterized using the Cayley Transform Q = 2(I X) 1 I, where X Rn n is an anti-symmetric matrix, and let F(X) = f(Q). The anti-symmetric matrix X can be parameterized as

j>k Xjk Ejk,

for Xjk R and Ejk pq = δpjδqk δpkδqj. Consider the gradient ﬂow

Under the Cayley Transform,

Q Xjk = 2(I X) 1 X

Xjk (I X) 1

Thus Qab Xjk = 2 X

lm Ωal Ejk lm Ωmb

= 2(ΩajΩkb ΩakΩjb)

The gradient ﬂow for the latent variables Xjk is therefore

f Qab (ΩajΩkb ΩakΩjb)

= 2(Λ ΩT ΩΛT )jk,

where Λ = ΩT f

Q. Converting this to the gradient ﬂow for the orthogonal matrix Q yields

jk (ΩpjΩkq ΩpkΩjq)(ΛΩT ΩΛT )jk

= 2[Ω(ΛΩT ΩΛT )Ω Ω(ΛΩT ΩΛT )T Ω]

= 4Ω(ΛΩT ΩΛT )Ω,

as required.

Optimization Over a Probability Simplex

Appendix B. Convergence Proofs

We will use the notation Πt = Πwt and ft = wf(wt) for the remaining section.

B.1 Decreasing Relative Entropy

Theorem 14 For any u N, each iteration of the linear Cauchy-Simplex (6) satisﬁes the bound

D(u|wt+1) D(u|wt) ηtu (Πt ft) + Cγt(ηt)2u (Πt ft)2,

where Cγt = γ 2 t log(e γt/(1 γt)) and ηt = γtηt,max with γt (0, 1).

Proof By the CS update scheme (6)

D(u|wt+1) D(u|wt) = X

i ui log wt+1 i /wt i

i ui log 1 1 ηtΠt ift

1 ηtΠt ift eηtΠt ift !

= ηtu (Πt ft) + X

Since the learning rate can be written as ηt = γtηt,max, with γt (0, 1),

ηtΠt ift (γtηt,max) max i Πt ift = γt < 1.

It can be shown that x 2 log(e x/(1 x)) > 0 for x (0, 1) and is an increasing function on (0, 1) with x 2 log(e x/(1 x)) as x 1. Thus, in the interval x (0, γt] with γt < 1, this can be upper-bounded by

0 < x 2 log[e x/(1 x)] γ 2 t log[e γt/(1 γt)] = Cγt, for x (0, γt].

This yields the inequality 0 < log(e x/(1 x)) Cγt x2 for 0 < x γt. Since Cγt > 0 for γt (0, 1) and Cγt as γt 1, we have the bound

Cγt(ηt)2(Πt ift)2.

Giving the required inequality

D(u|wt+1) D(u|wt) ηtu (Πt ft) + Cγt(ηt)2u (Πt ft)2.

Chok and Vasil

B.2 Progress Bound

Theorem 15 Let f be convex, diﬀerentiable, and f is L-Lipschitz continuous. Then each iteration of the linear Cauchy-Simplex (6) guarantees

f(wt+1) f(wt) ηt

2 Var[ ft | wt], for 0 < ηt < min 1

L, ηt,max ,

where ηt,max is deﬁned in (6).

Proof Since f is convex with f Lipschitz continuous, the descent lemma yields (Bertsekas, 2016)

f(wt+1) f(wt) + ft (wt+1 wt) + L

2 wt+1 wt 2. (17)

Our iteration scheme gives that

wt+1 wt 2 = (ηt)2 X

wt i Πt ift 2 (ηt)2 X

i wt i(Πt ift)2,

since 0 wt i 1. Hence

f(wt+1) f(wt) Var[ ft|wt]

where z = ηt L. However, (2z z2) z for 0 z 1. Therefore,

f(wt+1) f(wt) ηt

2 Var[ ft | wt], for 0 < ηt min 1

L, ηt,max .

B.3 Proof of Lemma 10

Proof Notice that for ηt (0, ηt,max ε], the bound (17) still holds, and that a line search minimizes the left-hand side of (17). Bounding the right-hand side yields

f(wt+1) f(wt) ηt

2 Var[ ft | wt], for ηt = min 1

L, ηt,max ε .

As shown in Theorem 6, Var[ ft|wt] = 0 only when wt = w . Thus w.l.o.g., for ﬁnite T > 0, assume the sequence of vectors {wt}T t=0 generated by the line search satisﬁes Πt ft = 0, i.e., the optimality condition has not been reached. Then {f(wt)}T t=0 is a strictly decreasing sequence. Since f is convex and is compact, it is bounded from below on . Hence f(w T ) f(w ) as T .

Optimization Over a Probability Simplex

B.4 Proof of Theorem 9

Proof W.l.o.g, we assume that the sequence {ηt}T t=0, which satisﬁes the assumptions of Theorem 9, generates the sequence {wt}T t=0 such that Πt ft = 0, i.e., the optimality condition has not been reached. By Theorem 14,

D(w |wt+1) D(w |wt) ηt ft (w wt) + Cγt(ηt)2w (Πt ft)2.

By convexity of f, rearranging gives

ηt f(wt) f(w ) D(w |wt) D(w |wt+1) + Cγt(ηt)2w (Πt ft)2. (18)

By Theorem 15, we have that

f(wt) f(wt+1) + ηt

2 wt (Πt ft)2.

Repeatedly applying this inequality gives

f(w T ) + 1

k=t ηkwk (Πk fk)2 f(wt),

for t T 1. Thus, (18) gives the bound

ηt(f(w T ) f(w )) D(w |wt) D(w |wt+1) + Cγt(ηt)2w (Πt ft)2

k=t ηkwk (Πk fk)2.

Summing over time and collapsing the sum gives

(f(w T ) f(w ))

t=0 ηt D(w |w0) D(w |w T ) (19)

t=0 Cγt(ηt)2w (Πt ft)2

k=t ηkηtwk (Πk fk)2.

We can rewrite the last term as

k=t ηkηtwk (Πk fk)2 =

k=0 ηkηtwt (Πt ft)2.

Thus, the last two terms of (19) can be bounded by

Cγtηtw (Πt ft)2 wt (Πt ft)2

Cγtηt max i (Πt ft)2 i wt (Πt ft)2

Chok and Vasil

as w n. By assumption on Cγt,

Cγtηt max i (Πt ft)2 i wt (Πt ft)2

k=0 ηk wt (Πt ft)2

It follows that S 0. Thus (19) gives

f(w T ) f(w ) D(w |w0) D(w |w T ) PT 1 t=0 ηt .

Taking w0 = (1/n, . . . , 1/n), then D(u|w0) log(n) for all u n. Since relative entropy is non-negative,

f(w T ) f(w ) log(n) PT 1 t=0 ηt .

B.5 Proof of Theorem 11

Proof Rearranging Theorem 14 gives

ηtu (Πt ft) D(u|wt) D(u|wt+1) + Cγt(ηt)2u (Πt ft)2. (20)

Since ft = lt, we have the inequality

1 Πt ift = lt i wt lt 1,

as lt i [0, 1]. Thus dividing (20) by ηt gives

wt lt u lt D(u|wt) D(u|wt+1)

ηt + Cγtηt,

where ηt = γtηt,max, for some γt (0, 1). Since the maximum learning rate has the lower bound

ηt,max = 1 maxi lt i wt lt 1 maxi lt i 1,

we can take a ﬁxed learning rate ηt = η (0, 1). Moreover, γt = η/ηt,max η. Since Cγt is an increasing function of γt, Cγt Cη, thus giving the bound

wt lt u lt D(u|wt) D(u|wt+1)

Optimization Over a Probability Simplex

Summing over time and collapsing the sum gives the bound

t=1 u lt D(u|w1) D(u|w T+1)

η + T log(e η/(1 η))

by deﬁnition of Cη. Using the inequality log(e x/(1 x))/x x/(2(1 x)) for 0 x 1,

t=1 u lt D(u|w1)

η + Tη 2(1 η).

Let w1 = (1/N, . . . , 1/N), then D(u|w1) log(N) for all u N. Thus giving the desired bound

t=1 u lt log(N)

η + Tη 2(1 η).

The right side of this inequality is minimized when η =

T < 1. Upon substitution

gives the bound

2T log(N) + log(N).

B.6 Proof of Theorem 13

Proof Rearranging Theorem 14 gives

ηtu (Πt ft) D(u|wt) D(u|wt+1) + Cγt(ηt)2u (Πt ft)2.

Since f = xt/(wt xt), we have the inequality

a Π if = 1 xt i/(wt xt) 1,

as xt i [a, 1]. Thus diving by ηt gives the bound

wt xt 1 D(u|wt) D(u|wt+1)

Chok and Vasil

Using the inequality ex 1 x for all x gives

D(u|wt) D(u|wt+1)

Since the maximum learning rate has the lower bound

ηt,max = 1 maxi(1 xt i/wt xt) = 1 1 mini(xt i/wt xt) 1,

we can take a ﬁxed learning rate ηt = η. Following the steps from Theorem 11 gives the bound

t=1 log(u lt)

t=1 log(wt lt) D(u|w1)

η + Tη 2a2(1 η).

Taking w1 = (1/N, . . . , 1/N) and minimizing the right-hand side of the inequality w.r.t. η

gives η = a

T . Thus giving the bound

t=1 log(u lt)

t=1 log(wt lt)

a + log(N).

Appendix C. Karush-Kuhn-Tucker Conditions

The Karush-Kuhn-Tucker (KKT) Conditions are ﬁrst-order conditions that are necessary but insuﬃcient for optimality in constrained optimization problems. For convex problems, it becomes a suﬃcient condition for optimality. We give a brief overview of the KKT conditions here, and for a full treatment of the subject, we suggest Kochenderfer and Wheeler (2019). Consider a general constrained optimization problem

min w f(w) s.t. gi(w) 0 and hj(w) = 0, (21)

for i = 1, ..., n and j = 1, ..., m. The (primal) Lagrangian is deﬁned as

L(w, α, β) = f(w) + X

i αigi(w) + X

Consider the new optimization problem

θ(w) = max α,β; αi 0 L(w, α, β).

( f(w) if w satisﬁes the constraints otherwise.

Optimization Over a Probability Simplex

Hence, αi and βj are slack variables that render a given Lagrangian variation equation irrelevant when violated. To solve (21), we can instead consider the new optimization problem

min w θ(w) = min w max α,β; αi 0 L(w, α, β).

Assume f and gi are convex, hj is aﬃne, and the constraints are feasible. A solution (w , α , β ) is an optimal solution to (21) if the following conditions, known as the KKT conditions, are satisﬁed:

wi L(w , α , β ) = 0 (Stationarity)

α i gi(w ) = 0 (Complementary Slackness)

gi(w ) 0, hj(w) = 0 (Primal Feasibility)

α i 0 (Dual Feasibility),

for all i and j. When the constraint is a simplex, the Lagrangian becomes

L(w, α, β) = f(w) X

Thus stationarity gives

wi L = if αi + β = 0.

Let Q = {i : wi = 0} be the active set and S = {i : wi > 0} be the support. The complementary slackness requires αi = 0 for i S, so stationarity gives β = if, i.e. constant on the support. The active set s dual feasibility and stationarity conditions thus require αi = β + if 0.

P. Ablin and G. Peyr e. Fast and accurate optimization on the orthogonal manifold without retraction. In Proceedings of The 25th International Conference on Artiﬁcial Intelligence and Statistics. PMLR, 2022.

E. M. Achour, F. Malgouyres, and F. Mamalet. Existence, stability and scalability of orthogonal convolutional neural networks. Journal of Machine Learning Research, 23(1), jan 2022.

A. Agarwal, E. Hazan, S. Kale, and R. E. Schapire. Algorithms for portfolio management based on the Newton method. In Proceedings of the 23rd international conference on Machine learning. ACM Press, 2006.

M. Amsler, V. I. Hegde, S. D. Jacobsen, and C. Wolverton. Exploring the high-pressure materials genome. Physical Review X, 2018.

Chok and Vasil

S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: A metaalgorithm and applications. Theory of Computing, 2012.

D. H. Bailey and M. L. de Prado. The deﬂated Sharpe ratio: Correcting for selection bias, backtest overﬁtting and non-normality. SSRN Electronic Journal, 2014.

D. H. Bailey, J. M. Borwein, M. L. de Prado, and Q. J. Zhu. Pseudomathematics and ﬁnancial charlatanism: The eﬀects of backtest over ﬁtting on out-of-sample performance. Notices of the AMS, 2014.

D. J. Bartholomew. Measuring intelligence. Cambridge University Press, Cambridge, England, August 2004.

A. Bellet, Y. Liang, A. B. Garakani, M. F. Balcan, and F. Sha. A Distributed Frank Wolfe Algorithm for Communication-Eﬃcient Sparse Learning. Society for Industrial and Applied Mathematics, 2015.

D. P. Bertsekas. Nonlinear programming, page 667. Athena Scientiﬁc, 3rd edition, 2016.

D. Bertsimas and D. L. Kitane. Sparse PCA: A geometric approach. Journal of Machine Learning Research, 2023.

F. Bomze, I. M.and Rinaldi and D. Zeﬃro. Frank Wolfe and friends: A journey into projection-free ﬁrst-order optimization methods, 2021.

I. M. Bomze. On standard quadratic optimization problems. Journal of Global Optimization, 1998.

I. M. Bomze. Regularity versus degeneracy in dynamics, games, and optimization: A uniﬁed approach to diﬀerent aspects. SIAM Review, 2002.

J. F. Bonnans, J. C. Gilbert, C. Lemar echal, and C. A. Sagastiz abal. Numerical Optimization: Theoretical and Practical Aspects. Springer-Verlag, 2006.

A. Bucci, L. Ippoliti, and P. Valentini. Comparing unconstrained parametrization methods for return covariance matrix prediction. Statistics and Computing, 2022.

A. Candelieri, A. Ponti, and F. Archetti. Bayesian optimization over the probability simplex. Annals of Mathematics and Artiﬁcial Intelligence, 2025.

I. Canyakmaz, W. Lin, G. Piliouras, and A. Varvitsiotis. Multiplicative updates for online convex optimization over symmetric cones. ar Xiv preprint ar Xiv:2307.03136, 2023.

N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 1997.

S. Chen, A. Garcia, M. Hong, and S. Shahrampour. Decentralized Riemannian gradient descent on the Stiefel manifold. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021.

Optimization Over a Probability Simplex

Yunmei Chen and Xiaojing Ye. Projection onto a simplex. ar Xiv preprint ar Xiv:1101.6081, 2011.

H.-H. Chou, J. Maly, and H. Rauhut. More is less: Inducing sparsity via overparameterization. Information and Inference: A Journal of the IMA, 2023.

Hung-Hsu Chou, Holger Rauhut, and Rachel Ward. Robust implicit regularization via weight normalization, 2024.

M. Collins, A. Globerson, T. Koo, X. Carreras, and P. L. Bartlett. Exponentiated gradient algorithms for conditional random ﬁelds and max-margin Markov networks. Journal of Machine Learning Research, 2008.

T. M. Cover. Universal portfolios. Mathematical Finance, 1(1):1 29, January 1991.

P. Das and A. Banerjee. Meta optimization and its application to portfolio selection. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2011.

E. de Klerk, D. den Hertog, and G. Elabwabi. On the complexity of optimization over the standard simplex. European Journal of Operational Research, 2008.

L. L. Duan. Latent simplex position model: High dimensional multi-view clustering with uncertainty quantiﬁcation. Journal of Machine Learning Research, 2020.

S. Fiori. Quasi-geodesic neural learning algorithms over the orthogonal group: A tutorial. Journal of Machine Learning Research, 2005.

G. Floto, T. Jonsson, M. Nica, S. Sanner, and E. Z. Zhu. Diﬀusion on the probability simplex. ar Xiv preprint ar Xiv:2309.02530, 2023.

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.

Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997.

A. A. Gaivoronski and F. Stella. Stochastic nonstationary optimization for ﬁnding universal portfolios. Annals of Operations Research, 2000.

A. Gal antai. Projectors and Projection Methods. Springer US, 2004.

J. W. Gibbs. Elementary Principles in Statistical Mechanics. Cambridge University Press, September 2010.

E. Gilbert and D. Strugnell. Does survivorship bias really matter? An empirical investigation into its eﬀects on the mean reversion of share returns on the JSE (1984 2007). Investment Analysts Journal, 2010.

L. S. Gottfredson. Logical fallacies used to dismiss the evidence on intelligence testing. In Correcting fallacies about educational and psychological testing. American Psychological Association, 2009.

Chok and Vasil

S. Gr unew alder. Compact convex projections. Journal of Machine Learning Research, 2018.

J. Gu elat and P. Marcotte. Some comments on Wolfe s away step . Mathematical Programming, 1986.

P. R. Halmos. Introduction to Hilbert Space and the Theory of Spectral Multiplicity: Second Edition. Dover Books on Mathematics. Dover Publications, 1998.

P. Hansen, B. Jaumard, and S. H. Lu. On using estimates of Lipschitz constants in global optimization. Journal of Optimization Theory and Applications, 1992.

E. Hazan and S. Kale. An online portfolio selection algorithm with regret logarithmic in price variation. Mathematical Finance, 2015.

D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth. On-line portfolio selection using multiplicative updates. Mathematical Finance, October 1998.

P. J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 1964.

M. Jaggi. Revisiting Frank Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning. PMLR, 2013.

E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003.

R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsityinducing norms. Journal of Machine Learning Research, 2011.

B. Jiang and Y.-H. Dai. A framework of constraint preserving update schemes for optimization on Stiefel manifold. Mathematical Programming, 2014.

S. W. Jung, T. H. Kim, and S. J. Ko. A novel multiple image deblurring technique using fuzzy projection onto convex sets. IEEE Signal Processing Letters, 2009.

A. Kalai and S. Vempala. Eﬃcient algorithms for universal portfolios. Journal of Machine Learning Research, 3:423 440, mar 2003.

J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 1997.

M. J. Kochenderfer and T. A. Wheeler. Algorithms for optimization. The MIT Press, 2019.

A. Kuznetsova and A. Strekalovsky. On solving the maximum clique problem, 2001.

A. Kyrillidis, S. Becker, V. Cevher, and C. Koch. Sparse projections onto the simplex. In Proceedings of the 30th International Conference on Machine Learning. PMLR, 2013.

S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank Wolfe optimization variants, 2015.

Optimization Over a Probability Simplex

S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank Wolfe optimization for structural SVMs. In Proceedings of the 30th International Conference on Machine Learning. PMLR, 2013.

H. Landemore and J. Elster. Collective Wisdom: Principles and Mechanisms. Cambridge University Press, 2012.

M. Lezcano-Casado and D. Mart ınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019.

B. Li, P. Zhao, S. C. H. Hoi, and V. Gopalkrishnan. PAMR: Passive aggressive mean reversion strategy for portfolio selection. Machine Learning, 2012.

Q. Li, D. Mc Kenzie, and W. Yin. From the simplex to the sphere: Faster constrained optimization using the hadamard parametrization. Information and Inference: A Journal of the IMA, 2023.

Y.-H. Li and V. Cevher. Convergence of the exponentiated gradient method with Armijo line search. Journal of Optimization Theory and Applications, 2018.

N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 1994.

H. Lu, R. M. Freund, and Y. Nesterov. Relatively smooth convex optimization by ﬁrst-order methods, and applications. SIAM Journal on Optimization, 2018.

D. G. Luenberger. Optimization by Vector Space Methods. John Wiley and Sons, Inc., USA, 1st edition, 1997.

N. Mackintosh. IQ and Human Intelligence. Oxford University Press, London, England, 2nd edition, 2011.

H. M. Markowitz, R. Lacey, J. Plymen, M. A. H. Dempster, and R. G. Tompkins. The general mean-variance portfolio selection problem [and discussion]. Philosophical Transactions: Physical Sciences and Engineering, 1994.

R. A. Maronna, R. D. Martin, and V. J. Yohai. Robust Statistics. Wiley, 2006.

C. Michelot. A ﬁnite algorithm for ﬁnding the projection of a point onto the canonical simplex of Rn. Journal of Optimization Theory and Applications, 1986.

T. Mizutani. Ellipsoidal rounding for nonnegative matrix factorization under noisy separability. Journal of Machine Learning Research, 2014.

C. Mu, Y. Zhang, J. Wright, and D. Goldfarb. Scalable robust matrix recovery: Frank Wolfe meets proximal methods. SIAM Journal on Scientiﬁc Computing, 2016.

M. Nandan, P. P. Khargonekar, and S. S. Talathi. Fast SVM training using approximate extreme points. Journal of Machine Learning Research, 2014.

Chok and Vasil

A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Eﬃciency in Optimization. A Wiley-Interscience publication. Wiley, 1983.

J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2nd edition, 2006.

L. Omlor and M. A. Giese. Anechoic blind source separation using Wigner marginals. Journal of Machine Learning Research, 2011.

A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. Simple MKL. Journal of Machine Learning Research, 2008.

K. Richardson. What IQ tests test. Theory & Psychology, 2002.

J. S arel a and H. Valpola. Denoising source separation. Journal of Machine Learning Research, 2005.

P. Schuster and K. Sigmund. Replicator dynamics. Journal of Theoretical Biology, 100(3): 533 538, 1983.

A. Selvi, D. den Hertog, and W. Wiesemann. A reformulation-linearization technique for optimization over simplices. Mathematical Programming, 2023.

A. B. Shuttleworth-Edwards. Generally representative is representative of none: commentary on the pitfalls of IQ test standardization in multicultural settings. The Clinical Neuropsychologist, 2016.

A. B. Shuttleworth-Edwards, R. D. Kemp, A. L. Rust, J. G. L. Muirhead, N. P. Hartman, and S. E. Radloﬀ. Cross-cultural eﬀects on IQ test performance: A review and preliminary normative indications on WAIS-III test performance. Journal of Clinical and Experimental Neuropsychology, 2004.

K. Tajima, Y. Hirohashi, E. R. R. Zara, and T. Kato. Frank Wolfe algorithm for learning SVM-type multi-category classiﬁers. Society for Industrial and Applied Mathematics, 2021.

N. K. Vishnoi. Mirror Descent and the Multiplicative Weights Update, page 108 142. Cambridge University Press, 2021.

W. Wang and M. A Carreira-Perpin an. Projection onto the probability simplex: An eﬃcient algorithm with a simple proof, and an application. ar Xiv preprint ar Xiv:1309.1541, 2013.

X. Wang, J. Zhang, and W. Zhang. The distance between convex sets with Minkowski sum structure: Application to collision detection. Computational Optimization and Applications, 2020.

Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 2012.

B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Srebro. Kernel and rich regimes in overparametrized models. In Proceedings of Thirty Third Conference on Learning Theory. PMLR, 2020.

Optimization Over a Probability Simplex

N. Xiu, C. Wang, and L. Kong. A note on the gradient projection method with exact stepsize rule. Journal of Computational Mathematics, 2007.

R. Yousefzadeh. A sketching method for ﬁnding the closest point on a convex hull. ar Xiv preprint ar Xiv:2102.10502, 2021.