# aggregating_algorithm_and_axiomatic_loss_aggregation__04ede7a6.pdf

Published in Transactions on Machine Learning Research (09/2025)

Aggregating Algorithm and Axiomatic Loss Aggregation

Armando J. Cabrera Pacheco a.cabrera@uni-tuebingen.de University of Tübingen and Tübingen AI Center Tübingen, 72070 Germany

Rabanus Derr rabanus.derr@uni-tuebingen.de University of Tübingen and Tübingen AI Center Tübingen, 72070 Germany

Robert C. Williamson bob.williamson@uni-tuebingen.de University of Tübingen and Tübingen AI Center Tübingen, 72070 Germany

Reviewed on Open Review: https: // openreview. net/ forum? id= 4b Uu Wt Ou Dx

Supervised learning has gone beyond the empirical risk minimization framework. Central to most of these developments is the introduction of more general aggregation functions for losses incurred by the learner. In this paper, we turn towards online learning under expert advice. Via easily justified assumptions we characterize a set of reasonable loss aggregation functions as quasi-sums. Based upon this insight, we suggest how to tailor Vovk s Aggregating Algorithm to these more general aggregation functions. The change of variables we propose, let us highlight that weighting profiles determine the contribution of each expert to the next prediction according to their loss and the multiplicative structure of the weight updates in the Aggregating Algorithm translates into the additive structure of the loss aggregation in the regret bound. In addition, we suggest that the mixability of the loss function, which is functionally necessary for the Aggregating Algorithm, is intrinsically relative to the log loss, because the standard aggregation of losses in online learning is the sum. Finally, we conceptually and empirically argue that our generalized loss aggregation functions express the attitude of the learner towards losses.

1 Introduction

Whether it is framed as collaborative learning, distribution shift or data corruption, machine learning scholarship encountered the necessity to rethink the gold standard of learning theory: expected risk minimization.1

Central to this paradigm is the minimization of the average loss over a set of instances incurred by the learner. In the named, more recent learning scenarios the average is replaced by other aggregation functionals which take into account the different data sources (Haghtalab et al., 2022, Eq. (1)), the distribution shift (Rahimian and Mehrotra, 2019, Eq. (R0)) or structured noise (Iacovissi et al., 2023). See (Fröhlich and Williamson, 2024) for a nice stratification of reasonable aggregations and an axiomatic approach to loss aggregation in offline learning. However, an axiomatic approach to generalized loss aggregations remained within the borders of offline learning. An analogous development in online, adversarial learning does, to the best of our knowledge, not exist. In this work, we focus on learning under expert advice, where a learner, informed by expert predictions, crafts a prediction which ideally performs as good as the best expert in the

*Equal contribution. 1This list is far from being exhaustive, e.g., fair machine learning (Williamson and Menon, 2019).

Published in Transactions on Machine Learning Research (09/2025)

group of experts. The performance is measured by the regret which is the sum of losses between learners and experts. The sum as aggregation of the instance-wise losses is a deliberate choice which makes every incurred loss equally important.

There have been efforts made to go beyond expert learners on sum-aggregation. For instance, discounted aggregation of losses (Cesa-Bianchi and Lugosi, 2006, p. 32) down-weights the history of incurred losses of learners and experts (Chernov and Zhdanov, 2010). Or, (strongly) adaptive regret (Hazan and Seshadhri, 2007; Daniely et al., 2015) computes the maximal regret on a fixed-length subintervals of instances (Zhang et al., 2020). All of these regret types share that they change the aggregation function of the losses such that certain instance-wise losses are emphasized.2 Hence, not every prediction made is equally important to good performance, as its relative contribution to the regret differs. The generalization of the aggregation functions is motivated by decaying interest in past performance (discounted regret) or interest in good performance in changing environments ((strongly) adaptive regret) (cf. distribution shift in supervised learning theory).

In contrast, our considered aggregations are mainly motivated by judging the importance of large or small losses. For instance, this might be relevant in scenarios like weather-prediction in which catastrophic mispredictions are extremely undesired, as weather warning systems might to be triggered, but slight mispredictions are not considered to be harmful.

The aggregation function for the losses is part of the optimization objective. Hence, the design of the objective requires an informed choice. Motivated by the context which demand for general aggregations beyond the sum named before, we offer an axiomatic access to the choice of loss aggregations. The axioms construe a starting point for informed and consequence-aware objective design.

Contributions: First, we put forward an axiomatic approach to reasonable loss aggregations in online learning, providing a generalized definition of regret. More concretely, under easily justified assumptions the aggregation forms a quasi-sum generated by an appropriate function u: [0, ) [0, ), given by

Qu n(x1, . . . , xn) = u 1 n X

Based upon this insight, we suggest how to tailor the Aggregating Algorithm to these more general aggregation functions. The Aggregating Algorithm, first suggested by Vovk (1990), solves the learning under expert advice problem. It enjoys nice theoretical guarantees, e.g., a time-independent bound on regret and the recovery of Bayes updating under an appropriate choice of loss function, while keeping simplicity. The change of variables we propose in this work, leads us to several structural insights on the Aggregating Algorithm: (a) We identify a weighting profile which determines the contribution of each expert to the next prediction. (b) We argue that the multiplicative structure of the weight updates in the Aggregating Algorithm translates into the additive structure of the loss aggregation in the regret bound. (c) We show that the mixability of the loss function, which is functionally necessary for the Aggregating Algorithm, is intrinsically relative to the log loss, because the standard aggregation of losses in online learning is the sum. Finally, we argue that generalized aggregations express the attitude of the learner towards losses incurred by providing predictions. In particular, we can tune the generator u of the quasi-sum to express the aversity towards extreme losses (convex u) or the risk-seeking behavior of a learner (concave u). We provide experimental evidence corroborating this statement, which closes the loop by motivating the use of generalized aggregations in the first place.

2 Vovk s Aggregating Algorithm

For an intuitive and illustrated introduction to Vovk s Aggregating Algorithm see Appendix A. Here, we describe it in a rigorous way following (Vovk, 1990). We also use this part to set up notation and remark its main properties. A prediction game (Ω, Γ, Θ, λ, η) consists of the following objects:

2Note that dynamic regret Zinkevich (2003) does not fit into this picture, as dynamic regret changes the comparison baseline, but not the aggregation functional.

Published in Transactions on Machine Learning Research (09/2025)

Sample space Ω. This is regarded as the set of outcomes from nature. In general, we do not impose any structure on it.

Decision space Γ. This is thought as the allowed predictions. Γ is a topological space endowed with the σ-algebra generated by the open sets.

Parameter space Θ. We index the experts (or decision strategies) by θ Θ. We define (Θ, F, µ) as a measure space with µ being a base measure on the σ-algebra F on Θ. For the sake of simplicity we do not write out the base measure µ in the rest of the paper, e.g., we write the Lebesgue-integral for a measurable E F, µ(E) = R

Loss function λ: Ω Γ [0, ]. This gives us a way to measure the quality of the predictions.

Learning rate η > 0. A positive real number, typically as large as possible.

Example 2.1. A widely used loss function in the prediction game is the log-loss λln : Ω Γ [0, ). In particular, when elements in Γ are of the form γ : Ω [0, 1], we define it as λln(ω, γ) := ln(γ(ω)).

The prediction game works as follows. At each time t [T] = {1, ..., T} N,

1. The experts make their predictions. That is, we have a measurable map ξt : Θ Γ. ξt(θ) is interpreted by the prediction made by the expert θ.

2. The learner (who observes the predictions made by the experts) makes a prediction γt Γ.

3. Nature chooses the outcome ωt Ω.

The goal of the learner is to ensure that its cumulative loss is as good as the best expert s cumulative loss. In other words, we want to bound the regret,

RT (θ) := LT (learner) LT (θ) :=

t=1 λ(ωt, γt)

t=1 λ(ωt, ξt(θ)), θ Θ.

Vovk s Aggregating Algorithm (Vovk, 1990; 2001) (see Algorithm 1) gives a bound which is independent of the number of played rounds T,

LT (learner) LT (θ) ln n

η , θ Θ, (1)

when the number of experts is finite (n = |Θ| < ) and when the game is mixable, i.e., we assume the existence of a substitution function Σ for given λ and η such that λ(ω, Σ(ψ)) ψ(ω), for all w Ωand all pseudo-predictions ψ of the form (2) below.3 As one can easily see, mixability is required for the third step of the Algorithm 1. Since the first two steps are independent of how the substitution concretely looks like, they are sometimes summarized as the Aggregating Pseudo-Algorithm (APA). The separation as well simplifies the analysis of the algorithm. A fundamental property of the AA is that, under some conditions, the updating scheme (Step (1) in AA (Algorithm 1)) is reduced to Bayesian updating (see Appendix A.1).

3 Generalizing the loss aggregation in learning under expert advice

The primary goal of the AA, as well as other algorithms solving the learning under expert advice problem, is to bound the summed loss of the learner by the summed loss of each of the experts plus an error term. We now turn towards an approach to loss aggregation from first principles. Instead of presuming the standard summation we provide a list of axioms for loss aggregation functionals which we readily justify (see for example (Grabisch et al., 2009)). Clearly, the choice of axioms is, and should not be, universal. We consider our suggested list as a starting point for discussions which can be used for designers of machine

3For several games, the log-loss (Example 2.1) can be shown to be mixable for η 1.

Published in Transactions on Machine Learning Research (09/2025)

Algorithm 1: Aggregating Algorithm (AA) Data: Mixable prediction game (Ω, Γ, Θ, λ, η) and prior distribution P0. Result: Predictions (γt)t 0. for every t do

(1) Update experts weights Pt(θ) := e ηλ(ωt,ξt(θ))Pt 1(θ) ; (2) Provide pseudo-prediction ,

ψt(ω) := ln η

Θ e ηλ(ωt,ξt(θ))P t 1 dθ , with P t 1(θ) := Pt 1(θ) R

Θ Pt 1(θ) dθ. (2)

(3) Substitute the pseudo-prediction by an allowed prediction γt := Σ(ψt) such that λ(ω, Σ(ψt)) ψ(ω), for all ω Ω; end

learning systems to justify the desired objective in a certain context. In particular, the axioms structure the existing options for aggregations. Given an axiomatic characterization of aggregations certain properties of aggregations can be discussed, accepted as being important or neglected as being irrelevant.

Our specific choice of axioms emphasize the special role of quasi-sums as they allow for the switch from additive to multiplicative structure necessary for the AA. The set of axioms is potentially not minimal, in the sense that one axiom can be derived from the others. Furthermore, this list of axioms is definitely not exhaustive. However, it is sufficient to disentangle parts and properties of the AA as stated in the contributions in Section 1. Definition 3.1 (Aggregation Functions). A function A: S

n N[0, )n [0, ) is called an aggregation function. We write An(x1, . . . , xn) for the aggregation of n instances. Let x1, . . . , xn, x [0, ). We define the following properties of A.

(A1) Continuity. We say A is continuous if for every xi, i [n],

lim xi x An(x1, . . . , xi, . . . , xn) = An(x1, . . . , x, . . . , xn).

(A2) Monotonicity. We say that A is strictly increasing if for xi < x i, i [n], we have

An(x1, . . . , xi, . . . , xn) < An(x1, . . . , x i, . . . , xn).

(A3) Associativity. We say that A is associative if for all x [0, ) and i [n] we have

An(x1, . . . , xi, . . . , xn) = A2(Ai(x1, . . . , xi), An i(xi+1, . . . , xn)).

(A4) Loss compatibility. We say that A is loss compatible if A(0, ..., 0) = 0.

Properties (A1)-(A4) seem, dependent on context, relevant to impose on an aggregation A of losses since they can be interpreted as follows. If A is continuous, an infinitesimally small change in a loss will result in an infinitesimal change in the aggregate of the losses. If it is strictly increasing, more loss on any instance give more aggregated loss. If it is associative, it is irrelevant how the losses are grouped together for aggregation. Associativity is arguably the most contested axiom, which is not fulfilled by discounted aggregation or adaptive aggregation (cf. Section 1). Furthermore, associativity does most of the heavy-lifting in the characterization result below. However, note that even though this type of associativity immediately implies that the aggregation function is completely determined by the binary aggregation (Grabisch et al., 2009, p. 33), it does not directly imply commutativity.4 Loss compatibility means that 0 losses aggregate to 0. It turns out we can fully characterize aggregation functions satisfying the properties above.

4For instance, consider the aggregation which gives back the last value, in terms of the intrinsic ordering of the input, of all elements.

Published in Transactions on Machine Learning Research (09/2025)

Lemma 3.2 (axiomatic Characterization of Loss-Aggregations). Let A: S

n N[0, )n [0, ) be an aggregation function. Suppose that A satisfies (A1) - (A4). Then, there exists a continuous, strictly increasing function u: [0, ) [0, ), with u(0) = 0, such that

An(x1, . . . , xn) = u 1 n X

We call such aggregations quasi-sums generated by u, and we denote them by Qu. (Proof F.1)

Example 3.3 (p-Norms). Let u(x) = xp for p > 0. Then Qu n(x1, . . . , xn) = (Pn i=1 xp i )1/p.

If furthermore A is positively homogeneous, i.e., scaling the losses scales the aggregation of losses An(λx1, . . . , λxn) = λAn(x1, . . . , xn) for all λ > 0, then u(x) = xk for some k (0, ) in (3) (Proof F.1).

Lemma 3.4 (Quasi-Sum to Quasi-Product). Let Qu be a quasi-sum. Then, there exists a continuous, strictly decreasing function f : [0, ) (0, 1] with f(0) = 1, such that

Qu n(x1, ..., xn) = g (f(x1) ... f(xn)) , (4)

where g: (0, 1] [0, ) is the inverse of f.5 (Proof F.2)

Example 3.5. Let u(x) = x, then f(x) = e x and the corresponding aggregation of the form (4) is the usual sum:

An(x1, ..., xn) = g (f(x1)...f(xn)) = Ln(x1, ..., xn).

Remark 3.6. In the remainder of this work we will sometimes write a quasi-sum Qu as an aggregation function A of the form (4) when convenient. This will be explicitly stated or clear from context.

Equipped with this characterization of reasonable aggregation functions of losses, it is natural to ask how can we solve the extended learning under expert advice problem: In which way do we have to modify existing online learning algorithms in order to provide regret guarantees such that the aggregated loss of the learner is bound above in terms of the aggregated loss of any expert and some constant error term?

3.1 Relation to existing literature

The axiomatization of aggregation functionals has a long history (Grabisch et al., 2009). The set of idempotent aggregation functionals is of particular relevance to our work. Idempotent aggregations map an n-tuple of constant c s to c. Averages and generalizations thereof are often idempotent. Hence, offline learning setups usually fall back to idempotent loss aggregations. Online learning most often focuses on sums, a non-idempotent aggregation functional, to offer more fine-grained statements on the regret guarantees.

The set of quasi-arithmetic means forms the set of idempotent analogues to the set of quasi-sums characterized above (see (Grabisch et al., 2009, Section 6.5.1) for details). Kolmogorov and Castelnuovo (1930) and Nagumo (1930) independently provided an axiomatic characterization of this set which strongly resembles our Lemma 3.2. In (Li et al., 2023) the authors use a subset of those quasi-arithmetic means generated by a family of exponential functions, the entropic risk measures, to extend the standard empirical risk minimization framework.

Note that coherent risk measures (Delbaen, 2002), another set of generalized expectations proven useful to generalize empirical risk minimization (Fröhlich and Williamson, 2024), are idempotent and allow for an axiomatic description. However, an axiomatic description of the non-idempotent analogues of coherent risk measures is still an open question. The existence of such a description would be helpful for the use of such aggregations in online learning.

The axiomatic characterizations of aggregations for loss functions should not be conflated with axiomatic characterizations of decision behavior. Certainly, there are relations between the axiomatic characterization

5For the sake of readability we neglect the multiplication sign in further expressions.

Published in Transactions on Machine Learning Research (09/2025)

of expected utility behavior based on the seminal work by Von Neumann and Morgenstern (1953) and the axiomatic characterization of quasi-arithmetic means by Kolmogorov and Castelnuovo (1930) and Nagumo (1930) mentioned before (Muliere and Parmigiani, 1993). However, the goals of the axiomatic statement and hence the axioms are quite different. Again differently, an axiomatic approach to aggregations of experts, not the losses, is put forward in (Neyman and Roughgarden, 2023). The authors use quasi-means to pool forecaster in a low-regret fashion.

Finally, in Section 1 we already hinted towards other generalizations of aggregations in online learning. The mentioned discounted loss aggregation fulfill all axioms listed in Definition 3.1, including positive homogeneity, except for associativity. The aggregation implicitly defined in (strong) adaptive, however, are only loss compatible, positive homogeneous and continuous.

4 A change of variables for the AA

It turns out that the answer to the question how to learn under expert advice such that the learner can provide regret guarantees under generalized loss aggregations is relatively straightforward. Via a change of variables -trick we obtain regret bounds for generalized aggregations relying on the standard Aggregating Algorithm. Despite its simplicity, we take this observation as a starting point to re-investigate the structure and requirements of the Aggregating Algorithm.

Corollary 4.1 (Change of variables for the AA). Let G := (Ω, Γ, Θ, eλ, η) be a mixable prediction game with |Θ| = n. Furthermore, we assume eλ = u λ for some loss function λ and continuous, strictly increasing function u: [0, ) [0, ), with u(0) = 0. The AA applied on G achieves the following regret bound:

Qu T (learner) := Qu T (λ(ω1, Σ(ψ1))...λ(ωT , Σ(ψT ))) Qu 2

Qu T (θ), u ln(n)

for any θ Θ.

Proof. Under the assumptions on G the AA achieves the following regret bound (Vovk, 1990),

t=1 u (λ(ωt, Σ(ψt)))

t=1 u (λ(ωt, ξt(θ ))) + ln (n)

which is equivalent to

t=1 u (λ(ωt, Σ(ψt))) u

t=1 u (λ(ωt, ξt(θ )))

+ u u 1 ln (n)

and, because u is strictly increasing,

t=1 u (λ(ωt, Σ(ψt)))

t=1 u (λ(ωt, ξt(θ )))

+ u u 1 ln (n)

which is (5).

Remark 4.2. It is worth to point out that often one is interested in composing loss functions from the inside , i.e., reparametrizations (Williamson et al., 2016). Here the transformation is extrinsic, we push the loss curve via u obtaining a distorted version of the original loss function λ.

The lemma shows that it is a matter of perspective to consider the loss u ℓwith standard sum aggregation or the loss ℓwith Qu-aggregation. Obtained in this way, the regret bound seems artificial and the arithmetic juggling unmotivated. Therefore, we shift our focus to understanding what is happening in the background, that is, why can a loss distortion be translated into a change of aggregation. In order do so we detour through the development of an Aggregated Algorithm adapted to Quasi-Sums, which is a reparametrization of the standard Aggregating Algorithm. We learn that:

Published in Transactions on Machine Learning Research (09/2025)

(a) The Aggregating Algorithm involves a weighting function, by default fixed to be the negative exponential e x, which explains the downthe or up-weighting of experts based on their incurred loss. (Section 4.1)

(b) The multiplicative structure of the weight updates directly translates into the additive structure of the loss aggregation. (Section 4.2)

(c) The fundamentality of the log loss in the definition of mixability required for the Aggregating Algorithm is an artifact of the standard use of summation for loss aggregation. (Section 4.3)

Let G := (Ω, Γ, Θ, eλ, η) be a prediction game. In the following, we step-by-step go through the Aggregating Algorithm under the premise that the loss function eλ is decomposable, i.e., eλ = u λ, with λ being a loss function and u: [0, ) [0, ) being a continuous, strictly increasing function with u(0) = 0.

4.1 The Weight Updates

Step (1) in Algorithm 1 is the updating of the experts weights. For the moment, we set the learning rate η = 1. Under the assumption of decomposability of eλ, i.e., eλ = u λ, the update can be written as,

Pt(θ) := e eλ(ωt,ξt(θ))Pt 1(θ) = f(λ(ωt, ξt(θ)))Pt 1(θ),

for f(x) := e u(x). In particular, the function f allows for the interpretation as being the profile to judge the experts. It determines how much the expert θ contributes to the next pseudo-prediction (Step 2) depending on the incurred loss in the current time step. For this reason, we call f weighting profile. Weighting profiles fulfill three simple requirements directly derived from the properties of u (cf. Lemma F.2).

Definition 4.3 (Weighting Profile). We say that a continuous f : [0, ) (0, 1] is a weighting profile if (a) f(0) = 1, (b) f is strictly decreasing, and (c) lim x f(x) = 0.

However, the properties of f can be justified independently of the properties of u. The normalization f(0) = 1 means that weights should be positive and bounded from above by 1. The expert incurring 0 loss should get assigned full weight, while f being strictly decreasing implies that the higher the loss the less weight should be put on the expert. The limiting behavior of f says that an expert which incurs extremely large losses should be punished by getting down-weighted to 0. Note that most of our following statements are framed for a choice of weighting profile f instead for the tantamount choice of u. Hence, the change of aggregation Qu amounts to the change of weighting profile.

To illustrate Table 1 provides a short list of potential aggregation functions, their corresponding function u (see Section 5.1) and weighting profile. Note that we use the term focal aggregation in the table to emphasize that the composition of the corresponding u with the log-loss recovers the often used focal loss (with γ = 2) (Lin et al., 2017).

For a fixed loss function we can analyze and compare weighting profiles for different aggregation functions. For instance (see Table 1), observe that for Lp-norm aggregations switching the value of p from less than 1 to strictly bigger than 1 drastically changes the shape of the weighting profile. Let us shortly compare the L2-norm and sum. For the sake of simplicity, we focus on the arbitrary learning rate η = 0.5. The weighting profile corresponding to the L2-norm punishes higher losses stronger than smaller losses compared to the sum. This, however, comes with the cost that for small losses the L2 weighting profile does not finely distinguish between good and even better experts. Both of them get nearly updated with the same weight, in contrast to the sum.

Crucially, comparisons of weighting profiles require to fix a loss a priori. The domain and distribution of the loss values themselves strongly interact with the choice of aggregation. For instance, the Brier score only provides values between 0 and 1. Thus, the weighting profile beyond 1 on the x-axis is irrelevant for the comparison. Hence, conclusive interpretations are only possible for fixed losses and different aggregations. Note that mixability might not be maintained for arbitrary combinations of losses and weighting profiles (Section B). For this reason, we go beyond this qualitative interpretation of the effect of aggregations and

Published in Transactions on Machine Learning Research (09/2025)

Table 1: Aggregations, their corresponding function u and weighting profiles for different learning rates (lightseagreen: η = 0.001, limegreen: η = 0.5, mediumseagreen: η = 1, seagreen: η = 2, darkgreen: η = 10, darkslategray: η = 100).

Aggregation u(x) = Weighting Profile

L0.5-norm ( x1 + x2 + . . .)2 x

Sum x1 + x2 + . . . x

x2 1 + x2 2 + . . . x2

L10-norm (x10 1 + x10 2 + . . .) 1 10 x10

Focal aggregation (1 exp( x))2x

illustrate via an experiment how the aggregation changes the performance of the Aggregating Algorithm in Section 5.2.

The properties of a weighting profile f imply the existence of a continuous inverse, which we denote by g: (0, 1] [0, ), such that (a) g(1) = 0, (b) g is strictly decreasing, and (c) lim x 0+ g(x) = . Finally,

note that when choosing λ(x) = g(x) in the experts weight updates one recovers Bayes updating much as described in detail in (Vovk, 2001, Section 2.2) for λ(x) = log x.

4.2 The Pseudo-Prediction

Different to the exponential weight algorithm (cf. (Cesa-Bianchi and Lugosi, 2006, p. 14)), which shares the weight updates, the Aggregating Algorithm additional involves a pseudo-prediction step, which is then used to derive the actual prediction. For η = 1, eλ = u λ and f(x) := e u(x) as above, the pseudo-prediction (2) writes as,

ψt(ω) = lne 1 Z

Θ f(λ(ω, ξt(θ)))P t 1(θ) dθ , P t 1(θ) := Pt 1(θ) R

Θ Pt 1(θ) dθ.

For notational convenience, we introduce the following slight variant,

ψf t (ω) := u 1(ψt(ω)) = g Z

Θ f(λ(ω, ξt(θ)))P t 1(θ) dθ , (6)

Published in Transactions on Machine Learning Research (09/2025)

where g is the inverse of f as described above. For the sake of readability, we sometimes do not explicitly highlight the dependence of ψ on f in every instance. However, we use the ψf notation in those cases where the dependency on f is not clear or particularly important.

The pseudo-prediction allows for several interpretations:

Normalizing Factor At each step t we define the Ω-dependent family of measures on Θ, given by

pt(θ; ω) = f(λ(ω, ξt(θ)))f(ψf t (ω)) 1P t 1(θ),

where P t 1 is the normalization of Pt 1. Here, we do not specify ψf t : Ω [0, ). Imposing pt(θ; ω) to be a probability distribution on Θ yields to the pseudo-prediction given in (6). Hence, we can interpret ψf t as the normalizing factor of pt(θ; ω).

Loss Mapping Suppose that ωt Ωis revealed by nature. Then

ψf t (ωt) = g Z

Θ f(λ(ωt, ξt(θ)))P t 1(θ) dθ .

If λ(ωt, ξt(θ)) 1 for all θ Θ, by properties of the weighting profile f, the value of ψf t (ωt) will be very large. On the other hand, if λ(ωt, ξt(θ)) 0, for all θ Θ, then its value will be close to 0. In this sense, we can interpret ψf t (ωt) as the loss incurred by pseudo-prediction ψf t when ωt is observed (cf. (Vovk, 2001, Section 2.1)).

Concluding, we summarize the reparametrization of the first two steps of the Aggregating Algorithm in Algorithm 2. Leaning on Vovk (2001) s naming convention we call it the Aggregating Pseudo-Algorithm for Quasi-Sums (APA-QS). Given the loss interpretation of the pseudo-prediction, we can prove the following Lemma 4.4. Note that we consider the aggregation function A: S

n N[0, )n [0, ) given by

An(x1, ..., xn) := g(f(x1)f(x2)...f(xn)). (7)

for a fixed weighting profile f (and hence its inverse g). Recall that using (4) A can be expressed as a quasi-sum Qu via the relation f(x) = e u(x).

Algorithm 2: Aggregating Pseudo-Algorithm for Quasi-Sums (APA-QS) Data: Prediction game (Ω, Γ, Θ, λ, η) and prior distribution P0. Result: Predictions (γt). for every t do

(1) Update experts weights Pt(θ) := f(λ(ωt, ξt(θ)))Pt 1(θ) ; (2) Provide pseudo-prediction with P t 1(θ) := Pt 1(θ) R

Θ Pt 1(θ) dθ,

ψf t (ω) = g Z

Θ f(λ(ω, ξt(θ)))P t 1(θ) dθ ; (8)

Lemma 4.4 (Bound for APA-Loss). Let f : [0, ) R be a weighting profile and g its inverse. Then

AT (APA(P0)) := AT ψf 1 (ω1), ψf 2 (ω2), ..., ψf T (ωT ) = g Z

Θ f (AT (θ)) P0(θ) dθ .

Moreover, when |Θ| = n and P0 is the uniform probability distribution with weights 1/n,

AT (APA(P0)) A2 AT (θ ), g n 1 , (9)

for any expert θ Θ.(Proof F.3)

Published in Transactions on Machine Learning Research (09/2025)

The statement follows from yet another change of variables for Lemma 1 in (Vovk, 2001). However, for didactic reasons we include a full proof from scratch in the appendix (Lemma F.3). In particular, the proof reveals that the multiplicative structure of the weight updates directly translates into the multiplicative structure of AT . This gives rise to the additive structure by the exponential relation in (4).

Note that incorporating a learning rate η > 0 in the APA (as in (Vovk, 2001)) amounts to set f(x) = e ηx.

Corollary 4.5. Let f : [0, ) R be a weighting profile. Let η (0, ) be a learning rate and define gη(x) := (f(x)η) 1. When |Θ| = n and P0 is the uniform probability distribution with weights 1/n,

AT (APA(P0), η) = AT (APA(P0)) A2 AT (θ ), gη n 1 , (10)

for any expert θ Θ.(Proof F.4)

Finally, notice that by setting f(x) = e x in Corollary 4.5 we recover the bound found by Vovk (1990).

4.3 The Substitution Step

As a last step the Aggregating Algorithm derives an actual prediction from the pseudo-prediction. This is achieved by using a substitution function Σ. The substitution function maps a pseudo-prediction to an allowed prediction such that the loss of the allowed prediction is smaller than the value of the pseudoprediction evaluated in every possible outcome. More formally, a substitution function Σ makes the following inequality hold, eλ(ω, Σ(ψt)) ψt(ω) for all ω Ωand all pseudo-predictions ψt. The existence of such a substitution function is guaranteed by the mixability of the game G := (Ω, Γ, Θ, eλ, η) (see Section 2). Concretely, Vovk (2001) suggests the substitution function Σ(ψ) arg minγ Γ supω Ωeλ(ω, γ)/ψ(ω).6 Since eλ decomposes, we apply the change of variables as well to the definition of mixability. To this end, we introduce P(λ, f) as the set of all pseudo-predictions of the form,

ψf(ω) = g Z

Θ f(λ(ω, ξt(θ)))P t 1(θ) dθ = g Z

Γ f(λ(ω, γ))Q(γ) dγ , (11)

for some distribution Q on Γ.

Definition 4.6 ((f, η)-Mixability). Let (Ω, Γ, Θ, λ, η) be a prediction game. Let f : [0, ) [0, 1] be a weighting profile and consider pseudo-predictions ψf P(λ, fη) given by (11). We call (Ω, Γ, Θ, λ, η) (f, η)- mixable there exists a substitution function Σ such that

λ(ω, Σ(ψf)) ψf(ω), (12)

for all ω Ω. If the game is (f, η)-mixable for some η, we say the game is f-mixable.

Lemma 4.7 (f-Mixability is Mixability of Composite Loss). A loss λ is (f, η)-mixable if and only if eλ = u λ for u(x) := ln f(x) is η-mixable. (Proof F.5)

As noted by Erven et al. (2011) and Cabrera Pacheco and Williamson (2023), the definition of mixability is a comparison of the loss function λ and the learning rate η with the log-loss. Our (f, η)-mixability emphasizes the relativity to some function f. Note that if f(x) = e x = ( log x) 1, the definition directly reduces to the original definition.

Given the prediction game (Ω, Γ, Θ, λ, η) is (f, η) (respectively the prediction game (Ω, Γ, Θ, eλ, η) is ηmixable), the substitution steps completes the Aggregating Algorithm. We finally obtain Corollary 4.1 in a slightly different, but equivalent, formulation.

Corollary 4.8. Let f : [0, ) R be a weighting profile. Let η (0, ) be a learning rate. When (Ω, Γ, Θ, λ, η) is (f, η)-mixable, |Θ| = n and P0 is the uniform probability distribution with weights 1/n,

AT (learner) := AT (λ(ω1, Σ(ψf 1 )), ..., λ(ωT , Σ(ψf T ))) A2 AT (θ ), gη n 1 , (13)

for any expert θ Θ.

6When proper, mixable loss functions are considered in class probability estimation, Kamalaruban et al. (2015) argue that antipolar losses constitute another universal substitution function.

Published in Transactions on Machine Learning Research (09/2025)

Remark 4.9. The substitution function is a computational bottleneck which, in particular in highdimensional prediction tasks, might lead to suboptimal time performance. We haven t found literature to address this shortcoming.

Remark 4.10. Vovk (2015) argues that the log-loss is in a particular way fundamental. As Cabrera Pacheco and Williamson (2023) have shown, both Vovk s fundamentality and mixability (in the sufficiently differentiable case) are equivalent to a curvature comparison between a given loss and the log-loss. Definition 4.6 and the analysis of the AA with more general aggregation functions, emphasize that the choice of aggregation is tightly intertwined with the definition of mixability. Particularly, standard sum aggregation corresponds to standard mixability. Hence, the fundamentality of the log-loss is a particularity of the standard sum aggregation and does not imply that the log-loss is fundamental for other aggregations. However, Corollary 4.1 suggest a connection back to a type fundamentality of log-loss via u, which can potentially be described geometrically. We will not go deeper into this observation here.

The corollary requires the prediction game to be (f, η)-mixable. However, it turns out that we can generalize the bound on the aggregated loss of the learner to non-mixable losses.

4.3.1 The Mixability Constant for Non-Mixable Losses

To slacken the requirement of mixability (Vovk, 2001) introduced the mixability constant. In our symbols,

c(f) := inf{c R | ψf P(λ, f), γ Γ, ω, f(λ(ω, γ)) f(ψf(ω))c}, (14)

and set inf{ } := . Note that c(f) 1 (Lemma C.2). Our definition of the mixability constant is equivalent to Vovk (2001) s definition. This is easy to see, since f(λ(ω, γ)) f(ψf(ω))c eλ(ω, γ)) cψ(ω) (cf. Proof F.5). When the infimum in (14) is attained, a substitution function Σ (which also depends on η and λ) exists and satisfies

f(λ(ω, Σ(ψ))) f(ψ(ω))c(f), (15)

for all ω Ω.

Remark 4.11. Note that if we fix f and consider fη(x) = f(x)η, where η > 0 is the learning rate, then we can consider the constant c(fη) in (11) to depend only on η. When we do this we simply denote it by c(η).

With the mixability constant at hand one can provide a generalized bound on the aggregated loss of the learner to non-mixable losses (Theorem C.3), which is analogous to (Vovk, 1990, Eq. (15)). Even the optimality result for the AA (Vovk, 1995), i.e., the constants in this Theorem C.3 cannot be undercut by any other prediction algorithm, translates to our reparametrization and even slightly generalizes the former statement (Section C).

5 How aggregation changes prediction

In the previous sections we have argued that the Aggregating Algorithm is generally applicable for losses interacting nicely with reasonable aggregation functions. However, it is still unclear how the aggregation influences the actual predictions made by the Aggregating Algorithm.

We qualitatively approach this question in two ways. First, we propose to interpret the generator functions of aggregations as utility functions. Then, we illustrate in an experiment that aggregations can actually express the forecaster s attitude towards losses.

5.1 Aggregation and utility of losses

Aggregation functions for losses are, under mild conditions, quasi-sums Qu (cf. Lemma 3.2). On one hand, as we have shown, the u-quasi-sum of losses λ in the regret bound is tantamount to summing up distorted losses u λ (Corollary 4.1). On the other hand, u can be interpreted as the negative utility function for the losses λ of the predictors. It expresses the dis-satisfaction of the learner to incur certain losses. Therefore,

Published in Transactions on Machine Learning Research (09/2025)

whether we talk about a certain choice of negative utility of losses or whether we talk about u-quasi-sums as aggregation functional does not make a difference. For an illustrated, comparative example see Appendix D.

More generally, it is true that risk-avoider prefer convex u, i.e., high losses are up-valued, low losses are down-valued. In contrast, risk-taker consider concave u, which means that low losses are up-valued and high losses are down-valued (cf. (Winkler and Murphy, 1970), concavity and convexity are switched therein for reasons of sign flip). Hence, we can conclude: the type of aggregation captures the attitude of the forecaster towards losses.

5.2 Weather prediction via Aggregating Algorithm

The preceding discussions suggest that aggregation, weighting profile and utility are essentially different facets of the same object. We earlier asked, how does the type of aggregation change the behavior of the Aggregating Algorithm? The proposed interpretations as weighting profile and utility lead us to the following three hypothesis, which we substantiate by a real-world data prediction experiment.

H-(a) Convex additive generators express the aversion towards extreme losses.

H-(b) Concave additive generators express the risk-loving behavior in terms of accepting extreme losses, but seeking for the perfect predictor.

H-(c) Focal aggregation expresses aversion towards extreme losses.

We provide the log-loss (sometimes called cross-entropy loss) profile of several applications of the AA-QS for different aggregations on a sequential weather classification task. We use the Aggregating Algorithm to aggregate probabilistic predictions of 9 simple classification algorithms.The AA-QS was implemented using the log-loss. The task is detailed in Appendix E. Table 2 summarizes our findings for the data collection from the Zugspitze (Germany). See Appendix E.1 for further experiments. The loss histograms particularly reveal the difference between convex and concave utility function. Concave utility functions (i.e., L0.5 and Sum) lead to a high number of predictions with extremely small loss values, with the downside that some predictions incur a high loss. Convex utility functions (i.e., L2 and Sum) damp the tails of high losses, i.e., only few predictions with high losses are made. On the other hand, there are many predictions made incurring small but suboptimal loss. In this realm, L10 is the most extreme example in which many suboptimal predictions are made, but the tails are banned. The focal aggregation leads to behavior largely similar to the one induced by L2-aggregation. All three hypothesis about the change of prediction given a certain aggregation can be substantiated in this explorative study. An exhaustive experimental study would be required to provide more conclusive statements. Nevertheless, there is a conceptual problem in comparing different aggregation schemes on a given problem and asking which one is better . We are not proposing different algorithms for a given objective. We actually change the objective.

6 Conclusion

In this paper, we put forward a general, axiomatic approach to loss aggregation in online learning. Analogous to the development in offline learning, but differently motivated, we show that a set of reasonable aggregations in online learning is characterized by the set of quasi-sums. It turns out that the AA can be adapted to deal with those general aggregations. Not only can we transfer the nice theoretical properties of the AA to its modified variant, we also provide experimental evidence that the choice of general aggregation determines the extreme-loss seeking or extreme-loss averse behavior of the AA. Hence, we span up a new dimension of choice, for which we think that the modified AA is just a starting point. We believe that similar modifications can be provided for other online learning algorithms such as weighted majority algorithm (Littlestone and Warmuth, 1994), follow the regularized leader (Abernethy et al., 2009), or follow the perturbed leader (Kalai and Vempala, 2005).

Published in Transactions on Machine Learning Research (09/2025)

Table 2: Aggregations, and their corresponding utility function u. The loss histogram shows the number of predictions (y-axis) incurring the loss (x-axis) in a certain bin. The blue line depicts the average loss.

Aggregation u(x) = Loss Histogram (Weather - Zugspitze)

L0.5-norm x

L10-norm x10

Focal aggregation (1 exp( x))2x

Published in Transactions on Machine Learning Research (09/2025)

6.1 Broader impact and position statement

The generality of the learning under expert advice setting allows for the deployment of the adapted Aggregating Algorithm in a variety of settings, which do not exclude any malicious nor benevolent use. Note that the type of aggregation creates another choice parameter which, depending on the deployment context, might call for a participatory, democratic approach to the determination of the used quasi-sum. This contextualization requires further studies. We are convinced that our socialization has shaped our method and approach to research. We might have been ignorant to aspects of our work which against other socio-cultural backgrounds might miss or need reframing.

7 Acknowledgements

The authors appreciate and thank the International Max Planck Research School for Intelligent System (IMPRS-IS) for supporting the second author. Part of this project was conducted when the second author was visiting Aaron Roth at the University of Pennsylvania. The stay was supported by a fellowship of the German Academic Exchange Service (DAAD). This work was funded in part by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy EXC number 2064/1 Project number 390727645; it was also supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center.

Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Conference on Learning Theory, number 110, 2009.

Jean Aczél. Sur les opérations définies pour nombres réels. Bulletin de la Société Mathématique de France, 76:59 64, 1948.

Jean Aczél. A solution of some problems of K. Borsuk and L. Jánossy. Acta physica, 4:351 362, 1955.

Joseph Aczél. A short course on functional equations: based upon recent applications to the social and behavioral sciences, volume 3. Springer Science & Business Media, 2012.

Armando J. Cabrera Pacheco and Robert C. Williamson. The geometry of mixability. Transactions on Machine Learning Research, 2023.

Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

Alexey Chernov and Fedor Zhdanov. Prediction with expert advice under discounted loss. In International Conference on Algorithmic Learning Theory, pages 255 269. Springer, 2010.

Robert Craigen and Zsolt Páles. The associativity equation revisited. aequationes mathematicae, 37:306 312, 1989.

Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405 1411. PMLR, 2015.

Freddy Delbaen. Coherent risk measures on general probability spaces. In Advances in finance and stochastics, pages 1 37. Springer, 2002.

Tim Erven, Mark D. Reid, and Robert C. Williamson. Mixability is bayes risk curvature relative to log loss. In Proceedings of the 24th Annual Conference on Learning Theory, pages 233 252. JMLR Workshop and Conference Proceedings, 2011.

Christian Fröhlich and Robert C. Williamson. Risk measures and upper probabilities: Coherence and stratification. Journal of Machine Learning Research, 25:1 99, 2024.

Richard J. Gardner and Markus Kiderlen. Operations between functions. Communications in Analysis and Geometry, 26(4):787 855, 2018.

Published in Transactions on Machine Learning Research (09/2025)

Michel Grabisch, Jean-Luc Marichal, Radko Mesiar, and Endre Pap. Aggregation functions, volume 127. Cambridge University Press, 2009.

Nika Haghtalab, Michael Jordan, and Eric Zhao. On-demand sampling: Learning optimally from multiple distributions. Advances in Neural Information Processing Systems, 35:406 419, 2022.

Elad Hazan and Comandur Seshadhri. Adaptive algorithms for online decision problems. In Electronic colloquium on computational complexity (ECCC), number 88 in 14, 2007.

Laura Iacovissi, Nan Lu, and Robert C. Williamson. A general framework for learning under corruption: Label noise, attribute noise, and beyond. ar Xiv preprint ar Xiv:2307.08643, 2023.

Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005.

Parameswaran Kamalaruban, Robert C. Williamson, and Xinhua Zhang. Exp-concavity of proper composite losses. In Conference on Learning Theory, pages 1035 1065. PMLR, 2015.

Andre ı Nikolaevich Kolmogorov and Guido Castelnuovo. Sur la notion de la moyenne. G. Bardi, tip. della R. Accad. dei Lincei, 1930.

Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. On tilted losses in machine learning: Theory and applications. Journal of Machine Learning Research, 24(142):1 79, 2023.

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980 2988, 2017.

Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994.

Pietro Muliere and Giovanni Parmigiani. Utility and means in the 1930s. Statistical Science, pages 421 432, 1993.

Mitio Nagumo. Über eine Klasse der Mittelwerte. In Japanese journal of mathematics: transactions and abstracts, volume 7, pages 71 79. The Mathematical Society of Japan, 1930.

Eric Neyman and Tim Roughgarden. From proper scoring rules to max-min optimal forecast aggregation. Operations Research, 2023.

Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. ar Xiv preprint ar Xiv:1908.05659, 2019.

John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior. Princeton university press, 3rd edition, 1953.

Vladimir Vovk. Aggregating strategies. In Proceedings of 3rd Annu. Workshop on Comput. Learning Theory, pages 371 383, 1990.

Vladimir Vovk. A game of prediction with expert advice. In Proceedings of the 8th annual conference on Computational learning theory, pages 51 60, 1995.

Vladimir Vovk. The fundamental nature of the log loss function. Fields of logic and computation II: Essays dedicated To Yuri Gurevich on the Occasion of His 75th Birthday, pages 307 318, 2015.

Volodya Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213 248, 2001.

Robert C. Williamson and Aditya Menon. Fairness risk measures. In International conference on machine learning, pages 6786 6797. PMLR, 2019.

Robert C. Williamson, Elodie Vernet, and Mark D. Reid. Composite multiclass losses. Journal of Machine Learning Research, 17(222):1 52, 2016.

Published in Transactions on Machine Learning Research (09/2025)

Robert L. Winkler and Allan H. Murphy. Nonlinear utility and the probability score. Journal of Applied Meteorology and Climatology, 9(1):143 148, 1970.

Lijun Zhang, Shiyin Lu, and Tianbao Yang. Minimizing dynamic regret and adaptive regret simultaneously. In International Conference on Artificial Intelligence and Statistics, pages 309 319. PMLR, 2020.

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning, pages 928 936, 2003.

A A simple introduction to Aggregating Algorithm

The Aggregating Algorithm is a relatively straightforward algorithm with strong theoretical guarantees. In order to motivate our theoretical development in this paper and to provide a low-threshold introduction to this algorithm, we first lead the reader through a sensibly simplified scenario of learning under expert advice.

Consider a binary class probability estimation task on the outcome space Ω:= {0, 1}. We repeatedly see 2 experts Θ := {θ1, θ2} predicting the probability that the next outcome will be 1, i.e., in every round t [T] := {1, . . . , T} each expert θ Θ provides a prediction ξt(θ) [0, 1]. After the predictions are given, the learner (who has seen the experts predictions) has to commit to a prediction γt [0, 1] as well. Then, nature reveals an outcome ωt {0, 1}. We measure the quality of the experts and learner s prediction by the log-loss, that is, λ(s, ωt) = Jωt = 0K ( ln(s)) + Jωt = 1K ( ln(1 s)), for s [0, 1] (s here refers to the prediction made either by the expert or the learner). Note that the set of possible predictions can be regarded as the probability simplex s 7 (s, 1 s) for s [0, 1], cf. top-left in Figure 1. Moreover, we can interpret the log-loss as an embedding of the simplex into R2, i.e., (s, 1 s) 7 ( ln(s), ln(1 s)), see top-right in Figure 1.

Now, the Aggregating Algorithm, as a learner, uses the embedding of the simplex into R2 and exponentiates it by the exponential mapping (λ0, λ1) 7 (e ηλ0, e ηλ1), where η > 0 is called the learning rate. Figure 1 right-top to bottom illustrates this step. In particular, the predictions of the experts can be traced through both mappings, the log-loss and the exponential mapping. The Aggregating Algorithm then forms a convex combination of the mapped experts predictions ψt. The weights on how much each experts prediction contributes to this mixture are based on a generalized Bayes updating (see Section A.1). The updating puts more weight on the experts which performed well in the past. As Figure 1 (orange-brown) illustrates, the obtained convex combination is not necessarily on the exponentiated embedding of the simplex anymore. We later call it a pseudo-prediction for this reason. Pseudo-predictions have nice theoretical properties (see Section 4.2), but they are not helpful as predictions, since they are not in the prediction space. That is why the Aggregating Algorithm requires a characteristical step: the substitution Σ of the pseudo-prediction by an actual prediction with similar theoretical properties. This substitution Σ is intuitively a projection of the pseudo-prediction ψ to the exponentiated embedded simplex (cf. Figure 1 darkgreen). Crucially, a property called mixability (see Definition 4.6) guarantees that the pseudo-prediction lies bottom-left to the exponentiated embedded simplex. The obtained actual predictions guarantee that the accumulated loss of the Aggregating Algorithm (learner), i.e., PT t=1 λ(γt, ωt) with γt = Σ(ψt) is always smaller than the accumulated loss of any expert, i.e., PT t=1 λ(ξt(θ), ωt) for all θ Θ, up to a constant C. In other words, the Aggregating Algorithm s regret is bounded above by a constant, independent of the number of played rounds:

t=1 λ(γt, ωt)

t=1 λ(ξt(θ), ωt) C, θ Θ.

A.1 Bayes updating for weights

A fundamental property of the AA is that, under some conditions, the updating scheme (Step (1) in AA (Algorithm 1)) is reduced to Bayesian updating (Vovk, 2001, Section 2.2). More precisely, make the following

Published in Transactions on Machine Learning Research (09/2025)

x1 := ξ(θ1)

x2 := ξ(θ2)

( ln s, ln 1 s)

(eη ln s, eη ln 1 s)

Figure 1: Graphical Summary of the Steps in the Aggregating Algorithm. Experts θ1 and θ2 provide predictions ξ(θ1) and ξ(θ2), respectively, which are placed in the simplex (top-left) as x1 := (ξ(θ1), 1 ξ(θ1)) and x2 := (ξ(θ2), 1 ξ(θ2)) via s 7 (s, 1 s). The log-loss embeds the simplex as a curve in R2 (top-right), i.e., s 7 ln s is applied coordinate-wise and maps x1 and x2 to x 1 and x 2. Then, the exponential mapping projects them into [0, 1]2. The Aggregating Algorithm forms a convex combination ψ of the projected predictions x 1 and x 2 based on weights updated by a Bayesian-type formula (orange-brown), called a pseudoprediction , which is substituted back to the simplex via a substitution function Σ (darkgreen).

Published in Transactions on Machine Learning Research (09/2025)

choices: (a) Ωis finite, (b) Γ = (Ω), the set of all probability measures on Ω, (c) loss function is the log-loss λln and (d) the learning rate η = 1.

For an expert θ whose prediction at time t is ξt(θ) (Ω), we write ξt(θ)(ωt) as the probability of seeing ωt forecasted by θ, who has observed all data points up to time step t 1. Let Pt(θ) denote the normalized density Pt(θ):

Pt(θ) = e ηλln(ωt,ξt(θ))Pt 1(θ) R e ηλln(ωt,ξt(θ))Pt 1(θ)dθ = ξt(θ)(ωt)Pt 1(θ) R ξt(θ)(ωt)Pt 1(θ)dθ.

One can interpret ξt(θ)(ωt) as the likelihood to observe ωt at time t given the expert θ. Analogously, Pt 1(θ) can be thought of as the prior on the experts after t 1 observations. Note that crucial in this derivation is the correspondence of the exponential projection e x (with η = 1) used in the definition of the pseudo-predictions and the log-loss λln.

B The Effect of u on the Mixability of Losses

The crucial property making the substitution step of the Aggregating Algorithm run, is mixability. A priori it is unclear how the composition of a loss function λ with a strictly increasing, continuous function u with u(0) = 0 effects the mixability. We shortly argue that all interesting cases are possible. For the sake of simplicity, we look a binary outcome sets and predictions p [0, 1]. Hence, we can write the loss function as λ(p) = (λ0(p), λ1(p)).

(1) non-mixable to mixable The absolute loss λabs(p) = (1 p, p) is not mixable for any learning rate η > 0. Applying the function u(x) = ln x gives λln(p) = ( ln 1 p, ln p) the log-loss which is mixable, e.g., for η = 0.

(2) mixable to non-mixable The Brier score λB(p) = ((1 p)2, p2) is mixable for η = 2. However, the square root of the Brier score recovers the absolute loss λabs, which not mixable for any η > 0.

(3) mixable to mixable The mixable (with η = 1) log loss λln(p) = ( ln 1 p, ln p) composed with u(x) = (1 ex)2 gives the Brier score λB(p) = ((1 p)2, p2) which is mixable for η = 2.

C AA-optimality for quasi-sums

In the majority of the paper we have assumed that the prediction game in consideration is (f, η)-mixable. In this case we obtain a direct bound for the aggregated loss of the predictions given via the substitution function and the APA-QS. As it happens with the usual AA, there is no guarantee that the given loss will be (f, η)-mixable. We deal with this situation in this section. With this at hand we also provide an optimality result for the Aggregating Algorithm under quasi-sum aggregation.

First, it will be useful to impose some mild conditions on the prediction game.

Definition C.1 (Regular Local Prediction Game (Vovk, 1995)). We call the tuple (Ω, Γ, λ) a local prediction game. A local prediction game is called regular if the following four assumptions hold

(a) Γ is a compact topological space.

(b) For each ω Ωthe function γ 7 λ(ω, γ) is continuous.

(c) There exists γ Γ such that, λ(ω, γ) < for all ω Ω.

(d) For all γ Γ there exists ω Ωsuch that λ(ω, γ) = 0.

In this section it will be always assumed that the (Ω, Γ, λ) is a local prediction game.

Published in Transactions on Machine Learning Research (09/2025)

C.1 Aggregation algorithm for non-mixable losses

Fix (Ω, Γ, Θ, λ) and a weighting profile f, as in Section 4.3. Note, we assume that we obtain a regular local prediction game when we drop the set of experts Θ. Recall that in this case, the pseudo-predictions belong to P(λ, f), that is, they are of the form

Γ f(λ(ω, γ)Q(γ) dγ ,

for some distribution Q on Γ.

First we show that c 1 as in (Vovk, 1990). The proof is basically the same and we include it for the reader s convenience. Lemma C.2. For given (Ω, Γ, θ, λ) and a weighting profile f, then c(f) 1.

Proof. Suppose that there is f such that c := c(f) < 1. Let γ Γ and Qγ be defined as Qγ (γ ) = 1 and 0 otherwise. Then, there exists γ Γ such that for all ω Ω,

f(λ(ω, γ)) f g Z

Γ f(λ(ω, γ)) Qγ dΓ c

f(λ(ω, γ ))c.

Since 0 < f(x) 1 for all x, we have

0 ln (f(λ(ω, γ))) c ( ln (f(λ(ω, γ )))) .

By assumption (c), we know there exists γ1 Γ such that λ(ω, γ1) < . By the argument above (with γ = γ1) we know there is γ2 Γ such that

0 ln (f(λ(ω, γ2))) c ( ln (f(λ(ω, γ1)))) .

Continuing this way, we obtain a sequence {γk} Γ such that

0 ln (f(λ(ω, γk+1))) c ( ln (f(λ(ω, γk)))) .

Using the compactness of Γ (assumption (a)), let γ Γ be a limit point of this sequence. By continuity (assumption (b)), ln (f(λ(ω, γ))) is the limit of a subsequence { ln (f(λ(ω, γk)))}.

Note that we have

0 ln (f(λ(ω, γk))) ck 1 ( ln (f(λ(ω, γ1)))) ,

thus when k we conclude that ln (f(λ(ω, γ)) = 0, that is λ(ω, γ) = 0, which contradicts (d).

We are now ready to obtain an analogous bound to (13). Theorem C.3. For given (Ω, Γ, θ, λ) and a weighting profile f. Let c := c(f). When |Θ| = n and P0 is the uniform probability distribution with weights 1/n, we have the bound

AT (learner) AT (APA-QS(P0)) g f(AT (θ ))c

for any expert θ Θ.

Moreover, if we consider a learning rate η > 0, fη(x) = f(x)η and cη := c(fη), we have

AT (learner) AT (APA(P0)) g(f(AT (θ ))cηf(gη(n 1))cη), (17)

where gη is the inverse of fη.

Published in Transactions on Machine Learning Research (09/2025)

Proof. Let ψ be a pseudo-prediction . Then, we have f(λ(ω, γ) f(ψ(ω))c.

AT (learner) = g(f(λ(ω1, γ1)...f(λ(ωT , γT )) g(f(ψ1(ω1))c...f(ψT (ωT ))c).

This motivate us to define a new aggregation function given by

Bc n(x1, ..., xn) := g (f(x1)c...f(xn)c) .

Using the fact that c 1 (Lemma C.2) and the notation in Lemma 4.4, we see that

f(Bc(θ))P0(θ) = f(λ(ω1, ξ1(θ))c....f(λ(ωT , ξT (θ))c P0(θ)

f(λ(ω1, ξ1(θ))....f(λ(ωT , ξT (θ))P0(θ)

Proceeding as the in the proof of Lemma 4.4, we obtain

Θ f (Bc T (θ)) P0(θ) dθ f(ψ1(ω1))...f(ψT (ωT )),

Θ f (Bc T (θ)) P0(θ) dθ g (f(ψ1(ω1))...f(ψT (ωT ))) = AT (APA(P0)). (18)

We are left to estimate the LHS of (18):

Θ f (Bc T (θ)) P0(θ) dθ = g Z

Θ f (AT (θ))c P0(θ) dθ

g f(AT (θ ))c

g f(AT (θ ))c

proving (16).

To obtain (17), let f = fη and cη := c(fη), then we have

Θ fη (Bη)cη T (θ) P0(θ) dθ = gη

Θ fη (Aη T (θ))cη P0(θ) dθ gη

fη(Aη T (θ ))cη

where (Bη)c n(x1, ..., xn) := gη(fη(x1)c, ..., fη(xn) = Bc n(x1, ..., xn).

The result follows since

fη(Aη T (θ ))cη

= (Bη)cη 2 (Aη T (θ ), gη(n 1))

= Bcη 2 (AT (θ ), gη(n 1))

=g(f(AT (θ ))cηf(gη(n 1))cη).

Published in Transactions on Machine Learning Research (09/2025)

Remark C.4. Recall that the weighting profile f in Theorem C.3 can be written in the form f(x) = e u(x)

for an appropriate u. In this case, for η > 0, we have

g(f(AT (θ ))cηf(gη(n 1))cη) = u 1 cηu(AT (θ )) + cηu(gη(n 1))

= u 1 cηu(AT (θ )) + cη

In particular, when u(x) = x, this gives

LT (APA(P0, η)) cηLT (θ ) + cη ln(n)

as in (Vovk, 1990).

C.2 AA-optimality

Surprisingly, it is possible to show that the Aggregating Algorithm is in a game-theoretic sense optimal (Vovk, 1995). In a general game between an adversarial environment, which gets to choose experts predictions and nature s outcome and a learner, the learner can only win if they achieve the bounds which are suggested by the Aggregating Algorithm, cf. Remark C.4. We formalize this statement and extend it to more general aggregation functions in the following. Definition C.5. Let A be a continuous, strictly increasing and associative aggregation function. Let (Ω, Γ, λ) be a regular local prediction game. We call the following full-information game G between environment E and learner L a global prediction game:

1. E chooses the size n of a set of experts Θ.

2. For every t [T],

(i) E chooses predictions ξt(θ) Γ for every θ Θ. (ii) L chooses a prediction γt Γ. (iii) E chooses an outcome ωt Ω.

(iv) At(θ) := A2(At 1(θ), λ(ωt, ξt(θ))) for all θ Θ.7

(v) At(learner) := A2(At 1(learner), λ(ωt, γt)).

Definition C.6. We say that the learner L wins the global prediction game G if for all t [T] and θ Θ there are constants c and a such that

At(learner) u 1(c u(At(θ)) + a ln(n)), (19)

otherwise, we say that nature wins. Note, that the aggregation function is a quasi-sum with generator u, i.e., A = Qu.

Optimality here is grounded in the global game specified above. Intuitively, the following theorem shows that in a worst-case scenario, concerning the choice of experts and outcomes, every learner under expert advice can at best achieve the regret bound parametrized by c and a of (19). Note that we don t put any restrictions on the abilities of the learner until this point. Strikingly, the Aggregating Algorithm can achieve this regret bound, hence is optimal. Theorem C.7 (Optimality of Constant Regret Bound for All Predictors). Let A be a continuous, strictly increasing and associative aggregation function. Consider the global prediction game following Definition C.5. There exists a learner L which against an arbitrary adversarial environment wins, i.e., for all T N and all θ Θ,

AT (L) := AT (λ(ω1, γ1), . . . , λ(ωT , γT )) u 1(c u(AT (θ )) + a ln(n)),

if and only if c c(η) and a c(η)

η for some η [0, ) with c(η) as defined in (14), and u is the generator of the aggregation, i.e., A = Qu.

7we set A0(θ) = 0 for all θ Θ.

Published in Transactions on Machine Learning Research (09/2025)

Proof. First, we note that the aggregation A fulfills (A1)-(A3). Hence, A = Qu for some generator u: [0, ) [0, ), continuous and strictly increasing with u(0) = 0 (Lemma 3.2).

Let us define the surrogate loss eλ := u λ. It is straightforward to check that (Ω, Γ, eλ) fulfills all conditions for a regular local prediction game. The first condition holds by assumption. The second condition is clear, since the composition of continuous functions is continuous. Thirdly, there exists γ Γ such that, λ(ω, γ) < for all ω Ω. It follows eλ(ω, γ) = u(λ(ω, γ)) < for all ω Ω. Finally, for all γ Γ there exists ω Ω such that λ(ω, γ) = 0, hence for all γ Γ there exists ω Ωsuch that eλ(ω, γ) = u(λ(ω, γ)) = 0, because u(0) = 0 and u strictly increasing. Concluding, (Ω, Γ, eλ) is a regular local prediction game.

Theorem 1 in (Vovk, 1995) states that in the specified game the learner L is guaranteed to achieve the regret bound, for all T N and all θ Θ,

t=1 eλ(ωt, γt)) c

t=1 eλ(ωt, ξt(θ))) + a ln (|Θ|) , (20)

if and only if c ec(η) and a ec(η)

η for some η [0, ), where

ec(η) := inf{c R | ψ P( λ, e ηx), γ Γ, ω, e η λ(ω,γ) e ηψ(ω)c},

and set inf{ } := , cf. (14). Note, that

ec(η) = inf{c R | ψ P(λ, e ηu(x)), γ Γ, ω, e ηu(λ(ω,γ)) e ηu(ψ(ω))c}

= inf{c R | ψ P(λ, f η), γ Γ, ω, f(λ(ω, γ))η (f(ψ(ω))η)c},

for f = e u, hence ec(η) coincides with c(η) defined in (14).

We give equivalent forms of (20). First, since u 1(x) is increasing, (20) is equivalent to

t=1 λ(ωt, γt))

t=1 λ(ωt, ξt(θ))) + a ln (|Θ|)

Furthermore, λ = u λ and f(x) = e u(x), so (20) is equivalent to

AT (λ(ω1, γ1), . . . , λ(ωT , γT )) u 1(c u(AT (θ )) + a ln(n)).

Remark C.8. The optimality of the Aggregating Algorithm under quasi-sum aggregation only refers to this specific definition of global prediction game. Note, if c > 1 the tightness of the regret-like bound depends on the performance of the experts. Standard O(

T)-regret algorithms in learning under expert advice, e.g., Exponential Weighting Algorithm, can potentially perform better, in terms of less loss, than the Aggregating Algorithm, even though the Aggregating Algorithm is optimal in the sense specified above (Cesa-Bianchi and Lugosi, 2006, p. 14).

D An illustrated, comparative example for different aggregations

Let us consider a comparative example: the simple negative utility function u(x) = x corresponds to the standard sum. The negative utility function u(x) = x2 generates the Euclidean norm aggregation. Compared to summation, in the Euclidean norm large loss values contribute relatively more to the result than small loss values. The analogous statement is true for the utility functions. As a negative utility function u(x) = x2, large loss values hurt, since higher negative utility (cf. orange-brown arrow in Figure 2), relatively more than small loss values (cf. darkgreen arrow in Figure 2).

Published in Transactions on Machine Learning Research (09/2025)

u(λ) u(x) = x2

Figure 2: Comparative Example of Linear and Squared Utility. The horizontal axis denotes the loss value. The vertical axis the negative utility of the loss. We compare the negative utility function u(x) = x to u(x) = x2. In particular, for two values highlighted by a darkgreen arrow, low value, and an orange-brown arrow, high value.

Table 3: Weather data collections from the DWD (German Weather Agency).

Place File Zugspitze tageswerte_KL_05792_19000801_20221231_hist.zip Potsdam tageswerte_KL_03987_18930101_20231231_hist.zip Putbus tageswerte_KL_04024_18530701_20231231_hist.zip Weissenburg-Emetzheim tageswerte_KL_05440_18790101_20231231_hist.zip

Place Days with Missing Values Final Number of Days (Train/Test) Zugspitze 15 8386 (6708/1678) Potsdam 7 8759 (7007/1752) Putbus 431 8323 (6658/1665) Weissenburg-Emetzheim 22 8744 (6995/1749)

E Sequential weather classification task

We run a tabular classification task on weather data collection. The data is curated and provided by the DWD (German Weather Agency). The data collections are publicly available at https://opendata.dwd.de/ climate_environment/CDC/observations_germany/climate/daily/kl/historical/. For the file names see Table 3. Note that the files are constantly updated, hence the file names potentially change. The file names are of the format tageswerte_KL_[location identifier]_[start date]]_[end date]_hist.zip. Each collection provides daily measurements of weather-related attributes for one place. In every collection we deleted days with missing values. For statistic of the data collections see Table 3.

Based on daily average air pressure, average temperature, average relative humidity, maximum temperature, minimum temperature and date we train 9 classifiers to distinguish between four classes of weather: cloudy, rainy/snowy, sunny and unsettled, which are defined according to Table 4. We use the first (in chronological order) 80% of the data points as training set for the following classifiers provided by the scikit learn package: logistic regression (LR), gaussian naive bayes (NB), support vector machine (SVC), linear model with stochastic gradient descent (SGD), decision tree (DT), k-nearest neighbors (KNN), random foreast (RF), bagging on decision trees (BAGGING), gradient boosting on decision trees (GB).

Published in Transactions on Machine Learning Research (09/2025)

Table 4: Definition of weather classes.

Precipitation 2mm Precipitation > 2mm Sunshine Hours > 4h sunny unsettled Sunshine Hours 4h cloudy rainy/snowy

Then, we run the classifiers on the remaining 20% of the data. We apply the Aggregating Algorithm for quasi-sums with the log-loss.8 We use different aggregations in the Aggregating Algorithm as specified in Table 1. We observed that the learning rate did not have a big impact on the loss histograms in our experiment. So slightly arbitrarily, we chose η = 1 for sum and focal aggregation, we chose η = 2 for L0.5, η = 0.001 for L10 and η = 0.5 for L2. We remark that the chosen learning rates (or the aggregations) don t necessarily guarantee that (f, η)-mixability with respect to the aggregation holds for the log-loss function. However, Theorem C.3 still applies for all aggregations.

The code was run on Mac Book Pro with Intel Core i9. However, the experiments only required a fraction of the compute power. We used Python 3.10.2, scikit learn 1.4.0, pandas 1.4.1, numpy 1.22.2, matplotlib 3.5.1.

For the loss histograms we used automatic binning on the interval 0.0 to 3.4. Higher values, which turned out to be np.inf, were cut off. For this reason, we introduced the -bars in the plots where necessary. See Section E.1 for more experiments.

8Note that the classifiers have not necessarily been trained using the log-loss function. This mismatch in classification tasks does not diminish the performance of the Aggregating Algorithm. However, it does allow for further optimization of the entire classification pipeline.

Published in Transactions on Machine Learning Research (09/2025)

E.1 More Experiments

The following tables provide more examples of the same experiment as described in Section 5.2. Note that in some of the experiments we observed -losses due to numerical instabilities close to 0.

Table 5: Aggregations, their corresponding utility function and loss histograms for the AA-QS on the weather data collection from Potsdam. The blue line depicts the average loss excluding -losses.

Aggregation u(x) = Loss Histogram (Weather Potsdam)

L0.5-norm x

L10-norm x10

Focal aggregation (1 exp( x))2x

Published in Transactions on Machine Learning Research (09/2025)

Table 6: Aggregations, their corresponding utility function and loss histograms for the AA-QS on the weather data collection from Putbus. The blue line depicts the average loss excluding -losses.

Aggregation u(x) = Loss Histogram (Weather Putbus)

L0.5-norm x

L10-norm x10

Focal aggregation (1 exp( x))2x

Published in Transactions on Machine Learning Research (09/2025)

Table 7: Aggregations, their corresponding utility function and loss histogram for the AA-QS on the weather data collection from Weissenburg-Emetzheim. The blue line depicts the average loss excluding -losses.

Aggregation u(x) = Loss Histogram (Weather Weissenburg-Emetzheim)

L0.5-norm x

L10-norm x10

Focal aggregation (1 exp( x))2x

Lemma F.1 (Proof of Lemma 3.2). Let A: S n N[0, )n [0, ) be an aggregation function. Suppose that A is continuous, strictly increasing, associative and loss compatible, i.e., it satisfies (A1) - (A4). Then,

Published in Transactions on Machine Learning Research (09/2025)

there exists a continuous, strictly increasing function u: [0, ) [0, ), with u(0) = 0, such that

An(x1, . . . , xn) = u 1 n X

If furthermore A is positively homogeneous, i.e., for every c [0, ) and n N we have An(cx1, . . . , cxn) = c An(x1, . . . , xi, . . . , xn), then u(x) = xk for some k (0, ) in (3).

Proof. Associativity (A3) guarantees that we can write

An(x1, . . . , xn) = A2(x1, A2(x2, A2(. . .))),

for all n 2 (cf. (Grabisch et al., 2009, p. 33))

In particular,

A3(x1, x2, x3) = A2(x1, A2(x2, x3)) = A2(A2(x1, x2), x3).

Hence, A2 : [0, ) [0, ) [0, ) is monotone, i.e., strictly increasing, continuous and associative in the sense of Aczél (1948) (for an English translation see (Aczél, 2012), for a different proof (Craigen and Páles, 1989)), where it is shown that there exists u: [0, ) [0, ) strictly increasing and continuous such that

A2(x1, x2) = u 1 (u(x1) + u(x2)) .

Since 0 = A2(0, 0) = u 1(u(0) + u(0)), it follows that u(0) = 0.

Finally, we obtain by induction (and associativity)

An(x1, . . . , xn) = A2(An 1(x1, ..., xn 1), xn)

For the second statement, we go back to A2, which is now not only strictly increasing, continuous, associative and loss compatible, but as well positive homogeneous, i.e.,

A2(cx1, cx2) = c A2(x1, x2),

for all c (0, ). Hence, Theorem 2 in (Aczél, 1955) (cf. (Gardner and Kiderlen, 2018, p. 797)) applies. This implies

A2(x1, x2) = xk 1 + xk 2 1

for some k (0, ). By induction as above, we have

An(x1, . . . , xn) =

Lemma F.2 (Proof of Lemma 3.4). Let Qu be a quasi-sum. Then, there exists a continuous, strictly decreasing function f : [0, ) (0, 1] with f(0) = 1, such that

Qn(x1, ..., xn) = g (f(x1)...f(xn)) , (22)

where g: (0, 1] [0, ) is the inverse of f.

Published in Transactions on Machine Learning Research (09/2025)

Proof. Let f(x) := e u(x). It is straightforward to check that f is strictly decreasing and that f(0) = 1. Let g(x) = u 1( ln(x)) be its inverse. Then, we can write

Qn(x1, ..., xn) = u 1 n X

i=1 u(xi)) = g (f(x1)...f(xn)) .

Lemma F.3 (Proof of Lemma 4.4). Let f : [0, ) R be a weighting profile. Then

AT (APA(P0)) := AT (ψ1(ω1), ψ2(ω2), ..., ψT (ωT )) = g Z

Θ f (AT (θ)) P0(θ) dθ .

Moreover, when |Θ| = n and P0 is the uniform probability distribution with weights 1/n,

AT (APA(P0)) A2 AT (θ ), g n 1 , (23)

for any expert θ Θ.

Proof. We will follow the general idea of the proof of Lemma 1 in (Vovk, 2001). Recall that using the updating rule (2), we have

PT (θ) = f(λ(ωT , ξT (θ)))...f(λ(ω1, ξ1(θ)))P0(θ).

It follows that:

f(AT (θ))P0(θ) = f (λ(ωT , ξT (θ))) f (λ(ωT 1, ξT 1(θ))) . . . f (λ(ω1, ξ1(θ))) P0(θ)

= f (λ(ωT , ξT (θ))) PT 1(θ)

Θ PT 1(θ) dθ R

Θ PT 1(θ) dθ

Θ PT 1(θ) dθ f (λ(ωT , ξT (θ))) P T 1(θ)

Θ PT 1(θ) dθ f(ψT (ωT ))f (λ(ωT , ξT (θ))) f(ψT (ωT )) 1P T 1(θ)

Θ PT 1(θ) dθ f(ψT (ωT ))p T (θ; ωT ).

Integrating with respect to θ we obtain Z

Θ f (AT (θ)) P0(θ) dθ = Z

Θ PT 1(θ) dθ f(ψT (ωT )) (24)

We now analyze R

Θ PT 1(θ) dθ. Using similar arguments as above, we have

PT 1(θ) = f(λ(ωT 1, ξT 1(θ)))f(λ(ωT 2, ξT 2(θ)))...f(λ(ω1, ξ1(θ)))P0(θ)

= f(λ(ωT 1, ξT 1(θ)))PT 2(θ)

Θ PT 2(θ) dθ R

Θ PT 2(θ) dθ

Θ PT 2(θ) dθ f(ψT 1(ωT 1))f(λ(ωT 1, ξT 1(θ)))f(ψT 1(ωT 1)) 1P T 2(θ)

Θ PT 2(θ) dθ f(ψT 1(ωT 1))p T 1(θ; ωT 1).

Integrating over Θ gives Z

Θ PT 1(θ) dθ = Z

Θ PT 2(θ) dθ f(ψT 1(ωT 1)).

Published in Transactions on Machine Learning Research (09/2025)

Continuing this process we arrive to

Θ f (AT (θ)) P0(θ) dθ = Z

Θ P0(θ) dθ f(ψ1(ω1))...f(ψT (ωT )) (25)

= f(ψ1(ω1))...f(ψT (ωT )) (26)

Applying g to both sides of (25), we obtain

Θ f (AT (θ)) P0(θ) dθ = g (f(ψ1(ω1)) . . . f(ψT (ωT ))) = AT (APA(P0)),

as desired.

If |Θ| = n and P0 is the uniform probability distribution with weights 1/n (cf. Vovk (1990)), we have for any fixed θ Θ,

Θ f (AT (θ)) P0(θ) dθ = g

g f (AT (θ ))

= g f (AT (θ )) f g n 1

= A2 AT (θ ), g n 1 .

Corollary F.4 (Proof of Corollary 4.5). Let f : [0, ) R be a weighting profile. Let η (0, ) be a learning rate. When |Θ| = n and P0 is the uniform probability distribution with weights 1/n,

AT (APA(P0), η) = AT (APA(P0)) A2 AT (θ ), gη n 1 , (27)

for any expert θ Θ.

Proof. Consider fη(x) = f(x)η to define Aη (see (7)), and notice that

Aη(x1, ...xn) = gη(fη(x1)...fη(xn)) = g(f(x1)...f(xn)) = A(x1, ..., xn). (28)

Thus, applying Lemma 4.4 with Aη we have

AT (APA(P0), η) := Aη T (APA(P0)) = gη

Θ fη (Aη T (θ)) P0(θ) dθ .

Further assuming that |Θ| = n and P0 is the uniform probability distribution with weights 1/n, we have (by the proof of Lemma 4.4)

Θ fη (Aη T (θ)) P0(θ) dθ Aη 2 Aη T (θ ), gη n 1 ,

for any θ Θ.

Using (28) again, we conclude that

AT (APA(P0), η) A2 AT (θ ), gη n 1 .

Published in Transactions on Machine Learning Research (09/2025)

Lemma F.5 (Proof of Lemma 4.7). A loss λ is (f, η)-mixable if and only if eλ = u λ for u(x) := ln f(x) is η-mixable.

Proof. If λ is (f, η)-mixable, there exists a substitution function Σ such that for all ψf P(λ, fη),

λ(ω, Σ(ψf)) ψf(ω), ω Ω

λ(ω, Σ(ψf)) gη

Γ fη(λ(ω, γ))Q(γ) dγ , ω Ω

λ(ω, Σ(ψf)) u 1 ln

Γ e ηu(λ(ω,γ))Q(γ) dγ , ω Ω

u(λ(ω, Σ (ψ))) lne η Z

Γ e ηu(λ(ω,γ))Q(γ) dγ , ω Ω,

where ψ := u ψf with ψ P(eλ, e ηx) and Σ is the mapping such that ψ 7 ψf 7 Σ(ψ). Hence, Σ is a substitution function for all ψ P(eλ, fη(x) = e ηx), which is the standard η-mixability.