# memoryless_sequences_for_general_losses__273c2fe5.pdf Journal of Machine Learning Research 21 (2020) 1-28 Submitted 1/18; Revised 3/20; Published 6/20 Memoryless Sequences for General Losses Rafael Frongillo raf@colorado.edu CU Boulder Andrew Nobel nobel@email.unc.edu UNC Chapel Hill Editor: Manfred Warmuth One way to define the randomness of a fixed individual sequence is to ask how hard it is to predict relative to a given loss function. A sequence is memoryless if, with respect to average loss, no continuous function can predict the next entry of the sequence from a finite window of previous entries better than a constant prediction. For squared loss, memoryless sequences are known to have stochastic attributes analogous to those of truly random sequences. In this paper, we address the question of how changing the loss function changes the set of memoryless sequences, and in particular, the stochastic attributes they possess. For convex differentiable losses we establish that the statistic or property elicited by the loss determines the identity and stochastic attributes of the corresponding memoryless sequences. We generalize these results to convex non-differentiable losses, under additional assumptions, and to non-convex Bregman divergences. In particular, our results show that any Bregman divergence has the same set of memoryless sequences as squared loss. We apply our results to price calibration in prediction markets. Keywords: algorithmic randomness, property elicitation, prediction markets. 1. Introduction Since the beginnings of probability theory there has been interest in understanding fixed, complex objects through the lens of randomness. One manifestation of this interest is the problem of assessing and defining the randomness of an individual numerical sequence, a representative question being whether, and in what sense, the digits of π are random. One influential approach to this problem is algorithmic randomness, introduced by Martin Löf (1966), under which a sequence is said to be random if it passes every statistical test performed by some class of algorithms (typically Turing machines). Perhaps the strongest mathematical model of randomness is that of an independent, identically distributed (i.i.d.) sequence of random variables, a paradigmatic example being independent tosses of a fair coin. Under appropriate moment conditions, i.i.d sequences obey the basic limit laws of probability, including the strong law of large numbers, the central limit theorem, and the law of the iterated logarithm. Sequences of i.i.d. random variables also have the property of being unpredictable: beyond any information they provide about their marginal distribution, the first n elements in the sequence do not convey information about the exact values of the elements that follow. The unpredictability of an i.i.d. sequence follows from its randomness. In this paper we study the randomness of an individual sequence by inverting this relationship. Following c 2020 Rafael Frongillo and Andrew Nobel. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-042.html. Frongillo and Nobel earlier work of Nobel (2004), we regard an individual sequence as being random if, in the limit, the average loss of the best constant predictor is no greater than the average loss of any rule that predicts the next entry from a continuous function of a fixed, finite number of previous entries, and call such sequences memoryless. (A precise definition of memoryless sequences in the general setting of this paper is given in Section 2.2.) Consideration of finitememory predictors is a natural desideratum, for example, in pseudorandomness, where one may only be concerned about the ability of a bystander with limited computational resources to guess the next bit in sequence, and not concerned with whether an algorithm having access to the entire sequence could distinguish it from a random one. We note that the definition of memoryless sequences naturally applies to sequences of real numbers, in contrast to Turing machines which require a careful theory of computation over the reals. Nobel (2004) studied real-valued memoryless sequences under the squared loss. He established that memoryless sequences exhibit a number of stochastic properties, including a law of large numbers and a version of the central limit theorem. These and other results follow from the fact that the weak limits of the empirical distributions of a memoryless sequence are stationary martingale difference sequences. The central role of the squared loss in this previous work leads to a number of interesting questions. How do the properties of memoryless sequences depend on the loss? For which other loss functions are memoryless sequences related to martingale difference sequences? Does some analog of the martingale difference property hold for memoryless sequences under general losses? This paper provides answers to these questions for convex loss functions, and establishes close connections between property elicitation and memoryless sequences for general losses. We show that, in a manner reminiscent of Blackwell approachability, the one-shot statistical attributes of the loss function alone determine which sequences are memoryless with respect to that loss. We establish that the property elicited by the loss function uniquely determines which sequences are memoryless, and under mild assumptions, the weak limits of these sequences are such that the conditional value of the property is constant. When the loss is a Bregman divergence, it elicits the mean, and the weak limits of memoryless sequences reduce to martingale differences. In particular, we establish that the family of memoryless sequences investigated by Nobel (2004) remains unchanged when the squared loss is replaced by any Bregman divergence. Our results rely on a key lemma concerning the relationship between the optimal sequential prediction and an asymptotic form of orthogonality. We establish the orthogonality lemma for convex differentiable losses and, with further assumptions, extend it to nondifferentiable losses, allowing us to cover many loss functions used in practice, such as those eliciting the median (absolute loss) and quantiles (pinball loss). We conclude the paper with applications to prediction markets, concerning the calibration of forecasts, and future work. 1.1. Related Work The literature on the randomness of individual sequences began with Von Mises (1919), Kolmogorov (1965), and Martin-Löf (1966). The survey of Uspenskii et al. (1990) gives an account of this early work. V yugin (1998) established an ergodic theorem for Martin Löf random (typical) individual sequences under certain computability conditions. A good Memoryless Sequences for General Losses overview of algorithmic randomness, including recent advances, can be found in the book of Downey and Hirschfeldt (2010). Our work is related to the literature on no-regret online learning algorithms (Foster and Vohra, 1999; Cesa-Bianchi et al., 1999; Cesa-Bianchi and Lugosi, 2006) and game-theoretic probability (Shafer and Vovk, 2005), in that the definition of memorylessness employs a infinite-horizon regret quantity. For example, Haussler, Kivinen, and Warmuth (1998) study binary individual sequence prediction under potentially nonconvex loss functions, showing that losses exhibit one of two possible finite-time regret bounds. An important distinction is that our learning algorithm will be restricted to employ a fixed map from the last k elements of the sequence to predict the next element. Dawid and Vovk (1999; 2001) studied a game-theoretic setting in which a learner attempts to predict a sequence of outcomes that may adapt to her predictions. The conclusion is that the learner can achieve low loss (squared or logarithmic), or the outcome sequence is random in the sense that the martingale law of large numbers and the law of the iterated logarithm hold. A important feature of this work is that the outcome sequence in online learning and game-theoretic probability is allowed to adapt to the choices of the learner. As a consequence, the resulting notion of unpredictability is much stronger than the notion of memorylessness considered here, and does not apply to fixed individual sequences. See Section 7 for further discussion. In other work, Vovk (1988) established a law of the iterated logarithm for Kolmogorov random binary sequences, namely binary sequences x1, x2, . . . such that the Kolmogorov complexity of the prefix x1, . . . , xn is of order n. In general, memoryless sequences need not be Kolmogorov random; many have constant Kolmogorov complexity. For details, and additional references, we refer the reader to Nobel (2004). Finally, the literature on property elicitation extends that of proper scoring rules (Brier, 1950; Good, 1952; Mc Carthy, 1956; Savage, 1971; Gneiting and Raftery, 2007) and proper losses (Reid and Williamson, 2010; Vernet et al., 2016), which concerns the design of loss functions L(p, y) such that when the outcome y is drawn from some distribution q, the optimal prediction is p = q. Property elicitation is the partial-information analog, where one only wishes to score predictions about some property or statistic of the distribution. The modern literature begins with Osband (1985) and Lambert et al. (2008) and continues in computer science, economics, and statistics (Lambert and Shoham, 2009; Lambert, 2018; Gneiting, 2011; Abernethy and Frongillo, 2012; Steinwart et al., 2014; Agarwal and Agarwal, 2015; Frongillo and Kash, 2015). 2. Memoryless Sequences and Elicitation This section is devoted to the definition and discussion of memoryless sequences and an overview of elicitation. We begin with the basic components and assumptions of the predictive setting, and then turn our attention to memoryless sequences. We establish close connections between memoryless sequences and elicitation in Section 4.2 2.1. Basic Assumptions for Predictions, Outcomes, and Loss The predictive setting of this paper has three primary components: a fixed prediction space X Rd, a fixed outcome space Y Rd, and a loss function ℓ: X Y R. In particular, Frongillo and Nobel ℓ(x, y) is the loss incurred when the element x X is predicted and the outcome y Y occurs. The following assumptions will be made throughout the paper: A1. the prediction space X Rd is open and convex; A2. the outcome space Y Rd is closed; A3. the loss ℓ( , y) is convex for each fixed y Y; A4. the loss function ℓ(x, y) is jointly continuous in (x, y); A5. the derivative ℓ(x, y) of the loss ℓ(x, y) with respect to its first argument exists for all (x, y) X Y and is jointly continuous. The continuity assumption A4 can be expressed equivalently as follows: If (x1, y1), (x2, y2), . . . X Y converge to (x, y) X Y then ℓ(xn, yn) converges to ℓ(x, y). Our assumption that X is open and Y is closed ensures that these sets, and their cartesian product X Y are Borel-measurable. Measurability of the loss ℓthen follows by standard arguments from the fact that it is continuous. Similar remarks apply to condition (A5) concerning the gradient of the loss. In what follows we will consider infinite sequences of predictions x = (x1, x2, . . .) and outcomes y = (y1, y2, . . .) taking values, respectively, in the sequence spaces X N and YN. To reduce notation, we will usually omit the parentheses when specifying infinite or finite sequences. 2.2. Memoryless Sequences Let X, Y Rd, and ℓ: X Y R satisfy A1, A2, and A4 above, and let y = y1, y2, . . . Y be a fixed individual sequence. Consider the problem of sequentially predicting the entries yi of y by elements xi X: each xi is obtained by applying a prediction rule to the previous entries y1, . . . , yi 1 of y, and the performance of these predictions at time n is measured by the average loss n 1 Pn i=1 ℓ(xi, yi). A set of prediction rules, one for each time index i 1, is called a prediction scheme. Let us say that the sequence y is unpredictable with respect to a family of prediction schemes if no scheme in the family can outperform a simple reference scheme that always predicts the same (constant) value. Then, informally, a sequence y is memoryless if it is unpredictable with respect to the family of finite order continuous Markov prediction schemes. In more detail, for each k 1 let Ck = Cb(Yk : X) be the family of bounded continuous functions g : Yk X, and for 1 i j let yj i = yi, yi+1, . . . , yj. Each function g Ck represents a continuous kth order Markov prediction scheme for y that predicts the outcome yi by xi = g(yi 1 i k), which is the value of g applied to the k previous values in the sequence. The family of continuous finite order Markov prediction schemes corresponds to the union k 1Ck of the families Ck. Definition 1 A sequence y YN is ℓ-memoryless if there exists a predictor c X such that for every k 1 and every function g Ck i=1 ℓ(g(yi 1 i k), yi) 1 i=1 ℓ(c, yi) Memoryless Sequences for General Losses Here we adopt the convention that g(yi 1 i k) is equal to a fixed predictor x0 X for i k. When (1) holds we will say that y is ℓ-memoryless for c. Note that neither average in (1) is assumed to converge. Thus a sequence is memoryless if no bounded continuous function of finitely many arguments outperforms the lazy prediction scheme that ignores the past and always predicts the next value of the sequence by c. We note that the finite sample performance of a prediction scheme is measured by its average loss, and that schemes are compared via the limiting difference of their finite sample performances. Markov prediction schemes have been widely studied in information theory, see for example the survey of Merhav and Feder (1998). In terms of standard no-regret online learning guarantees, equation (1) says that all Markov predictors have nonnegative regret with respect to the best constant predictor in hindsight. Memoryless real-valued sequences under the squared loss ℓ(x, y) = (x y)2 were previously defined and studied by Nobel (2004). The main focus of that work was establishing that memoryless sequences exhibit asymptotic behavior analogous to that of i.i.d. sequences, including versions of the law of large numbers, the central limit theorem, and Hoeffding s inequality. Our primary focus here is on general convex loss functions ℓ, and the close connection between memoryless sequences and properties elicited by ℓ(elicitation is defined and discussed in the next section). It follows from the results here that if ℓis a Bregman divergence, then ℓ-memoryless sequences have a stochastic nature analogous to that under the squared loss. Details and discussion can be found in Section 4.3. We note that the memoryless property is defined asymptoticly. A finite order prediction scheme g may perfectly predict the initial elements of a sequence y, and yet the sequence could still be memoryless. In particular, a memoryless sequence y that is padded with a large but finite number of initial 0s (or some fixed element of Y) is still memoryless. In this sense, memorylessness is a somewhat weak notion of randomness, in a manner analogous to online learning: just as online learning algorithms can produce arbitrary outputs for an initial block of time and still achieve no regret, here a learner can perform well for a finite amount of time and still fail to predict the sequence y in an asymptotic sense. 2.3. Property Elicitation One of the key conclusions of this work is Theorem 9, which establishes a close connection between ℓ-memoryless sequences and properties elicited by the loss function ℓ. This terminology is defined below. Definition 2 Let spaces X and Y satisfy A1-A2, let loss function ℓ(x, y) satisfy A4, and let Q be the set of all compactly supported probability measures on Y. A property is a set-valued function Γ : Q 2X that associates each probability measure Q Q with a subset Γ(Q) of the prediction space X. The loss function ℓelicits property Γ if Γ(Q) = argmin x X EQℓ(x, Y ) for all Q Q, where EQℓ(x, Y ) = R Y ℓ(x, y) d Q(y). A property is elicitable if it is elicited by some loss. Frongillo and Nobel It is well known, and easy to check, that the squared loss ℓ(x, y) = (x y)2 elicits the mean, and that the absolute loss ℓ(x, y) = |x y| elicits the median. While the squared loss is differentiable, the absolute loss is not, nor is any loss function that elicits the median (Gneiting, 2011). This fact will play a role in our results (see 3). Squared loss is a special case of a broader family of loss functions, called Bregman divergences, that measure the error of a linear approximation to a convex function. Definition 3 Let D Rd be convex, X D be convex and open, and Y D be closed. Given a strictly convex function G : D R which is differentiable on X, its associated Bregman divergence is the loss function ℓ: X Y R given by ℓ(x, y) = G(y) G(x) G(x), y x . (2) While much more general than squared loss, it is easy to see that Bregman divergences also elicit the mean. In fact, they are the only such losses, even when removing the differentiability assumption (Frongillo and Kash, 2015). Theorem 4 (Savage (1971)) If ℓis the Bregman divergence associated with G and EQ|ℓ(x, Y )| is finite for each x X and Q Q, then ℓelicits the mean Γ : Q 7 {EQ Y }. Proof Letting x = EQ Y , note that EQ ℓ(x , Y ) = EQ G(Y ) G(x ). Expanding and simplifying, we find that EQ ℓ(x , Y ) EQ ℓ(x, Y ) = G(x ) G(x) G(x), x x , which is nonnegative by the subgradient inequality. Uniqueness of the argmin follows from strict convexity of G. We briefly mention two other examples of properties elicited by convex losses: quantiles, ratios of expectations, and expectiles. Quantiles. The classic loss function eliciting the α-quantile is the so-called pinball loss given by ℓ(x, y) = (1{x y} α)(x y), for which absolute loss is the special case α = 1/2. Here ℓis convex, but not differentiable in x when x = y. Ratios of expectations. Consider a measurable function b : Y [0, ) such that EQ b(Y ) > 0 for all Q Q. If G is as in Definition 3, then the transformed Bregman divergence ℓ(x, y) = b(y) G(y) b(y) G(x) G(x), y b(y)x elicits the ratio of expectations Γ : Q 7 {EQ Y/EQ b(Y )} (Gneiting, 2011; Frongillo and Kash, 2015). (To see this, consider the ratio EQ ℓ(x, Y )/EQ b(Y ), ignore terms depending on Y but not x, and appeal to Theorem 4.) Expectiles. The τ-expectile, introduced by Newey and Powell, is a type of generalized quantile defined as the unique solution x = µτ to the equation EQ [|1x y τ|(x Y )] = 0, where Y = R and τ (0, 1). The expectile µτ is elicited by asymmetric squared loss ℓ(x, y) = |1x y τ|(x y)2, which is strictly convex and differentiable (Newey and Powell, 1987; Gneiting, 2011). As we show in 4.2, our main results can be cast in terms of elicitation. In particular we find that individual sequences that are hard to predict under a loss ℓare determined only by the property the loss elicits. Memoryless Sequences for General Losses 3. An Orthogonality Condition To make headway in characterizing memoryless sequences and understanding their stochastic behavior, our first step will be to reduce the no-regret definition of memorylessness to a statement of orthogonality, as captured in this section. Given an outcome sequence y YN and a family X of bounded prediction sequences, the results of this section show that a sequence x X is optimal for predicting y under the average loss if and only if for every x X the difference sequence x x is orthogonal, in an appropriate sense, to the gradients sequence ( ℓ(x i , yi))i 1. We first show this result for the case of differentiable losses, then demonstrate why it does not extend to the nondifferentiable case without further assumptions, and finally make such an extension under mild assumptions on the weak limits of y. We begin with a short review of weak convergence, which plays an important role in our principal results and proofs. 3.1. Weak Convergence Several of the key results and proofs in this paper rely on the notion of weak convergence of probability measures, which we briefly review here. A succinct treatment of weak convergence can be found in Chapter 2 of van der Vaart (2000). A sequence {νn : n 1} of probability measures on Rp is said to converge weakly to a limiting probability measure ν, written νn ν, if R f dνn R f dν for every bounded continuous function f : Rp R. A sequence {νn : n 1} of probability measures on Rp is tight if for every ϵ > 0 there exists a compact set K Rp such that νn(Kc) < ϵ for each n 1. Thus any sequence of measures supported on a common compact set is tight. Prokhorov s theorem states that if {νn} is tight then any subsequence {νnk} has a further subsequence {νmk} that converges weakly to a limiting measure. Of particular interest in this paper are the empirical measures of vector sequences. Let u = u1, u2, . . . a sequence of vectors Rp. For each n 1 define a probability measure νn on Rp by assigning mass 1/n to each of the vectors u1, . . . , un. More formally, i=1 I{ui A} (3) for all Borel sets A Rp. If the sequence u is bounded, then it is easy to see that the empirical measures {νn} are tight and, as a consequence of Prohorov s theorem, any subsequence of {νn} has a further subsequence that converges weakly to a limiting measure on Rp. Although the empirical measures νn are discrete, their weak limits may be absolutely continuous with respect to Lebesgue measure. Finally, we note that any measure ν on Ym Rdm corresponds to a sequence of random vectors Y1, . . . , Ym Y, defined on a common probability space and having ν as their joint distribution. Thus we may write νn ν equivalently as νn (Y1, . . . , Ym) where (Y1, . . . , Ym) ν. 3.2. Differentiable Losses We begin our study of orthogonality in the case where assumption (A5) holds, namely the loss ℓ(x, y) is differentiable with respect to x for each y Y. The following definitions will be used in Lemma 7 below. Frongillo and Nobel Definition 5 A sequence u = u1, u2, . . . with values ui Rd is bounded if there exists a constant L < such that ||ui|| L for all i 1. The closure cl(u) of u is the (ordinary) closure in Rd of the countable set {u1, u2, . . .} containing its elements. We will say that u is interior to an open set U Rd if cl(u) is contained in U. Definition 6 Let U be a subset of a vector space. The star interior of U is given by starint(U) = {u U : v U α0 > 0 α [ α0, α0], u + α(v u) U}. Note that the star interior of U is a subset of the relative interior of U. Lemma 7 (Orthogonality, differentiable case) Let X, Y be subsets of Rd satisfying assumptions A1 and A2, and let ℓ(x, y) be a loss function satisfying assumptions A3-A5. Let y YN be a bounded sequence, and let X X N be a family of bounded sequences x that are interior to X. Then for all x starint(X), the following two statements are equivalent: i=1 ℓ(xi, yi) 1 i=1 ℓ(x i , yi) 0 for all x X (4) i=1 xi x i , ℓ(x i , yi) = 0 for all x X . (5) Proof Let x starint(X) be fixed. To show that (5) implies (4), we use convexity of ℓin its first coordinate. By the subgradient inequality ℓ(xi, yi) ℓ(x i , yi) xi x i , ℓ(x i , yi) . Summing over 1 i n, dividing by n, and taking the limit inferior, inequality (4) follows from (5). To establish the converse, suppose that (5) fails to hold. Then there exists a sequence x X and δ > 0 such that the average in (5) has limit inferior less than δ or limit superior greater than δ. We consider the former case; the argument for the latter is similar. For each α R define xα by xα i = x i + α(xi x i ) = αxi + (1 α)x i . As x is in the star interior of X by assumption, there exists 0 < α0 1 such that xα X for all α [0, α0]. Note that for each such α and each i 1, ℓ(xα i , yi) ℓ(x i , yi) = Gα(yi, xi, x i ) + α xi x i , ℓ(x i , yi) (6) where Gα : Y X X R is defined by Gα(u, v, w) = ℓ(w + α(v w), u) ℓ(w, u) α v w, ℓ(w, u) (7) and is equal to the Bregman divergence of ℓ( , u) evaluated at w + α(v w) and w. In particular, Gα is non-negative. Define the set K = cl(y) cl(x) cl(x ). As Y is closed and x, x are interior to X by assumption, K is a subset of Y X X. Moreover, the boundedness of y, x, x implies that K is a compact subset of (Rd)3. Our assumptions on ℓ() ensure that Gα is continuous and Memoryless Sequences for General Losses bounded on K. For n 1 let νn( ) = n 1 Pn i=1 I((yi, xi, x i ) ) be the empirical measure on K of the finite sequence of triples (y1, x1, x 1), . . . , (yn, xn, x n). By assumption, there is a subsequence {nl} of the positive integers such that i=1 xi x i , ℓ(x i , yi) δ. (8) As K is compact the sequence {νn} is tight, there is a subsequence {nk} of {nl} such that νnk converges weakly to some probability measure ν on K. Using equation (6), we find that for each 0 < α < α0, i=1 ℓ(xα i , yi) 1 i=1 ℓ(x i , yi) i=1 ℓ(xα i , yi) 1 i=1 ℓ(x i , yi) = lim inf k i=1 Gα(yi, xi, x i ) + α i=1 xi x i , ℓ(x i , yi) = lim inf k "Z Gα dνnk + α i=1 xi x i , ℓ(x i , yi) = Z Gα dν + α lim inf k 1 nk i=1 xi x i , ℓ(x i , yi) Z Gα dν α δ. The last equality above follows from the weak convergence of νnk to ν, and the final inequality follows from (8). It suffices to show that the final term of the previous display is negative for some α > 0, and for this it is enough to show that R Gα dν = o(α). Note that for each triple (u, v, w) in K the existence of the gradient ℓ(w, u) implies that α 1Gα(u, v, w) 0 as α 0. We show that the functions Gα for 0 < α < α0 are dominated by a constant function, and the result then follows from the dominated convergence theorem. To this end, let M be the maximum of the continuous function || ℓ(w, u)|| over u in cl(y) and w in the convex hull W of cl(x) cl(x ), which is a compact subset of X. By standard results (Shalev-Shwartz, 2012, Lem 2.6), |ℓ(w1, u) ℓ(w2, u)| M ||w1 w2|| for all u cl(y) and w1, w2 W. From this bound and the Cauchy-Schwarz inequality, we find that for all (u, v, w) K |Gα(u, v, w)| α |ℓ(w + α(v w), u) ℓ(w, u)| ||α(v w)|| ||v w|| + | v w, ℓ(w, u) | 2M ||v w|| 2MD Frongillo and Nobel where D is the diameter of the set cl(x) cl(x ). As x and x are bounded, D is finite, and the proof is complete. 3.3. A Counterexample While Lemma 7 applies to differentiable loss functions such as squared loss, it does not address continuous, but non-differentiable, loss functions. Perhaps the simplest non-differentiable loss function is absolute loss ℓ1(x, y) = |x y|. Given the prominence of absolute loss in statistics and machine learning, it is natural to ask whether the conclusion of Lemma 7 continues to hold for ℓ1 with ℓ1(x, y) replaced by a subgradient of ℓ1(x, y) with respect to x at a fixed value of y. As the following example shows, the answer is no in general, essentially because the weak limits of a sequence y may concentrate on kinks of the loss. We will then show in 3.4 that if the weak limits do not concentrate on kinks, the result does go through. Example. Consider a setting with X = [ 1, 1] and Y = (0, 2), and let X = {c N : c X} be the set of constant predictors, with xi = c for all i for some c X. Now let y = (1/i)i 1 and x = 0. Then for any x = (c, c, c, . . .) X, i=1 ℓ(xi, yi) 1 i=1 ℓ(x i , yi) = lim inf n i=1 |c 1/i| 1 lim inf n 1 n i |c| 1 |1/i| + X i>|c| 1 (|c| 2/i) |c| lim sup n 1 n [2 log n] = |c| 0 , where the first inequality follows from |c 1/i| |c| |1/i| , and the second from properties of the harmonic numbers. Thus y and x satisfy the no-regret condition (4) of Lemma 7 with respect to absolute loss. They do not satisfy the orthogonality condition (5), however, as we have i=1 xi x i , ℓ(x i , yi) = lim n 1 n i=1 (c x 1) ( 1) = x 1 c = c , which is non-zero when c = 0. In fact, when c > 0, it appears from this expression that the loss would decrease by moving x 1 = 0 slightly toward c, as the derivative of the loss is strictly negative, but as we know from the above, any movement away from 0 eventually will be bounded away from y. This tells us that the derivative of the regret term in (4) does not commute with the limit operator, even when all the derivatives ℓ(x i , yi) exist in (5). 3.4. Non-differentiable Losses As the counterexample above makes clear, in order to establish a version of Lemma 7 in the non-differentiable setting, additional assumptions are required to compensate for kinks Memoryless Sequences for General Losses in the loss function, specifically (x, y) pairs where ℓ(x, y) is not differentiable in its first argument. We formulate one such assumption below, which ensures that the sequences of interest avoid such pairs. Assume that X and Y are subsets of Rd satisfying assumptions A1 and A2, respectively, and that ℓ(x, y) is a loss function satisfying assumptions A3 and A4. It follows from A3 that the function ℓ(x, y) is subdifferentiable in x for each y Y. Let ℓ(x, y) denote the (non-empty) set of subgradients for ℓ(x, y) at x X. Furthermore, let S X Y be the set of points (x, y) where the ordinary gradient ℓ(x, y) exists and is jointly continuous in both arguments. Lemma 8 (Orthogonality, non-differentiable case) Assume the subgradients ℓ(x, y) are bounded on bounded subsets of X Y. Let y YN be bounded, X X N a family of bounded sequences that are interior to X, and x starint(X). If every weak limit η of the empirical measures of {(x i , yi) : i 1} is such that η(S) = 1 then the following two statements are equivalent: (i) For all x X i=1 ℓ(xi, yi) 1 i=1 ℓ(x i , yi) (ii) For all x X and all sequences z1, z2, . . . with zi ℓ(x i , yi) i=1 xi x i , zi = 0. (10) Proof The proof follows that for the differentiable case, with minor changes. We sketch the argument below. The argument that (10) implies (9) is identical to the differentiable case. To establish the converse, we may assume that for some x X and some sequence zi ℓ(x i , yi), there exists δ > 0 such that the average in (10) has limit inferior less than δ. Note that if (x i , yi) S then zi = ℓ(x i , yi). Let xα be defined as before. For sufficiently small α > 0 and each i 1, we have ℓ(xα i , yi) ℓ(x i , yi) = Hα(yi, xi, x i , zi) + α xi x i , zi where Hα : Y X 2 Rd R is defined by Hα(u, v, w, z) = ( Gα(u, v, w) if (w, u) S ℓ(w + α(v w), u) ℓ(w, u) α v w, z if (w, u) Sc, and Gα is defined as in (7). By assumption, the sequence (yi, xi, x i , zi) takes values in a compact set K Y X 2 Rd. Moreover, Hα is continuous and bounded on the set KS = {(u, v, w, z) K : (w, u) S}. By arguments like those in the differentiable case, there is a subsequence {nk} of the positive integers with the following properties: (i) the limit of n 1 k Pnk i=1 yi y i , gi exists and is at most δ; (ii) the empirical measures γnk of Frongillo and Nobel the sequence (yi, xi, x i , zi) converge weakly to a measure γ supported on K; and (iii) for all sufficiently small α > 0, i=1 ℓ(xα i , yi) 1 i=1 ℓ(x i , yi) "Z Hα dγnk + α i=1 xi x i , zi Let η be the (w, z)-marginal of γ on X Rd. It is easy to see that η is a weak limit of the sequence (x i , yi), and therefore γ(Kc S) η(Sc) = 1 η(S) = 0 by assumption. It follows from the Portmanteau Theorem that Z Hα dγnk = Z Hα dγ = Z KS Hα dγ = Z By arguments like those in the differentiable case, the final integral above is o(α), and the result follows as before. 4. Weak Limits of Memoryless Sequences We now turn to our principal result, which establishes that memoryless sequences are characterized by an optimality type property of their limiting empirical measures. We begin by noting that any infinite sequence y = y1, y2, . . . YN gives rise, for each m 1, to a sequence of empirical measures on Ym that are obtained by placing mass on overlapping blocks of m successive terms in the sequence. Any weak limit of these measures can be represented as a sequence of random vectors Y1, . . . , Ym with Yi Y. Using the orthogonality lemma, we establish in Theorem 9 that y is ℓ-memoryless for c if and only if c is the best predictor of Yk+1 given Y1, . . . , Yk for k = 1, . . . , m 1. In Section 4.2 we show that this optimality can be expressed equivalently in terms of the property Γ elicited by the loss ℓ. In particular, ℓ-memoryless sequences are characterized and determined by the property Γ elicited by the loss ℓ. For losses that elicit the mean, the limiting random vectors Y1, . . . , Ym are closely related to martingale difference sequence, and this fact can be used to extend results of Nobel (2004) on the stochastic behavior of memoryless sequences to these loss functions (see Section 4.3). 4.1. Principal Result We require several preliminaries concerning measures derived from an individual sequence taking values in the outcome space Y. Let y = y1, y2, . . . YN and let m 1 be a blocklength. For each and n 1 define the n-sample m-dimensional empirical measure of y by µn,m(A) = 1 i=0 I{(yi+1, . . . , yi+m) A} (11) for all Borel sets A Ym. The measure µn,m(A) places mass 1/n at each of the n successive m-tuples ym 1 , ym+1 2 , . . . , yn+m n+1 in Ym obtained by sliding a window of width m along the sequence y one component at a time. Memoryless Sequences for General Losses If y is bounded then the empirical measures {µn,m : n 1} are tight, and thus every subsequence has a further subsequence that converges weakly to a limiting measure µm on Ym. We will express the convergence µnk,m µm in the equivalent form µnk,m (Y1, . . . , Ym), where Y1, . . . , Ym Y are random vectors with joint distribution µm. Recall that a sequence Y1, . . . , Ym is stationary if for each s, j 1 with s + j m the sequence (Ys, . . . , Ys+j) has the same joint distribution as (Y1, . . . , Yj+1). Theorem 9 Let X, Y be subsets of Rd satisfying assumptions A1 and A2, and let ℓbe a loss function satisfying assumptions A3-A5. Let c X and let y YN be bounded. The following are equivalent: (i) The sequence y is ℓ-memoryless for c; (ii) For each m 1 every weak limit (Y1, . . . , Ym) of the m-dimensional empirical measures {µn,m : n 1} of y is stationary, bounded, and satisfies c argmin x X E[ ℓ(x, Yk+1) | Y k 1 ] wp1 (12) for 1 k n 1. Proof Let y be a bounded sequence with values in Y. Suppose that for some m 2 and some subsequence {nl} of the positive integers µnl,m (Y1, . . . , Ym) as l , where µnl,m are defined as in (11). Then for each s, j 1 with s+j m, and every bounded continuous function f : Yj+1 R, Eg(Ys, . . . , Ys+j) = lim l 1 nl i=0 g(yi+s, . . . , yi+s+j) = lim l 1 nl i=0 g(yi+1, . . . , yi+j+1) = Eg(Y1, . . . , Yj+1). It follows that (Ys, . . . , Ys+j) has the same joint distribution as (Y1, . . . , Yj+1), and as this is true for each choice of s, j above, the sequence Y1, . . . , Ym is stationary. By the Portmanteau theorem (see, for example, Lemma 2.2 of Vaart (2000)) P(Yk cl(y)) lim sup l i=0 I(yi cl(y)) = 1, and since cl(y) is bounded by assumption, each random variable Yk is bounded as well. Suppose that y satisfies condition (i) of the theorem. As X is open, there exists δ > 0 such that B(c, 2δ) X where B(c, γ) = {x : ||c x|| < γ} is the open ball of radius γ centered at c. Let X be the set of all infinite sequences x = x1, x2, . . . X such that, for some k 0 and some continuous function g : Yk B(c, δ), x1 = = xk = c and xi = g(yi 1 i k) for i k + 1. One may easily verify that the conditions of Lemma 7 are satisfied with x identically equal to c, and therefore i=1 xi c, ℓ(c, yi) = 0 for all x X. (13) Frongillo and Nobel Let 1 k < m and let g : Yk B(c, δ) be any continuous function. Then the function f : Yk+1 R defined by f(uk+1 1 ) = g(uk 1) c, ℓ(c, uk+1) is continuous and is bounded on the compact set cl(y)k+1 supporting (Y1, . . . , Yk+1). By appropriate choice of x X, the relation (13) implies that E D g(Y k 1 ) c, ℓ(c, Yk+1) E = lim l Z f dµnl,m = lim l 1 nl i=0 f(yi+k+1 i+1 ) = 0. As the function g : Yk B(c, δ) was arbitrary, a routine argument shows that E[ ℓ(c, Yk+1) | Y k 1 ] = 0. Now fix x X. As ℓ(u, v) is convex in its first argument, ℓ(x, y) ℓ(c, y) y c, ℓ(c, y) . Replacing y by Yk+1 and taking the conditional expectation with respect to Y k 1 , we find that E[ ℓ(x, Yk+1) | Y k 1 ] E[ ℓ(c, Yk+1) | Y k 1 ] 0 with probability 1, giving condition (ii). Suppose now that y fails to satisfy (i). It follows from Lemma 7 (or the subgradient inequality) that there exists k 0, g Ck, and a subsequence {nr} of the positive integers such that g(yi 1 i k) c, ℓ(c, yi) < 0. (14) Let {nl} be a further subsequence of {nr} such that µnl,k+1 converges in law to a sequence (Y1, . . . , Yk+1). It follows from (14) that E g(Y k 1 ) c, ℓ(c, Yk+1) is non-zero, and therefore the conditional expectation E[ ℓ(c, Yk+1) | Y k 1 ] is non-zero with positive probability. Thus there exists γ Rd and δ > 0 such that E[ γ, ℓ(c, Yk+1) | Y k 1 ] = D γ, E[ ℓ(c, Yk+1) | Y k 1 ] E = δ. Our assumptions on ℓ( , ) ensure that for each compact set K Y the supremum sup y K |ℓ(x, y) ℓ(c, y) ℓ(c, y)(x c)| = o(||x c||) as x c. Replacing y by Yk+1 and x by xα = αγ + c, it follows from the previous two displays that E[ ℓ(xα, Yk+1) | Y k 1 ] = E[ ℓ(c, Yk+1) | Y k 1 ] αδ + o(α). Thus for α > 0 sufficiently small, we find that E[ ℓ(xα, Yk+1) | Y k 1 ] < E[ ℓ(c, Yk+1) | Y k 1 ]. Thus condition (ii) fails to hold, and the proof is complete. Theorem 9 can be extended to non-differentiable loss functions ℓsatisfying A3-A4 by using Lemma 8 instead of Lemma 7 in equation 13 and the subsequent display. For c X let Sc be the (open) set of points y where the derivative ℓ(c, y) exists and is continuous in a neighborhood of (c, y). We state the following result without proof. Theorem 10 Suppose that the subgradients ℓ(x, y) are bounded on bounded subsets of X Y, and let y YN be a bounded sequence such that every weak limit of {µn,1} assigns probability one to Sc. Then y is ℓ-memoryless for c if and only if for each m 1 every weak limit (Y1, . . . , Ym) of {µn,m} is stationary, bounded, and satisfies c argminx X E[ ℓ(x, Yk+1) | Y k 1 ] with probability one for 1 k m 1. Memoryless Sequences for General Losses 4.2. Memoryless Sequences and Property Elicitation Theorem 9 establishes a close connection between memoryless sequences and the property elicited by the loss function ℓ. Recall that ℓelicits the property Γℓ: Q 2X defined by Γℓ(Q) = argminx X R ℓ(x, y) d Q(y). Condition (12) in Theorem 9 can be stated equivalently as c Γℓ(Yk+1|Y k 1 ) wp1 (15) where Yk+1|Y k 1 denotes the conditional distribution of Yk+1 given Y1, . . . , Yk when k 1, and the distribution of Y1 when k = 0. Thus Theorem 9 implies that ℓ-memoryless sequences are characterized by the property Γℓelicited by ℓ. As an immediate consequence, two losses eliciting the same property have the same family of memoryless sequences. Corollary 11 Let X, Y be subsets of Rd satisfying assumptions A1 and A2, and let ℓand ℓ be loss functions satisfying A3-A5. If ℓand ℓ elicit the same property Γ : Q 2X , then a bounded sequence y YN is ℓ-memoryless for c if and only if it is ℓ -memoryless for c. Thus any convex loss eliciting the mean, which must be a Bregman divergence, will have the same memoryless sequences as squared loss. As discussed in 4.3 below, the weak limits of such a sequence are martingale difference sequences, and this can be used to establish stochastic properties of the sequence. Another common loss function is the absolute loss ℓ1(x, y) = |x y|, which is convex and elicits the median, rather than the mean. Theorem 10 implies that a real-valued individual sequence whose weak limits assign no probability to the point 0 is ℓ1-memoryless with c = 0 if and only if its weak limits have conditional medians equal to 0 with probability one. Linton and Whang (2007) and Coudin and Dufour (2009) have studied such sequences, which they term mediangale differences, as well as more general quantilegale difference sequences. 1 Following this terminology, sequences Y1, . . . , Ym satisfying condition (15) might be termed Γ-gale differences for an elicitable property Γ, or lossingale differences for condition (12). Similar statements can be made for expectiles and ratios of expectations. For example, as we saw in 2.3, a ratio of expectations Γ(Q) = EQY/EQb(Y ) is elicited by the transformed Bregman divergence ℓ(x, y) = b(y)G(y) b(y)G(x) G(x), y b(y)x , which is convex whenever the original divergence is convex. For such losses, Theorem 9 together with eq. (15) imply that a sequence is ℓ-memoryless for c if and only if its weak limits satisfy c = Γ(Yk+1|Y k 1 ) = E[Yk+1|Y k 1 ]/E[b(Yk+1)|Y k 1 ]. In particular, a sequence is ℓ-memoryless at c if and only if {Yk c b(Yk)}m k=1 is a martingale difference sequence. 4.3. Mean Elicitation, Martingale Differences, and Stochastic Properties When the loss ℓelicits the mean Theorem 9 establishes a connection between the weak limits of memoryless sequences and multivariate martingale differences, generalizing earlier work of Nobel (2004), who considered the case of squared loss with X, Y R. As its proof of the following result is routine, we omit the details. 1. Coudin and Dufour use the term mediangale to refer to the difference sequence. Medians in higherdimensional spaces are often defined in terms of minimizing ℓ (x, y) = x y 1 (Fekete et al., 2005), in which case the same result would apply: sequences whose weak limits avoid 0 Rk will be ℓ -memoryless if and only if their weak limits are (generalized) mediangale differences. Frongillo and Nobel Proposition 12 Let ℓbe a loss function satisfying A3-A5 that elicits the mean in the sense of Definition 2, and let y YN be bounded and ℓ-memoryless for c X. Then c is unique, and the centered sequence z = y c with zi = yi c is ℓ-memoryless for 0. In particular, every weak limit Z1, . . . , Zm of the m-dimensional empirical measures of z satisfies E[Zk+1 | Zk 1 ] = 0 for 1 k n 1. Nobel (2004) showed how the martingale difference property of Proposition 12 could be leveraged to establish stochastic properties of ℓ-memoryless sequences. These properties include a Hoeffding-type inequality and a central limit type theorem. These arguments extend without modification to any loss that elicits the mean and satisfies the conditions of Theorem 9 or Theorem 10. We state the Hoeffding inequality and central limit theorem without proof. In Section 5 we show how to relax convexity and differentiability of the loss, thereby extending the results below to all Bregman divergences. Proposition 13 Let ℓbe a loss function satisfying A3-A5 that elicits the mean. If the sequence y = y1, y2, . . . [a, b] is ℓ-memoryless for 0, then for every m 1 and every t > 0, lim sup n 1 n 2e mt2/2(b a)2. Remark: As an easy corollary of Proposition 13 we obtain a law of large numbers for y, namely n 1 Pn i=1 yi 0 as n tends to infinity. Let the sequence y YN be given. For each i, m 1 define the root-normalized partial sums si,m = m 1/2 Pm j=1 yi+j. For each n 1 let νn,m be the empirical measure of s1,m, . . . , sn,m, that is, νn,m(A) = 1 i=1 I{si,m A} for each set A B where B denotes the Borel subsets of R. Note that νn,m depends only on x1, . . . , xn+m. Let ρ( , ) be any metric for probability measures on R that is compatible with weak convergence, in the sense that νn ν if and only if ρ(νn; ν) 0. One example is the Prokhorov metric ρ(ν; η) = inf{ϵ > 0 s.t. ν(A) η(Aϵ) for every A B}, where Aϵ = {u : |u v| < ϵ for some v A} is the ϵ-blow-up of A. Let N(0, σ2) denote the normal distribution with mean 0 and variance σ2. Theorem 14 Let y = y1, y2, . . . R be a bounded sequence, and let ℓbe a loss satisfying A3-A5 that elicits the mean. If y is ℓ-memoryless for 0 and y2 := y2 1, y2 2, . . . is ℓ-memoryless for σ2 > 0 then lim m lim sup n ρ( νn,m , N(0, σ2) ) = 0. Equivalently, for every ϵ > 0 there exists a block length m0 (depending on ϵ) and a sample size n0 (depending on ϵ and m0) such that ρ( νn,m0, N(0, σ2) ) < ϵ for each n n0. Memoryless Sequences for General Losses Remark. Theorem 14 is expressed in terms of double limits: the first is taken as the number n of sliding blocks increases, and the second is taken with increasing block size m. The first limit accounts for the stochastic behavior of the sequence, essentially replacing the m-dimensional distribution of a random sequence with the limiting empirical distribution of the m blocks of x. The second limit corresponds to increasing sample size. When Y is finite and X is a subset of distributions on Y, an important special case of Bregman divergences is the class of proper losses, which elicit the full distribution on Y when this distribution is in X. Other than squared loss, the most common proper loss is log loss ℓlog : X Y R, given by ℓlog(x, y) = log x(y), where X is the relative interior of the probability simplex (Y), and x(y) denotes the probability x assigns to y. Up to terms depending only on y, log loss can be written as the Kullback Leibler divergence, which is the Bregman divergence with respect to (negative) Shannon entropy. For proper losses, we can interpret the stochastic results established below as giving certain calibration guarantees; see 6 for how to reformulate proper losses to satisfy assumptions A1-A5 in the context of prediction markets, where proper losses correspond to the complete market case. 4.4. Memoryless Sequences from Random Processes The characterization of memoryless sequences in Theorem 9 suggests that the sample paths of appropriate sequences of random variables should be memoryless. This is the conclusion of the following result, the proof of which can be found in Appendix A. Proposition 15 Let X, Y, and ℓsatisfy assumptions A1-A5, and let Γℓbe the property elicited by ℓ. Let Y = Y1, Y2, . . . be a sequence of random vectors defined on a common probability space, and taking values in a fixed compact subset of Y such that k 1 Γℓ(Yk+1|Y k 1 ) wp1 (16) for some predictor c X. Then with probability one the sequence Y is ℓ-memoryless for c. Proposition 15 relies on the following elementary characterization of conditional loss optimality. Note that one could view this result, and equation (17) in particular, in terms of identification functions from the property elicitation literature, see for example Steinwart et al. (2014); Lambert (2018). Lemma 16 Let Y be a random vector defined on a probability space (Ω, G, Q) and taking values in a compact subset of Y, and let G0 be a sub-sigma field of G. For any fixed c X the following statements are equivalent: E[ ℓ(c, Y ) | G0 ] = 0 with probability 1 , (17) E[ ℓ(c, Y ) | G0 ] E[ ℓ(x, Y ) | G0 ] with probability 1 for every x X . (18) 5. Extensions to Non-Convex and Non-Differentiable Losses We now show that, under appropriate conditions, one may extend Theorem 9 to certain loss functions ℓthat do not satisfy condition A3 (convexity) or A5 (differentiability), including Frongillo and Nobel some non-differentiable losses that are not covered by Theorem 10. The key idea is to express a non-convex or non-differentiable loss function in a composite form via a continuous invertible link function ψ that effectively replaces the native space X by a surrogate space ˆ X on which the loss is convex and differentiable. Suitable assumptions on the link function ensure that the set of memoryless sequences is preserved by the link function, and that Theorem 9 applies. In particular, we argue that essentially all Bregman divergences are composite in the sense above, and we can thereby show that Theorem 9 and the results of Nobel (2004) extend to the full class of losses eliciting the mean. Let X and Y satisfy conditions A1 and A2, respectively, and let ℓ: X Y R be a loss function of interest that elicits a property Γ. Suppose that ℓfails to satisfy one or both of conditions A3 (convexity) or A5 (differentiability), but that ℓcan be written in the form ℓ(x, y) = ˆℓ(ψ(x), y) (19) where ˆℓ: ˆ X Y R satisfies conditions A3 A5, ˆ X Rd satisfies condition A1, and the link function ψ : X ˆ X is a homeomorphism (a continuous bijection with a continuous inverse) of X and ˆ X. Then Theorem 9 applies to ℓand the property Γ. Proposition 17 Under the conditions above, a bounded sequence y YN is ℓ-memoryless for c X if and only if the weak limits of y are stationary, bounded, and satisfy the optimality relation (12). Proof Let y YN be bounded, and let ˆΓ be the property elicited by ˆℓ. As ψ : X ˆ X is a homeomorphism, it is easy to see that every bounded continuous function ˆg : Yk ˆ X can be written in the form ˆg = ψ g for some bounded continuous function g : Yk X. Similarly, every bounded continuous function g : Yk X can be written in the form g = ψ 1 ˆg for some bounded continuous function ˆg : Yk ˆ X. It follows from (19) that y is ℓ-memoryless for c X if and only if y is ˆℓ-memoryless for ψ(c) ˆ X which, by Theorem 9, holds if and only if the weak limits of y are stationary, bounded, and satisfy (12) with c replaced by ψ(c). A straightforward argument using (19) shows that Γ = ψ 1 ˆΓ and the result follows. Note that Proposition 17 allows us to characterize memoryless sequences for certain nondifferentiable losses without needing the assumptions of Theorem 10. In particular, if ψ : X ˆ X is a nondifferentiable homeomorphism, then in general ℓ(x, y) = ˆℓ(ψ(x), y) will be nondifferentiable, yet Proposition 17 will still apply. Now let ℓG(x, y) = G(y) G(x) G(x), y x be the Bregman divergence of a convex function G : D R. The results of Section 4 imply that the weak limits of an ℓG-memoryless sequences are shifted martingale difference sequences if conditions A3-A5 hold. However, the convexity condition A3 is somewhat restrictive: while ℓG is always convex in y, it is generally not convex in x. For example, when G(x) = ex the corresponding divergence ℓG(x, y) = ey ex(1+y x) is nonconvex in x for all y (see Figure 1). However, for suitable convex functions G the Bregman divergences ℓG can be written in the composite form (19) using the mixed Bregman divergence (Gordon, 1999). This leads to the following extension of Theorem 9, which relies on Proposition 17. For more details, and the proof of the result, see Appendix B. Memoryless Sequences for General Losses 5 10 15 20 x Figure 1: Three loss functions evaluated at y = 3. The first, given by ˆℓ(ˆx, y) = ey + ˆx log ˆx ˆx yˆx, satisfies the conditions A3-A5, and thus Theorem 9 applies. The second, ℓ(x, y) = ey ex(1+y x), is a nonconvex Bregman divergence ℓG for G(x) = ex, but can be made convex via the invertible link ψ(x) = ex as for all x, y we have ℓ(x, y) = ˆℓ(ψ(x), y), so Proposition 17 applies. The third is both nonconvex and nondifferentiable, but again we may apply Proposition 17 as ℓ (x, y) = ˆℓ(ψ (x), y) for ψ (x) = ex for x 2 and 1 + ex/2 for x < 2. Proposition 18 Let D Rd be convex, Y D be closed, and X D be open and convex. If G : D R is strictly convex, closed, and differentiable on X, then the characterization of Theorem 9 applies to the Bregman divergence ℓG. 6. Application to Prediction Markets We now apply our results in the setting of prediction markets, which are markets designed to elicit and aggregate predictions from traders about some future outcome Z in the arena of athletics, finance, entertainment, or politics. Our conclusion will be that, under the efficient market hypothesis, the outcomes of a sequence of markets are memoryless with respect to the market prices. Prediction markets work by offering financial contracts whose payoffs are contingent in some way on the eventually-observed value of Z; by revealed preference, the choices of traders in such a market can be interpreted as predictions about Z, and the final prices of the market can be viewed as an aggregate, or consensus belief of the traders (Hanson, 2003; Wolfers and Zitzewitz, 2004). Formally, our setting is as follows. The set Z will represent the possible outcomes, and thus the possible values of Z. The market will support the buying and selling of d different securities, whose payoffvalues are each contingent on which outcome z Z materializes. In particular, we define the payoffs of these securities by a vector-valued function φ : Z Rd, where the component φi(z) denotes the payoffof security i upon outcome z. The prediction space X = Rd will represent vectors of shares in these d securities, which traders can buy and sell. Thus, if a trader holds a bundle of shares r X and outcome z Z materializes, then the trader is owed r, φ(z) . Intuitively, a risk-neutral trader (i.e. one seeking to maximize expected payoff) who buys a bundle r for a cost of c reveals a belief r, Epφ(Z) > c; that is, the trader must believe the expected value of r, φ(Z) to be greater than c. In this way, the market prices are thought to reflect the consensus belief about the expected value of the securities φ. An important special case is when |Z| = d and φi(z) = 1{z = zi} is the indicator of the element zi Z, corresponding to a complete market. In this case the expected value of φ(Z) is simply the distribution p over Z. Frongillo and Nobel Market maker initializes state x0 0 Rd for all traders t = 1, . . . , T do Trader t decides to purchase bundle rt Rd Market maker updates the state xt xt 1 + rt Trader pays the market maker C(xt) C(xt 1) end Outcome z Z is revealed and market maker pays rt, φ(z) to trader t = 1, . . . , T Algorithm 1: The cost-function-based market maker Due to thin market problems, it is common to employ an automated market maker framework, which is simply a central entity in the market through which all transactions must be made. A popular mechanism to determine the cost of each purchase is the costfunction-based market, introduced by Abernethy, Chen, and Wortman Vaughan (2013). Here the market maker chooses a convex potential function C : Rd R, and maintains a vector xt = Pt i=1 ri describing the total number of shares bought and sold of each security up to time t. The cost of purchasing a bundle rt X of shares at time t N is then given by the difference in the potential, C(xt 1 + rt) C(xt 1), at which point we set xt = xt 1 + rt. This procedure is described more formally in Algorithm 1. Typically one regards the gradient C(x) of C at the current market state x as the market price . The reason is that C corresponds to the instantaneous prices of the securities: C(x)/ xi is the price per unit of an infinitessimal quantity of security i. One can check that if a trader believes the outcome to be drawn from some distribution p over Z, then monotonicity of C implies that a risk-neutral trader would have an incentive to buy or sell shares until the market state satisfied C(x) = Epφ(Z), as discussed above. In this sense, the market is giving traders incentives to predict the value of Epφ(Z). For the market to satisfy standard axioms, C must be strictly convex and differentiable, and the set of possible gradients C(Rd) should be equal to D := relint(conv(φ(Z))), the relative interior of the convex hull of the security payoffs (Abernethy et al., 2013). This is equivalent to being able to write C(x) = supµ D µ, x G(µ) for some differentiable and strictly convex function G : D R with G(D) = Rd (Abernethy et al., 2013). Now consider running this market, from initalization to the final outcome revelation, many times for many events. One can ask, if the final market prices are correct , in the sense that the expected value of the securities really matched the price vector, i.e., C(x) = Epφ(Z) where p is the true distribution over Z? In attempting to answer this question one quickly encounters issues concerning the nature of probability and whether or not the market outcome is truly random. A convenient way around these issues is to appeal to individual sequences, as we do in this paper, and simply ask whether a given (infinite) series of market prices are memoryless with respect to the corresponding outcomes. To do so, we will need to translate the cost-function-based market setting to our own. To capture this prediction market setting, let X = Rd as above and let Y = {φ(z) : z Z} Rd be the possible security payoffs. Our loss function will take the form of the mixed Bregman divergence ˆℓG(x, y) = C(x)+G(y) x, y , with C and G as above, which satisfies Memoryless Sequences for General Losses assumptions A3-A5 as C and G are both convex and continuously differentiable (see the proof in Appendix B). Note that if the current market state is x , and a trader moves the state to x = x + r by purchasing bundle r, then ˆℓG(x, y) ˆℓG(x , y) = C(x + r) C(x ) r, y ; this is precisely the net loss of the trader in Algorithm 1, namely the up-front cost of bundle r, minus the eventual payoffof the securities y = φ(z). Now translating Lemma 7, fixing an outcome sequence z ZN and set of sequences X X N to which the initial market states x belong, we have, lim inf n 1 n i=1 (C(x i + ri) C(x i ) ri, φ(zi) ) 0 for all x + r X (20) i=1 ri, C(x i ) φ(zi) = 0 for all x + r X , (21) where now i denotes the run of the market, so that x i , ri, and zi represent, respectively, the initial market state, trader s purchase, and outcome in the ith run of the market. Thus, one can interpret the application of Lemma 7 to prediction markets as a version of the efficient market hypothesis: either the market prices are calibrated with respect to the class of trading algorithms whose outputs belong to X, in the sense of eq. (21), or some sequence of trades in X can make an infinite profit over the course of these market runs. For example, if X contains all constant sequences, and x is constant, eq. (21) implies a version of the law of large numbers in that the average security payoffmust approach the initial market price. Turning now to Theorem 9, we can say something stronger. Observe that the loss ˆℓG elicits the property Γ such that Γ(p) is the set of share vectors whose price matches the expected security payoffs under p; that is, C(Γ(p)) = {Epφ(Z)}. From the discussion in 4, we find that for any sequence z ZN, no continuous finite-memory trading strategy can garner (linearly) infinite profits from a series of markets initialized at x X, if and only if the weak limits of the security payoffsequence y = φ(z) are Γ-centered at the initial price C(x ), in the sense of eq. (15). In particular, subtracting off C(x ), the weak limits of the security payoffs form a multidimensional martingale difference sequence. Theorem 19 Let outcome space Z, securities φ : Z Rd, strictly convex and differentiable cost function C : Rd R with C(Rd) = relint(conv(φ(Z))), and inital share vector x X be given. Then for all z ZN, we have the following for every k 1 and g Ck, lim inf n 1 n C(x + g(zi 1 i k)) C(x ) g(zi 1 i k), φ(zi) 0 , (22) if and only if every weak limit (Z1, . . . , Zm), m 1, of the empirical measures {µn,m : n 1} of z is stationary, bounded, and satisfies Ep[φ(Zk+1)|Zk 1 ] = C(x ) wp1 for 1 k n 1. Proof As discussed above, we take yi = φ(zi), y = φ(z), and predictors g = g + x ; note that g is a continuous Markov predictor if and only if g is. Memorylessness is equivalent to eq. (22) as argued above. For the conditional expectation, from Theorem 9 we have x argminx X E[ ℓ(x, Yk+1) | Y k 1 ] wp1. As we assume C is strictly convex and differentiable, from Lemma 16 we see that x achieves the minimum loss if and only if Frongillo and Nobel 0 = E[ x (C(x + x) C(x ) x, φ(Zk+1) ) | Zk 1 ] = E[ C(x ) φ(Zk+1) | Zk 1 ]. We conclude with some remarks. First, the class of finite-memory trading algorithms is perhaps restrictive in this setting; ideally, we would allow our trading algorithms to use the entire past history of prices and outcomes. This immediately becomes problematic, however, as it is difficult to exclude algorithms that know the outcome sequence z. (The restriction to finite-memory and continuity in the definition of memoryless accomplishes this, for generic outcome sequences.) Intuitively, one should allow the outcome sequence to be independent of the prediction sequence, but this would stray from our focus on individual sequences. Nonetheless, it is possible that our techniques can be extended to such online settings, which could allow for a formal link to similar statements made in game-theoretic probability (Shafer and Vovk, 2005; Vovk, 2014). Finally, we note that the connection laid out in this section also applies to the broader family of scoring rule markets, by replacing the Bregman divergence with the corresponding score or loss (Lambert et al., 2008; Frongillo and Waggoner, 2018). 7. Discussion and Future Work We have extended the notion of memoryless sequences, introduced in Nobel (2004), to higher dimensions and general convex differentiable losses, as well as some nondifferentiable and nonconvex cases that encompass Bregman divergences. Our results establish that memoryless sequences are characterized by the stochastic behavior of their finite-dimensional weak limits, and that the distribution of these limits is governed by the property elicited by the loss function. In particular, the broad class of Bregman divergences have the same set of memoryless sequences as squared loss, and hence their weak limits form martingale difference sequences. Similarly, the memoryless sequences for absolute loss correspond to mediangales, whose conditional medians are 0. Finally, we applied our results to the question of price calibration in prediction markets, showing that if no trader can make an infinite profit, then prices are calibrated. It would be interesting to establish similar results in a more online setting, as discussed in Section 6, where the outcome sequence can adapt to the predictions adversarially. Such a setting would be closer to online learning and game-theoretic probability, allowing a more direct comparison to that literature. Another interesting future direction would be to incorporate a dependent variable, and study the relationship between the complexity of the prediction function class and the resulting set of memoryless sequences. Acknowledgements RM Frongillo was supported in part by NSF grant CCF-1657598. AB Nobel was supported in part by NSF grants DMS-1613072 and DMS-1613261, and NIH grant HG009125-01. Appendix A. Stochastic Sources of Memoryless Sequences We assume below that the prediction space X and outcome space Y satisfy assumptions A1 and A2, respectively, and that the loss ℓ(x, y) satisfies assumptions A3-A5. Memoryless Sequences for General Losses Proof [of Lemma 16]. Let c X be fixed, and assume without loss of generality that Y is compact. The subgradient inequality ensures that ℓ(x, Y ) ℓ(c, Y ) x c, ℓ(c, Y ) . Taking conditional expectations of both sides and using linearity of the conditional expectation, it is clear that (17) implies (18), and we turn our attention to the converse. As X is open, there exists δ > 0 such that the closed ball B(c, δ) of radius δ centered at c is contained in X. For each x B(c, δ) and each y Y define r(x : y) by ℓ(x, y) = ℓ(c, y) + x c, ℓ(c, y) + r(x : y). By assumption (A5), for each y Y the ratio r(x : y)/||x c|| 0 as x c. By arguments like those in the proof of Lemma 7 one may readily establish that r(x : y)/||x c|| a M for all (x, y) B(c, δ) Y, where M is the maximum of || ℓ(x, y)|| over the compact set B(c, δ) Y, and a is a constant. Note that M is finite as the gradient is assumed to be continuous. Now fix k 1 and let x B(c, δ). It follows from assumption (18) and the linearity of the conditional expectation that 0 E ℓ(x, Y ) ℓ(c, Y ) | G0 = x c, E ℓ(c, Y ) | G0 + E r(x : Y ) | G0 . By the dominated convergence theorem the second expectation above is o(||x c||), and we conclude that for every x B(0, δ) 0 x, E ℓ(c, Y ) | G0 + o(|| x||). Let v Rd be any unit vector, and let x = αv for some 0 < α < δ. It follows from the previous display that 0 α v, E ℓ(c, Y ) | G0 + o(α). Dividing both sides above by α and letting α tend to zero we find v, E ℓ(c, Y ) | G0 0; replacing v by v it follows that the inner product is equal to zero. As the unit vector v was a arbitrary, we conclude that E ℓ(c, Y ) | G0 = 0, as desired. We note that statement (18) of Lemma 16 can be written in the equivalent form c argmin x X E[ ℓ(x, Y ) | G0 ] wp1 Corollary 20 Let Y1, . . . , Yn be random vectors taking values in a compact subset of Y. If c argminx X E[ ℓ(x, Yk+1) | Y k 1 ] for 1 k n 1, then the gradients ℓ(c, Y1), . . . , ℓ(c, Yn) are a martingale difference sequence with respect to the filtration Fk = σ(Y k 1 ). Corollary 21 Let the random vector Y and sigma field G0 be as in Lemma 16. If the random vector X0 X is measurable G0 then E[ ℓ(X0, Y ) | G0 ] = 0 with probability one if and only if X0 argminx X E[ ℓ(x, Y ) | G0 ] with probability one. Frongillo and Nobel Proof [of Proposition 15] Let c X be as in the statement of the theorem, and assume without loss of generality that Y is compact. Recall that Ck denotes the family of (bounded) continuous functions from Yk to X. Using the orthogonality lemma it suffices to show that with probability one, for each k 1 and each g Ck, g(Y i 1 i k ) c, ℓ(c, Yi) = 0. (23) As Y is compact, the family of functions Ck contains a countable subset Ck that is dense in the supremum norm. Let g be any function in Ck and for m k + 1 define Zm = g(Y m 1 m k ) c, ℓ(c, Ym) . Lemma 16 ensures that E[Zm | Y m 1 1 ] = 0, and therefore Zk+1, Zk+2, . . . is a bounded martingale difference sequence. It follows from the Azuma-Hoeffding inequality and the Borel Cantelli lemma that n 1 Pn m=k+1 Zm 0 with probability one as n tends to infinity, which establishes (23) for the function g. As g was an arbitrary element of the countable dense family Ck, one may extend (23) to all of Ck by a straightforward approximation argument. Finally, excluding an exceptional set of probability zero for each family Ck, we find that, with probability one, (23) holds for all k 1 and all g Ck, as desired. Appendix B. Proof of Proposition 18 Assume that D Rd is convex, Y D is closed, and X D is open and convex. Let G : D R be a closed, strictly convex function that is differentiable on X. Recall that the (usual) Bregman divergence ℓG is given by ℓG(x, y) = G(y) G(x) G(x), y x . Let F(x) := supˆx D ˆx, x G(ˆx) be the convex conjugate of G and ˆ X dom F. Define the mixed Bregman divergence ˆℓG : ˆ X Y R by ˆℓG(ˆx, y) = F(ˆx) + G(y) ˆx, y . We claim that ˆℓG( G(c), y) = ℓG(c, y). To see this, we use the fact that F( G(c)) = G(c), c G(c), which essentially follows from the first-order optimality condition in the definition of F (Urruty and Lemaréchal, 2001, Corollary E.1.4.4). Thus ˆℓG( G(c), y) = F( G(c)) + G(y) G(c), y = G(c), c G(c) + G(y) G(c), y = G(y) G(c) G(c), y c = ℓG(c, y), as claimed. Proof [of Proposition 18] As G is convex and differentiable on X, it must be continuously differentiable (Urruty and Lemaréchal, 2001, Remark D.6.2.6), and thus G is continuous on X. We conclude that ℓG satisfies A4 (continuity). Memoryless Sequences for General Losses As G is strictly convex and closed, F is continuously differentiable on the interior of its domain, denoted int dom F (Urruty and Lemaréchal, 2001, Theorem E.4.1.1). Define ˆ X := G(X); we will now show ˆ X int dom F. As G is closed, we have s G(x) x F(s) (Rockafellar, 1997, Theorem 23.5). Now suppose s = G(x) / int dom F. By the above, we have F(s) = , and therefore F(s) must be an unbounded set (Rockafellar, 1997, Theorem 23.4). Taking any x F(s), x = x, we have again by the above that G(x) = s G(x ), which contradicts the strict convexity of G. Now as F is continuous on ˆ X int dom F, ˆ X must be an open set as the preimage of X under F. Moreover, the gradient maps G and F (restricted to X and ˆ X, respectively) are inverses of each other, and therefore homeomorphisms. We have now established A1 (for ˆ X) and A2. Letting ˆℓG : ˆ X Y R be the mixed Bregman divergence with first argument restricted to ˆ X, we have properties A3 A5 for ˆℓG. (Convexity is immediate, and continuity and differentiability follow from continuity of G, G, F, and F.) Now Proposition 17 applies with ℓ= ℓG, ˆℓ= ˆℓG, and ψ(x) = G(x). Jacob Abernethy and Rafael Frongillo. A characterization of scoring rules for linear properties. In Proceedings of the Conference on Learning Theory, pp. 1 27, 2012. Jacob Abernethy, Yiling Chen, and Jennifer Wortman Vaughan. Efficient market making via convex optimization, and a connection to online learning. ACM Transactions on Economics and Computation, 1(2):1 39, 2013. Arpit Agarwal and Shivani Agarwal. On consistent surrogate risk minimization and property elicitation. In Proceedings of the Conference on Learning Theory, pp. 4-22, 2015. Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1 3, 1950. Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. Nicolo Cesa-Bianchi and Gábor Lugosi. On prediction of individual sequences. The Annals of Statistics, 27(6):1865 1895, 1999. Elise Coudin and Jean-Marie Dufour. Finite-sample distribution-free inference in linear median regressions under heteroscedasticity and non-linear dependence of unknown form. The Econometrics Journal, 12:S19 S49, 2009. A. Philip Dawid and Vladimir G. Vovk. Prequential probability: principles and properties. Bernoulli, 5(1):125 162, 1999. Rodney G. Downey and Denis R. Hirschfeldt. Algorithmic Randomness and Complexity. Springer Science & Business Media, 2010. Sándor P. Fekete, Joseph S.B. Mitchell, and Karin Beurer. On the continuous Fermat-Weber problem. Operations Research, 53(1):61 76, 2005. Frongillo and Nobel Dean P. Foster and Rakesh Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29(1):7 35, 1999. Rafael Frongillo and Ian Kash. Vector-valued property elicitation. In Proceedings of the Conference on Learning Theory, pp. 710 727, 2015. Rafael Frongillo and Bo Waggoner. An axiomatic study of scoring rule markets. In Innovations in Theoretical Computer Science, 2018. Tilmann Gneiting. Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746 762, 2011. Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007. Irving J. Good. Rational decisions. Journal of the Royal Statistical Society, Series B (Methodological), 14(1):107 114, 1952. Geoffrey J. Gordon. Regret bounds for prediction problems. In Proceedings of the 12th annual conference on Computational learning theory, pages 29 40. ACM, 1999. Robin Hanson. Combinatorial information market design. Information Systems Frontiers, 5(1):107 119, 2003. David Haussler, Jyrki Kivinen, and Manfred K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44(5):1906 1925, 1998. Andrei N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1(1):1 7, 1965. Nicolas S. Lambert and Yoav Shoham. Eliciting truthful answers to multiple-choice questions. In Proceedings of the 10th ACM Conference on Electronic Commerce, pp. 109 118, 2009. Nicolas S. Lambert, David M. Pennock, and Yoav Shoham. Eliciting properties of probability distributions. In Proceedings of the 9th ACM Conference on Electronic Commerce, pp. 129 138, 2008. Nicolas S. Lambert. Elicitation and evaluation of statistical forecasts. Working paper, 2019. URL https://web.stanford.edu/~nlambert/papers/elicitability.pdf. Oliver Linton and Yoon-Jae Whang. The quantilogram: With an application to evaluating directional predictability. Journal of Econometrics, 141(1):250 282, 2007. Per Martin-Löf. The definition of random sequences. Information and Control, 9(6):602 619, 1966. John Mc Carthy. Measures of the value of information. Proceedings of the National Academy of Sciences of the United States of America, 42(9):654, 1956. Memoryless Sequences for General Losses Neri Merhav and Meir Feder. Universal prediction. IEEE Transactions on Information Theory, 44(6):2124 2147, 1998. Richard von Mises. Grundlagen der wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 5(191):52 99, 1919. Whitney K. Newey and James L. Powell. Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society, 819 847, 1987. Andrew B. Nobel. Some stochastic properties of memoryless individual sequences. IEEE Transactions on Information Theory, 50(7):1497 1505, 2004. Kent Harold Osband. Providing Incentives for Better Cost Forecasting (Prediction, Uncertainty Elicitation). University of California, Berkeley, 1987. Mark D. Reid and Robert C. Williamson. Composite binary losses. The Journal of Machine Learning Research, 11:2387 2422, 2010. R. Tyrell Rockafellar. Convex Analysis. Princeton University Press, 1997. Leonard J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783 801, 1971. Glenn Shafer and Vladimir Vovk. Probability and Finance: It s Only a Game! John Wiley & Sons, 2005. Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. Ingo Steinwart, Chloé Pasin, Robert Williamson, and Siyu Zhang. Elicitation and identification of properties. In Proceedings of the Conference on Learning Theory, pp. 482 526, 2014. Jean-Baptiste Hiriart Urruty and Claude Lemaréchal. Fundamentals of Convex Analysis. Springer, 2001. Vladimir A. Uspenskii, Alexei L. Semenov, and Alexander Kh. Shen. Can an individual sequence of zeros and ones be random? Russian Mathematical Surveys, 45(1):121-189, 1990. Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Press, June 2000. Elodie Vernet, Rorbert C. Williamson, and Mark D. Reid. Composite multiclass losses. The Journal of Machine Learning Research, 17(1):7860 7911, 2016. Vladimir G. Vovk. Probability theory for the Brier game. Theoretical Computer Science, 261(1):57 79, June 2001. Vladimir G. Vovk. Laws of probabilities in efficient markets. http://www. probabilityandfinance.com/GTP2014/Slides/Vovk.pdf. November 2014. Frongillo and Nobel Vladimir G. Vovk. The law of the iterated logarithm for random Kolmogorov, or chaotic, sequences. Theory of Probability & Its Applications, 32(3):413 425, 1988. Vladimir V. V yugin. Effective convergence in probability and an ergodic theorem for individual random sequences. Theory of Probability & Its Applications, 42(1):39 50, 1998. Justin Wolfers and Eric Zitzewitz. Prediction Markets. Journal of Economic Perspective, 18(2):107 126, 2004.