# generalized_maximum_entropy_estimation__80bcdee6.pdf

Journal of Machine Learning Research 20 (2019) 1-29 Submitted 8/17; Revised 1/19; Published 9/19

Generalized maximum entropy estimation

Tobias Sutter tobias.sutter@epfl.ch Risk Analytics and Optimization Chair EPFL, Switzerland

David Sutter suttedav@phys.ethz.ch Institute for Theoretical Physics ETH Zurich, Switzerland

Peyman Mohajerin Esfahani P.Mohajerin Esfahani@tudelft.nl Delft Center for Systems and Control TU Delft, The Netherlands

John Lygeros lygeros@control.ee.ethz.ch Department of Electrical Engineering and Information Technology ETH Zurich, Switzerland

Editor: Benjamin Recht

We consider the problem of estimating a probability distribution that maximizes the entropy while satisfying a ﬁnite number of moment constraints, possibly corrupted by noise. Based on duality of convex programming, we present a novel approximation scheme using a smoothed fast gradient method that is equipped with explicit bounds on the approximation error. We further demonstrate how the presented scheme can be used for approximating the chemical master equation through the zero-information moment closure method, and for an approximate dynamic programming approach in the context of constrained Markov decision processes with uncountable state and action spaces.

Keywords: Entropy maximization, convex optimization, relative entropy minimization, fast gradient method, approximate dynamic programming

1. Introduction

This article investigates the problem of estimating an unknown probability distribution given a ﬁnite number of observed moments that might be corrupted by noise. Given that the observed moments are consistent, i.e., there exists a probability distribution that satisﬁes all the moment constraints, the problem is underdetermined and has inﬁnitely many solutions. This raises the question of which solution to choose. A natural choice would be to pick the one with the highest entropy, called the Max Ent distribution. The main reason why the Max Ent distribution is a natural choice is due to a concentration phenomenon described by Jaynes (2003):

If the information incorporated into the maximum-entropy analysis includes all the constraints actually operating in the random experiment, then the distribution predicted by maximum entropy is overwhelmingly the most likely to be observed experimentally.

c 2019 Tobias Sutter, David Sutter, Peyman Mohajerin Esfahani, John Lygeros.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-486.html.

Sutter, Sutter, Mohajerin Esfahani and Lygeros

See Jaynes (2003); Gr unwald (2008) for a rigorous statement. This maximum entropy estimation problem subject to moment constraints, also known as the principle of maximum entropy, is applicable to large classes of problems in natural and social sciences in particular in economics, see Golan (2008) for a comprehensive survey. Furthermore it has important applications in approximation methods to dynamical objects, such as in systems biology, where Max Ent distributions are key objects in the so-called moment closure method to approximate the chemical master equation Smadbeck and Kaznessis (2013), or more recently in the context of approximating dynamic programming Mohajerin Esfahani et al. (2018) where Max Ent distributions act as a regularizer, leading to computationally more eﬃcient optimization programs.

Their operational signiﬁcance motivates the study of numerical methods to compute Max Ent distributions, which are the solutions of an inﬁnite-dimensional convex optimization problem and as such computationally intractable in general. Since it was shown that the Max Ent distribution subject to a ﬁnite number of moment constraints (if it exists) belongs to the exponential family of distributions Csisz ar (1975), its computation can be reduced to solving a system of nonlinear equations, whose dimension is equal to the number of moment constraints Mead and Papanicolaou (1984). Furthermore, the system of nonlinear equations involves evaluating integrals over the support set K that are computationally diﬃcult in general. Even if K is ﬁnite, ﬁnding the Max Ent distribution is not straightforward, since solving a system of nonlinear equations can be computationally demanding.

In this article, we present a new approximation scheme to minimize the relative entropy subject to noisy moment constraints. This is a generalization of the introduced maximum entropy problem and extends the principle of maximum entropy to the so-called principle of minimum discriminating information Kullback (1959). We show that its dual problem exhibits a particular favorable structure that allows us to apply Nesterov s smoothing method Nesterov (2005) and hence tackle the presented problem using a fast gradient method obtaining process convergence properties, unlike Lasserre (2009).

Computing the Max Ent distribution has applications in randomized rounding and the design of approximation algorithms. More precisely, it has been shown how to improve the approximation ratio for the symmetric and the asymmetric traveling salesman problem via Max Ent distributions Asadpour et al. (2010); Gharan et al. (2011). Often, it is important to eﬃciently compute the Max Ent distribution. For example, the zero-information moment closure method Smadbeck and Kaznessis (2013) (see Section 6), a recent approximate dynamic programming method for constrained Markov decision processes (see Section 7), as well as the approximation of the channel capacity of a large class of memoryless channels Sutter et al. (2015) deal with iterative algorithms that require the numerical computation of the Max Ent distribution in each iteration step.

Related results. Before comparing the approach presented in this article with existing methods we provide a brief digression on the moment problem. Consider a one-dimensional moment problem formulated as follows: Given a set K R and a sequence (yi)i N R of moments, does there exist a measure µ supported on K such that

K xiµ(dx) for all i N ? (1)

Generalized maximum entropy estimation

For K = R and K = [a, b] with < a < b < the above moment problem is known as the Hamburger moment problem and Hausdorﬀmoment problem, respectively. If the moment sequence is ﬁnite, the problem is called a truncated moment problem. In both full and truncated cases, a measure µ that satisﬁes (1), is called a representing measure of the sequence (yi)i N. If a representing measure is unique, it is said to be determined by its moments. From the Stone-Weierstrass theorem it followes directly that every nontruncated representing measure with compact support is determined by its moments. In the Hamburger moment problem, given a representing measure µ for a moment sequence (yi)i N, a suﬃcient condition for µ being determined by its moments is the so-called Carleman condition, i.e., P i=1 y 1/2i 2i = . Roughly speaking this says that the moments should not grow too fast, see Akhiezer (1965) for further details. For the Hamburger and the Hausdorﬀmoment problem, there are necessary and suﬃcient conditions for the existence of a representing measure for a given moment sequence (yi)i N in both the full as well as the truncated setting, that exploit the rich algebraic connection with Hankel matrices see (Lasserre, 2009, Theorems 3.2, 3.3, 3.4). In (Lasserre, 2009, Section 12.3) it is shown that the maximum entropy subject to ﬁnite moment constraints can be approximated by using duality of convex programming. The problem can be reduced to an unconstrained ﬁnite-dimensional convex optimization problem and an approximation hierarchy of its gradient and Hessian in terms of two single semideﬁnite programs involving two linear matrix inequalities is presented. The desired accuracy is controlled by the size of the linear matrix inequalities. The method seems to be powerful in practice, however a rate of convergence has not been proven. Furthermore, it is not clear how the method extends to the case of uncertain moment constraints. In a ﬁnite dimensional setting, Dudik et al. (2007) presents a treatment of the maximum entropy principle with generalized regularization measures, that as a special case contains the setting presented here. However, convergence rates of algorithms presented are not known and again it is not clear how the method extends to the case of uncertain moment constraints. The discrete case, where the support set K is discrete, has been studied in more detail in the past. It has been shown the the maximum entropy problem in the discrete case has a succinct description that is polynomial-size in the input and can be eﬃciently computed Singh and Vishnoi (2014); Straszak and Vishnoi (2017). Furthermore it was shown that the maximum entropy problem is equivalent to the counting problem Singh and Vishnoi (2014).

Structure. The layout of this paper is as follows: In Section 2 we formally introduce the problem setting. Our results on an approximation scheme in a continuous setting are reported in Section 3. In Section 4, we show how these results simplify in the ﬁnitedimensional case. Section 5 discusses the gradient approximation that is the dominant step of the proposed approximation method from a computational perspective. The theoretical results are applied in Section 6 to the zero-information moment closure method and in Section 7 to constrained Markov decision processes. We conclude in Section 8 with a summary of our work and comment on possible subjects of further research.

Notation. The logarithm with basis 2 and e is denoted by log( ) and ln( ), respectively. We deﬁne the standard n simplex as n := {x Rn : x 0, Pn i=1 xi = 1}. For a probability mass function p n we denote its entropy by H(p) := Pn i=1 pi log pi. Let

Sutter, Sutter, Mohajerin Esfahani and Lygeros

B(y, r) := {x Rn : x y 2 r} denote the ball with radius r centered at y. Throughout this article, measurability always refers to Borel measurability. For a probability density p supported on a measurable set B R we denote the diﬀerential entropy by h(p) := R

B p(x) log p(x)dx. For A R and 1 p , let Lp(A) denote the space of Lp-functions on the measure space (A, B(A), dx), where B(A) denotes the Borel σ-algebra and dx the Lebesgue measure. Let X be a compact metric space, equipped with its Borel σ-ﬁeld B( ). The space of all probability measures on (X, B(X)) will be denoted by P(X). The relative entropy (or Kullback-Leibler divergence) between any two probability measures µ, ν P(X) is deﬁned by

X log dµ dν dµ, if µ ν

+ , otherwise ,

where denotes absolute continuity of measures, and dµ

dν is the Radon-Nikodym derivative. The relative entropy is non-negative, and is equal to zero if and only if µ ν. Let X be restricted to a compact metric space and let us consider the pair of vector spaces (M(X), B(X)) where M(X) denotes the space of ﬁnite signed measures on B(X) and B(X) is the Banach space of bounded measurable functions on X with respect to the sup-norm and consider the bilinear form

X f(x)µ(dx).

This induces the total variation norm as the dual norm on M(X), since by (Hern andez-Lerma and Lasserre, 1999, p.2) µ = sup f 1

µ, f = µ TV,

making M(X) a Banach space. In the light of (Hern andez-Lerma and Lasserre, 1999, p. 206) this is a dual pair of Banach spaces; we refer to (Anderson and Nash, 1987, Section 3) for the details of the deﬁnition of dual pairs. The Lipschitz norm is deﬁned as u L := supx,x X{|u(x)|, |u(x) u(x )|

x x } and L(X) denotes the space of Lipschitz functions on X.

2. Problem Statement

Let K R be compact and consider the scenario where a probability measure µ P(K) is unknown and only observed via the following measurement model

yi = µ, xi + ui, ui Ui for i = 1, . . . , M , (2)

where ui represents the uncertainty of the obtained data point yi and Ui R is compact, convex and 0 Ui for all i = 1, . . . , M. Given the data (yi)M i=1 R, the goal is to estimate a probability measure µ that is consistent with the measurement model (2). This problem (given that M is ﬁnite) is underdetermined and has inﬁnitely many solutions. Among all possible solutions for (2), we aim to ﬁnd the solution that maximizes the entropy. Deﬁne the set T := M i=1{yi u : u Ui} RM and the linear operator A : M(K) RM by

(Aµ)i := µ, xi = Z

K xiµ(dx) for all i = 1, . . . , M .

Generalized maximum entropy estimation

The operator norm is deﬁned as A := sup µ TV=1, y 2=1 Aµ, y . Note that due to the compactness of K the operator norm is bounded, see Lemma 4 for a formal statement. The adjoint operator to A is given by A : RM B(K), where A z(x) := PM i=1 zixi; note that the domain and image spaces of the adjoint operator are well deﬁned as (B(K), M(K)) is a topological dual pairs and the operator A is bounded (Hern andez-Lerma and Lasserre, 1999, Proposition 12.2.5). Given a reference measure ν P(K), the problem of minimizing the relative entropy subject to moment constraints (2) can be formally described by

J = min µ P(K)

D µ||ν : Aµ T . (3)

We note that the reference measure ν P(K) is always ﬁxed a priori. A typical choice for ν is the uniform measure over K.

Proposition 1 (Existence & uniqueness of (3)) The optimization problem (3) attains an optimal feasible solution that is unique.

Proof The variational representation of the relative entropy (Boucheron et al., 2013, Corollary 4.15) implies that the mapping µ 7 D µ||ν is lower-semicontinuous Luenberger (1969). Note also that the space of probability measures on K is compact (Aliprantis and Border, 2007, Theorem 15.11). Moreover, since the linear operator A is bounded, it is continuous. As a result, the feasible set of problem (3) is compact and hence the optimization problem attains an optimal solution. Finally, the strict convexity of the relative entropy Csisz ar (1975) ensures uniqueness of the optimizer.

Note that if Ui = {0} for all i = 1, . . . , M, i.e., there is no uncertainty in the measurement model (2), Proposition 1 reduces to a known result Csisz ar (1975). Consider the special case where the reference measure ν is the uniform measure on K and let p denote the Radon-Nikodym derivative dµ

dν (whose existence can be assumed without loss of generality). Since A is weakly continuous and the diﬀerential entropy is known to be weakly lower semicontinuous Boucheron et al. (2013), we can restrict attention to a (weakly) dense subset of the feasible set and hence assume without loss of generality that p L1(K). Problem (3) then reduces to

max p L1(K)

K p(x)dx = 1, Z

K xip(x)dx Ti, i = 1, . . . , M . (4)

Problem (4) is a generalized maximum entropy estimation problem that, in case Ui = {0} for all i = 1, . . . , M, simpliﬁes to the standard entropy maximization problem subject to M moment constraints. In this article, we present a new approach to solve (3) that is based on its dual formulation. It turns out that the dual problem of (3) has a particular structure that allows us to apply Nesterov s smoothing method Nesterov (2005) to accelerate convergence. Furthermore, we will show how an ε-optimal solution to (3) can be reconstructed. This is done by solving the dual problem of (3). To achieve additional feasibility guarantees for the ε-optimal solution to (3), we introduce and a second smoothing step that is motivated by Devolder et al. (2012). The problem of entropy maximization subject to uncertain moment constraints (4) can be seen as a special case of (3).

Sutter, Sutter, Mohajerin Esfahani and Lygeros

3. Relative entropy minimization

We start by recalling that an unconstrained minimization of the relative entropy with an additional linear term in the cost admits a closed form solution. Let c B(K), ν P(K) and consider the optimization problem

D µ||ν µ, c . (5)

Lemma 2 (Gibbs distribution) The unique optimizer to problem (5) is given by the Gibbs distribution, i.e.,

µ (dx) = 2c(x)ν(dx) R

K 2c(x)ν(dx) for x K,

which leads to the optimal value of log R

K 2c(x)ν(dx).

Proof The result is standard and follows from Csisz ar (1975) or alternatively by (Sutter et al., 2015, Lemma 3.10).

Let RM z 7 σT (z) := maxx T x, z R denote the support function of T, which is continuous since T is compact (Rockafellar, 1997, Corollary 13.2.2). The primal-dual pair of problem (3) can be stated as

(primal program) : J = min µ P(K)

n D µ||ν + sup z RM

Aµ, z σT (z) o (6)

(dual program) : J D = sup z RM

n σT (z) + min µ P(K)

D µ||ν + Aµ, z o , (7)

where the dual function is given by

F(z) = σT (z) + min µ P(K)

D µ||ν + Aµ, z . (8)

Note that the primal program (6) is an inﬁnite-dimensional convex optimization problem. The key idea of our analysis is driven by Lemma 2 indicating that the dual function, that involves a minimization running over an inﬁnite-dimensional space, is analytically available. As such, the dual problem becomes an unconstrained ﬁnite-dimensional convex optimization problem, which is amenable to ﬁrst-order methods.

Lemma 3 (Zero duality gap) There is no duality gap between the primal program (6) and its dual (7), i.e., J = J D. Moreover, if there exists µ P(K) such that A µ int(T), then the set of optimal dual variables in (7) is compact.

Proof Recall that the relative entropy is known to be lower semicontinuous and convex in the ﬁrst argument, which can be seen as a direct consequence of the duality relation for the relative entropy (Boucheron et al., 2013, Corollary 4.15). Hence, the desired zero duality gap follows by Sion s minimax theorem (Sion, 1958, Theorem 4.2). The compactness of the

Generalized maximum entropy estimation

set of dual optimizers is due to (Bertsekas, 2009, Proposition 5.3.1).

Because the dual function (8) turns out to be non-smooth, in the absence of any additional structure, the eﬃciency estimate of a black-box ﬁrst-order method is of order O(1/ε2), where ε is the desired absolute additive accuracy of the approximate solution in function value Nesterov (2004). We show, however, that the generalized entropy maximization problem (6) has a certain structure that allows us to deploy the recent developments in Nesterov (2005) for approximating non-smooth problems by smooth ones, leading to an eﬃciency estimate of order O(1/ε). This, together with the low complexity of each iteration step in the approximation scheme, oﬀers a numerical method that has an attractive computational complexity. In the spirit of Nesterov (2005); Devolder et al. (2012), we introduce a smoothing parameter η := (η1, η2) R2 >0 and consider a smooth approximation of the dual function

Fη(z) := max x T

2 x 2 2 o + min µ P(K)

D µ||ν + Aµ, z η2

2 z 2 2 , (9)

with respective optimizers denoted by x z and µ z. Consider the projection operator πT : Rm R, πT (y) = arg minx T x y 2 2. It is straightforward to see that the optimizer x z is given by

x z = arg min x T x η 1 1 z 2 2 = πT η 1 1 z .

Hence, the complexity of computing x z is determined by the projection operator onto T; for simple enough cases (e.g., 2-norm balls, hybercubes) the solution is analytically available, while for more general cases (e.g., simplex, 1-norm balls) it can be computed at relatively low computational eﬀort, see (Richter, 2012, Section 5.4) for a comprehensive survey. The optimizer µ z according to Lemma 2 is given by

B 2 A z(x)ν(dx) R

K 2 A z(x)ν(dx), for all B B(K).

Lemma 4 (Lipschitz gradient) The dual function Fη deﬁned in (9) is η2-strongly concave and diﬀerentiable. Its gradient Fη(z) = x z +Aµ z η2z is Lipschitz continuous with

Lipschitz constant 1 η1 + PM i=1 Bi 2 + η2 and B := max{|x| : x K}.

Proof The proof follows along the lines of (Nesterov, 2005, Theorem 1) and in particular by recalling that the relative entropy (in the ﬁrst argument) is strongly convex with convexity parameter one and Pinsker s inequality, that says that for any µ P(K) we have

2D µ||ν . (10)

Sutter, Sutter, Mohajerin Esfahani and Lygeros

Moreover, we use the bound

A = sup λ RM, µ P(K)

Aµ, λ : λ 2 = 1, µ TV = 1

sup λ RM, µ P(K) { Aµ 2 λ 2 : λ 2 = 1, µ TV = 1} (11)

sup µ P(K) { Aµ 1 : µ TV = 1}

= sup µ P(K)

K xiµ(dx) : µ TV = 1

where (11) is due to the Cauchy-Schwarz inequality.

Note that Fη is η2-strongly concave and according to Lemma 4 its gradient is Lipschitz continuous with constant L(η) := 1

η1 + A 2 +η2. We ﬁnally consider the approximate dual program given by

(smoothed dual program) : J η = sup z RM Fη(z) . (12)

It turns out that (12) belongs to a favorable class of smooth and strongly convex optimization problems that can be solved by a fast gradient method given in Algorithm 1 (see Nesterov (2004)) with an eﬃciency estimate of the order O(1/ ε).

Algorithm 1: Optimal scheme for smooth & strongly convex optimization

Choose w0 = y0 RM and η R2 >0

For k 0 do Step 1: Set yk+1 = wk + 1 L(η) Fη(wk)

Step 2: Compute wk+1 = yk+1 +

L(η)+ η2 (yk+1 yk)

[*The stopping criterion is explained in Remark 7]

Under an additional regularity assumption, solving the smoothed dual problem (12) provides an estimate of the primal and dual variables of the original non-smooth problems (6) and (7), respectively, as summarized in the next theorem (Theorem 5). The main computational diﬃculty of the presented method lies in the gradient evaluation Fη. We refer to Section 5, for a detailed discussion on this subject.

Assumption 1 (Slater point) There exits a strictly feasible solution to (3), i.e., µ0 P(K) such that Aµ0 T and δ := miny T c Aµ0 y 2 > 0.

Generalized maximum entropy estimation

Note that ﬁnding a Slater point µ0 such that Assumption 1 holds, in general can be diﬃcult. In Remark 8 we present a constructive way of ﬁnding such an interior point. Given Assumption 1, for ε > 0 deﬁne

C := D µ0||ν , D := 1

2 max x T x 2, η1(ε) := ε 4D, η2(ε) := εδ2

ε2δ2 + 2 A 2C2

ln 10(ε + 2C)

ε2δ2 + 2 A 2C2

ε + A 2 + εδ2

Theorem 5 (Convergence rate) Given Assumption 1 and (13), let ε > 0 and N(ε) := max{N1(ε), N2(ε)} . Then, N(ε) iterations of Algorithm 1 produce approximate solutions to the problems (7) and (6) given by

ˆzk,η := yk and ˆµk,η(B) :=

B 2 A ˆzk,η(x)ν(dx) R

K 2 A ˆzk,η(x)ν(dx) , for all B B(K) , (14)

which satisfy

dual ε-optimality: 0 J F(ˆzk(ε)) ε (15a)

primal ε-optimality: |D ˆµk(ε)||ν J | 2(1 + 2

primal ε-feasibility: d(Aˆµk(ε), T) 2εδ

where d( , T) denotes the distance to the set T, i.e., d(x, T) := miny T x y 2.

In some applications, Assumption 1 does not hold, as for example in the classical case where Ui = {0} for all i = 1, . . . , M, i.e., there is no uncertainty in the measurement model (2). Moreover, in other cases satisfying Assumption 1 using the construction described in Remark 8 might be computationally expensive. Interestingly, Algorithm 1 can be run irrespective of whether Assumption 1 holds or not, i.e. for any choice of C and δ. While explicit error bounds of Theorem 5 as well as the a-posteriori error bound discussed below do not hold anymore, the asymptotic convergence is not aﬀected. Proof Using Assumption 1, note that the constant deﬁned as

ι := D µ0||ν minµ P(K) D µ||ν

miny T c Aµ0 y 2 = C

can be shown to be an upper bound for the optimal dual multiplier (Nedi c and Ozdaglar, 2008, Lemma 1), i.e., z 2 ι. The dual function can be bounded from above by C, since weak duality ensures F(z) J D µ0||ν = C for all z RM. Moreover, if we recall the preparatory Lemmas 3 and 4, we are ﬁnally in the setting such that the presented error bounds can be derived from Devolder et al. (2012), see Appendix A for a detailed derivation.

Sutter, Sutter, Mohajerin Esfahani and Lygeros

Theorem 5 directly implies that we need at most O(1

ε) iterations of Algorithm 1 to achieve ε-optimality of primal and dual solutions as well as ε-feasible primal variable. Note that Theorem 5 provides an explicit bound on the so-called a-priori errors, together with approximate optimizer of the primal (6) and dual (7) problem. The latter allows us to derive an a-posteriori error depending on the approximate optimizers, which is often signiﬁcantly smaller than the a-priori error.

Corollary 6 (Posterior error estimation) Given Assumption 1, the approximate primal and dual variables ˆµ and ˆz given by (14), satisfy the following a-posteriori error bound

F(ˆz) J D ˆµ||ν + C

δ d(Aˆµ, T) ,

where d( , T) denotes the distance to the set T, i.e., d(x, T) := infy T x y 2.

Proof The two key ingredients of the proof are Theorem 5 and the Lipschitz continuity of the so-called perturbation function of convex programming. Let z denote the dual optimizer to (7). We introduce the perturbed program as

J (ε) = min µ P(K){D µ||ν : d(Aµ, T) ε}

= min µ P(K) D µ||ν + sup λ 0 inf y T λ Aµ y λε

= sup λ 0 λ ε + inf µ P(K) y T sup z 2 λ

Aµ y, z + D µ||ν (16a)

= sup λ 0 z 2 λ

λ ε + inf µ P(K) y T

Aµ y, z + D µ||ν (16b)

z 2 ε + inf µ P(K) y T

Aµ y, z + D µ||ν

= z 2 ε + J .

Equation (16a) uses the strong duality property that follows by the existence of a Slater point that is due to the deﬁnition of the set T, see Section 2. Step (16b) follows by Sion s minimax theorem (Sion, 1958, Theorem 4.2). Hence, we have shown that the perturbation function is Lipschitz continuous with constant z 2. Finally, recalling z 2 C

δ , established in the proof of Theorem 5 completes the proof.

Remark 7 (Stopping criterion of Algorithm 1) There are two alternatives for deﬁning a stopping criterion for Algorithm 1. Choose desired accuracy ε > 0.

(i) a-priori stopping criterion: Theorem 5 provides the required number of iterations N(ε) to ensure an ε-close solution.

(ii) a-posteriori stopping criterion: Choose the smoothing parameter η as in (13). Fix a (small) number of iterations ℓthat are run using Algorithm 1. Compute the aposteriori error D ˆµ||ν + C

δ d(Aˆµ, T) F(ˆz) according to Corollary 6 and if it is smaller than If ε terminate the algorithm. Otherwise continue with another ℓiterations.

Generalized maximum entropy estimation

Remark 8 (Slater point computation) To compute the respective constants in Assumption 1, we need to construct a strictly feasible point for (3). For this purpose, we consider a polynomial density of degree r deﬁned as pr(α, x) := Pr 1 i=0 αixi. For notational simplicity we assume that the support set is the unit interval (K = [0, 1]), such that the moments induced by the polynomial density are given by

pr(α, x), xi = Z 1

j=0 αjxj+idx =

αj j + i + 1,

for i = 0, . . . , M. Consider β RM+1, where β1 = 1 and βi = yi 1 for i = 2, . . . , M + 1. Hence, the feasibility requirement of (3) can be expressed as the linear constraint Aα = β, where A R(M+1) r, α Rr, β RM+1 and Ai,j = 1 i+j 1 and ﬁnding a strictly feasible solution reduces to the following feasibility problem

max α Rr const

s. t. Aα = β pr(α, x) 0 x [0, 1], (17)

where pr is a polynomial function in x of degree r with coeﬃcients α. The second constraint of the program (17) (i.e., pr(α, x) 0 x [0, 1])1 can be equivalently reformulated as linear matrix inequalities of dimension r

2 , using a standard result in polynomial optimization, see (Lasserre, 2009, Chapter 2) for details. We note that for small degree r, the set of feasible solutions to problem (17) may be empty, however, by choosing r large enough and assuming that the moments can be induced by a continuous density, problem (17) becomes feasible. Moreover, if 0 int(T) the Slater point leads to a δ > 0 in Assumption 1.

Example 1 (Density estimation) We are given the ﬁrst 3 moments of an unknown probability measure deﬁned on K = [0, 1] as2

y := 1 ln 2

ln 2 , ln 4 1

ln 4 , 5 ln 64

(0.44, 0.28, 0.20).

The uncertainty set in the measurement model (2) is assumed to be Ui = [ u, u] for all i = 1, . . . , 3. A Slater point is constructed using the method described in Remark 8, where r = 5 is enough for the problem (17) to be feasible, leading to the constant C = 0.0288. The Slater point is depicted in Figure 1 and its diﬀerential entropy can be numerically computed as 0.0288. We consider two simulations for two diﬀerent uncertainty sets (namely, u = 0.01 and u = 0.005). The underlying maximum entropy problem (4) is solved using Algorithm 1. The respective features of the a-priori guarantees by Theorem 5 as well as the a-posteriori guarantees (upper and lower bounds) by Corollary 6 are reported in Table 1. Recall that ˆµk(ε) denotes the approximate primal variable after k-iterations of Algorithm 1 as deﬁned in Theorem 5 and that d(Aˆµk(ε), T) (resp. 2εδ

C ) represent the a-posteriori (resp. a-priori)

1. In a multi-dimensional setting one has to consider a tightening (i.e., pr(α, x) > 0 x [0, 1]n). 2. The considered moments are actually induced by the probability density f(x) := (ln 2 (1 + x)) 1. We, however, do not use this information at any point of this example.

Sutter, Sutter, Mohajerin Esfahani and Lygeros

feasibility guarantees. It can be seen in Table 1 that increasing the uncertainty set U leads to a higher entropy, where the uniform density clearly has the highest entropy. This is also intuitively expected since enlarging the uncertainty set is equivalent to relaxing the moment constraints in the respective maximum entropy problem. The corresponding densities are graphically visualized in Figure 1.

Table 1: Some speciﬁc simulation points of Example 1.

U = [ 0.01, 0.01] U = [ 0.005, 0.005]

a-priori error ε 1 0.1 0.01 0.001 1 0.1 0.01 0.001 JUB -0.0174 -0.0189 -0.0194 -0.0194 -0.0223 -0.0236 -0.0237 -0.0238 JLB -0.0220 -0.0279 -0.0204 -0.0195 -0.0263 -0.0298 -0.0244 -0.0238 iterations k(ε) 99 551 5606 74423 232 1241 12170 157865 d(Aˆµk(ε), T) 0.0008 0.0036 0.0005 0 0 0.001 0.0001 0 2εδ

C 0.69 0.069 0.0069 0.00069 0.35 0.035 0.0035 0.00035 runtime [s]3 1.4 1.4 2.3 12.9 1.4 1.5 3.3 26.1

0 0.2 0.4 0.6 0.8 1 0.6

Slater point

ˆµk(ε), ε = 0.001, U = [ 0.1, 0.1] ˆµk(ε), ε = 0.001, U = [ 0.01, 0.01] ˆµk(ε), ε = 0.001, U = [ 0.005, 0.005]

Figure 1: Maximum entropy densities obtained by Algorithm 1 for two diﬀerent uncertainty sets. As a reference, the Slater point density, that was computed as described in Remark 8 is depicted in red.

4. Finite-dimensional case

We consider the ﬁnite-dimensional case where K = {1, . . . , N} and hence we optimize in (3) over the probability simplex P(K) = N. One substantial simpliﬁcation, when restricting to the ﬁnite-dimensional setting, is that the Shannon entropy is non-negative and bounded from above (by log N). Therefore, we can substantially weaken Assumption 1 to the following assumption.

3. Runtime includes Slater point computation. Simulations were run with Matlab on a laptop with a 2.2 GHz Intel Core i7 processor.

Generalized maximum entropy estimation

Assumption 2 (Regularity)

(i) There exists δ > 0 such that B(0, δ) {x Aµ : µ N, x T}.

(ii) The reference measure ν N has full support, i.e., min 1 i N νi > 0.

Consider the the deﬁnitions given in (13) with C := max 1 i N log 1

νi , then the following

ﬁnite-dimensional equivalent to Theorem 5 holds.

Corollary 9 (A-priori error bound) Given Assumption 2, C := max 1 i N log 1

deﬁnitions (13), let ε > 0 and N(ε) := {N1(ε), N2(ε)} . Then, N(ε) iterations of Algorithm 1 produce the approximate solutions to the problems (7) and (6), given by

ˆzk(ε) := yk(ε) and ˆµk(ε)(B) := P i B 2 (A ˆzk(ε))iνi PN i=1 2 (A ˆzk(ε))iνi for all B {1, 2, . . . , N} , (18)

which satisfy

dual ε-optimality: 0 F(ˆzk(ε)) J ε (19a)

primal ε-optimality: |D ˆµk(ε)||ν J | 2(1 + 2

primal ε-feasibility: d(Aˆµk(ε), T) 2εδ

where d( , T) denotes the distance to the set T, i.e., d(x, T) := miny T x y 2.

Proof Under Assumption 2 the dual optimal solutions in (7) are bounded by

r max 1 i N log 1

This bound on the dual optimizer follows along the lines of (Devolder et al., 2012, Theorem 6.1). The presented error bounds can then be derived along the lines of Theorem 5.

In addition to the explicit error bound provided by Corollary 9, the a-posteriori upper and lower bounds presented in Corollary 6 directly apply to the ﬁnite-dimensional setting as well.

5. Gradient Approximation

The computationally demanding element for Algorithm 1 is the evaluation of the gradient Fη( ) given in Lemma 4. In particular, Theorem 5 and Corollary 9 assume that this gradient is known exactly. While this is not restrictive if, for example, K is a ﬁnite set, in general, Fη( ) involves an integration that can only be computed approximately. In particular if we consider a multi-dimensional setting (i.e., K Rd), the evaluation of the gradient Fη( ) represents a multi-dimensional integration problem. This gives rise to the question of how the fast gradient method (and also Theorem 5) behaves in a case of inexact ﬁrst-order information. Roughly speaking, the fast gradient method Algorithm 1,

Sutter, Sutter, Mohajerin Esfahani and Lygeros

while being more eﬃcient than the classical gradient method (if applicable), is less robust when dealing with inexact gradients Devolder et al. (2014). Therefore, depending on the computational complexity of the gradient, one may consider the possibility of replacing Algorithm 1 with a classical gradient method. A detailed mathematical analysis of this tradeoﬀis a topic of further research, and we refer the interested readers to Devolder et al. (2014) for further details in this regard. In this section we discuss two numerical methods to approximate this gradient. Note that in Lemma 4, given that T is simple enough the optimizer x z is analytically available, so what remains is to compute Aµ z, that according to Lemma 2 is given by

K xi2 A z(x)ν(dx) R

K 2 A z(x)ν(dx) for all i = 1, . . . , M . (21)

Semideﬁnite programming. Due to the speciﬁc structure of the considered problem, (21) represents an integration of exponentials of polynomials for which an eﬃcient approximation in terms of two single semideﬁnite programs (SDPs) involving two linear matrix inequalities has been derived, where the desired accuracy is controlled by the size of the linear matrix inequalities constraints, see Bertsimas et al. (2008); Lasserre (2009) for a comprehensive study and for the construction of those SDPs. While the mentioned hierarchy of SDPs provides a certiﬁcate of optimality (hat is easy to evaluate and asymptotic convergence (in the size of the SDPs), a convergence rate that explicitly quantiﬁes the size of the SDPs required for a desired accuracy is unknown. In practice, the hierarchy often converges in few iteration steps, which however, depends on the problem and is not known a priori. Quasi-Monte Carlo. The most popular methods for integration problems of the from (21) are Monte Caro (MC) schemes, see Robert and Casella (2004) for a comprehensive summary. The main advantage of MC methods is that the root-mean-square error of the approximation converges to 0 with a rate of O(N 1/2) that is independent of the dimension, where N are the number of samples used. In practise, this convergence often is too slow. Under mild assumptions on the integrand, the MC methods can be signiﬁcantly improved with a more recent technique known as Quasi-Monte Carlo (QMC) methods. QMC methods can reach a convergence rate arbitrarily close to O(N 1) with a constant not depending on the dimension of the problem. We would like to refer the reader to Dick et al. (2013); Sloan and Wo zniakowski (1998); Kuo and Sloan (2005); Niederreiter (2010); Th ely (2017) for a detailed discussion about the theory of QMC methods.

Remark 10 (Computational stability) The evaluation of the gradient in Lemma 4 involves the term Aµ z, where µ z is the optimizer of the second term in (9). By invoking Lemma 2 and the deﬁnition of the operator A, the gradient evaluation reduces to

K xi2 PM j=1 zjxjdx R

K 2 PM j=1 zjxjdx for i = 1, . . . , M . (22)

Note that a straightforward computation of the gradient via (22) is numerically diﬃcult. To alleviate this diﬃculty, we follow the suggestion of (Nesterov, 2005, p. 148) which we brieﬂy

Generalized maximum entropy estimation

elaborate here. Consider the functions f(z, x) := PM j=1 zjxj, f(z) := maxx K f(z, x) and g(z, x) := f(z, x) f(z). Note, that g(z, x) 0 for all (z, x) RM R. One can show that

zi g(z, x)dx R

K 2g(z,x)dx +

zi f(z) for i = 1, . . . , M ,

which can be computed with signiﬁcantly smaller numerical error than (22) as the numerical exponent are always negative, leading to values always being smaller than 1.

6. Zero-information moment closure method

In chemistry, physics, systems biology and related ﬁelds, stochastic chemical reactions are described by the chemical master equation (CME), that is a special case of the Chapman Kolmogorov equation as applied to Markov processes van Kampen (1981); Wilkinson (2006). These equations are usually inﬁnite-dimensional and analytical solutions are generally impossible. Hence, eﬀort has been directed toward developing of a variety of numerical schemes for eﬃcient approximation of the CME, such as stochastic simulation techniques (SSA) Gillespie (1976). In practical cases, one is often interested in the ﬁrst few moments of the number of molecules involved in the chemical reactions. This motivated the development of approximation methods to those low-order moments without having to solve the underlying inﬁnite-dimensional CME. One such approximation method is the so-called moment closure method Gillespie (2009), that brieﬂy described works as follows: First the CME is recast in terms of moments as a linear ODE of the form d dtµ(t) = Aµ(t) + Bζ(t) , (23)

where µ(t) denotes the moments up to order M at time t and ζ(t) is an inﬁnite vector describing the contains moments of order M + 1 or higher. In general ζ can be an inﬁnite vector, but for most of the standard chemical reactions considered in, e.g., systems biology it turns out that only a ﬁnite number of higher order moments aﬀect the evolution of the ﬁrst M moments. Indeed, if the chemical system involves only the so-called zeroth and ﬁrst order reactions the vector ζ has dimension zero (reduces to a constant aﬃne term), whereas if the system also involves second order reactions then ζ also contains some moments of order M + 1 only. It is widely speculated that reactions up to second order are suﬃcient to realistically model most systems of interest in chemistry and biology Gillespie (2007); Gillespie et al. (2013). The matrix A and the linear operator B (that may potentially be inﬁnite-dimensional) can be found analytically from the CME. The ODE (23), however, is intractable due to its higher order moments dependence. The approximation step is introduced by a so-called closure function

where the higher-order moments are approximated as a function of the lower-oder moments, see Singh and Hespanha (2006, 2007). A closure function that has recently attracted interest is known as the zero-information closure function (of order M) Smadbeck and Kaznessis (2013), and is given by

(ϕ(µ))i = p µ, x M+i , for i = 1, 2, . . . , (24)

Sutter, Sutter, Mohajerin Esfahani and Lygeros

where p µ denotes the maximizer to the problem (4), where T = M i=1 µ, xi u : u Ui , where Ui = [ κ, κ] R for all i and for a given κ > 0, that acts as a regularizer, in the sense of Assumption 1. This approximation reduces the inﬁnite-dimensional ODE (23) to a ﬁnite-dimensional ODE d dtµ(t) = Aµ(t) + Bϕ(µ(t)) . (25)

To numerically solve (25) it is crucial to have an eﬃcient evaluation of the closure function ϕ. In the zero-information closure scheme this is given by to the entropy maximization problem (4) and as such can be addressed using Algorithm 1. To illustrate this point, we consider a reversible dimerisation reaction where two monomers (M) combine in a second-order monomolecular reaction to form a dimer (D); the reverse reaction is ﬁrst order and involves the decomposition of the dimer into the two monomers. This gives rise to the chemical reaction system

D k2 2M , (26)

with reaction rate constants k1, k2 > 0. Note that the system as described has a single degree of freedom since M = 2D0 2D + M0, Where M denotes the count of the monomers, D the count of dimers, and M0 and D0 the corresponding initial conditions. Therefore, the matrices can be reduced to include only the moments of one component as a simpliﬁcation and as such the zero-information closure function (24) consists of solving a one-dimensional entropy maximization problem such as given by (4), where the support are the natural numbers (upper bounded by M0 +D0 and hence compact). For illustrative purposes, let us look at a second order closure scheme, where the corresponding moment vectors are deﬁned as µ = (1, M , M2 ) R3 and ζ = M3 R and the corresponding matrices are given by

0 0 0 k2S0 2k1 k2 2k1 2k2S0 2k2(S0 1) 4k1 8k1 2k2

where S0 = M0 + 2D0. The simulation results, Figure 2, show the time trajectory for the average and the second moment of the number of M molecules in the reversible dimerization model (26), as calculated for the zero information closure (25) using Algorithm 1, for a second-order closure as well as a third-order closure. To solve the ODE (25) we use an explicit Runge-Kutta (4,5) formula (ode45) built into MATLAB. The results are compared to the average of 106 SSA Wilkinson (2006) trajectories. It can be seen how increasing the order of the closure method improves the approximation accuracy.

7. Approximate dynamic programming for constrained Markov decision processes

In this section, we show that problem (3) naturally appears in the context of approximate dynamic programming, which is at the heart of reinforcement learning (see Bertsekas and Tsitsiklis (1996); Bertsekas (1995); Recht (2018) and references therein). We consider

Generalized maximum entropy estimation

0 1 2 3 4 5 0

K = 1, max Ent K = 1, SSA K = 10, max Ent K = 10, SSA

(a) second-order moment closure

0 1 2 3 4 5 0

K = 1, max Ent K = 1, SSA K = 10, max Ent K = 10, SSA

(b) third-order moment closure

0 1 2 3 4 5

K = 1, max Ent K = 1, SSA K = 10, max Ent K = 10, SSA

(c) second-order moment closure

0 1 2 3 4 5

K = 1, max Ent K = 1, SSA K = 10, max Ent K = 10, SSA

(d) third-order moment closure

Figure 2: Reversible dimerization system (26) with reaction constants K = k2/k1: Comparison of the zero-information moment closure method (25), solved using Algorithm 1 and the average of 106 SSA trajectories. The initial conditions are M0 = 10 and D0 = 0 and the regularization term κ = 0.01.

constrained Markov decision processes (MDPs) that form an important class of stochastic control problems with applications in many areas; see Piunovskiy (1997); Altman (1999) and the comprehensive bibliography therein. A look at these references shows that most of the literature concerns constrained MDPs where the state and action spaces are either ﬁnite or countable. Inspired by the recent work Mohajerin Esfahani et al. (2018), we show here

Sutter, Sutter, Mohajerin Esfahani and Lygeros

that the entropy maximization problem (3) is a key element to approximate constrained MDPs on general, possibly uncountable, state and action spaces.

7.1. Constrained MDP problem formulation

Consider a discrete-time constrained MDP S, A, {A(s) : s S}, Q, c, d, κ , where S (resp. A) is a metric space called state space (resp. action space) and for each s S the measurable set A(s) A denotes the set of feasible actions when the system is in state s S. The transition law is a stochastic kernel Q on S given the feasible state-action pairs in K := {(s, a) : s S, a A(s)}. A stochastic kernel acts on real-valued measurable functions u from the left as Qu(s, a) := R

S u(s )Q(ds |s, a), for all (s, a) K, and on probability measures µ on K from the right as µQ(B) := R

K Q(B|s, a)µ d(s, a) , for all B B(S). The (measurable) function c : K R+ denotes the so-called one-stage cost function, and the (measurable) function d : K Rq the one-stage constraint cost along with the preassigned budget level κ Rq. In short, the MDP model described above may read as follows: When the system at the state s S deploys the action a A(s), it incurs the one-stage cost and constraint c(s, a) and d(s, a), respectively, and subsequently lands in the next state whose distribution is supported on S and described via Q( |s, a). We consider the expected long-run average cost criteria4, i.e., for any measurable function ψ : K Rp with p {1, q}, we deﬁne

Jψ(π, ν) := lim sup n 1 n Eπ ν

t=0 ψ(st, at)

Using the deﬁntion above, the central object of this section is the optimization problem

inf ν,π Jc(π, ν)

s.t. Jd(π, ν) κ ν P(S), π Π, (27)

where Π denotes the set of all control policies. We refer the reader to Hern andez-Lerma and Lasserre (1996); Hern andez-Lerma et al. (2003); Altman (1999) for a detailed mathematical treatment of this setting. It is well known that the MDP problem (27) can be stated equivalently as an inﬁnite-dimensional linear program and its corresponding dual

s.t. µ(B A) = µQ(B) B B(S) µ, d κ µ P(K),

sup u,ρ,γ ρ γ κ

s.t. ρ + u(x) Qu(x, a) c(x, a) + γ d(x, a) (x, a) K µ, d κ u , ρ R, γ Rq +.

4. We refer the interested reader to Mohajerin Esfahani et al. (2018) for extension to the discounted cost.

Generalized maximum entropy estimation

The following regularity assumption is required in order to ensure that the solutions are well posed and that equivalence between (27) and the LPs (28a), (28b) holds.

Assumption 3 (Control model) We stipulate that

(i) the set of feasible state-action pairs is the unit hypercube K = [0, 1]dim(S A);

(ii) the transition law Q is Lipschitz continuous, i.e., there exists LQ > 0 such that for all k, k K and all continuous functions u

|Qu(k) Qu(k )| LQ u k k ℓ ;

(iii) the cost function c is non-negative and Lipschitz continuous on K and d is continuous on K.

Under this assumption, strong duality between the linear programs (28a) and (28b) holds (i.e., the supremum and inﬁmum are attained and J P = J D). Moreover, the LP formulation is equivalent to the original problem (27) in the sense that J = J P = J D, see (Hern andez Lerma et al., 2003, Theorem 5.2). Finding exact solutions to either (28a) or (28b) generally is impossible as the linear programs are inﬁnite dimensional. This challenge has given rise to a wealth of approximation schemes in the literature under the names of approximate dynamic programming. Typically, one restricts decision space in (28b) to a ﬁnite dimensional subspace spanned by basis functions {ui}n i=1 L(S) denoted by Un := {Pn i=1 αiui : α 2 θ}. Motivated by de Farias and Roy (2004); Mohajerin Esfahani et al. (2018) we then approximate the solution J by

inf µ µ, c + θ Tnµ e 2

s.t. µ, d κ µ P(K), (29)

where the operator Tn : P(K) Rn+1 is deﬁned as (Tnµ)1 = 1, (Tnµ)i+1 = Qui ui, µ , i = 1, . . . , n and e := ( 1, 0, . . . , 0) Rn+1. The optimization problem (29) can be solved with an accelerated ﬁrst-order method provided by Algorithm 2, stated below, where each iteration step involves solving a problem an entropy maximization problem of the form (3). We deﬁne

i=1 αi(Qui ui)), Y := {µ P(K) : µ, d κ},

T q, α) := (α q) min{1, θ q α 1 2 }, y ζ(α) := arg min y Y D y||λ + y, cα,ζ . (30)

Steps 2 and 3 of Algorithm 2 are simple arithmetic operations. Step 1 relies on the solution to the optimization problem (30) that can be reduced to (3). To see this, note that the additional linear term in the objective function can be directly integrated into the analysis of Section 3. We provide the explicit construction in Section 7.2 for an inventory management system. We are interested in establishing bounds on the quality of the approximate solution J(k) n,ζ obtained by Algorithm 2. The results of (Mohajerin Esfahani et al., 2018, Theorem 3.3 and 5.3) allow us to obtain the following bound.

Sutter, Sutter, Mohajerin Esfahani and Lygeros

Algorithm 2: Approximate dynamic programming scheme

Input: n, k N, ζ, θ > 0, and w(0) Rn+1 such that w(0) 2 θ

For 0 ℓ k do Step 1: Deﬁne r(ℓ) := ζ 4n(e Tn y ζ(w(ℓ))) Step 2: Let z(ℓ) := T Pℓ j=0 j+1

2 r(j), 0 and β(ℓ) = T r(ℓ), w(ℓ)

Step 3: Set w(ℓ+1) = 2 ℓ+3z(ℓ) + ℓ+1

Output: J(k) n,ζ := c, ˆyζ + θ Tnˆyζ e 2 with ˆyζ := Pk j=0 2(j+1) (k+1)(k+2) y ζ(w(j))

Theorem 11 (Approximation error) Under Assumption 3, Algorithm 2 provides an approximation to (27) with the following error bound

|J(k) n,ζ J | (1 + max{LQ, 1}) u ΠUn(u ) L + 4n2θ

k2ζ + dim(K)ζ max{log (β/ζ) , 1} ,

where β := e dim(K)(θ n(max{LQ, 1} + 1) + c L).

Note that the bound depends on the projection residual u ΠUn(u ) L, where the projection mapping is deﬁned as ΠUn(u ) := arg minu Un u u 2, the parameters of the problem (notably the dimensions of the state and action spaces, the stage cost, and the Lipschitz constant of the kernel LQ), and the design choices of the algorithm (the number of basis functions n, norm bound θ and ζ).

Remark 12 (Tuning parameters)

(i) The residual error u ΠUn(u ) L can be approximated by leveraging results from the literature on universal function approximation. Prior information about the value function u may oﬀer explicit quantitative bounds. For instance, for MDP under Assumption 3 we know that u is Lipschitz continuous. For appropriate choice of basis functions, we can therefore ensure a convergence rate of n 1/ dim(S), see for instance Farouki (2012) for polynomials and Olver (2009) for the Fourier basis functions.

(ii) The regularization parameter θ has to be chosen such that θ > c L. An optimal choice for θ and ζ is described in (Mohajerin Esfahani et al., 2018, Remark 4.6 and Theorem 5.3).

7.2. Inventory management system

Consider an inventory model in which the state variable describes the stock level st at the beginning of period t. The control or action variable at at t is the quantity ordered and immediately supplied at the beginning of period t, and the disturbance or exogenous variable ξt is the demand during that period. We assume ξt to be i.i.d. random variables following an exponential distribution with parameter λ. The system equation Hern andez Lerma and Lasserre (1996) is

st+1 = max{0, st + at ξt} =: (st + at ξt)+, t = 0, 1, 2, . . . . (31)

Generalized maximum entropy estimation

We assume that the system has a ﬁnite capacity C. Therefore S = A = [0, C] and since the current stock plus the amount ordered cannot exceed the system s capacity, the set of feasible actions is A(s) = [0, C s] for every s S. Suppose we wish to maximize an expected proﬁt for operating the system, we might take the net proﬁt at stage t to be

r(st, at, ξt) := v min{st + at, ξt} pat h(st + at), (32)

which is of the form proﬁt = sales - production cost - holding cost . In (32), v, p and h are positive numbers denoting unit sale price, unit production cost, and unit holding cost, respectively. To write the cost (32) in the form of our control model (27), we deﬁne

c(s, a) := E[ r(st, at, ξt)|st = s, at = a]

= v(s + a)e λ(s+a) v

1 e λ(s+a) (λ(s + a) + 1) + pa + h(s + a). (33)

Note that non-negativity of the cost can be ensured by subtracting the term 2v C from (32). We assume that there are regulatory constraints on the required stock level st, for example to avoid the risk of running into a shortage of a certain critical product. For simplicity, let us assume that the regulator enforces constraints on the long-term ﬁrst and second moments of the stock st in the following sense

lim sup n 1 n Eπ ν

ℓ1, and lim sup n 1 n Eπ ν

for given ℓ1, ℓ2 R+, where we assume that ℓ2 1 < ℓ25. To express it in the form of our control model (27), we deﬁne d1(s, a) := s, d2(s, a) := s2, and κ := ( ℓ1, ℓ2) R2. From the described assumptions on the constants ℓ1 and ℓ2, it can be directly seen that the Slater point assumption described in Lemma 3 holds (consider the set T := {x R2 : x1 ℓ1, x2 ℓ2, x2 1 x2}). In the following, we will describe how Algorithm 2 and the respective error bounds given by Theorem 11 can be applied to the inventory management system. In particular we show how the computationally demanding part of Algorithm 2, given by (30), is directly addressed by the methodology presented in Section 3. To fulﬁll Assumption 3, we equivalently reformulate the above problem using the dynamics

st+1 = (min{C, st + at} ξt)+, t = 0, 1, 2, . . . , (35)

where the admissible actions set is now the state-independent set A = [0, C]. Finally by normalizing the state and action variables through the deﬁnitions st := st

C and at := at

C , we have

st+1 = min{1, st + at} ξt

+ , t = 0, 1, 2, . . . . (36)

Furthermore, it can be seen directly using Leibnitz rule that the transition law is Lipschitz continuous and LQ

2Cλ. It remains to argue how (30) can be addressed by the

5. Note that the enforced constraints (34) imply the upper and lower bounds ℓ1 lim supn 1 n Eπ ν Pn 1 t=0 st ℓ2 and ℓ2 1 lim supn 1 n Eπ ν Pn 1 t=0 s2 t ℓ2.

Sutter, Sutter, Mohajerin Esfahani and Lygeros

methodology presented in Section 3. We introduce the linear operator A : P(K) R2, deﬁned by (Aµ)i := µ, di for i = 1, 2. Then, the optimization problem (30) can be expressed as

(primal program) : J = min µ P(K)

n D µ||ν µ, cα,ζ + sup z RM

Aµ, z σT (z) o , (37)

(dual program) : J D = sup z RM

n σT (z) + min µ P(K)

D µ||ν + µ, A z cα,ζ o , (38)

which apart from the additional linear term are in the form of (6) and (7). Due to our assumptions on ℓ1 and ℓ2 described above, there exists a strictly feasible solution to (30), i.e., µ0 P(K) such that Aµ0 T and δ := miny T c Aµ0 y 2 > 0. Hence, the method from Section 3 is applicable. Numerical Simulation. For a given set of model parameters (C = 1, λ = 1

2, v = 1, p = 1

10) we consider four diﬀerent scenarios:

(i) Unconstrained case (i.e., ℓ1 = 0 and ℓ2 = 1);

(ii) ℓ1 = 0.5 and ℓ2 = 0.4;

(iii) ℓ1 = 0.5 and ℓ2 = 0.3, which is strictly more constrained than scenario (ii);

(iv) ℓ1 = 0.1 and ℓ2 = 0.1.

For each scenario we run Algorithm 2 and plot the value of the resulting approximation J(k) n,ζ as a function of the number of iterations in Figure 3. In each iteration of Algorithm 2 the variable (30) is required, which is computed with Algorithm 1. The following simulation parameters were used:

Algorithm 1: η1 = η2 = 10 3, 1500 iterations;

Algorithm 2: ζ = 10 1.5, θ = 3, n = 10, Fourier basis u2i 1(s) = C 2iπ cos 2iπs

C and u2i(s) = C 2iπ sin 2iπs

C for i = 1, . . . , n

As shown in Figure 3, the value of J(k) n,ζ converges at around 1000 iterations of Algorithm 2.6 The fact that scenario (ii) is a (strict) relaxation in terms of the constraints compared to scenario (iii) is visualized by the numerical simulation as the expected proﬁt of scenario (ii) is continuously higher compared to scenario (iii). Figure 3 also indicates that in Scenario (iv) the constraints are the most restrictive.

8. Conclusion and future work

We presented an approximation scheme to a generalization of the classical problem of estimating a density via a maximum entropy criterion, given some moment constraints. The key idea used is to apply smoothing techniques to the non-smooth dual function of the entropy maximization problem, that enables us to solve the dual problem eﬃciently with fast

6. 1000 iterations of Algorithm 2 took around 5.5 hours with Matlab on a laptop with a 2.2 GHz Intel Core i7 processor.

Generalized maximum entropy estimation

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 0

Iterations k of Algorithm 2

Expected proﬁt

(i) unconstrained (ii) ℓ1 = 0.5, ℓ2 = 0.4 (iii) ℓ1 = 0.5, ℓ2 = 0.3 (iv) ℓ1 = 0.1, ℓ2 = 0.1

Figure 3: The expected proﬁt of the inventory system, approximated by J(k) n,ζ, resulting from Algorithm 2 is displayed for four diﬀerent scenarios (i)-(iv) representing four diﬀerent constraints on the inventory system.

gradient methods. Due to the favorable structure of the considered entropy maximization problem, we provide explicit error bounds on the approximation error as well as a-posteriori error estimates.

The proposed method requires one to evaluate the gradient (21) in every iteration step, which, as highlighted in Section 5, in the inﬁnite-dimensional setting involves an integral. As such the method used to compute those integrals has to be included to the complexity of the proposed algorithm and, in higher dimensions, may become is the dominant factor. Therefore, it would be interesting to investigate this integration step in more detail. Two approaches, one based on semideﬁnite programming and another invoking Quasi Monte Carlo integration techniques, are brieﬂy sketched. What remains open is to quantify the accuracy required in the gradient approximations, which could be done along the lines of Devolder et al. (2014). Another potential direction, would be to test the proposed numerical method in the context of approximating the channel capacity of a large class of memoryless channels Sutter et al. (2015), as mentioned in the introduction.

Finally it should be mentioned that the approximation scheme proposed in this article can be further generalized to quantum mechanical entropies. In this setup probability mass functions are replaced by density matrices (i.e., positive semideﬁnite matrices, whose trace is equal to one). The von Neumann entropy of such a density matrix ρ is deﬁned by H(ρ) := tr(ρ log ρ), which reduces to the (Shannon) entropy in case the density matrix ρ is diagonal. Also the relative entropy can be generalized to the quantum setup Umegaki (1962) and general treatment of our approximation scheme, its analysis can be lifted to the this (strictly) more general framework. As demonstrated in Sutter et al. (2016), (quantum) entropy maximization problems can be used to eﬃciently approximate the classical capacity of quantum channels.

Sutter, Sutter, Mohajerin Esfahani and Lygeros

Appendix A. Detailed proof of Theorem 5

Given the bounds z 2 C

δ and F(z) J D µ0||ν = C for all z RM as argued above, we show how the error bounds of Theorem 5 follow from Devolder et al. (2012). The dual ε-optimality (15a) is derived from (Devolder et al., 2012, Equation (7.8)), that in our setting states

0 J F(ˆzk) 3ε

4 + 5 F(z ) F(0) + ε

L(η) . (39)

Using the parameters as deﬁned in Theorem 5, i.e.,

C := D µ0||ν , D := 1

2 max x T x 2, η1(ε) := ε 4D, η2(ε) := εδ2

the fact that F(z ) F(0) = F(z ) C, and the Lipschitz constant L(η) = 1

η1 + A 2 + η2 derived in Lemma 4, inequality (39) ensures 0 J F(ˆzk) ε if

k N1(ε) := 2

ε2δ2 + 2 A 2C2

ln 10(ε + 2C)

The primal ε-optimality bound (15b) following the derivations in Devolder et al. (2012) is implied if k is chosen large enough such that Fη(zk) 2 2εδ

C . According to (Devolder et al., 2012, Equation (7.11)) the gradient can be bounded by

4L(η) F(z ) F(0) + ε

Again using the parameters and constants as we did for the dual ε-optimality bound we ﬁnd that Fη(zk) 2 2εδ

ε2δ2 + 2 A 2C2

ε + A 2+ εδ2

The primal ε-feasiblility (15c) ﬁnally directly follows from (Devolder et al., 2012, Equation (7.14)) and the bound z 2 C

Acknowledgments

The authors thank Andreas Milias-Argeitis for helpful discussions regarding the moment closure example. TS and JL acknowledge Systems X under the grant Signal X . DS acknowledges support by the Swiss National Science Foundation (SNSF) via the National Centre of Competence in Research QSIT and by the Air Force Oﬃce of Scientiﬁc Research (AFOSR) via grant FA9550-16-1-0245. PME was supported by the Swiss National Science Foundation under grant P2EZP2 165264 .

Generalized maximum entropy estimation

N. I. Akhiezer. The classical moment problem and some related questions in analysis. Translated by N. Kemmer. Hafner Publishing Co., New York, 1965.

Charalambos D. Aliprantis and Kim C. Border. Inﬁnite Dimensional Analysis: A Hitchhiker s Guide. Springer, 2007. ISBN 9783540326960. doi: 10.1007/3-540-29587-9.

Eitan Altman. Constrained Markov decision processes. Stochastic Modeling. Chapman & Hall/CRC, Boca Raton, FL, 1999. ISBN 0-8493-0382-6.

Edward J. Anderson and Peter Nash. Linear programming in inﬁnite-dimensional spaces: theory and applications. Wiley-Interscience Series in Discrete Mathematics and Optimization. Wiley, 1987. ISBN 9780471912507.

Arash Asadpour, Michel X. Goemans, Aleksander M?dry, Shayan Oveis Gharan, and Amin Saberi. An O(log n/ log log n)-approximation Algorithm for the Asymmetric Traveling Salesman Problem, pages 379 389. 2010. doi: 10.1137/1.9781611973075.32. URL https: //epubs.siam.org/doi/abs/10.1137/1.9781611973075.32.

D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientiﬁc, 1995.

D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996.

Dimitri P. Bertsekas. Convex Optimization Theory. Athena Scientiﬁc optimization and computation series. Athena Scientiﬁc, 2009. ISBN 9781886529311.

Dimitris Bertsimas, Xuan Vinh Doan, and Jean Lasserre. Approximating integrals of multivariate exponentials: a moment approach. Oper. Res. Lett., 36(2):205 210, 2008. ISSN 0167-6377. doi: 10.1016/j.orl.2007.07.002. URL http://dx.doi.org/10.1016/j.orl. 2007.07.002.

St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013. ISBN 978-0-19953525-5. doi: 10.1093/acprof:oso/9780199535255.001.0001. URL http://dx.doi.org/ 10.1093/acprof:oso/9780199535255.001.0001.

I. Csisz ar. I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3:146 158, 1975. doi: 10.1214/aop/1176996454.

D.P. de Farias and B. Van Roy. On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming. Mathematics of Operations Research, 29(3):462 478, 2004.

Olivier Devolder, Francois Glineur, and Yurii Nesterov. Double smoothing technique for large-scale linearly constrained convex optimization. SIAM Journal on Optimization, 22 (2):702 727, 2012. doi: 10.1137/110826102.

Olivier Devolder, Fran cois Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Math. Program., 146(1-2, Ser. A):37 75, 2014.

Sutter, Sutter, Mohajerin Esfahani and Lygeros

ISSN 0025-5610. doi: 10.1007/s10107-013-0677-5. URL https://doi.org/10.1007/ s10107-013-0677-5.

Josef Dick, Frances Y. Kuo, and Ian H. Sloan. High-dimensional integration: The Quasi-Monte Carlo way. Acta Numerica, 22:133 288, 5 2013. ISSN 1474-0508. doi: 10.1017/S0962492913000044. URL http://journals.cambridge.org/article_ S0962492913000044.

Miroslav Dudik, Steven J. Phillips, and Robert E. Schapire. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8:1217 1260, 2007. ISSN 1532-4435.

Rida T. Farouki. The Bernstein polynomial basis: A centennial retrospective. Computer Aided Geometric Design, 29(6):379 419, 2012.

Shayan Oveis Gharan, Amin Saberi, and Mohit Singh. A randomized rounding approach to the traveling salesman problem. In Proceedings of the 2011 IEEE 52Nd Annual Symposium on Foundations of Computer Science, FOCS 11, pages 550 559, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4571-4. doi: 10.1109/FOCS.2011.80. URL http://dx.doi.org/10.1109/FOCS.2011.80.

C. S. Gillespie. Moment-closure approximations for mass-action models. IET Systems Biology, 3(1):52 58, January 2009. ISSN 1751-8849. doi: 10.1049/iet-syb:20070031.

Daniel T. Gillespie. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Computational Phys., 22(4):403 434, 1976. ISSN 0021-9991. doi: 10.1016/0021-9991(76)90041-3.

Daniel T. Gillespie. Stochastic simulation of chemical kinetics. Annual Review of Physical Chemistry, 58(1):35 55, 2007. doi: 10.1146/annurev.physchem.58.032806.104637. URL https://doi.org/10.1146/annurev.physchem.58.032806.104637. PMID: 17037977.

Daniel T Gillespie, Andreas Hellander, and Linda R Petzold. Perspective: Stochastic algorithms for chemical kinetics. The Journal of Chemical Physics, 138(17):170901, 2013. doi: 10.1063/1.4801941.

Amos Golan. Information and entropy econometrics a review and synthesis. Foundations and Trends in Econometrics, 2:1 145, 2008. ISSN 1551-3076. doi: 10.1561/0800000004. URL http://dx.doi.org/10.1561/0800000004.

Peter Gr unwald. Entropy concentration and the empirical coding game. Statistica Neerlandica, 62(3):374 392, 2008. ISSN 1467-9574. doi: 10.1111/j.1467-9574.2008.00401.x. URL http://dx.doi.org/10.1111/j.1467-9574.2008.00401.x.

O. Hern andez-Lerma and J.B. Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria. Applications of Mathematics Series. Springer, 1996. ISBN 9780387945798.

Generalized maximum entropy estimation

O. Hern andez-Lerma and J.B. Lasserre. Further topics on discrete-time Markov control processes. Applications of Mathematics Series. Springer, 1999. ISBN 9780387986944. doi: 10.1007/978-1-4612-0561-6.

On esimo Hern andez-Lerma, Juan Gonz alez-Hern andez, and Raquiel R. L opez-Mart ınez. Constrained average cost Markov control processes in Borel spaces. SIAM J. Control Optim., 42(2):442 468, 2003. ISSN 0363-0129. doi: 10.1137/S0363012999361627. URL https://doi.org/10.1137/S0363012999361627.

E. T. Jaynes. Probability theory: The logic of science. Cambridge University Press, Cambridge, 2003. ISBN 0-521-59271-2. doi: 10.1017/CBO9780511790423. URL http://dx.doi.org/10.1017/CBO9780511790423.

Solomon Kullback. Information theory and statistics. John Wiley and Sons, Inc., New York; Chapman and Hall, Ltd., London, 1959.

Frances Y Kuo and Ian H Sloan. Lifting the curse of dimensionality. Notices of the AMS, 52(11):1320 1328, 2005.

Jean B. Lasserre. Moments, Positive Polynomials and Their Applications. Imperial College Press optimization series. Imperial College Press, 2009. ISBN 9781848164468. doi: 10. 1142/9781848164468\ 0001.

David G. Luenberger. Optimization by Vector Space Methods. John Wiley & Sons, Inc., New York-London-Sydney, 1969.

Lawrence R. Mead and N. Papanicolaou. Maximum entropy in the problem of moments. J. Math. Phys., 25(8):2404 2417, 1984. ISSN 0022-2488. doi: 10.1063/1.526446. URL http://dx.doi.org/10.1063/1.526446.

Peyman Mohajerin Esfahani, Tobias Sutter, Daniel Kuhn, and John Lygeros. From inﬁnite to ﬁnite programs: explicit error bounds with applications to approximate dynamic programming. SIAM J. Optim., 28(3):1968 1998, 2018. ISSN 1052-6234. doi: 10.1137/17M1133087. URL https://doi.org/10.1137/17M1133087.

Angelia Nedi c and Asuman Ozdaglar. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19(4):1757 1780, 2008.

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization. Springer, 2004. ISBN 9781402075537. doi: 10.1007/978-1-4419-8853-9.

Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152, 2005. ISSN 0025-5610. doi: 10.1007/s10107-004-0552-5. URL http: //dx.doi.org/10.1007/s10107-004-0552-5.

Harald Niederreiter. Quasi-Monte Carlo methods. Encyclopedia of quantitative ﬁnance, 2010.

Sheehan Olver. On the convergence rate of a modiﬁed Fourier series. Mathematics of Computation, 78(267):1629 1645, 2009.

Sutter, Sutter, Mohajerin Esfahani and Lygeros

A. B. Piunovskiy. Optimal control of random sequences in problems with constraints, volume 410 of Mathematics and its Applications. Kluwer Academic Publishers, Dordrecht, 1997. ISBN 0-7923-4571-1. doi: 10.1007/978-94-011-5508-3. URL https://doi.org/10.1007/ 978-94-011-5508-3. With a preface by V. B. Kolmanovskii and A. N. Shiryaev.

Benjamin Recht. A Tour of Reinforcement Learning: The View from Continuous Control. ar Xiv e-prints, art. ar Xiv:1806.09460, June 2018.

Stefan Richter. Computational complexity certiﬁcation of gradient methods for real-time model predictive control. Ph D thesis, ETH Zurich, 2012. doi: 10.3929/ethz-a-007587480.

C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2 edition, 2004.

R. Tyrrell Rockafellar. Convex analysis. Princeton Landmarks in Mathematics and Physics Series. Princeton University Press, 1997. ISBN 9780691015866.

A. Singh and J. P. Hespanha. Lognormal moment closures for biochemical reactions. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 2063 2068, Dec 2006. doi: 10.1109/CDC.2006.376994.

Abhyudai Singh and Jo ao Pedro Hespanha. A derivative matching approach to moment closure for the stochastic logistic model. Bull. Math. Biol., 69(6):1909 1925, 2007. ISSN 0092-8240. doi: 10.1007/s11538-007-9198-9. URL http://dx.doi.org/10.1007/ s11538-007-9198-9.

Mohit Singh and Nisheeth K. Vishnoi. Entropy, optimization and counting. In Proceedings of the Forty-sixth Annual ACM Symposium on Theory of Computing, STOC 14, pages 50 59, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2710-7. doi: 10.1145/2591796. 2591803. URL http://doi.acm.org/10.1145/2591796.2591803.

Maurice Sion. On general minimax theorems. Paciﬁc Journal of Mathematics, 8:171 176, 1958. ISSN 0030-8730.

Ian H Sloan and Henryk Wo zniakowski. When are Quasi-Monte Carlo algorithms eﬃcient for high dimensional integrals? Journal of Complexity, 14(1):1 33, 1998.

Patrick Smadbeck and Yiannis N. Kaznessis. A closure scheme for chemical master equations. Proceedings of the National Academy of Sciences, 110(35):14261 14265, 2013. doi: 10.1073/pnas.1306481110. URL http://www.pnas.org/content/110/35/14261. abstract.

Damian Straszak and Nisheeth K. Vishnoi. Computing maximum entropy distributions everywhere, 2017. available at ar Xiv:1711.02036.

D. Sutter, T. Sutter, P. Mohajerin Esfahani, and R. Renner. Eﬃcient approximation of quantum channel capacities. IEEE Transactions on Information Theory, 62(1):578 598, Jan 2016. ISSN 0018-9448. doi: 10.1109/TIT.2015.2503755.

T. Sutter, D. Sutter, P. Mohajerin Esfahani, and J. Lygeros. Eﬃcient approximation of channel capacities. IEEE Transactions on Information Theory, 61(4):1649 1666, April 2015. ISSN 0018-9448. doi: 10.1109/TIT.2015.2401002.

Generalized maximum entropy estimation

Maxime Th ely. Stochastic programming with Quasi-Monte Carlo methods. Master s thesis, ETH Zurich, Switzerland, 2017.

H. Umegaki. Conditional expectation in an operator algebra. Kodai Mathematical Seminar Reports, 14:59 85, 1962. doi: 10.2996/kmj/1138844604.

N. G. van Kampen. Stochastic processes in physics and chemistry, volume 888 of Lecture Notes in Mathematics. North-Holland Publishing Co., Amsterdam-New York, 1981. ISBN 0-444-86200-5.

Darren James Wilkinson. Stochastic modelling for systems biology. Chapman & Hall/CRC Mathematical and Computational Biology Series. Chapman & Hall/CRC, Boca Raton, FL, 2006. ISBN 978-1-58488-540-5; 1-58488-540-8.