# distributionally_robust_optimization_with_markovian_data__517bd016.pdf

Distributionally Robust Optimization with Markovian Data

Mengmeng Li 1 Tobias Sutter 1 Daniel Kuhn 1

We study a stochastic program where the probability distribution of the uncertain problem parameters is unknown and only indirectly observed via ﬁnitely many correlated samples generated by an unknown Markov chain with d states. We propose a data-driven distributionally robust optimization model to estimate the problem s objective function and optimal solution. By leveraging results from large deviations theory, we derive statistical guarantees on the quality of these estimators. The underlying worst-case expectation problem is nonconvex and involves O(d2) decision variables. Thus, it cannot be solved efﬁciently for large d. By exploiting the structure of this problem, we devise a customized Frank-Wolfe algorithm with convex direction-ﬁnding subproblems of size O(d). We prove that this algorithm ﬁnds a stationary point efﬁciently under mild conditions. The efﬁciency of the method is predicated on a dimensionality reduction enabled by a dual reformulation. Numerical experiments indicate that our approach has better computational and statistical properties than the state-of-the-art methods.

1. Introduction

Decision problems under uncertainty are ubiquitous in machine learning, engineering and economics. Traditionally, such problems are modeled as stochastic programs that seek a decision x X Rn minimizing an expected loss EP[L(x, ξ)], where the expectation is taken with respect to the distribution P of the random problem parameter ξ. However, more often than not, P is unknown to the decision maker and can only be observed indirectly via a ﬁnite number of training samples ξ1, . . . , ξT . The classical sample average approximation (SAA) replaces the unknown probability distribution P with the empirical distribution cor-

1Risk Analytics and Optimization Chair, Ecole Polytechnique F ed erale de Lausanne. Correspondence to: Mengmeng Li <mengmeng.li@epﬂ.ch>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

responding to the training samples and solves the resulting empirical risk minimization problem (Shapiro et al., 2014). Fuelled by modern applications in machine learning, however, there has recently been a surge of alternative methods for data-driven optimization complementing SAA. Ideally, any meaningful approach to data-driven optimization should display the following desirable properties. (i) Avoiding the optimizer s curse. It is well understood that decisions achieving a low empirical cost on the training dataset may perform poorly on a test dataset generated by the same distribution. In decision analysis, this phenomenon is called the optimizer s curse, and it is closely related to overﬁtting in statistics (Smith & Winkler, 2006). Practically useful schemes should mitigate this detrimental effect. (ii) Statistical guarantees. Most statistical guarantees for existing approaches to data-driven optimization critically rely on the assumption that the training samples are independent and identically distributed (i.i.d.). This assumption is often not justiﬁable or even wrong in practice. Practically useful methods should offer guarantees that remain valid when the training samples display serial dependencies. (iii) Computational tractability. For a data-driven scheme to be practically useful it is indispensable that the underlying optimization problems can be solved efﬁciently.

While the SAA method is computationally tractable, it is susceptible to the optimizer s curse if training data is scarce; see, e.g., (Van Parys et al., 2021). Distributionally robust optimization (DRO) is an alternative approach that mitigates overﬁtting effects (Delage & Ye, 2010; Goh & Sim, 2010; Wiesemann et al., 2014). DRO seeks worst-case optimal decisions that minimize the expected loss under the most adverse distribution from within a given ambiguity set, that is, a distribution family characterized by certain known properties of the unknown data-generating distribution. DRO has been studied since Scarf s seminal treatise on the ambiguityaverse newsvendor problem (Scarf, 1958), but it has gained thrust only with the advent of modern robust optimization techniques (Bertsimas & Sim, 2004; Ben-Tal et al., 2009). In many cases of practical interest, DRO problems can be reformulated exactly as ﬁnite convex programs that are solvable in polynomial time. Indeed, such reformulations are available for many different ambiguity sets deﬁned through generalized moment constraints (Delage & Ye, 2010; Goh & Sim, 2010; Wiesemann et al., 2014; Bertsimas et al., 2018),

Distributionally Robust Optimization with Markovian Data

φ-divergences (Ben-Tal et al., 2013; Namkoong & Duchi, 2016), Wasserstein distances (Mohajerin Esfahani & Kuhn, 2018; Kuhn et al., 2019), or maximum mean discrepancy distances (Staib & Jegelka, 2019; Kirschner et al., 2020).

With the notable exceptions of (Dou & Anitescu, 2019; Derman & Mannor, 2020; Sutter et al., 2020; Duchi et al., 2021), we are not aware of any data-driven DRO models for non-i.i.d. data. In this paper we apply the general framework by Sutter et al. (2020) to data-driven DRO models with Markovian training samples and propose an efﬁcient algorithm for their solution. Our DRO scheme is perhaps most similar to the one studied by Duchi et al. (2021), which can also handle Markovian data. However, this scheme differs from ours in two fundamental ways. First, while Duchi et al. (2021) work with φ-divergence ambiguity sets, we use ambiguity sets inspired by a statistical optimality principle recently established by Van Parys et al. (2021) and Sutter et al. (2020) using ideas from large deviations theory. Second, the statistical guarantees by Duchi et al. (2021) depend on unknown constants, whereas our conﬁdence bounds are explicit and easy to evaluate. Dou & Anitescu (2019) assume that the training samples are generated by a ﬁrst-order autoregressive process, which is neither a generalization nor a special case of a ﬁnite-state Markov chain. Indeed, while Dou & Anitescu (2019) can handle continuous state spaces, our model does not assume a linear dependence on the previous state. Derman & Mannor (2020) investigate a ﬁnite-state Markov decision process (MDP) with an unknown transition kernel, and they develop a DRO approach using a Wasserstein ambiguity set for estimating its value function. While MDPs are more general than the static optimization problems considered here, the out-of sample guarantees in (Derman & Mannor, 2020, Theorem 4.1) rely on the availability of several i.i.d. sample trajectories. In contrast, our statistical guarantees require only one single trajectory of (correlated) training samples. We also remark that MDP models can be addressed with online mirror descent methods (Jin & Sidford, 2020). Unfortunately, it is doubtful whether such methods could be applied to our DRO problems because the statistically optimal ambiguity sets considered in this paper are nonconvex.

In summary, while many existing data-driven DRO models are tractable and can mitigate the optimizer s curse, there are hardly any statistical performance guarantees that apply when the training samples fail to be i.i.d. and when there is only one trajectory of correlated training samples. This paper addresses this gap. Speciﬁcally, we study data-driven decision problems where the training samples are generated by a time-homogeneous, ergodic Markov chain with d states. Sutter et al. (2020) show that statistically optimal data-driven decisions for such problems are obtained by solving a DRO model with a conditional relative entropy ambiguity set. The underlying worst-case expectation prob-

lem is nonconvex and involves d2 decision variables. To our best knowledge, as of now there exist no efﬁcient algorithms for solving this hard optimization problem. We highlight the following main contributions of this paper.

We apply the general framework by Sutter et al. (2020), which uses ideas from large deviations theory to construct statistically optimal data-driven DRO models, to decision problems where the training data is generated by a timehomogeneous, ergodic ﬁnite-state Markov chain. We prove that the resulting DRO models are asymptotically consistent.

We develop a customized Frank-Wolfe algorithm for solving the underlying worst-case expectation problems in an efﬁcient manner. This is achieved via the following steps.

(i) We ﬁrst reparametrize the problems to move the nonconvexities from the feasible set to the objective function. (ii) We then develop a Frank-Wolfe algorithm for solving the reparametrized nonconvex problem of size O(d2). (iii) Using a duality argument, we show that the directionﬁnding subproblems with a linearized objective function are equivalent to convex programs of size O(d) with a rectangular feasible set that can be solved highly efﬁciently.

We prove that the proposed Frank-Wolfe algorithm converges to a stationary point of the nonconvex worst-case expectation problem at a rate O(1/

M), where M denotes the number of iterations. Each iteration involves the solution of a convex d-dimensional minimization problem with a smooth objective function and a rectangular feasible set. The solution of this minimization problem could be further accelerated by decomposing it into d one-dimensional optimization problems that can be processed in parallel.

We propose a Neyman-Pearson-type hypothesis test for determining the Markov chain that generated a given sequence of training samples, and we construct examples of Markov chains that are arbitrarily difﬁcult to distinguish on the basis of a ﬁnite training dataset alone.

In the context of a revenue maximization problem where customers display a Markovian brand switching behavior, we show that the proposed DRO models for Markovian data outperform classical DRO models, which are either designed for i.i.d. data or based on different ambiguity sets.

Notation. The inner product of two vectors a, b Rm is denoted by a, b = a b, and the n-dimensional probability simplex is deﬁned as n = {x Rn + : Pn i=1 xi = 1}. The relative entropy between two probability vectors p, q n is deﬁned as D(p q) = Pn i=1 pi log (pi/qi), where we use the conventions 0 log(0/q) = 0 for q 0 and p log(p/0) = for p > 0. The closure and the interior of a subset D of a topological space are denoted by cl D and int D, respectively. For any n N we set [n] = {1, . . . , n}. We use Ai and A i to denote the i-th row and column of a matrix A Rm n, respectively. All proofs are relegated to the Appendix.

Distributionally Robust Optimization with Markovian Data

2. Data-driven DRO with Markovian Data

In this paper we study the single-stage stochastic program

min x X EP[L(x, ξ)], (1)

where X Rn is compact, ξ denotes a random variable valued in Ξ = {1, 2, . . . , d}, and L : X Ξ R represents a known loss function that is continuous in x. We assume that all random objects (e.g., ξ) are deﬁned on a probability space (Ω, F, P) whose probability measure P is unknown but belongs to a known ambiguity set A to be described below. We further assume that P can only be observed indirectly through samples {ξt}t N from a time-homogeneous ergodic Markov chain with state space Ξ and deterministic initial state ξ0 Ξ. We ﬁnally assume that the distribution of ξ coincides with the unique stationary distribution of the Markov chain. In the following we let θ Rd d ++ be the unknown stationary probability mass function of the doublet (ξt, ξt+1) Ξ2, that is, we set

θ ij = lim t P(ξt = i, ξt+1 = j) i, j Ξ. (2)

By construction, the row sums of θ must coincide with the respective column sums, and thus θ must belong to

θ Rd d ++ :

i,j=1 θij = 1,

j=1 θji i Ξ

that is, the set of all strictly positive doublet probability mass functions with balanced marginals. Below we call elements of Θ simply models. Note that any model θ Θ encodes an ergodic time-homogeneous Markov chain with a unique row vector πθ R1 d ++ of stationary probabilities and a unique transition probability matrix Pθ Rd d ++ deﬁned through (πθ)i = P

j Ξ θij and (Pθ)ij = θij/(πθ)i, respectively. Hence, the elements of πθ sum to 1, Pθ represents a rowstochastic matrix, and πθ = πθPθ. Moreover, any θ Θ induces a probability measure Pθ on (Ω, F) with

Pθ(ξt = it t = 1, . . . , T) = (Pθ)ξ0i1

t=1 (Pθ)itit+1

for all (i1, . . . , i T ) ΞT and T N. We are now ready to deﬁne the ambiguity set as A = {Pθ : θ Θ}. Below we use Eθ to denote the expectation operator with respect to Pθ. By construction, there exists a model θ Θ corresponding to the unknown true probability measure P such that Pθ = P. Estimating P is thus equivalent to estimating the unknown true parameter θ . Given a ﬁnite training dataset {ξt}T t=1, a natural estimator for θ is the empirical doublet distribution bθT Rd d + deﬁned through

(bθT )ij = 1

t=1 1(ξt 1,ξt)=(i,j) (3)

for all i, j Ξ. The ergodic theorem ensures that bθT converges Pθ-almost surely to θ for any model θ Θ (Ross, 2010, Theorem 4.1). In the remainder we deﬁne Θ = d d as the state space of bθT , which is strictly larger than Θ. For ease of notation, we also deﬁne the model-based predictor c(x, θ) = Eθ[L(x, ξ)] = Pd i,j=1 L(x, i)θij as the expected loss of x under model θ. Note that c(x, θ) is jointly continuous in x and θ due to the continuity of L(x, ξ) in x.

As we will show below, it is natural to measure the discrepancy between a model θ Θ and an estimator realization θ Θ by the conditional relative entropy.

Deﬁnition 1 (Conditional relative entropy). The conditional relative entropy of θ Θ with respect to θ Θ is

i=1 (πθ )i D((Pθ )i ||(Pθ)i )

θ ij Pd k=1 θ ik

θij Pd k=1 θik

where the invariant distributions πθ, πθ R1 d + and the transition probability matrices Pθ, Pθ Rd d + corresponding to θ and θ , respectively, are deﬁned in the usual way.

Note that if the training samples {ξt}t N are i.i.d., then the conditional relative entropy of θ with respect to θ collapses to the relative entropy of πθ with respect to πθ. Using the conditional relative entropy, we can now deﬁne the distributionally robust predictor bcr : X Θ R through

bcr(x, θ ) = max θ cl Θ {c(x, θ) : Dc(θ θ) r} (4a)

if (4a) is feasible and bcr(x, θ ) = maxθ cl Θ c(x, θ) otherwise. Note that bcr(x, θ ) represents the worst-case expected cost of x with respect to all probability measures in the ambiguity set {Pθ : Dc(θ θ) r}, which can be viewed as an image (in A) of a conditional relative entropy ball (in Θ) of radius r 0 around θ . We know from (Sutter et al., 2020, Proposition 5.1) that the sublevel sets of Dc(θ θ) are compact for all ﬁxed θ , and thus the maximum in (4a) is attained whenever the problem is feasible. The distributionally robust predictor (4a) also induces a distributionally robust prescriptor bxr : Θ X deﬁned through

bxr(θ ) arg min x X bcr(x, θ ) (4b)

Embracing the distributionally robust approach outlined above, if we only have access to T training samples, we will implement the data-driven decision bxr(bθT ) and predict the corresponding expected cost as bcr(bxr(bθT ), bθT ), which we will refer to, by slight abuse of terminology, as the in-sample risk. To assess the quality of the data-driven decision bxr(bθT )

Distributionally Robust Optimization with Markovian Data

we use two performance measures that stand in direct competition with each other: the out-of-sample risk c(bxr(bθT ), θ) and the out-of-sample disappointment

Pθ c(bxr(bθT ), θ) > bcr(bxr(bθT ), bθT )

with respect to any ﬁxed model θ. Intuitively, the out-ofsample risk represents the true expected loss of bxr(bθT ), and the out-of-sample disappointment quantiﬁes the probability that the in-sample risk (the predicted loss) strictly underestimates the out-of-sample risk (the actual loss) under Pθ. Following Sutter et al. (2020), we will consider a data-driven decision as desirable if it has a low out-of-sample risk and a low out-of-sample disappointment under the unknown true model θ . This is reasonable because the objective of the original problem (2) is to minimize expected loss. Optimistically underestimating the loss could incentivize na ıve decisions that overﬁt to the training data and therefore lead to disappointment in out-of-sample tests. Thus, it makes sense to keep the out-of-sample disappointment small.

3. Statistical Guarantees

The distributionally robust optimization model (4) with a conditional relative entropy ambiguity set has not received much attention in the extant literature and may thus seem exotic at ﬁrst sight. As we will show below, however, this model offers in a precise sense optimal statistical guarantees if the training samples are generated by an (unknown) time-homogeneous ergodic Markov chain. The reason for this is that the empirical doublet distribution (3) is a sufﬁcient statistic for θ and satisﬁes a large deviation principle with rate function Dc(θ θ) under Pθ for any θ Θ.

Before formalizing these statistical guarantees, we ﬁrst need to recall some basic concepts from large deviation theory. We refer to the excellent textbooks by Dembo & Zeitouni (2010) and den Hollander (2008) for more details.

Deﬁnition 2 (Rate function (Dembo & Zeitouni, 2010, Section 2.1)). A function I : Θ Θ [0, ] is called a rate function if I(θ , θ) is lower semi-continuous in θ .

Deﬁnition 3 (Large deviation principle). An estimator bθT Θ satisﬁes a large deviation principle (LDP) with rate function I if for all θ Θ and Borel sets D Θ we have

inf θ int D I(θ , θ) lim inf T 1 T log Pθ bθT D (5a)

1 T log Pθ bθT D (5b)

inf θ cl D I(θ , θ). (5c)

Conveniently, the empirical doublet distribution (3) obeys an LDP with the conditional relative entropy as rate function.

Lemma 1 (LDP for Markov chains (Dembo & Zeitouni, 2010, Theorem 3.1.13)). The empirical doublet distribution (3) satisﬁes an LDP with rate function Dc(θ θ).

Lemma 1 is the key ingredient to establish the following out-of-sample guarantees for the distributionally robust predictor and the corresponding prescriptor.

Theorem 2 (Out-of-sample guarantees (Sutter et al., 2020, Theorems 3.1 & 3.2)). The distributionally robust predictor and prescriptor deﬁned in (4) offer the following guarantees.

(i) For all θ Θ, x X, we have

1 T log Pθ c(x, θ) > bcr(x, bθT ) r.

(ii) For all θ Θ, we have

1 T log Pθ c(bxr(bθT ), θ) > bcr(bxr(bθT ), bθT ) r.

Theorem 2 asserts that the out-of-sample disappointment of the distributionally robust predictor and the corresponding prescriptor decay at least as fast as e r T +o(T ) asymptotically as the sample size T grows, where the decay rate r coincides with the radius of the conditional relative entropy ball in (4). Note also that bxr(bθT ) can be viewed as an instance of a data-driven prescriptor, that is, any Borel measurable function that maps the training samples {ξt}T t=1 to a decision in X. One can then prove that bxr(bθT ) offers the best (least possible) out-of-sample risk among the vast class of data-driven prescriptors whose out-of-sample disappointment decays at a prescribed rate of at least r. Maybe surprisingly, this optimality property holds uniformly across all models θ Θ and thus in particular for the unknown true model θ ; see (Sutter et al., 2020, Theorem 3.2). It is primarily this Pareto dominance property that makes the distributionally robust optimization model (4) interesting.

Next, we prove that if the radius r of the conditional relative entropy ball tends to 0 at a sufﬁciently slow speed as T increases, then the distributionally robust predictor converges Pθ-almost surely to the true expected loss of the decision x.

Theorem 3 (Strong asymptotic consistency). If r T d

T for all T N and lim T r T = 0, then

lim T bcr T (x, bθT ) = c(x, θ) Pθ-a.s. x X, θ Θ.

4. Numerical Solution

Even though the distributionally robust optimization model (4) offers powerful statistical guarantees, it is practically useless unless the underlying optimization problems can be solved efﬁciently. In this section we develop a numerical procedure to compute the predictor (4a) for a ﬁxed x and θ . This is challenging for the following reasons. First, the conditional relative entropy Dc(θ θ) is nonconvex in θ

Distributionally Robust Optimization with Markovian Data

(see Remark 13 in Appendix A), and thus problem (4a) has a nonconvex feasible set. In addition, problem (4a) involves d2 decision variables (the entries of the matrix θ) and thus its dimension is high already for moderate values of d.

In the following we reformulate (4a) as an equivalent optimization problem with a convex feasible set and a nonconvex objective function, which is amenable to efﬁcient numerical solution via a customized Frank-Wolfe algorithm (Frank & Wolfe, 1956; Jaggi, 2013). To this end, recall from Deﬁnition 1 that the conditional relative entropy Dc(θ θ) can be expressed as a weighted sum of relative entropies between corresponding row vectors of the transition probability matrices Pθ and Pθ. By construction, Pθ is a rowstochastic matrix, and the stationary distribution πθ corresponding to Pθ is a normalized non-negative row vector with πθPθ = πθ. Routine calculations show that

(πθ) = (Ad(Pθ)) 1 0 0 . . . 1 , (6)

where the matrix Ad(Pθ) Rd d is deﬁned through

(Ad(Pθ))ij =

(Pθ)ii 1 for i = j, i < d (Pθ)ji for i = j, i < d 1 for i = d.

From now on we denote by P Rd d the set of all transition probability matrices of ergodic Markov chains on Ξ with strictly positive entries. Applying the variable substitution Pθ θ and using (6), we can now reformulate the optimization problem (4a) in terms of Pθ.

Lemma 4 (Reformulation). If θ > 0, then the worst-case expectation problem (4a) is equivalent to

bcr(x, θ ) = max P Dr(θ ) Ψ(x, P), (7)

where Ψ(x, P) = Pd i=1 L(x, i)(Ad(P) 1)id and

Dr(θ ) = n P P : Pd i=1(πθ )i D((Pθ )i ||(P)i ) r o .

In the following we will sometimes abbreviate Dr(θ ) by D when the dependence on θ and r is irrelevant. Problem (7) is more attractive than (4a) from a computational point of view because its feasible set is closely related to the popular relative entropy uncertainty sets, which have received considerable attention in robust optimization (Ben Tal et al., 2013). On the other hand, while the objective function c(x, θ) of the original problem (4a) was linear in θ, the objective function Ψ(x, P) of (7) is nonconvex in P. These properties suggest that (7) can be solved efﬁciently and exactly if its objective function is replaced with a linear approximation. We thus develop a customized Frank Wolfe algorithm that generates approximate solutions for

Algorithm 1 Frank-Wolfe algorithm for solving (7)

1: Input: r > 0, θ > 0, x X, ε > 0, πθ , Pθ 2: Output: P 3: Initialize P (0), g(0) = 2ε, m = 0 4: while g(m) > ε do 5: Compute S(m) argmax S D S, P Ψ(x, P (m))

6: Compute g(m) = S(m) P (m), P Ψ(x, P (m))

7: if g(m) ε then 8: Return P = P (m)

9: end if 10: Computeγm argmax γ [0,1] Ψ(x, (P (m)+γ(S(m) P (m))))

11: P (m+1) = P (m) + γm(S(m) P (m)) 12: m m + 1 13: end while

Primal oracle subproblem S(m) argmax S D S, P Ψ(x, P (m))

Dual oracle subproblem η argminη η η Q(η)

Reconstruct the primal maximizer S(m)

from the dual minimizer η

Use strong duality

Use SGD to ﬁnd η

Figure 1. Flow diagram of Algorithm 2.

problem (7) by solving a sequence of linearized oracle subproblems. While our main Frank-Wolfe routine is presented in Algorithm 1, its key subroutine for solving the oracle subproblems is described in Algorithm 2.

Algorithm 1 generates a sequence of approximate solutions for problem (7) that enjoy rigorous optimality guarantees.

Theorem 5 (Convergence of Algorithm 1). For any ﬁxed ε > 0 and θ > 0, Algorithm 1 ﬁnds an approximate stationary point for problem (7) with a Frank-Wolfe gap of at most ε in O(1/ε2) iterations.

The proof of Theorem 5 leverages convergence results for generic Frank-Wolfe algorithms by Lacoste-Julien (2016). A detailed proof is relegated to Appendix C. We now explain the implementation of Algorithm 1. Note that the only nontrivial step is the solution of the oracle subproblem in line 5. To construct this subproblem we need the gradient P Ψ(x, P), which can be computed in closed form. Full details are provided in Appendix C. Even though the oracle subproblem is convex by construction (see Proposition 14), it is still computationally expensive because it involves d2 decision variables P Rd d subject

Distributionally Robust Optimization with Markovian Data

to O(d) linear constraints as well as a single conditional relative entropy constraint. Algorithm 2 thus solves the dual oracle subproblem, which involves only d decision variables η Rd subject to box constraints. Strong duality holds because Pθ represents a Slater point for the primal problem (see Proposition 18). The dual oracle subproblem is amenable to efﬁcient numerical solution via Stochastic Gradient Descent (SGD), and a primal maximizer can be recovered from any dual minimizer by using the problem s ﬁrst order optimality conditions (see Proposition 18). The high-level structure of Algorithm 2 is visualized in Figure 1.

To construct the dual oracle subproblem, ﬁx a decision x X and an anchor point P (m) P, and set C = P Ψ(x, P (m)). In addition, deﬁne ηi = maxj{Cij} and

ηi = 1 1 e r

d max i,j {Cij} e rtr(C Pθ ) X

The dual subproblem minimizes Q(η) = Pd i=1 Qi(η) over the box [η, η], where Qi(η) = ηi λ (η)/d and

λ (η) = exp

i,j=1 (πθ )i(Pθ )ij log ηi Cij

is a convex function. As Q(η) is reminiscent of an empirical average, the dual oracle subproblem lends itself to efﬁcient numerical solution via SGD, which obviates costly highdimensional gradient calculations. In addition, the partial derivatives of Qi(η) are readily available in closed form as

ηi Qi(η(n)) = d (πθ )i

(Pθ )ij η(n) i Cij .

We set the SGD learning rate to ℓ= KN 1/2 for some constant K > 0, which ensures that the suboptimality of the iterates generated by the SGD algorithm decays as O(1/

N) after N iterations; see (Nemirovski et al., 2009, Section 2.2). A summary of Algorithm 2 in pseudocode is given below.

Algorithm 2 Solution of the oracle subproblem

1: Input: η, η, ℓ, Pθ , N

2: Output: S(N) arg min S D S, P Ψ(x, P (m))

3: Initialize η(0) with ηi η(0) i ηi, i = 1, . . . , d 4: for n = 0, . . . , N do 5: for i = 1, . . . , d do 6: y(n) i = η(n) i ℓ

ηi Qi(η(n)) 7: end for 8: η(n+1) argminηi si ηi,i=1,...,d s y(n) 2 2 9: end for 10: η = ηN+1, λ = λ (η )

11: (S(N))d(i 1)+j = (πθ )i(Pθ )ijλ /(η i c(i) j ) for i, j = 1, . . . , d

Theorem 6 (Convergence of Algorithm 2). For any ﬁxed ε > 0, Algorithm 2 outputs an ε-suboptimal solution of the primal oracle subproblem in O(1/ε2) iterations.

Theorem 6 follows from standard convergence results in (Nemirovski et al., 2009, Section 2.2) applied to the dual oracle subproblem. We ﬁnally remark that the stepsize γm to be computed in line 1 of Algorithm 1 can be obtained via a direct line search method.

Remark 7 (Overall complexity of Algorithm 1). Denote by ε1 the error of of the Frank-Wolfe algorithm (Algorithm 1) provided by Theorem 5, and by ε2 the error of the subroutine (Algorithm 2) provided by Theorem 6. That is, suppose we solve the Frank-Wolfe subproblem approximately in the sense that S(m), P Ψ(x, P (m)) max S D S, P Ψ(x, P (m)) κε2, for some constant κ > 0. We denote by Cf the curvature constant of the function Ψ. By (Lacoste-Julien, 2016), after running O((1 + 2κε2/Cf)/ε2 1) many iterations of Algorithm 1 we achieve an accumulated error below ε1. As Algorithm 1 calls Algorithm 2 in total O(1/ε2 1) times, the overall complexity coincides with the number of operations needed to solve O(1/ε2 1) times a d-dimensional minimization problems of a smooth convex function over a compact box. This signiﬁcantly reduces the computational cost that would be needed to solve in each step the original d2-dimensional nonconvex problem (4a) directly. A further reduction of complexity is possible by applying a randomized Frank-Wolfe algorithm, as suggested by Reddi et al. (2016).

Given an efﬁcient computational method to solve the predictor problem (4a), we now address the solution of the prescriptor problem (4b). Since our Frank-Wolfe algorithm only enables us to access the values of the objective function bcr(x, θ ) for ﬁxed values of x X and θ Θ , we cannot use gradient-based algorithms to optimize over x. Nevertheless, we can resort to derivative-free optimization (DFO) approaches, which can ﬁnd local or global minima under appropriate regularity conditions on bcr(x, θ ).

Besides popular heuristic algorithms such as simulated annealing (Metropolis et al., 1953) or genetic algorithms (Holland, 1992), there are two major classes of DFO methods: direct search and model-based methods. The direct search methods use only comparisons between function evaluations to generate new candidate points, while the model-based methods construct a smooth surrogate objective function. We focus here on direct search methods due to their solid worst-case convergence guarantees. Descriptions of these algorithms and convergence proofs can be found in the recent textbook by Audet & Hare (2017). For a survey of modelbased DFO see (Conn et al., 2009). Off-the-shelf software to solve constrained DFO problems includes fminsearch in Matlab or NOMAD in C++; see (Rios & Sahinidis, 2013) for an extensive list of solvers.

Distributionally Robust Optimization with Markovian Data

When applied to the extreme barrier function formulation of the prescriptor problem (4b), which is obtained by ignoring the constraints and adding a high penalty to each infeasible decision (Audet & Dennis, 2006), the directional direct-search method by Vicente (2013) offers an attractive worst-case convergence guarantee thanks to (Vicente, 2013, Theorem 2 & Corollary 1). The exact algorithm that we use to solve (4b) is outlined in Algorithm 3 in Appendix E.

Theorem 8 (Convergence of Algorithm 3 (Vicente, 2013)). Fix any θ Θ and ε > 0. If xbcr(x, θ ) is Lipschitz continuous in x, then Algorithm 3 outputs a solution x N with xbcr(x N, θ ) ε after N O(ε 2) iterations. If bcr(x, θ ) is also convex in x, then we further have

bcr(x N, θ ) min x X bcr(x, θ ) ε.

5. An Optimal Hypothesis Test

We now develop a Neyman-Pearson-type hypothesis test for determining which one of two prescribed Markov chains has generated a given sequence ξ1, . . . , ξT of training samples. Speciﬁcally, we assume that these training samples were either generated by the Markov chain corresponding to the doublet probability mass function θ(1) Θ or by the one corresponding to θ(2) Θ, that is, we have θ {θ(1), θ(2)}. Next, we construct a decision rule encoded by the open set

B = {θ Θ : Dc(θ θ(1)) < Dc(θ θ(2))}, (8)

which predicts θ = θ(1) if bθT B and predicts θ = θ(2) otherwise. The quality of any such decision rule is conveniently measured by the type I and type II error probabilities

αT = Pθ(1)(bθT / B) and βT = Pθ(2)(bθT B) (9)

for T N, respectively. Speciﬁcally, αT represents the probability that the proposed decision rule wrongly predicts θ = θ(2) if the training samples were generated by θ(1), and βT represents the probability that the decision rule wrongly predicts θ = θ(1) if the data was generated by θ(2).

Proposition 9 (Decay of the error probabilities). If θ arg minθ B Dc(θ θ(1)) and r = Dc(θ θ(1)) > 0, then

lim T 1 T log αT = r and lim T 1 T log βT = r.

Proposition 9 ensures that the proposed decision rule is powerful in the sense that both error probabilities decay exponentially with the sample size T at some rate r that can at least principally be computed. The next theorem is inspired by the celebrated Neyman-Pearson lemma (Cover & Thomas, 2006, Theorem 11.7) and shows that this decision rule is actually optimal in a precise statistical sense.

Theorem 10 (Optimality of the hypothesis test). Let αT and βT represent the type I and type II error probabilities of the decision rule obtained by replacing B with any other open set B Θ . If lim T 1

T log αT lim T 1

T log αT , then lim T 1

T log βT lim T 1

Theorem 10 implies that the proposed hypothesis test is Pareto optimal in the following sense. If any test with the same test statistic bθT but a different set B has a faster decaying type I error probability, then it must necessarily have a slower decaying type II error probability. In Appendix D.1 we construct two different Markov chains that are almost indistinguishable on the basis of a ﬁnite training dataset. Indeed, we will show that the (optimal) decay rates of the type I and type II error probabilities can be arbitrarily small.

6. Revenue Maximization under a Markovian Brand Switching Model

We now test the performance of our approach in the context of a revenue maximization model, which aims to recognize and exploit repeat-buying and brand-switching behavior. This model assumes that the probability of a customer buying a particular brand depends on the brand purchased last (Herniter & Magee, 1961; Chintagunta et al., 1991; Gensler et al., 2007; Leeﬂang et al., 2015). Adopting the perspective of a medium-sized retailer, we then formulate an optimization problem that seeks to maximize proﬁt from sales by anticipating the long-term average demand of each brand based on the brand transition probabilities between different customer segments characterized by distinct demographic attributes. For instance, younger people might prefer relatively new brands, while senior people might prefer more established brands. Once such insights are available, the retailer selects the brands to put on offer with the goal to maximize long-run average revenue. We assume that the brand purchasing behavior of the customers is exogenous, which is realistic unless the retailer is a monopolist or has at least signiﬁcant market power. Therefore, the retailer s ordering decisions have no impact on the behavior of the customers. The ergodicity of the brand choice dynamics of each customer segment can be justiﬁed by the ergodicity of demographic characteristics (Ezzati, 1974).

6.1. Problem Formulation

Assume that there are d different brands of a particular good, and denote by a Rd the vector of retail prices per unit of the good for each brand. The seller needs to decide which brands to put on offer. This decision is encoded by a binary vector x {0, 1}d with xj = 1 if brand j is offered and xj = 0 otherwise. To quantify the seller s revenue, we need to specify the demand for each brand. To this end, assume that the customers are clustered into n groups with

Distributionally Robust Optimization with Markovian Data

similar demographic characteristics (Kuo et al., 2002). The percentage of customers in group k is denoted by wk R+, and the aggregate brand preferences ξk t of the customers in group k and period t represent an ergodic Markov chain on Ξ = {1, . . . , d} with stationary distribution πθk. Thus, the long-run average cost per customer and time period is

c(x, θ) = lim T 1

k=1 wk1ξk t =j

j=1 ajxj (πθk)j =

k=1 wkck(x, θk),

where ck(x, θk) = Pd j=1 ajxj(πθk)j and θ = (θk)n k=1. Here, the second equality exploits the ergodic theorem (Ross, 2010, Theorem 4.1). It is easy to show that ck(x, θk) can be expressed as an expected loss with respect to a probability measure encoded by θk, much like the objective function of problem (1). Finally, we deﬁne X = {x {0, 1}d : Cx b} for some C Rm d and b Rm. The linear inequalities Cx b may capture budget constraints or brand exclusivity restrictions etc. The cost minimization problem minx X c(x, θ) can thus be viewed as an instance of (1). If θ is unknown and only T training samples are available, we construct the empirical doublet distributions bθk T for all k = 1, . . . , n and solve the DRO problem

k=1 wkbc k r (x, bθ k T ), (10)

bc k r (x, bθ k T ) = max θk cl Θ

n ck(x, θk) : Dc(bθ k T θk) r o .

Below we denote by bc T (x, bθT ) the objective function and by bxr(bθT ) a minimizer of (10), where bθT = (bθk T )n k=1.

6.2. Numerical Experiments

We now validate the out-of-sample guarantees of the proposed Markov DRO approach experimentally when the observable data is indeed generated by ergodic Markov chains and compare our method against three state-of-the-art approaches to data-driven decision making from the literature. The ﬁrst baseline is the Wasserstein DRO approach by Derman & Mannor (2020), which replaces the worst-case expectation problem (11) for customer group k with

bc k r (x, bθ k T ) = max P Dr(bθ k T ) Ψ(x, P),

where the reparametrized objective function Ψ(x, P) is deﬁned as in Lemma 4, and the ambiguity set is deﬁned as

Dr(bθ k T ) = n P P : d W((Pbθ k T )i , Pi ) r i o .

Here, d W denotes the 1-Wasserstein distance, where the transportation cost between two states i, j Ξ is set to |i j|. We also compare our method against the classical SAA approach (Shapiro et al., 2014) and the DRO approach by Van Parys et al. (2021), which replaces the conditional relative entropy in the ambiguity set (11) with the ordinary relative entropy (i.e., the Kullback-Leibler divergence). We stress that both of these approaches were originally designed for serially independent training samples. For small T it is likely that bθT > 0, in which case Algorithm 1 may not converge; see Theorem 5. In this case, we slightly perturb and renormalize bθT to ensure that bθT > 0 and bθT Θ .

Synthetic data. In the ﬁrst experiment we solve a random instance of the revenue maximization problem with n = 5 customer groups and d = 10 brands, where the weight vector w and the price vector a are sampled from the uniform distributions on n and {1, . . . , 10}d, respectively. To construct the transition probability matrix of the Markov chain reﬂecting the brand switching behavior of any group k, we ﬁrst sample a random matrix from the uniform distribution on [0, 1]d d, increase two random elements of this matrix to 4 and 5, respectively, and normalize all rows to render the matrix stochastic. As the data-generating Markov chains in this synthetic experiment are known, the true out-of-sample risk of any data-driven decision can be computed exactly.1

10 410 310 210 1 100 101 3

10 410 310 210 1 100 101 3

10 410 310 210 1 100 101 3

Markov DRO Wasserstein i.i.d. DRO SAA True minimum

Figure 2. Out-of-sample risk of different data-driven decisions based on synthetic data. Solid lines and shaded areas represent empirical averages and empirical 90% conﬁdence intervals, respectively, computed using 100 independent training datasets.

10 410 310 210 1 100 101

10 410 310 210 1 100 101

10 410 310 210 1 100 101

Markov DRO Wasserstein i.i.d. DRO SAA

Figure 3. Empirical out-of-sample disappointment of different data-driven decisions based on synthetic data, computed using 100 independent training datasets.

1The Matlab code for reproducing all results is available from https://github.com/mkvdro/DRO_Markov.

Distributionally Robust Optimization with Markovian Data

Real-world data. The second experiment is based on a marketing dataset from Kaggle,2 which tracks the purchasing behavior of 2,000 customers with respect to d = 5 brands of chocolates. The customers are clustered into n = 5 groups based on their age, education level and income by using the K-means++ clustering algorithm by Vassilvitskii & Arthur (2006). The resulting clusters readily induce a weight vector w. The price vector a is sampled randomly from the uniform distribution on {1, . . . , 10}d. In addition, we concatenate the purchase histories of all customers in any group k and interpret the resulting time series as a trajectory of the unknown Markov chain corresponding to group k.

10 410 310 210 1 100 101 7

10 410 310 210 1 100 101 7

10 410 310 210 1 100 101 7

Markov DRO Wasserstein i.i.d. DRO SAA

Figure 4. Out-of-sample risk of different data-driven decisions based on 200 test samples from the Kaggle dataset.

Results. Figures 2 and 4 show that the proposed Markov DRO method outperforms the SAA scheme and the DRO approach by Van Parys et al. (2021) tailored to i.i.d. data (denoted as i.i.d. DRO ) in that its out-of-sample risk displays a smaller mean as well as a smaller variance. For small training sample sizes T, our method is also superior to the Wasserstein DRO approach by Derman & Mannor (2020), but for T 300 the two methods display a similar behavior. Note also that the mean as well as the variance of the out-ofsample risk increase with r for all three DRO approaches. This can be explained by the increasing inaccuracy of a model that becomes more and more pessimistic. In fact, if no transitions are observed from a certain brand A to another brand B given that the data follows an ergodic Markov process, then brand B s true long-term value is overestimated when the data is falsely treated as i.i.d. This phenomenon explains the discrepancy between the out-of-sample risk incurred by methods that treat the data as Markovian or as i.i.d., respectively; see Figure 4. Note also that if the prescribed decay rate r drops to 0, then all methods should collapse to the SAA approach if θ > 0. If θ has zero entries, however, the worst-case expected risk with respect to a conditional relative entropy ambiguity set does not necessarily reduce to the empirical risk if r tends to 0. We shed more light on this phenomenon in Remark 12. Figure 3 further shows that the Markov DRO approach results in the smallest out-of-sample disappointment among all tested methods for

2https://www.kaggle.com/khalidnasereddin/ retail-dataset-analysis

a wide range of rate parameters r. Comparing the charts for T = 10, T = 300 and T = 500, we also see that the out-of-sample disappointment of the Markov DRO method decays with T, which is consistent with Theorem 2. In practice, the optimal choice of the decay rate r remains an open question. To tune r with the aim to trade off in-sample performance against out-of-sample disappointment, one could use the rolling window heuristic for model selection in time series models by Bergmeir & Ben ıtez (2012).

Scalability. We also compare the scalability of the proposed Frank-Wolfe algorithm against that of an interior point method by Waltz et al. (2006) (with or without exact gradient information), which represents a state-of-the-art method in nonlinear optimization. To this end, we solve 10 instances of the worst-case expectation problem (11) with rate parameter r = 1 for a ﬁxed decision x sampled from the uniform distribution on {0, 1}d. The 10 instances involve independent training sets of size T = 5,000, each of which is sampled from the same ﬁxed Markov chain constructed as in the experiments with synthetic data. The problem instances are modelled in MATLAB, and all experiments are run on an Intel i5-5257U CPU (2.7GHz) computer with 16GB RAM. Figure 5 shows that our Frank-Wolfe algorithm is signiﬁcantly faster than both baseline methods whenever the Markov chain accommodates at least d = 100 states. In particular, note that the interior point methods run out of memory as soon as d exceeds 200.

101 101.5 102 102.5

number of states d

Interior point (approximate gradient) Interior point (exact gradient) Customized Frank-Wolfe

Figure 5. Runtimes (in seconds) of different methods for solving problem (4a). Lines represent means and shaded areas represent the ranges between the smallest and the largest runtimes.

Acknowledgements. We thank Bart Van Parys for valuable comments on the paper and for suggesting the Markov coin example in Appendix D.1. This research was supported by the Swiss National Science Foundation under the NCCR Automation, grant agreement 51NF40 180545.

Distributionally Robust Optimization with Markovian Data

Audet, C. and Dennis, J. E. Mesh adaptive direct search algorithms for constrained optimization. SIAM Journal on Optimization, 17(1):188 217, 2006.

Audet, C. and Hare, W. Derivative-Free and Blackbox Optimization. Springer, 2017.

Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust Optimization. Princeton University Press, 2009.

Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013.

Bergmeir, C. and Ben ıtez, J. M. On the use of crossvalidation for time series predictor evaluation. Information Sciences, 191:192 213, 2012.

Berman, A. and Plemmons, R. J. Nonnegative Matrices in the Mathematical Sciences. SIAM, 1994.

Bertsimas, D. and Sim, M. The price of robustness. Operations Research, 52(1):35 53, 2004.

Bertsimas, D., Gupta, V., and Kallus, N. Data-driven robust optimization. Mathematical Programming, 167 (2):235 292, 2018.

Chintagunta, P. K., Jain, D. C., and Vilcassim, N. J. Investigating heterogeneity in brand preferences in logit models for panel data. Journal of Marketing Research, 28(4): 417 428, 1991.

Conn, A. R., Scheinberg, K., and Vicente, L. N. Introduction to Derivative-Free Optimization. SIAM, 2009.

Cover, T. M. and Thomas, J. A. Elements of Information Theory. John Wiley & Sons, 2006.

Delage, E. and Ye, Y. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595 612, 2010.

Dembo, A. and Zeitouni, O. Large Deviations Techniques and Applications. Springer, 2010.

den Hollander, F. Large Deviations. American Mathematical Society, 2008.

Derman, E. and Mannor, S. Distributional robustness and regularization in reinforcement learning. ar Xiv e-prints, 2020.

Dou, X. and Anitescu, M. Distributionally robust optimization with correlated data from vector autoregressive processes. Operations Research Letters, 47(4):294 299, 2019.

Duchi, J., Glynn, P., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 2021. Forthcoming.

Eriksson, K., Estep, D., and Johnson, C. Applied Mathematics: Body and Soul. Volume 1: Derivatives and Geometry in R3. Springer, 2004.

Ezzati, A. Forecasting market shares of alternative homeheating units by Markov process using transition probabilities estimated from aggregate time series data. Management Science, 21(4):462 473, 1974.

Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95 110, 1956.

Gensler, S., Dekimpe, M. G., and Skiera, B. Evaluating channel performance in multi-channel environments. Journal of Retailing and Consumer Services, 14(1): 17 23, 2007.

Goh, J. and Sim, M. Distributionally robust optimization and its tractable approximations. Operations Research, 58(4):902 917, 2010.

Herniter, J. D. and Magee, J. F. Customer behavior as a Markov process. Operations Research, 9(1):105 122, 1961.

Holland, J. H. Adaptation in Natural and Artiﬁcial Systems: An Introductory Analysis with Applications to Biology, Control, and Artiﬁcial Intelligence. MIT Press, 1992.

Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pp. 427 435, 2013.

Jin, Y. and Sidford, A. Efﬁciently solving MDPs with stochastic mirror descent. In International Conference on Machine Learning, pp. 4890 4900, 2020.

Kirschner, J., Bogunovic, I., Jegelka, S., and Krause, A. Distributionally robust Bayesian optimization. In Artiﬁcial Intelligence and Statistics, pp. 2174 2184, 2020.

Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shaﬁeezadeh Abadeh, S. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pp. 130 166. 2019.

Kuo, R., Ho, L., and Hu, C. M. Integration of self-organizing feature map and k-means algorithm for market segmentation. Computers & Operations Research, 29(11):1475 1493, 2002.

Distributionally Robust Optimization with Markovian Data

Lacoste-Julien, S. Convergence rate of Frank-Wolfe for non-convex objectives. ar Xiv e-prints, 2016.

Leeﬂang, P. S., Wieringa, J. E., Bijmolt, T. H., and Pauwels, K. H. Modeling Markets. Springer, 2015.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092, 1953.

Mohajerin Esfahani, P. and Kuhn, D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115 166, 2018.

Namkoong, H. and Duchi, J. C. Stochastic gradient methods for distributionally robust optimization with fdivergences. In Advances in Neural Information Processing Systems, pp. 2208 2216, 2016.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 (4):1574 1609, 2009.

Perron, O. Zur Theorie der Matrices. Mathematische Annalen, 64(2):248 263, 1907.

Reddi, S. J., Sra, S., P oczos, B., and Smola, A. Stochastic Frank-Wolfe methods for nonconvex optimization. In Annual Conference on Communication, Control, and Computing, pp. 1244 1251, 2016.

Rios, L. M. and Sahinidis, N. V. Derivative-free optimization: A review of algorithms and comparison of software implementations. Journal of Global Optimization, 56 (3):1247 1293, 2013.

Ross, M. Introduction to Probability Models. Elsevier, 2010.

Scarf, H. A min-max solution of an inventory problem. In Arrow, K. J., Karlin, S., and Scarf, H. (eds.), Studies in the Mathematical Theory of Inventory and Production, pp. 201 209. Stanford University Press, 1958.

Shapiro, A., Dentcheva, D., and Ruszczy nski, A. Lectures on Stochastic Programming: Modeling and Theory. SIAM, 2014.

Smith, J. and Winkler, R. The optimizer s curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52(3):311 322, 2006.

Staib, M. and Jegelka, S. Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems, pp. 9134 9144, 2019.

Sutter, T., Van Parys, B. P. G., and Kuhn, D. A general framework for optimal data-driven optimization. ar Xiv e-prints, 2020.

Van Parys, B. P. G., Mohajerin Esfahani, P., and Kuhn, D. From data to decisions: Distributionally robust optimization is optimal. Management Science, 2021. Articles in Advance.

Vassilvitskii, S. and Arthur, D. k-means++: The advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, pp. 1027 1035, 2006.

Vicente, L. N. Worst case complexity of direct search. EURO Journal on Computational Optimization, 1(12):143 153, 2013.

Waltz, R. A., Morales, J. L., Nocedal, J., and Orban, D. An interior algorithm for nonlinear optimization that combines line search and trust region steps. Mathematical Programming, 107(3):391 408, 2006.

Wiesemann, W., Kuhn, D., and Sim, M. Distributionally robust convex optimization. Operations Research, 62 (6):1358 1376, 2014.