# planning_and_learning_with_stochastic_action_sets__d18bad69.pdf Planning and Learning with Stochastic Action Sets Craig Boutilier, Alon Cohen, Avinatan Hassidim, Yishay Mansour, Ofer Meshi, Martin Mladenov and Dale Schuurmans Google Research {cboutilier,aloncohen,avinatan,mansour,meshi,schuurmans}@google.com In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities, and a sampling-based model for unknown distributions. We develop polynomial-time value and policy iteration methods for both cases, and provide a polynomial-time linear programming solution for the first case. 1 Introduction Markov decision processes (MDPs) are the standard model for sequential decision making under uncertainty, and provide the foundations for reinforcement learning (RL). With the recent emergence of RL as a practical AI technology in combination with deep learning [Mnih et al., 2013; 2015], new use cases are arising that challenge basic MDP modeling assumptions. One such challenge is that many practical MDP and RL problems have stochastic sets of feasible actions; that is, the set As of feasible actions at state s varies stochastically with each visit to s. For instance, in online advertising, the set of available ads differs at distinct occurrences of the same state (e.g., same query, user, contextual features), due to exogenous factors like campaign expiration or budget throttling. In recommender systems with large item spaces, often a set of candidate recommendations is first generated, from which top scoring items are chosen; exogenous factors often induce non-trivial changes in the candidate set. With the recent application of MDP and RL models in ad serving and recommendation [Charikar et al., 1999; Li et al., 2009; Archak et al., 2010; 2012; Amin et al., 2012; Silver et al., 2013; Theocharous et al., 2015; Mladenov et al., 2017], understanding how to capture the stochastic nature of available action sets is critical. Somewhat surprisingly, this problem seems to have been largely unaddressed in the literature. Standard MDP formulations [Puterman, 1994] allow each state s to have its own feasible action set As, and it is not uncommon to allow the set As to be non-stationary or time-dependent. However, they do not support the treatment of As as a stochastic random variable. In this work, we: (a) introduce the stochastic action set MDP (SAS-MDP) and provide its theoretical foundations; (b) describe how to account for stochastic action sets in model-free RL (e.g., Q-learning); and (c) develop tractable algorithms for solving SAS-MDPs in important special cases. An obvious way to treat this problem is to embed the set of available actions into the state itself. This provides a useful analytical tool, but it does not immediately provide tractable algorithms for learning and optimization, since each state is augmented with all possible subsets of actions, incurring an exponential blow up in state space size. To address this issue, we show that SAS-MDPs possess an important property: the Q-value of an available action a at a state s is independent of the availability of other actions. This allows us to prove that optimal policies can be represented compactly using (statespecific) decision lists (or orderings) over the action set. This special structure allows one to solve the SAS RL problem effectively using, for example, Q-learning. We also devise model-based algorithms that exploit this policy structure. We develop value and policy iteration schemes, showing they converge in a polynomial number of iterations (w.r.t. the size of the underlying base MDP). We also show that per-iteration complexity is polynomial time for two important special forms of action availability distribution: (a) when action availabilities are independent, both methods are exact; (b) when the distribution over sets As is sampleable, we obtain approximation algorithms with polynomial sample complexity. In fact, policy iteration is strongly polynomial under additional assumptions (for a fixed discount factor). We show that a linear program for SAS-MDPs can be solved in polynomial time as well. Finally, we offer a simple empirical demonstration of the importance of accounting for stochastic action availability when computing an MDP policy. Additional discussion and full proofs of all results can be found in a longer version of this paper [Boutilier et al., 2018]. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 2 MDPs with Stochastic Action Sets We first introduce SAS-MDPs and provide a simple example illustrating how action availability impacts optimal decisions. See [Puterman, 1994] for more background on MDPs. 2.1 The SAS-MDP Model Our formulation of MDPs with Stochastic Action Sets (SASMDPs) derives from a standard, finite-state, finite-action MDP (the base MDP) M, with n states S, base actions Bs for s S, and transition and reward functions, P : S B (S) and r : S B R. We use pk s,s and rk s to denote the probability of transition to s and the accrued reward, respectively, when action k is taken at state s. For notational ease, we assume that feasible action sets for each s S are identical, so Bs = B (allowing distinct base sets at different states has no impact on what follows). Let |B| = m and M = |S B| = nm. We assume an infinite-horizon, discounted objective with fixed discount rate γ, 0 γ < 1. In a SAS-MDP, the set of actions available at state s at any stage t is a random subset A(t) s B. We assume a family of action availability distributions Ps (2B) defined over the powerset of B. These can depend on s S but are otherwise history-independent, hence Pr(A(t) s |s(1), . . . , s(t)) = Pr(A(t) s |s(t)). Only actions k A(t) s in the realized available action set can be executed at stage t. Apart from this, the dynamics of the MDP is unchanged: when an (available) action is taken, state transitions and rewards are prescribed as in the base MDP. In what follows, we assume that some action is always available, i.e., Pr(A(t) s = ) = 0 for all s, t.1 Note that a SAS-MDP does not conform to the usual definition of an MDP. 2.2 Example The following simple MDP shows the importance of accounting for stochastic action availability when making decisions. The MDP below has two states. Assume the agent starts at state s1, where two actions (indicated by directed edges for their transitions) are always available: one (Stay) stays at s1, and the other (Go) transitions to state s2, both with reward 1/2. At s2, the action Down returns to s1, is always available and has reward 0. A second action Up also returns to s1, but is available with only probability p and has reward 1. A naive solution that ignores action availability is as follows: we first compute the optimal Q-function assuming all actions are available (this can be derived from the optimal value function, computed using standard techniques). Then at each stage, we use the best action available at the current state where actions are ranked by Q-value. Unfortunately, 1Models that trigger process termination when A(t) s = are well-defined, but we set aside this model variant here. this leads to a suboptimal policy when the Up action has low availability, specifically if p < 0.5. The best naive policy always chooses to move to s2 from s1; at s2, it picks the best action available. This yields a reward of 1/2 at even stages, and an expected reward of p at odd stages. However, by anticipating the possibility that action Up is unavailable at s2, the optimal (SAS) policy always stays at s1, obtaining reward 1/2 at all stages. For p < 1/2, the latter policy dominates the former: the plot on the right shows the fraction of the optimal (SAS) value lost by the naive policy (Std) as a function of the availability probability p. This example also illustrates that as action availability probabilities approach 1, the optimal policy for the base MDP is also optimal for the SAS-MDP. 2.3 Related Work While a general formulation of MDPs with stochastic action availability does not appear in the literature, there are two strands of closely related work. In the bandits literature, sleeping bandits are bandit problems in which the arms available at each stage are determined randomly or adversarially (sleeping experts are similar, with complete feedback than bandit feedback) [Kleinberg et al., 2010; Kanade et al., 2009]. Best action orderings (analogous to our decision list policies for SAS-MDPs) are often used to define regret in these models. The goal is to develop exploration policies to minimize regret. Since these models have no state, if the action reward distributions are known, the optimal policy is trivial: always take the best available action. By contrast, a SAS-MDP, even a known model, induces a difficult optimization problem, since the quality of an action depends not just on its immediate reward, but also on the availability of actions at reachable (future) states. This is our focus. The second closely related branch of research comes from the field of stochastic routing. The Canadian Traveller Problem the problem of minimizing travel time in a graph with unavailable edges was introduced by Papadimitriou and Yannakakis [1991], who give intractability results (under weaker assumptions about edge availability, e.g. adversarial). Poliyhondrous and Tsitsiklis [1996] consider a stochastic version of the problem, where edge availabilities are random but static (and any edge unavailable remains so throughout the scenario). Most similar to our setting is the work of Nikolova and Karger [2008], who discuss the case of resampling edge costs at each node visit; however, the proposed solution is well-defined only when the edge costs are finite and does not easily extend to unavailable actions/infinite edge costs. Due to the specificity of their modeling assumptions, none of the solutions found in this line of research can be adapted in a straightforward way to SAS-MDPs. 3 Two Reformulations of SAS-MDPs The randomness of feasible actions means that SAS-MDPs do not conform to the usual definition of an MDP. In this section, we develop two reformulations of SAS-MDPs that transform them into MDPs. We discuss the relative advantages of each, outline key properties and relationships between these models, and describe important special cases of the SAS-MDP model itself. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 3.1 The Embedded MDP We first consider a reformulation of the SAS-MDP in which we embed the (realized) available action set into the state space itself. This is a straightforward way to recover a standard MDP. The embedded MDP Me for a SAS-MDP has state space Se = {s A : s S, A B}, with s A having feasible action set A.2 The history independence of Ps allows transitions to be defined as: pk s A,s A = P(s A |s A, k) = pk s,s Ps (A ), k A. Rewards are defined similarly: rk(s A) = rk(s) for k A. In our earlier example, the embedded MDP has three states: s1 {Stay, Go}, s2 {Up, Down}, s2 {Down} (other action subsets have probability 0 hence their corresponding embedded states are unreachable). The feasible actions at each state are given by the embedded action set, and the only stochastic transition occurs when Go is taken at s1: it moves to s2 {Up, Down} with probability p and s2 {Down} with probability 1 p. Clearly, the induced reward process and dynamics are Markovian, hence Me is in fact an MDP under the usual definition. Given the natural translation afforded by the embedded MDP, we view this as providing the basic semantic underpinnings of the SAS-MDP model. This translation affords the use of standard MDP analytical tools and methods. A (stationary, deterministic, Markovian) policy π : Se B for Me is restricted so that π(s A) A. The policy backup operator T π e and Bellman operator T e for Me decompose naturally as follows: T π e Ve(s As) = rπ(s As) s + s pπ(s As) s,s X As B Ps (As )Ve(s As ), (1) T e Ve(s As) = max k As rk s+ As B Ps (As )Ve(s As ) (2) Their fixed points, V π e and V e respectively, can be expressed similarly. Obtaining an MDP from an SAS-MDP via action-set embedding comes at the expense of a (generally) exponential blow-up in the size of the state space, which can increase by a factor of 2|B|. 3.2 The Compressed MDP The embedded MDP provides a natural semantics for SASMDPs, but is problematic from an algorithmic and learning perspective given the state space blow-up. Fortunately, the history independence of the availability distributions gives rise to an effective, compressed representation. The compressed MDP Mc recasts the embedded MDP in terms of the original state space, using expectations to express value functions, policies, and backups over S rather than over the (exponentially larger) Se. As we will see below, the compressed MDP induces a blow-up in action space rather than state space, but offers significant computational benefits. 2Embedded states whose embedded action subsets have zero probability are unreachable and can be ignored. Formally, the state space for Mc is S. To capture action availability, the feasible action set for s S is the set of state policies, or mappings µs : 2B B satisfying µs(As) As. In other words, once we reach s, µs dictates what action to take for any realized action set As. A policy for Mc is a family µc = {µs : s S} of such state policies. Transitions and rewards use expectations over As: pµs s,s = X As B Ps(As)pµs(As) s,s and rµs s = X As B Ps(As)rµs(As) s . In our earlier example, the compressed MDP has only two states, s1 and s2. Focusing on s2, its actions in the compressed MDP are the set of state policies, or mappings from the realizable available sets {{Up, Down}, {Down}} into action choices (as above, we ignore unrealizable action subsets). In this case, there are two such state policies: the first selects Up for {Up, Down} and (obviously) Down for {Down}; the second selects Down for {Up, Down} and Down for {Down}. It is not hard to show that the dynamics and reward process defined above over this compressed state space and expanded action set (i.e., the set of state policies) are Markovian. Hence we can define policies, value functions, optimality conditions, and policy and Bellman backup operators in the usual fashion. For instance, the Bellman and policy backup operators, T c and T c µ, on compressed value functions are: T c Vc(s) = E As B max k As rk s + γ X s pk s,s Vc(s ), (3) T µ c Vc(s) = E As B rµs(As) s + γ X s pµs(As) s,s Vc(s ). (4) It is easy to see that any state policy µ induces a Markov chain over base states, hence we can define a standard n n transition matrix P µ for such a policy in the compressed MDP, where pµ s,s = EA B pµ(s)(A) s,s . When additional independence assumptions hold, this expectation over subsets can be computed efficiently (see Section 3.4). Critically, we can show that there is a direct equivalence between policies and their value functions (including optimal policies and values) in Mc and Me. Define the actionexpectation operator E : Rn2m Rn to be a mapping that compresses a value function Ve for Me into a value function V e c for Mc: V e c (s) = EVe(s) = E As B Ve(s As) = X As B Ps(As)Ve(s As). We emphasize that E transforms an (arbitrary) value function Ve in embedded space into a new value function V e c defined in compressed space (hence, V e c is not defined w.r.t. Mc). Lemma 1 ET e Ve = T c EVe. Hence, T c has a unique fixed point V c = EV e . Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) ET e Ve(s) = E A B T e Ve(s A) = E A B max k A rk s + γ X s A pk s A,s A Ve(s A ) = E A B max k A rk s + γ X s pk s,s E A B Ve(s A ) = E A B max k A rk s + γ X s pk s,s EV e(s ) = T c EV e(s ). Lemma 2 Given the optimal value function V c for Mc, the optimal policy π e for Me can be constructed directly. Specifically, for any s A, the optimal policy π e(s A) and optimal value V e (s A) at that embedded state can be computed in polynomial time. Proof Sketch: Given s A, the expected value of each action in k A can be computed using a one-step backup of V c . Then π e(s A) is the action with maximum value, and V e (s A) is its backed-up expected value. Therefore, it suffices to work directly with the compressed MDP, which allows one to use value functions (and Qfunctions) over the original state space. The price is that one needs to use state policies, since the best action at s depends on the available set As. In other words, while the embedded MDP causes an exponential blow-up in state space, the compressed MDP causes an exponential blow-up in action space. We now turn to assumptions that allow us to effectively manage this action space blow-up. 3.3 Decision List Policies The embedded and compressed MDPs do not, prima facie, offer much computational or representational advantage, since they rely on an exponential increase in the size of the state space (embedded MDP) or decision space (compressed MDP). Fortunately, SAS-MDPs have optimal policies with a useful, concise form. We first focus on the policy representation itself, then describe the considerable computational leverage it provides. A decision list (DL) policy µ is a type of policy for Me that can be expressed compactly using O(nm log m) space and executed efficiently. Let ΣB be the set of permutations over base action set B. A DL policy µ : S ΣB associates a permutation µ(s) ΣB with each state, and is executed at embedded state s A by executing min{i {1, . . . , m} : µ(s)(i) A}. In other words, whenever base state s is encountered and A is the available set, the first action k A in the order dictated by DL µ(s) is executed. Equivalently, we can view µ(s) as a state policy µs for s in Mc. In our earlier example, one DL µ(s2) is [Up, Down], which requires taking (base) action Up if it is available, otherwise taking Down. For any SAS-MDP, we have optimal DL policies: Theorem 1 Me has an optimal policy that can be represented using a decision list. The same policy is optimal for the corresponding Mc. Proof Sketch: Let V be the (unique) optimal value function for Me and Q its corresponding Q-function (see Sec. 5.1 for a definition). A simple inductive argument shows that no DL policy is optimal only if there is some state s, action sets A = A , and (base) actions j = k, s.t. (i) j, k A, A ; (ii) for some optimal policy π (s A) = j and π (s A ) = k; and (iii) either Q (s A, j) > Q (s A, k) or or Q (s A , k) > Q (s A , j). However, the fact that the optimal Q-value of any action k A at state s A is independent of the other actions in A (i.e., it depends only on the base state) implies that these conditions are mutually contradictory. 3.4 The Product Distribution Assumption The DL form ensures that optimal policies and value functions for SAS-MDPs can be expressed polynomially in the size of the base MDP M. However, their computation still requires the computation of expectations over action subsets, e.g., in Bellman or policy backups (Eqs. 3, 4). This will generally be infeasible without some assumptions on the form the action availability distributions Ps. One natural assumption is the product distribution assumption (PDA). PDA holds when Ps(A) is a product distribution where each action k B is available with probability ρk s, and subset A B has probability ρA s = Q k A ρk s Q k B\A(1 ρk s). This assumption is a reasonable approximation in the settings discussed above, where state-independent exogenous processes determine the availability of actions (e.g., the probability that one advertiser s campaign has budget remaining is roughly independent of another advertiser s). For ease of notation, we assume that ρk s is identical for all states s (allowing different availability probabilities across states has no impact on what follows). To ensure the MDP is well-founded, we assume some default action (e.g., no-op) is always available.3 Our earlier running example trivially satisifes PDA: at s2, Up s availability probability (p) is independent of the availability of Down (1). When the PDA holds, the DL form of policies allows the expectations in policy and Bellman backups to be computed efficiently without enumeration of subsets A B. For example, given a fixed DL policy µ, we have T µ c Vc(s) = j=1 (1 ρµ(s)(j) s ) s pµ(s)(i) s,s Vc(s ) The Bellman operator has a similar form. We exploit this below to develop tractable value iteration and policy iteration algorithms, as well as a practical LP formulation. 3.5 Arbitrary Distributions with Sampling (ADS) We can also handle the case where, at each state, the availability distribution is unknown, but is sampleable. Using sam- 3We omit the default action from analysis for ease of exposition. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) ples to approximate expectations w.r.t. available action subsets provides a means to estimate values and approximate optimal policies. Critically, the required sample size is polynomial in |B|, and not in the size of the support of the distribution (see below). Of course, this approach does not allow us to compute the optimal policy exactly. However, it has important implications for the sample complexity of learning algorithms like Q-learning. We note that the ability to sample available action subsets is quite natural in many domains. For instance, in ad domains, it may not be possible to model the process by which eligible ads are generated (e.g., specific and evolving advertiser targeting criteria, budgets, frequency capping, etc.). But the eligible subset of ads considered for each impression opportunity is an action subset sampled from this process. Under ADS, we compute approximate backup operators as follows. Let As = {A(1) s , . . . , A(T ) s } be an i.i.d. sample of size T of action subsets in state s. For a subset of actions A, an index i and a decision list µ, define I[i,A,µ] to be 1 if µ(i) A and for each j < i we have µ(j) A, or 0 otherwise. Similar to Eq. (5), we define: T µ c Vc(s)= 1 i=1 Ih i,A(t) s ,µ(s) i rµ(s)(i) s +γ X s pµ(s)(i) s,s Vc(s ) . We now consider the quality of of the policies generated using this approximation (see the longer paper [Boutilier et al., 2018] for proofs and more details). The following lemma is a direct application of Hoeffding s concentration inequality along with a union bound (for any 0 < δ < 1 and error tolerance ϵ): Lemma 3 Given samples A(1) s , . . . , A(T ) s for each s S, if T = Ω m + log(n/δ) then with probability 1 δ, for each DL policy µ we have: E A B Qµ(s, µs(A)) 1 t=1 Qµ(s, µs(A(t) s )) The action-set samples induce an approximate SAS-MDP, defined using the empirical distributions above. Under the conditions stated, Lemma 3 leads to bounds on the quality of the optimal policy in the approximate SAS-MDP: Theorem 2 Let ˆµ be the optimal policy for the approximate SAS-MDP and Q the optimal Q-function for the true SASMDP. With probability 1 δ we have, for each s S, k B, Qˆµ(s, k) Q (s, k) ϵ. In the sequel, we focus largely on PDA; in most cases equivalent results can be derived in the ADS model. 4 Q-Learning in the Compressed MDP Suppose we are faced with learning the optimal value function or policy for an SAS-MDP from a collection of trajectories. The (implicit) learning of the transition dynamics and rewards can proceed as usual; the novel aspect of the SAS model is that the action availability distribution must also be considered. Remarkably, Q-learning can be readily augmented to incorporate stochastic action sets: we require only that our training trajectories are augmented with the set of actions that were available at each state, . . . s(t), A(t), k(t), r(t), s(t+1), A(t+1), k(t+1), r(t+1), . . . , where: s(t) is the realized state at time t (drawn from distribution P( |s(t 1), k(t 1))); A(t) is the realized available set at time t, drawn from Ps(t); k(t) A(t) is the action taken; and r(t) is the realized reward. Such augmented trajectory data is typically available. In particular, the required sampling of available action sets is usually feasible (e.g., in ad serving as discussed above). SAS-Q-learning can be applied directly to the compressed MDP Mc, requiring only a minor modification of the standard Q-learning update for the base MDP. We simply require that each Q-update maximize over the realized available actions A(t+1): Qnew(s(t), k(t)) (1 αt)Qold(s(t), k(t)) + αt[r(t) + γ max k A(t+1) Qold(s(t+1), k)] . Here Qold is the previous Q-function estimate and Qnew is the updated estimate, thus it encompasses both online and batch Q-learning, experience replay, etc.; and 0 αt < 1 is our (adaptive) learning rate. It is straightforward to show that, under the usual exploration conditions, SAS-Q-learning will converge to the optimal Q-function for the compressed MDP, since the expected maximum over sampled action sets at any particular state will converge to the expected maximum at that state. Theorem 3 The SAS-Q-learning algorithm will converge w.p. 1 to the optimal Q-function for the (discounted, infinitehorizon) compressed MDP Mc if the usual stochastic approximation requirements are satisfied. That is, if (a) rewards are bounded and (b) the subsequence of learning rates αt(s,k) applied to (s, k) satisfies P αt(s,k) = and P α2 t(s,k) < for all state-action pairs (s, k) (see, e.g., [Watkins and Dayan, 1992]). Moreover, function approximation techniques, such as DQN [Mnih et al., 2015], can be directly applied with the same action set-sample maximization. Implementing an optimal policy is also straightforward: given a state s and the realization As of the available actions, one simply executes arg maxk As Q(s, k). We note that extracting the optimal value function Vc(s) for the compressed MDP from the learned Q-function is not viable without some information about the action availability distribution. Fortunately, one need not know the expected value at a state to implement the optimal policy.4 4It is, of course, straightforward to learn an optimal value function if desired. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 5 Value Iteration in the Compressed MDP Computing a value function for Mc, with its small state space S, suffices to execute an optimal policy. We develop an efficient value iteration (VI) method to do this. 5.1 Value Iteration Solving an SAS-MDP using VI is challenging in general due to the required expectations over action sets. However, under PDA, we can derive an efficient VI algorithm whose complexity depends only polynomially on the base set size |B|. Assume a current iterate V t, where V t(s) = EAs[maxk As Qt(s, k)]. We compute V t+1 as follows: For each s S, k B, compute its (t + 1)-stage-to-go Q-value: Qt+1(s, k) = rk s + γ P s pk s,s V t(s ). Sort these Q-values in descending order. For convenience, we re-index each action by its Q-value rank (i.e., k(1) is the action with largest Q-value, and ρ(1) is its probability, k(2) the second-largest, etc.). For each s S, compute its (t + 1)-stage-to-go value: V t+1(s) = EAs max k As Qt+1(s, k) j=1 (1 ρ(j)) ρ(i)Qt+1(s, k(i)). Under ADS, we use the approximate Bellman operator: b V t+1(s) = EAs max k As b Qt+1(s, k) i=1 Ih i,A(t) s ,µ(s) i b Qt+1(s, µ(s)(i)) , where µ(s) is the DL resulting from sorting b Qt+1-values. The Bellman operator under PDA is tractable: Observation 1 The compressed Bellman operator T c can be computed in O(nm log m) time. Therefore the per-iteration time complexity of VI for Mc compares favorably to the O(nm) time of VI in the base MDP. The added complexity arises from the need to sort Qvalues.5 Conveniently, this sorting process immediately provides the desired DL state policy for s. Using standard arguments, we obtain the following results, which immediately yield a polytime approximation method. Lemma 4 T c is a contraction with modulus γ i.e., ||T c vc T c v c|| γ||vc v c||. Corollary 1 For any precision ε < 1, the compressed value iteration algorithm converges to an ε-approximation of the optimal value function in O(log(L/ε)) iterations, where L [maxs,k rk s]/(1 γ) is an upper bound on ||V e ||. We provide an even stronger result next: VI, in fact, converges to an optimal solution in polynomial time. 5The products of the action availability probabilities can be computed in linear time via caching. 5.2 The Complexity of Value Iteration Given its polytime per-iteration complexity, to ensure VI is polytime, we must show that it converges to a value function that induces an optimal policy in polynomially many iterations. To do so, we exploit the compressed representation and adapt the technique of [Tseng, 1990]. Assume, w.r.t. the base MDP M, that the discount factor γ, rewards rk s, and transition probabilities pk s,s , are rational numbers represented with a precision of 1/δ (δ is an integer). Tseng shows that VI for a standard MDP is strongly polynomial, assuming constant γ and δ, by proving that: (a) if the t th value function produced by VI satisfies ||V t V || < 1/(2δ2n+2nn), then the policy induced by V t is optimal; and (b) VI achieves this bound in polynomially many iterations. We derive a similar bound on the number of VI iterations needed for convergence in an SAS-MDP, using the same input parameters as in the base MDP, and applying the same precision δ to the action availability probabilities. We apply Tseng s result by exploiting the fact that: (a) the optimal policy for the embedded MDP Me can be represented as a DL; (b) the transition function for any DL policy can be expressed using an n n matrix (we simply take expectations, see above); and (c) the corresponding linear system can be expressed over the compressed rather than the embedded state space to determine V c (rather than V e ). Tseng s argument requires some adaptation to apply to the compressed VI algorithm. We extend his precision assumption to account for our action availability probabilities as well, ensuring ρk s is also represented up to precision of 1/δ. Since Mc is an MDP, Tseng s result applies; but notice that each entry of the transition matrix for any state s DL µ, which serves as an action in Mc, is a product of m+1 probabilities, each with precision 1/δ. We have that pµ s,s has precision of 1/δm+1. Thus the required precision parameter for our MDP is at most δm+1. Plugging this into Tseng s bound, VI applied to Mc must induce an optimal policy at the t th iteration if ||V t v || < 1/(2(δ(m+1))2nnn) = 1/(2δ(m+1)2nnn) . This in turn gives us a bound on the number of iterations of VI needed to reach an optimal policy: Theorem 4 VI applied to Mc converges to a value function whose greedy policy is optimal in t iterations, where t log(2δ2n(m+1)nn M)/ log(1/γ) Combined with Obs. 1, we have: Corollary 2 VI yields an optimal policy for the SAS-MDP corresponding to Mc in polynomial time. Under ADS, VI merely approximates the optimal policy. In fact, one cannot compute an exact optimal policy without observing the entire support of the availability distributions (requiring exponential sample size). 6 Policy Iteration in the Compressed MDP We now outline a policy iteration (PI) algorithm. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 6.1 Policy Iteration The concise DL form of optimal policies can be exploited in PI as well. Indeed, the greedy policy πV with respect to any value function V in the compressed space is representable as a DL. Thus the policy improvement step of PI can be executed using the same independent evaluation of action Q-values and sorting as used in VI above: QV (s, k) = r(s, k) + γ X s pk s,s V (s ), QV (s, As)= max k As QV (s, k) , and πV (s, As)=arg max k As QV (s, k). The DL policy form can also be exploited in the policy evaluation phase of PI. The tractability of policy evaluation requires a tractable representation of the action availability probabilities, which PDA provides, leading to the following PI method that exploits PDA: 1. Initialize an arbitrary policy π in decision list form. 2. Evaluate π by solving the following linear system over variables V π(s), s S: (Note: We use Qπ(s, k) to represent the relevant linear expression over V π.) j=1 (1 ρ(j))] ρ(i)Qπ(s, k(i)) 3. Let π denote the greedy policy w.r.t. V π, which can be expressed in DL form for each s by sorting Q-values Qπ(s, k) as above (with standard tie-breaking rules). If π (s) = π(s), terminate; otherwise replace π with π and repeat (Steps 2 and 3). Under ADS, PI can use the approximate Bellman operator, giving an approximately optimal policy. 6.2 The Complexity of Policy Iteration The per-iteration complexity of PI in Mc is polynomial: as in standard PI, policy evaluation solves an n n linear system (naively, O(n3)) plus the additional overhead (linear in M) to compute the compounded availability probabilities; and policy improvement requires O(mn2) computation of action Qvalues, plus O(nm log m) overhead for sorting Q-values (to produce improving DLs for all states). An optimal policy is reached in a number of iterations no greater than that required by VI, since: (a) the sequence of value functions for the policies generated by PI contracts at least as quickly as the value functions generated by VI (see, e.g., [Meister and Holzbaur, 1986; Hansen et al., 2013]); (b) our precision argument for VI ensures that the greedy policy extracted at that point will be optimal; and (c) once PI finds an optimal policy, it will terminate (with one extra iteration). Hence, PI is polytime (assuming a fixed discount γ < 1). Theorem 5 PI yields an optimal policy for the SAS-MDP corresponding to Mc in polynomial time. In the longer version of the paper [Boutilier et al., 2018], we adapt more direct proof techniques [Ye, 2011; Hansen et al., 2013] to derive polynomial-time convergence of PI for SASMDPs under additional assumptions. Concretely, for a policy µ and actions k1, k2, let ηµ(s, k1, k2) be the probability, over action sets, that at state s, the optimal µ selects k1 and µ selects k2. Let q > 0 be such that ηµ(s, k1, k2) q whenever ηµ(s, k1, k2) > 0. We show: Theorem 6 The number of iterations it takes policy iteration to converge is no more than 1 γ log m 1 γ log e Under PDA, the theorem implies strongly-polynomial convergence of PI if each action is available with constant probability. In this case, for any µ, ki, kj, and s, we have ηµ(s, ki, kj) ρki s ρkj s = Ω(1), which in turn implies that we can take q = Ω(1) in the bound above. 7 Linear Programming in the Compressed MDP An alternative model-based approach is linear programming (LP). The primal formulation for the embedded MDP Me is straightforward (since it is a standard MDP), but requires exponentially many variables (one per embedded state) and constraints (one per embedded state, base action pair). A (nonlinear) primal formulation for the compressed MDP Mc reduces the number of variables to |S|: s S αsvs, s.t. vs EAs max k As Q(s, k) s. (6) Here α is an arbitrary, positive state-weighting, over the embedded states corresponding to each base state and Q(s, k) = rk s + X s S pk s,s vs abbreviates the linear expression of the action-value backup at the state and action in question w.r.t. the value variables vs. This program is valid given the definition of Mc and the fact that a weighting over embedded states corresponds to a weighting over base states by taking expectations. Unfortunately, this formulation is non-linear, due to the max term in each constraint. And while it has only |S| variables, it has factorially many constraints; moreover, the constraints themselves are not compact due to the presence of the expectation in each constraint. PDA can be used to render this formulation tractable. Let σ denote an arbitrary (inverse) permutation of the action set (so σ(i) = j means that action j is ranked in position i). As above, the optimal policy at base state s w.r.t. a Q-function is expressible as a DL ( with actions sorted by Q-values) and its expected value given by the expression derived below. Specifically, if σ reflects the relative ranking of the (optimal) Q-values of the actions at some fixed state s, then V (s) = Q(s, σ(1)) with probability ρσ(1), i.e., the probability that σ(1) occurs in As. Similarly, V (s) = Q(s, σ(2)) with probability (1 ρσ(1))ρσ(2), and so on. We define the Q-value of a DL σ as follows: j=1 (1 ρσ(j))] ρσ(i)QV (s, σ(i)). (7) Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Thus, for any fixed action permutation σ, the constraint that vs at least matches the expectation of the maximum action s Q-value is linear. Hence, the program can be recast as an LP by enumerating action permutations for each base state, replacing the constraints in Eq. (6) as follows: vs QV s (σ) s S, σ Σ. (8) The constraints in this LP are now each compactly represented, but it still has factorially many constraints. Despite this, it can be solved in polynomial time. First, we observe that the LP is well-suited to constraint generation. Given a relaxed LP with a subset of constraints, a greedy algorithm that simply sorts actions by Q-value to form a permutation can be used to find the maximally violated constraint at any state. Thus we have a practical constraint generation algorithm for this LP since (maximally) violated constraints can be found in polynomial time. More importantly from a theoretical standpoint, the constraint generation algorithm can be used as a separation oracle within an ellipsoid method for this LP. This directly yields an exact, (weakly) polynomial time algorithm for this LP [Gr otschel et al., 1988]. 8 Empirical Illustration We now provide a somewhat more substantial empirical demonstration of the effects of stochastic action availability. Consider an MDP that corresponds to a routing problem on a real-world road network (Fig. 1) in the San Francisco Bay Area. The shortest path between the source and destination locations is sought. The dashed edge in Fig. 1 represents a bridge, available only with probability p, while all other edges correspond to action choices available with probability 0.5. At each node, a no-op action (waiting) is available at constant cost; otherwise the edge costs are the geodesic lengths of the corresponding roads on the map. The optimal policies for different choices p = 0.1, 0.2 and 0.4 are depicted in Fig. 1, where line thickness and color indicate traversal probabilities under the corresponding optimal policies. We see that lower values of p lead to policies with more redundancy (i.e., more alternate routes). Fig. 2 shows the effect of solving the routing problem when ignoring stochastic action availability (i.e., assuming actions are always available). The SAS-optimal policy allows graceful scaling of the expected travel time from source to destination as bridge availability decreases. The effects of violating the PDA assumption are also investigated in the longer version of this paper [Boutilier et al., 2018]. 9 Concluding Remarks We have developed a new MDP model, SAS-MDPs, that extends the usual finite-action MDP model by allowing the set of available actions to vary stochastically. This captures an important use case that arises in many practical applications (e.g., online advertising, recommender systems). We have shown that embedding action sets in the state gives a standard MDP, supporting tractable analysis at the cost of an exponential blow-up in state space size. Despite this, we demonstrated that (optimal and greedy) policies have a useful decision list Figure 1: Stochastic action MDPs applied to routing. Figure 2: Expected trip time from source to destination under the SAS-optimal policy vs. under the oblivious optimal policy (the MDP solved as if actions are fully available) as a function of bridge availability. structure. We showed how this DL format can be exploited to construct tractable Q-learning, value and policy iteration, and linear programming algorithms. While our work offers firm foundations for stochastic action sets, most practical applications will not use the algorithms described here explicitly. For example, in RL, we generally use function approximators for generalization and scalability in large state/action problems. We have successfully applied Q-learning using DNN function approximators (i.e., DQN) using sampled/logged available actions in ads and recommendations domains as described in Sec. 4. This has allowed us to apply SAS-Q-learning to problems of significant, commercially viable scale. Model-based methods such as VI, PI, and LP also require suitable (e.g., factored) representations of MDPs and structured implementations of our algorithms that exploit these representations. For instance, extensions of approximate linear programming or structured dynamic programming to incorporate stochastic action sets would be extremely valuable. Other important questions include developing a polynomial-sized direct LP formulation; and deriving sample-complexity results for RL algorithms like Q-learning is also of particular interest, especially as it pertains to the Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) sampling of the action distribution. Finally, we are quite interested in relaxing the strong assumptions embodied in the PDA model of particular interest is the extension of our algorithms to less extreme forms of action availability independence, for example, as represented using concise graphical models (e.g., Bayes nets). Acknowledgments Thanks to the reviewers for their helpful suggestions. [Amin et al., 2012] K. Amin, M. Kearns, P. Key, and A. Schwaighofer. Budget optimization for sponsored search: Censored learning in MDPs. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence (UAI-12), pp.543 553, Catalina, CA, 2012. [Archak et al., 2010] N. Archak, V. S. Mirrokni, and S. Muthukrishnan. Mining advertiser-specific user behavior using adfactors. In Proceedings of the 19th Intl. World Wide Web Conference (WWW 2010), pp.31 40, Raleigh, NC, 2010. [Archak et al., 2012] N. Archak, V. Mirrokni, and S. Muthukrishnan. Budget optimization for online campaigns with positive carryover effects. In Proceedings of the 8th Intl. Workshop on Internet and Network Economics (WINE-12), pp.86 99, Liverpool, 2012. [Boutilier et al., 2018] C. Boutilier, A. Cohen, A. Hassidim, Y. Mansour, O. Meshi, M. Mladenov, and D. Schuurmans. Planning and learning in Markov decision processes with stochastic action sets. Technical Report ar Xiv:1805.02363 [cs.AI], Ar Xi V, May 2018. [Charikar et al., 1999] M. Charikar, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. On targeting Markov segments. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC-99), pp.99 108, Atlanta, 1999. [Gr otschel et al., 1988] M. Gr otschel, L. Lov asz, and A. Schrijver. Geometric Algorithms and Combinatorial Optimization, Vol. 2 of Algorithms and Combinatorics. Springer, 1988. [Hansen et al., 2013] T. D. Hansen, P. B. Miltersen, and U. Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1), 2013. Article 1, 16pp. [Kanade et al., 2009] V. Kanade, H. B. Mc Mahan, and B. Bryan. Sleeping experts and bandits with stochastic action availability and adversarial rewards. In 12th Intl. Conference on Artificial Intelligence and Statistics (AIStats09), pp.272 279, Clearwater Beach, FL, 2009. [Kleinberg et al., 2010] R. Kleinberg, A. Niculescu-Mizil, and Y. Sharma. Regret bounds for sleeping experts and bandits. Machine learning, 80(2 3):245 272, 2010. [Li et al., 2009] T. Li, N. Liu, J. Yan, G. Wang, F. Bai, and Z. Chen. A Markov chain model for integrating behavioral targeting into contextual advertising. In Proceedings of the 3rd Intl. Workshop on Data Mining and Audience Intelligence for Advertising, pp.1 9, Paris, 2009. [Meister and Holzbaur, 1986] U. Meister and U. Holzbaur. A polynomial time bound for Howard s policy improvement algorithm. OR Spektrum, 8:37 40, 1986. [Mladenov et al., 2017] M. Mladenov, C. Boutilier, D. Schuurmans, O. Meshi, G. Elidan, and T. Lu. Logistic Markov decision processes. In Proceedings of the 26th Intl. Joint Conference on Artificial Intelligence (IJCAI-17), pp.2486 2493, Melbourne, 2017. [Mnih et al., 2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. [Mnih et al., 2015] V. Mnih, K. Kavukcuoglu, D. Silver, Andrei A Rusu, J. Veness, Marc G Bellemare, A. Graves, M. Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. [Nikolova and Karger, 2008] E. Nikolova and D. R. Karger. Route planning under uncertainty: The Canadian traveller problem. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI-08), pp.969 974, 2008. [Papadimitriou and Yannakakis, 1991] C. H. Papadimitriou and M. Yannakakis. Shortest paths without a map. Theoretical Computer Science, 84(1):127 150, 1991. [Polychronopoulos and Tsitsiklis, 1996] G. H. Polychronopoulos and J. N. Tsitsiklis. Stochastic shortest path problems with recourse. Networks, 27(2):133 143, 1996. [Puterman, 1994] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 1994. [Silver et al., 2013] D. Silver, L. Newnham, D. Barker, S. Weller, and J. Mc Fall. Concurrent reinforcement learning from customer interactions. In Proceedings of the 30th Intl. Conference on Machine Learning (ICML-13), pp.924 932, Atlanta, 2013. [Theocharous et al., 2015] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In Proceedings of the 24th Intl. Joint Conference on Artificial Intelligence (IJCAI-15), pp.1806 1812, Buenos Aires, 2015. [Tseng, 1990] P. Tseng. Solving h-horizon, stationary Markov decision problems in time proportional to log(h). Operations Research Letters, 9(5):287 297, 1990. [Watkins and Dayan, 1992] C.J.C.H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279 292, 1992. [Ye, 2011] Y. Ye. The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Mathematics of Operations Research, 36(4):593 603, 2011. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)