# equireward_utility_maximizing_design_in_stochastic_environments__b067b5d7.pdf

Equi-Reward Utility Maximizing Design in Stochastic Environments

Sarah Keren , Luis Pineda , Avigdor Gal , Erez Karpas , Shlomo Zilberstein

Technion Israel Institute of Technology College of Information and Computer Sciences, University of Massachusetts Amherst sarahn@technion.ac.il, lpineda@cs.umass.edu

We present the Equi-Reward Utility Maximizing Design (ER-UMD) problem for redesigning stochastic environments to maximize agent performance. ER-UMD ﬁts well contemporary applications that require ofﬂine design of environments where robots and humans act and cooperate. To ﬁnd an optimal modiﬁcation sequence we present two novel solution techniques: a compilation that embeds design into a planning problem, allowing use of off-the-shelf solvers to ﬁnd a solution, and a heuristic search in the modiﬁcations space, for which we present an admissible heuristic. Evaluation shows the feasibility of the approach using standard benchmarks from the probabilistic planning competition and a benchmark we created for a vacuum cleaning robot setting.

1 Introduction

We are surrounded by physical and virtual environments with a controllable design. Hospitals are designed to minimize the daily distance covered by staff, computer networks are structured to maximize message throughput, human-robot assembly lines are designed to maximize productivity, etc. Common to all these environments is that they are designed with the intention of maximizing some user beneﬁt while accounting for different forms of uncertainty. Typically, design is performed manually, often leading to far from optimal solutions. We therefore suggest to automate the design process and formulate the Equi-Reward Utility Maximizing Design (ER-UMD) problem where a system controls the environment by applying a sequence of modiﬁcations in order to maximize agent utility. We assume a fully observable stochastic setting and use Markov decision processes [Bellman, 1957] to model the agent environment. We exploit the alignment of system and agent utility to show a compilation of the design problem into a planning problem and piggyback on the search for an optimal policy to ﬁnd an optimal sequence of modiﬁcations. In

Last three authors email addresses: avigal@ie.technion.ac.il, karpase@technion.ac.il, shlomo@cs.umass.edu

addition, we exploit the structure of the ofﬂine design process and offer a heuristic search in the modiﬁcations space to yield optimal design strategies. We formulate the conditions for heuristic admissibility and propose an admissible heuristic based on environment simpliﬁcation. Finally, for settings where practicality is prioritized over optimality, we present a way to efﬁciently acquire sub-optimal solutions. The contributions of this work are threefold. First, we formulate the ER-UMD problem as a special case of environment design [Zhang et al., 2009]. ER-UMD supports arbitrary modiﬁcation methods. Particularly, for stochastic settings, we propose modifying probability distributions, an approach which offers a wide range of subtle environment modiﬁcations. Second, we present two new approaches for solving ER-UMD problems, specify the conditions for acquiring an optimal solution and present an admissible heuristic to support the solution. Finally, we evaluate our approaches given a design budget, using probabilistic benchmarks from the International Planning Competitions, where a variety of stochastic shortest path MDPs are introduced [Bertsekas, 1995] and on a domain we created for a vacuum cleaning robot. We show how redesign substantially improves expected utility, expressed via reduced cost, achieved with a small modiﬁcation budget. Moreover, the techniques we develop outperform the exhaustive approach reducing calculation effort by up to 30% . The remaining of the paper is organized as follows. Section 2 describes the ER-UMD framework. In Section 3, we describe our novel techniques for solving the ER-UMD problem. Section 4 describes an empirical evaluation followed by related work (Section 5) and concluding remarks (Section 6).

2 Equi-Reward Utility Maximizing Design

The equi-reward utility maximizing design (ER-UMD) problem takes as input an environment with stochastic action outcomes, a set of allowed modiﬁcations, and a set of constraints and ﬁnds an optimal sequence of modiﬁcations (atomic changes such as additions and deletions of environment elements) to apply to the environment for maximizing agent expected utility under the constraints. We refer to sequences rather then sets to support settings where different application orders impact the model differently. Such a setting may involve, for example, modiﬁcations that add preconditions nec-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 1: An example of an ER-UMD problem

essary for the application of other modiﬁcations (e.g. a docking station can only be added after adding a power outlet). We consider stochastic environments deﬁned by the quadruple ϵ = Sϵ, Aϵ, fϵ, s0,ϵ with a set of states Sϵ, a set of actions Aϵ, a stochastic transition function fϵ : Sϵ Aϵ Sϵ [0, 1] specifying the probability f(s, a, s ) of reaching state s after applying action a in s S, and an initial state s0,ϵ Sϵ. We let E, SE and AE denote the set of all environments, states and actions, respectively. Adopting the notation of Zhang and Parkes (2008) for environment design, we deﬁne the ER-UMD model as follows.

Deﬁnition 1 An equi-reward utility maximizing (ER-UMD) model ω is a tuple ϵ0 ω, Rω, γω, ω, Fω, Φω where ϵ0 ω E is an initial environment. Rω : SE AE SE R is a Markovian and stationary reward function, specifying the reward r(s, a, s ) an agent gains from transitioning from state s to s by the execution of a. γω is a discount factor in (0, 1], representing the deprecation of agent rewards over time. ω is a ﬁnite set of atomic modiﬁcations a system can apply. A modiﬁcation sequence is an ordered set of modiﬁcations = 1, . . . , n s.t. i ω. We denote by ω the set of all such sequences. Fω : ω E E is a deterministic modiﬁcation transition function, specifying the result of applying a modiﬁcation to an environment. Φω : ω E {0, 1} is an indicator that speciﬁes the allowed modiﬁcation sequences in an environment.

Whenever ω is clear from the context we use ϵ0, R, γ, ,F, and Φ. Note that a reward becomes a cost when negative. The reward function R and discount factor γ form, together with an environment ϵ E an inﬁnite horizon discounted reward Markov decision process (MDP) [Bertsekas, 1995] S, A, f, s0, R, γ . The solution of an MDP is a control policy π : S A describing the appropriate action to perform at each state. We let Πϵ represent the set of all possible policies in ϵ. Optimal policies Π ϵ Πϵ yield maximal expected accumulated reward for every state s S [Bellman, 1957]. We assume agents are optimal and let V (ω) represent the discounted expected agent reward of following an optimal policy from the initial state s0 in a model ω.

Modiﬁcations can be deﬁned arbitrarily, supporting all the changes applicable to a deterministic environment [Herzig et al., 2014]. For example, we can allow adding a transition between previously disconnected states. Particular to a stochastic environment is the option of modifying the transition function by increasing and decreasing the probability of speciﬁc outcomes. Each modiﬁcation may be associated with a system cost C : R+ and a sequence cost C( ) = P

i C( i). Given a sequence such that

Φ( , ϵ) = 1 (i.e., can be applied to ϵ E ) we let ϵ represent the environment that is the result of applying to ϵ and ω is the same model with ϵ as its initial environment. The solution to an ER-UMD problem is a modiﬁcation sequence to apply to ϵ0 ω that maximizes agent utility V (ω ) under the constraints, formulated as follows.

Problem 1 Given a model ω = ϵ0, R, γ, , F, Φ , the ERUMD problem ﬁnds a modiﬁcation sequence

argmax |Φ( )=1 V (ω )

We let ω represent the set of solutions to Problem 1 and Vmax(ω) = max |Φ( )=1 V (ω ) represent the maximal

agent utility achievable via design in ω. In particular, we seek solutions ω that minimize design cost C( ).

Example 1 As an example of a controllable environment where humans and robots co-exist consider Figure 1(left), where a vacuum cleaning robot is placed in a living room. The set E of possible environments speciﬁes possible room conﬁgurations. The robot s utility, expressed via the reward R and discount factor γ, may be deﬁned in various ways; it may try to clean an entire room as quickly as possible or cover as much space as possible before its battery runs out. (Re)moving a piece of furniture from or within the room (Figure 1(center)) may impact the robot s utility. For example, removing a chair from the room may create a shortcut to a speciﬁc location but may also create access to a corner the robot may get stuck in. Accounting for uncertainty, there may be locations in which the robot tends to slip, ending up in a different location than intended. Increasing friction, e.g., by introducing a high friction tile (Figure 1(right)), may reduce

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

the probability of undesired outcomes. All types of modiﬁcations, expressed by and F, are applied ofﬂine (since such robots typically perform their task unsupervised) and should be applied economically in order to maintain usability of the environment. These type of constraints are reﬂected by Φ that can restrict the design process by a predeﬁned budget or by disallowing speciﬁc room conﬁgurations.

A baseline method for ﬁnding an optimal modiﬁcation sequence involves applying an exhaustive best ﬁrst search (BFS) in the space of allowed sequences and selecting one that maximizes system utility. This approach was used for ﬁnding the optimal set of modiﬁcations in a goal recognition design setting [Keren et al., 2014; Wayllace et al., 2016]. The state space pruning applied there assumes that disallowing actions is the only allowed modiﬁcation, making it nonapplicable for ER-UMD, which supports arbitrary modiﬁcation methods. We therefore present two novel techniques for ﬁnding an optimal design strategy for ER-UMD.

3.1 ER-UMD Compilation to Planning

As a ﬁrst approach, we embed design into a planning problem description. The Design Comp compilation (Deﬁnition 2) extends the agent s underlying MDP by adding pre-process operators that modify the environment off-line. After initialization, the agent acts in the new optimized environment. The compilation uses the PPDDL notation [Younes and Littman, 2004] which uses a factored MDP representation. Accordingly, an environment ϵ E is represented as a tuple Xϵ, s0,ϵ, Aϵ with states speciﬁed as a combination of state variables Xϵ and a transition function embedded in the description of actions. Action a Aϵ is represented by prec, p1, add1, del1 , . . . , pm, addm, delm where prec is the set of literals that need to be true as a precondition for applying a. The probabilistic effects p1, add1, del1 , . . . , pm, addm, delm are represented by pi, the probability of the i-th effect. When outcome i occurs, addi and deli are literals, added and removed from the state description, respectively [Mausam and Kolobov, 2012]. The policy of the compiled planning problem has two stages: design - in which the system is modiﬁed and execution - describing the policy agents follow to maximize utility. Accordingly, the compiled domain has two action types: Ades, corresponding to modiﬁcations applied by the design system and Aexe, executed by the agent. To separate between the stages we use a ﬂuent execution, initially false to allow the application of Ades, and a no-cost action astart that sets execution to true rending Aexe applicable. The compilation process supports two modiﬁcations types. Modiﬁcations X change the initial state by modifying the value of state variables X X. Modiﬁcations A change the action set by enabling actions A A. Accordingly, the deﬁnition includes a set of design action Ades = Ades-s0 Ades-A, where Ades-s0 are actions that change the initial value of variables and Ades-A includes actions A that are originally disabled but can be enabled in the modiﬁed environment. In particular, we include in A actions that share

the same structure as actions in the original environment except for a modiﬁed probability distribution. The following deﬁnition of Design Comp supports a design budget B implemented using a timer mechanism as in [Keren et al., 2015]. The timer advances with up to B design actions that can be applied before performing astart. This constraint is represented by ΦB that returns 0 for any modiﬁcations sequences that exceeds the given budget.

Deﬁnition 2 For an ER-UMD problem

ω = ϵ0 ω, Rω, γω, ω, Fω, ΦB ω

where ω = X A we create a planning problem

P = X , s 0, A , R , γ , where:

X = {Xϵ0 ω} {execution} {timet | t 0, . . . , B} {enableda | a A } s 0 = {s0,ϵ0 ω} {time0} A = Aexe Ades-s0 Ades-A astart where Aexe = Aϵ0 A s.t. { prec(a) execution, eff(a) | a Aϵ0} { prec(a) execution enableda, eff(a) | a A } Ades-s0 = { execution, timei , 1, x, timei+1 , timei | x X } Ades-A = { execution, timei , 1, enableda, timei+1 , timei | a A }} astart = , 1, execution, }

R = R(a), if a Aexe 0, if a Ades, ainit γ = γ

Optimally solving the compiled problem P yields an optimal policy π P with two components, separated by the execution of astart. The initialization component consists of a possibly empty sequence of deterministic design actions denoted by P , while the execution component represents the optimal policy in the modiﬁed environment. The next two propositions establish the correctness of the compilation. Proofs are omitted due to space constraints. We ﬁrst argue that V (P ), the expected reward from the initial state in the compiled planning problem, is equal to the expected reward in the optimal modiﬁed environment.

Lemma 1 Given an ER-UMD problem ω and an optimal modiﬁcation sequence ω

V (P ) = V (ω ).

An immediate corollary is that the compilation outcome is indeed an optimal sequence of modiﬁcations.

Corollary 1 Given an ER-UMD problem ω and the compiled model P , P ω

The reward function R assigns zero cost to all design actions Ades. To ensure the compilation not only respects the budget B, but also minimizes design cost, we can assign a

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

ᵂ Original environment

Execution - agent policy

Design - system policy

ᵂ1,3 ᵂ2,3 ᵂ3,1 ᵂ3,2 ᵂ2,1

Figure 2: Search space of an ER-UMD problem

small cost (negative reward) cd to design actions Ades. If cd is too high, it might lead the solver to omit design actions that improve utility by less than cd. However, the loss of utility will be at most cd B. Thus, by bounding the minimum improvement in utility from a modiﬁcation, we can still ensure optimality.

3.2 Design as Informed Search

The key beneﬁt of compiling ER-UMD to planning is the ability to use any off-the-shelf solver to ﬁnd a design strategy. However, this approach does not fully exploit the special characteristics of the off-line design setting we address. We therefore observe that embedding design into the deﬁnition of a planning problem results in an MDP with a special structure, depicted in Figure 2. The search of an optimal redesign policy is illustrated as a tree comprising of two component. The design component, at the top of the ﬁgure, describes the deterministic ofﬂine design process with nodes representing the different possibilities of modifying the environment. The execution component, at the bottom of the ﬁgure, represents the stochastic modiﬁed environments in which agents act. Each design node represents a different ER-UMD model, characterized by the sequence of modiﬁcations that has been applied to the environment and a constraints set Φ, specifying the allowed modiﬁcations in the subtree rooted at a node. With the original ER-UMD problem ω at the root, each successor design node represents a sub-problem ω of the ancestor ER-UMD problem, accounting for all modiﬁcation sequences that have as their preﬁx. The set of constraints of the successors is updated with relation to the parent node. For example, when a design budget is speciﬁed, it is reduced when moving down the tree from a node to its successor. When a design node is associated with a valid modiﬁcation, it is connected to a leaf node representing a ER-UMD model with the environment ϵ that results from applying the modiﬁcation. To illustrate, invalid modiﬁcation sequences are crossed out in Figure 2. Using this search tree we propose an informed search in the space of allowed modiﬁcations, using heuristic estima-

Algorithm 1 Best First Design (BFD) BFD(ω, h) 1: create OPEN list for unexpanded nodes 2: ncur = design, (initial model) 3: while ncur do 4: if Is Execution(ncur) then 5: return ncur. (best modiﬁcation found - exit) 6: end if 7: for each nsuc Get Successors(ncur, ω) do 8: put design, nsuc. , h(nsuc) in OPEN 9: end for 10: if Φσ(ncur. ) = 1 then 11: put execution, new , V (ω new) in OPEN 12: end if 13: ncur = Extract Max(OPEN) 14: end while 15: return error

tions to guide the search more effectively by focusing attention on more promising redesign options. The Best First Design (BFD) algorithm (detailed in Algorithm 1) accepts as input an ER-UMD model ω, and a heuristic function h. The algorithm starts by creating an OPEN priority queue (line 1) holding the front of unexpanded nodes. In line 2, ncur is assigned the original model, which is represented by a ﬂag design and the empty modiﬁcation sequence . The iterative exploration of the currently most promising node in the OPEN queue is given in lines 3-14. If the current node represents an execution model (indicated by the execution ﬂag) the search ends successfully in line 5, returning the modiﬁcation sequence associated with the node. Otherwise, the successor design nodes of the current node are generated by Get Successors in line 7. Each successor sub-problem nsuc is placed in the OPEN list with its associated heuristic value h(nsuc) (line 8), to be discussed in detail next. In addition, if the modiﬁcation sequence ncur. associated with the current node is valid according to Φ, an execution node is generated and assigned a value that corresponds to the actual value V (ω new) in the resulting environment (lines 10-12). The next node to explore is extracted from OPEN in line 13. Both termination and completeness of the algorithm depend on the implementation of Get Successors, which controls the graph search strategy by generating the sub-problem design nodes related to the current node. For example, when a modiﬁcation budget is speciﬁed, Get Successors generates a sub-problem for every modiﬁcation that is appended to the sequence of the parent node, discarding sequences that violate the budget and updating it for the valid successors. For optimality, we require the heuristic function h to be admissible. An admissible estimation of a design node n is one that never underestimates Vmax ω , the maximal system s utility in the ER-UMD problem ω represented by ncur.1

Running BFD with an admissible heuristic is guaranteed to yield an optimal modiﬁcation sequence.

1When utility is cost, it needs not overestimate the real cost.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Theorem 1 Given an ER-UMD model ω and an admissible heuristic h, BFD(ω, h) returns ω ω.

The proof of Theorem 1 bares similarity to the proof of optimality of A [Nilsson, 1980] and is omitted here for the sake of brevity.

The simpliﬁed-environment heuristic To produce efﬁcient over-estimations of the maximal system utility Vmax(ω), we suggest a heuristic that requires a single pre-processing simpliﬁcation of the original environment used to produce estimates for the design nodes of the search.

Deﬁnition 3 Given an ER-UMD model ω, a function f : E E is an environment simpliﬁcation in ω if ϵ, ϵ Eω s.t. ϵ = f(ϵ), V (ω) V (ωf), where ωf is the ERUMD model with f(ϵ) as its initial environment.

The simpliﬁed-environment heuristic, denoted by hsim estimates the value of applying a modiﬁcation sequence to ω by the value of applying it to ωf.

hsim(ω ) def = Vmax(ωf ) (1)

The search applies modiﬁcations on the simpliﬁed model and uses its optimal solution as an estimate of the value of applying the modiﬁcations in the original setting. In particular, the simpliﬁed model can be solved using the Design Comp compilation presented in the previous section. The literature is rich with simpliﬁcation approaches, including adding macro actions that do more with the same cost, removing some action preconditions, eliminating negative effects of actions (delete relaxation) or eliminating undesired outcomes [Holte et al., 1996]. Particular to stochastic settings is the commonly used all outcome determinization [Yoon et al., 2007], which creates a deterministic action for each probabilistic outcome of every action.

Lemma 2 Given an ER-UMD model ω, applying the simpliﬁed-environment heuristic with f implemented as an all outcome determinization function is admissible.

The proof of Lemma 2, omitted for brevity, uses the observation that f only adds solutions with higher reward (lower cost) to a given problem (either before or after redesign). A similar reasoning can be applied to the commonly used delete relaxation or any other approaches discussed above. Note that admissibility of a heuristic function depends on speciﬁc characteristics of the ER-UMD setting. In particular, the simpliﬁed-environment heuristic is not guaranteed to produce admissible estimates for policy teaching [Zhang and Parkes, 2008] or goal recognition design [Keren et al., 2014; Wayllace et al., 2016], where design is performed to incentivize agents to follow speciﬁc policies. This is because the relaxation itself may change the set of optimal agent policies and therefore underestimate the value of a modiﬁcation.

change init reduce probability BOX relocate a truck driving to wrong destination BLOCK dropping a block or tower EX-BLOCK as for Blocks World TIRE add spare tires having a ﬂat tire ELEVATOR add elevator shaft falling to initial state VACUUM (re)move furniture add high friction tile

Table 1: Allowed modiﬁcations for each domain

B=1 B=2 B=3 solved reduc solved reduc solved reduc BOX 8 28 8 42 7 44 BLOCK 6 21 3 24 3 24 EX-BLOCK 10 42 9 42 9 42 TIRE 9 44 8 51 6 54 ELEVATOR 9 22 7 24 1 18 VACUUM 8 15 6 17 0

Table 2: Utility improvement for optimal solvers

4 Empirical Evaluation

We evaluated the ability to maximize agent utility given a design budget in various ER-UMD problems, as well as the performance of both optimal and approximate techniques. We used ﬁve PPDDL domains from the probabilistic tracks of the sixth and eighth International Planning Competition2 (IPPC06 and IPPC08) representing stochastic shortest path MDPs with uniform action cost: Box World (IPPC08/ BOX), Blocks World (IPPC08/ BLOCK), Exploding Blocks World (IPPC08/ EX-BLOCK), Triangle Tire (IPPC08/ TIRE) and Elevators (IPPC06/ ELEVATOR). In addition, we implemented the vacuum cleaning robot setting from Example 1 (VACUUM) as an adaptation of the Taxi domain [Dietterich, 2000]. The robot moves in a grid and collects pieces of dirt. It cannot move through furniture, represented by occupied cells, and may fail to move, remaining in its current position. In all domains, agent utility is expressed as expected cost and constraints as a design budget. For each domain, we examined at least two possible modiﬁcations, including at least one that modiﬁes the probability distribution. Modiﬁcations by domain are speciﬁed in Table 1 with change init marking modiﬁcations of the initial state and probability change marking modiﬁcations of the probability function.

4.1 Optimal Solutions

Setup For each domain, we created 10 smaller instances optimally solvable within a time bound of ﬁve minutes. Each instance was optimally solved using: EXExhaustive exploration of possible modiﬁcations. DCSolution of the Design Comp compilation. BFDAlgorithm 1 with simpliﬁed-environment heuristic using the delete relaxation to simplify the model and the Design Comp compilation to optimally solve it. We used a portfolio of 3 admissible heuristics:

2http://icaps-conference.org/index.php/main/competitions

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Ex-h0 Ex-h0+ Ex-h Min Min DC-h0 DC-h0+ DC-h Min Min BFD-h0 BFD-h0+ BFD-h Min Min

B=1 158.4(8) 159.0(8) 158.9(8) 163.9(8) 70.7(8) 221.4 (8) 157.4(8) 68.2(8) 216.4(8) B=2 264.7(7) 264.9(7) 267.8(7) 270.6(7) 92.1(8) 332.7(7) 260.8(7) 88.0(8) 325.3(7) B=3 238.5(4) 236.5(4) 235.6 (4) 241.5(4) 73.5(4) 271.7(4) 234.3(4) 70.2 (7) 265.94(4)

B=1 50.5(6) 50.5(6) 50.8(6) 50.7(6) 41.7(6) 77.1(6) 50.3(6) 41.6(6) 74.4(6) B=2 28.0(4) 28.2(4) 28.0(4) 28.2(4) 17.4(4) 36.4(3) 28.0(4) 17.2(4) 35.4(3) B=3 348.9(2) 347.3(2) 348.2(2) 354.5(2) 194.6(3) 363.5(2) 352.2(2) 118.2(3) 354.8(2)

B=1 69.4(9) 70.2(9) 69.9 (9) 68.4(9) 38.7(9) 6.7(10) 69.5(9) 40.3(9) 60.3(9) B=2 161.7(9) 170.9(9) 168.1(9) 153.1(9) 88.2(9) 30.2(10) 153.9(9) 85.6(9) 135.0(9) B=3 250.7 (9) 265.9(9) 292.2(9) 252.5(9) 134.9(9) 88.8(8) 285.9(9) 160.9(9) 237.4(9)

B=1 32.9(9) 32.9(9) 33.1(9) 33.3(9) 30.2(9) 36.8(9) 33.0(9) 29.5 (9) 36.9(9) B=2 55.2(7) 55.4(7) 55.0(7) 55.5(7) 51.1(8) 88.8(8) 55.0(7) 50.9(8) 89.1(8) B=3 270.3(6) 136.5)6 258.4(6) 269.7(6) 136.5(6) 258.4(6) 267.6(6) 188.3(6) 256.2(6)

B=1 300.4(8) 299.6(8) 301.6(8) 301.9(8) 236.2(9) 192.6(9) 302.6(8) 238.3(9) 176.6(9) B=2 361.8(5) 360.9(5) 366.2(5) 363.4(5) 261.0(5) 243.89(5) 360.8(5) 258.6(5) 231.2(5) B=3 na na na na 1504.6(1) 1117.4(1) na 1465.8(1) 1042.5(1)

B=1 0.15(9) 0.16(9) 0.15(9) 0.17(9) 0.099(9) 0.15(9) 0.15(9) 0.096(9) 0.13(9) B=2 3.6(9) 3.27(9) 3.44(9) 3.25(9) 2.13(9) 2.49(9) 3.39(9) 2.021(9) 2.61(9) B=3 na na na na na na na na na

Table 3: Running time and number of instances solved for optimal solvers

h0 assigns 0 to all states and serves as a baseline for the assessing the value of more informative heuristics. h0+ assigns 1 to all non-goal states and 0 otherwise. h Min Min solves all outcome determinization using the zero heuristic [Bonet and Geffner, 2005]. Each problem was tested on a Intel(R) Xeon(R) CPU X5690 machine with a budget of 1, 2 and 3. Design actions were assigned a cost of 10 4, and problems were solved using LAO* [Hansen and Zilberstein, 1998] with convergence error bound of 10 6. Each run had a 30 minutes time limit. Results Separated by domain and budget, Table 2 summarizes the number of solved instances (solved) and average percentage of expected cost reduction over instances solved (reduc). In all domains, complexity brought by increased budget reduces the number of solved instances, while the actual reduction varies among domains. As for solution improvement, all domains show an improvement of 15% to 54%. Table 3 compares solution performance. Each row represents a solver and heuristic pair. Results are separated by domain and budget, depicting the average running time for problems solved by all approaches for a given budget and the number of instances solved in parenthesis (na indicates no instances were solved within the time limit). The dominating approach for each row (representing a domain and budget) is emphasized in bold. In all cases, the use of informed search outperformed the exhaustive approach.

4.2 Approximate Solutions Setup For approximate solutions we used an MDP reduced model approach that creates simpliﬁed MDPs accounting for the full probabilistic model a bounded number of times (for each execution history), and treat the problem as deterministic afterwards [Pineda and Zilberstein, 2014]. The deterministic problems were solved using the FF classical planner [Hoffmann and Nebel, 2001], as explained in [Pineda and

Zilberstein, 2017]. We used Monte-Carlo simulations to evaluate the policies probability of reaching a goal state and its expected cost. In particular, we gave the planner 20 minutes to solve each problem 50 times. We used the ﬁrst 10 instances of each competition domain mentioned above, excluding Box World, due to limitations of the planner. For the VACUUM domain we generated ten conﬁgurations of up to 5 7 grid size rooms, based on Figure 1. Results Table 4 reports three measures (per budget): the number of problems completed within allocated time (solved), improved probability of reaching a goal of the resulting policies with respect to the policies obtained without design (δPs), and the average percentage of reduction in expected cost after redesign (reduc) (δPs and reduc are averaged only over problems solved 50 times when using both the original and modiﬁed model). In general, we observed that redesign enables either improvement in expected cost or in probability of successes (and sometimes both), across all budgets. For BLOCK and EX-BLOCK, a budget of 2 yielded best results, while for ELEVATOR, TIRE, and VACUUM a budget of 3 was better. However, the increased difﬁculty of the compiled problem resulted sometimes in a lower number of solved problems (e.g., solving only 3 problems on TIRE with budget of 3). Nevertheless, these results demonstrate the feasibility of obtaining good solutions when compromising optimality.

B = 1 B = 2 B = 3 solved δPs reduc solved δPs reduc solved δPs reduc BLOCK 8 0 19.1 8 0 21.2 8 0 18.6 EX-BLOCK 10 0.42 0 10 0.50 0 10 0.41 0 TIRE 7 0 6.98 7 0 17.9 3 0 33 ELEVATOR 10 -0.33 25 10 0.1 30 10 0.1 38.3 VACUUM 10 0.2 8.12 10 0.2 4.72 10 0.3 9.72

Table 4: Utility improvement for sub-optimal solver

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

4.3 Discussion For all examined domains, results indicate the beneﬁt of using heuristic search over an exhaustive search in the modiﬁcation space. However, the dominating heuristic approach varied between domains, and for TIRE also between budget allocation. Investigating the reasons for this variance, we note a key difference between BFD and DC. While DC applies modiﬁcations to the original model, BFD uses the simpliﬁedenvironment heuristic that applies them to a simpliﬁed model. Poor performance of BFD can be due to either the minor effect the applied simpliﬁcations have on the computational complexity or to an exaggerated effect that may limit the informative value of heuristic estimations. In particular, this could happen due to the overlap between the design process and the simpliﬁcation. To illustrate, by applying the all outcome determinization to the Vacuum domain depicted in Example 1, we ignore the undesired outcome of slipping. Consequently, the heuristic completely overlooks the value of adding high-friction tiles, while providing informative estimations to the value of (re)moving furniture. This observation may explain the poor performance of BFD with EX-BLOCKS, where simpliﬁcation via the delete relaxation ignores the possibility of blocks exploding and overlooks the value of the proposed modiﬁcations. Therefore, estimations of BFD may be improved by developing a heuristic that uses the aggregation of several estimations. Also, when the order of application is immaterial, a closed list may be used for examined sets in the BFD approach but not with DC. Finally, a combination of relaxation approaches may enhance performance of sub-optimal solvers.

5 Related Work Environment design [Zhang et al., 2009] provides a framework for an interested party (system) to seek an optimal way to modify an environment to maximize some utility. Among the many ways to instantiate the general model, policy teaching [Zhang and Parkes, 2008; Zhang et al., 2009] enables a system to modify the reward function of a stochastic agent to entice it to follow speciﬁc policies. We focus on a different special case where the system is altruistic and redesigns the environment in order to maximize agent utility. The techniques used for solving the policy teaching do not apply to our setting, which supports arbitrary modiﬁcations. The Design Comp compilation is inspired by the technique of G obelbecker et al. (2010) of coming up with good excuses for why there is no solution to a planning problem. Our compilation extends the original approach in four directions. First, we move from a deterministic environment to a stochastic one. Second, we maximize agent utility rather than only moving from unsolvable to solvable. Third, we embed support of a design budget. Finally, we support arbitrary modiﬁcation alternatives including modiﬁcations speciﬁc to stochastic settings as well as all those suggested for deterministic settings [Herzig et al., 2014; Menezes et al., 2012].

6 Conclusions We presented the ER-UMD framework for maximizing agent utility by the off-line design of stochastic environments.

We presented two solution approaches; a compilation-based method that embeds design into the deﬁnition of a planning problem and an informed heuristic search in the modiﬁcation space, for which we provided an admissible heuristic. Our empirical evaluation supports the feasibility of the approaches and shows substantial utility gain on all evaluated domains. In future work, we will explore creating tailored heuristics to improve planner performance. Also, we will extend the model to deal with partial observability using POMDPs, as well as automatically ﬁnding possible modiﬁcations, similarly to [G obelbecker et al., 2010]. In addition, we plan to extend the ofﬂine design paradigm, by accounting for online design that can be dynamically applied to a model.

Acknowledgements

The work was supported in part by the National Science Foundation grant number IIS-1405550.

References [Bellman, 1957] Richard Bellman. A Markovian decision process. Indiana University Mathematics Journal, 6:679 684, 1957. [Bertsekas, 1995] Dimitri P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena Scientiﬁc Belmont, MA, 1995. [Bonet and Geffner, 2005] Blai Bonet and H ector Geffner. m GPT: A probabilistic planner based on heuristic search. Journal of Artiﬁcial Intelligence Research, 24:933 944, 2005. [Dietterich, 2000] Thomas G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artiﬁcial Intelligence Research, 13:227 303, 2000. [G obelbecker et al., 2010] Moritz G obelbecker, Thomas Keller, Patrick Eyerich, Michael Brenner, and Bernhard Nebel. Coming up with good excuses: What to do when no plan can be found. Cognitive Robotics, 2010. [Hansen and Zilberstein, 1998] Eric A. Hansen and Shlomo Zilberstein. Heuristic search in cyclic AND/OR graphs. In Proceedings of the Fifteenth National Conference on Artiﬁcial Intelligence, pages 412 418, 1998. [Herzig et al., 2014] Andreas Herzig, Viviane Menezes, Leliane Nunes de Barros, and Renata Wassermann. On the revision of planning tasks. In Proceedings of the Twenty-ﬁrst European Conference on Artiﬁcial Intelligence (ECAI), pages 435 440, 2014. [Hoffmann and Nebel, 2001] J org Hoffmann and Bernhard Nebel. The FF planning system: Fast plan generation through heuristic search. Journal of Artiﬁcial Intelligence Research, 14:253 302, 2001. [Holte et al., 1996] Robert C. Holte, M. B. Perez, R. M. Zimmer, and Alan J. Mac Donald. Hierarchical A*: Searching abstraction hierarchies efﬁciently. In Proceedings of the Thirteenth National Conference on Artiﬁcial Intelligence (AAAI), pages 530 535, 1996.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Keren et al., 2014] Sarah Keren, Avigdor Gal, and Erez Karpas. Goal recognition design. In Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling (ICAPS), pages 154 162, 2014. [Keren et al., 2015] Sarah Keren, Avigdor Gal, and Erez Karpas. Goal recognition design for non-optimal agents. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 3298 3304, 2015. [Mausam and Kolobov, 2012] Mausam and Andrey Kolobov. Planning with Markov decision processes: An AI perspective. Morgan & Claypool Publishers, 2012. [Menezes et al., 2012] Maria Viviane Menezes, Leliane N. de Barros, and Silvio do Lago Pereira. Planning task validation. In ICAPS Workshop on Scheduling and Planning Applications (SPARK), pages 48 55, 2012. [Nilsson, 1980] Nils J. Nilsson. Principles of Artiﬁcial Intelligence. Tioga Publishers, Palo Alto, California, 1980. [Pineda and Zilberstein, 2014] Luis Pineda and Shlomo Zilberstein. Planning under uncertainty using reduced models: Revisiting determinization. In Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling (ICAPS), pages 217 225, 2014. [Pineda and Zilberstein, 2017] Luis Pineda and Shlomo Zilberstein. Generalizing the role of determinization in probabilistic planning. Technical Report UM-CS-2017-006, University of Massachusetts Amherst, 2017. [Wayllace et al., 2016] Christabel Wayllace, Ping Hou, William Yeoh, and Tran Cao Son. Goal recognition design with stochastic agent action outcomes. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 3279 3285, 2016. [Yoon et al., 2007] Sung Wook Yoon, Alan Fern, and Robert Givan. FF-Replan: A baseline for probabilistic planning. In Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS), pages 352 359, 2007. [Younes and Littman, 2004] H akan L. S. Younes and Michael L. Littman. PPDDL1.0: The language for the probabilistic part of IPC-4. In Proceedings of the International Planning Competition (IPC), 2004. [Zhang and Parkes, 2008] Haoqi Zhang and David Parkes. Value-based policy teaching with active indirect elicitation. In Proceedings of the Twenty-Third Conference on Artiﬁcial Intelligence (AAAI), pages 208 214, 2008. [Zhang et al., 2009] Haoqi Zhang, Yiling Chen, and David Parkes. A general approach to environment design with one agent. In Proceedings of the Twenty-First International Joint Conference on Artiﬁcal Intelligence, pages 2002 2008, 2009.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)