# solving_partially_observable_stochastic_shortestpath_games__b3377cff.pdf

Solving Partially Observable Stochastic Shortest-Path Games

Petr Tom aˇsek1 , Karel Hor ak1 , Aditya Aradhye1 , Branislav Boˇsansk y1 and Krishnendu Chatterjee2

1Artiﬁcial Intelligence Center, Dept. of Computer Science Faculty of Electrical Engineering, Czech Technical University in Prague 2Institute of Science and Technology Austria {petr.tomasek, karel.horak, aditya.aradhye, branislav.bosansky}@aic.fel.cvut.cz, krishnendu.chatterjee@ist.ac.at

We study the two-player zero-sum extension of the partially observable stochastic shortest-path problem where one agent has only partial information about the environment. We formulate this problem as a partially observable stochastic game (POSG): given a set of target states and negative rewards for each transition, the player with imperfect information maximizes the expected undiscounted total reward until a target state is reached. The second player with the perfect information aims for the opposite. We base our formalism on POSGs with one-sided observability (OS-POSGs) and give the following contributions: (1) we introduce a novel heuristic search value iteration algorithm that iteratively solves depth-limited variants of the game, (2) we derive the bound on the depth guaranteeing an arbitrary precision, (3) we propose a novel upper-bound estimation that allows early terminations, and (4) we experimentally evaluate the algorithm on a pursuit-evasion game.

1 Introduction

Stochastic shortest path (SSP) problem [Bertsekas and Tsitsiklis, 1991; Bertsekas, 1995] belongs to classical problems in which an agent aims to ﬁnd an optimal plan to reach a target state in a stochastic environment (a problem with indeﬁnite horizon). The problem can be modeled as a Markov Decision Process (MDP) where the objective is an undiscounted sum of rewards1 e.g., for each transition, the agent receives a penalty, and thus the agents want to reach the target state where no additional penalty is received. This model naturally arises in many ﬁelds including robotics [Lim et al., 2013; Saisubramanian et al., 2019], wireless networks [Chen et al., 2007] or model checking [Norman et al., 2005]. Throughout the years, many variants of the original problem have emerged. We focus on two of them with signiﬁcant practical implications: (i) partially observable SSP [Patek, 1999; Egorov et al., 2016; Hor ak et al., 2018; Delamer et

Contact Author 1Throughout the paper we assume that agents maximize their rewards (i.e., a cost/penalty means a negative reward for an agent).

al., 2019] and (ii) planning in the presence of an adversary and SSP games [Patek and Bertsekas, 1999; Neu et al., 2012; Rosenberg and Mansour, 2020; Chen et al., 2020]. The partially observable SSP (also referred to as Goal-POMDP) generalizes the model to better reﬂect real-world scenarios where perfect information is not always available (e.g., robotic sensors are imprecise, and the true position of the robot in an environment may be unknown). For the latter variant, formulating the problem as a game against an opponent allows the agent to ﬁnd robust plans in an adversarial environment. Until now, however, the combination of these two variants an SSP game with partial observability has not been sufﬁciently covered by the existing works. We address this gap and, to the best of our knowledge, solve Partially Observable Stochastic Shortest-Path Games (POSSPGs) for the ﬁrst time. We focus on the two-player zero-sum setting. From the gametheoretic perspective, a POSSPG is a variant of Partially Observable Stochastic Games (POSGs) that are intractable in general. One of the main reasons for the intractability of POSGs is reasoning about uncertainty one player must consider a belief over the possible states of the environment, but also opponent s beliefs, and also beliefs over beliefs, and a belief hierarchy in general. To avoid such nesting, we restrict to a simpliﬁed setting where one player has perfect information about the course of the game. In POSSPG, this is quite natural since we can assume that the adversary has the perfect information and the agent seeks for a robust plan against a well-informed opponent. Moreover, the existing works on POSGs with such an information asymmetry (termed one-sided POSGs [Hor ak et al., 2017]; OS-POSGs) offer algorithms for solving such games [Hor ak et al., 2017; Hor ak et al., 2020]. Unfortunately, the heuristic search value iteration (HSVI) algorithm introduced for one-sided POSGs uses the discounted-sum objective, and the proof of the convergence of the algorithm relies on the discount factor being strictly smaller than 1. Therefore, the existing algorithm cannot be used for POSSPGs where agents maximize the undiscounted sum of rewards. We address these challenges and introduce a new algorithm for solving POSSPGs based on HSVI. Although the high-level idea of our algorithm is simple, its realization that yields theoretical guarantees in the game-theoretic context is nontrivial. In our algorithm, we solve a sequence of depth-limited variants of the original game (termed k-cutoff games) while

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

gradually increasing the depth limit. The main challenge stems from determining whether the currently found solution is sufﬁciently close to the optimum or whether the algorithm should continue with an increased depth limit. To this end, we (i) derive a theoretical bound on the depth-limit guaranteeing desired precision ε and (ii) extend HSVI with an auxiliary upper-bound estimation of the value of the original unbounded game. This auxiliary upper bound then allows earlier terminations if it is sufﬁciently close to the lower bound from the k-cutoff games. We evaluate the performance of our algorithm on a pursuit-evasion game on a graph and demonstrate that it is not feasible to use the original HSVI for OS-POSGs with very large discount factors as an alternative to our new HSVI algorithm for POSSPGs.

2 One-sided Partially Observable Stochastic Games (OS-POSG)

We start by reviewing the closely related model of one-sided partially observable stochastic games (OS-POSGs) [Hor ak et al., 2017; Hor ak et al., 2020]: these are two-player zerosum inﬁnite-horizon games played on a graph, where one of the players (player 1, P1) is imperfectly informed about the course of the game, while his adversary (player 2, P2) is perfectly informed. Unlike our proposed model, however, the objective of the players is to optimize the discounted sum of rewards obtained in the course of the game. Formally, an OS-POSG G is a tuple G = S, A1, A2, O, T, R, binit . The game starts in one of the states s(0) S, where S is the ﬁnite set of states. The state s(0) is sampled from the probability distribution over states binit (S) called the initial belief. Then, in each stage t of the game (t 1), the players choose their actions a(t) 1 A1 and a(t) 2 A2 simultaneously and independently on each other. Based on their choice, the game transitions to a new state s(t) and an observation o(t) O is generated for P1. The set O contains all possible observations for P1 that provide P1 with the partial information about the new state. Formally, the transition function T ensures that s(t) and o(t) are generated with probability T(o(t), s(t) | s(t 1), a(t) 1 , a(t) 2 ). For this transition, P1 receives the stage reward R(s(t 1), a(t) 1 , a(t) 2 ). P2 can use the entire past history of the game s(0)(a(t ) 1 , a(t ) 2 , o(t )s(t ))t 1 t =1 to make his decision about his upcoming action a(t) 2 . P1, on the other hand, can only consider his own past actions and observations (a(t ) 1 , o(t ))t 1 t =1 when making his decision. Total expected utility for P1 is deﬁned as the expected inﬁnite discounted sum of stage rewards, that is E[P t=1 γt 1R(s(t 1), a(t) 1 , a(t) 2 )]. The discount factor γ is assumed to be positive and strictly less than 1. The goal of P1 is to maximize his total expected utility, while P2 aims at minimizing total expected utility of P1.

2.1 Solving OS-POSGs

One of the options to solve OS-POSGs is to (approximately) ﬁnd the optimal value function V : (S) R of the game

G. This function maps beliefs b (S) of P1 to the discounted payoff V (b) he can achieve in G given that b is the current distribution over possible states of the game and it can be characterized by a Bellman-style ﬁxed point equation [Hor ak et al., 2017]:

V (b) = [HV ](b) = max π1 min π2

h Eb,π1,π2[R(s, a1, a2)] +

a1,o Pb,π1,π2[a1, o] V (τ(b, a1, π2, o)) i (1)

where π1 : (A1) and π2 : S (A2) are stage strategies player 1 and 2 use to choose their actions a1 and a2 in the current stage of the game and τ(b, a1, π2, o) = Pb,π2[s | a1, o] is the Bayesian belief update. Similarly to the single-player case (POMDP), V can be approximated using more scalable algorithms. For OS-POSGs, there exists a modiﬁcation of the heuristic search value iteration algorithm (HSVI) [Hor ak et al., 2017] that extends HSVI [Smith and Simmons, 2004; Smith and Simmons, 2005] to the two-player zero-sum games. This algorithm (Algorithm 1) uses a self-play between P1 and P2 to identify relevant beliefs in the game and aims at achieving a close approximation of V primarily in these relevant beliefs. Piecewise linear and convex lower V and upper V bounds on V are maintained and gradually reﬁned over time. Since V is δLipschitz continuous for δ = (max R( ) min R( ))/2(1 γ), δ-Lipschitz bounds V and V are considered. The goal of the algorithm is to ensure that a desired precision V (binit) V (binit) ε is eventually reached (line 1). To this end, the algorithm performs a sequence of trials (represented by the Explore procedure) that aim at achieving that beliefs b(k) reached at k-th level of recursion are approximated within ρ(k) precision. Sequence ρ is deﬁned as

ρ(1) = ε ρ(k + 1) = [ρ(k) 2δD]/γ . (2)

where δ is the Lipschitz constant of V and D > 0 is a parameter such that ρ is monotonically increasing and unbounded. In each stage of a trial, players choose their stage strategies π(k) 1 and π(k) 2 (lines 5 6). The maximizing player 1 chooses his strategy based on solving the game corresponding to [HV ](b), while the minimizing player 2 chooses his strategy based on solving [HV ](b) (i.e., each player obtain his own strategy based upon reasoning about optimistic variant of the stage game from his perspective).2 Each trial constructs a sequence of beliefs (line 7) such that (1) b(k+1) is reachable from b(k) when employing stage strategies π(k) 1 and π(k) 2 , (2) b(k+1) has the highest excess gap

excessk+1(τ(b(k), a1, π(k) 2 , o)) = (3)

= V (τ(b(k), a1, π(k) 2 , o)) V (τ(b(k), a1, π(k) 2 , o)) ρ(k)

among the reachable beliefs weighted by the probability Pb(k),π(k) 1 ,π(k) 2 [a1, o] of reaching that belief. Upon reach-

ing a belief b(k) where sufﬁcient accuracy is achieved (i.e.,

2[Hor ak et al., 2020] shows that for piecewise linear and convex bounds V and V these strategies can be obtained by means of linear programming.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Algorithm 1: HSVI for discounted OS-POSGs

1 while V (binit) V (binit) > ε do

2 Explore(binit, 1)

3 procedure Explore(b(t), t)

4 if excesst(b(t)) 0 then return

5 π(t) 1 optimal strategy of P1 in [HV ](b(t))

6 π(t) 2 optimal strategy of P2 in [HV ](b(t))

7 b(t+1) arg maxa1,o Pb(t),π(t) 1 ,π(t) 2 [a1, o]

excesst+1(τ(b(t), a1, π(t) 2 , o))

8 Explore(b(t+1), t + 1)

9 Perform point-based update of V and V in b(t)

excessk(b(k)) 0), the trial terminates and a sequence of point-based updates in beliefs b(1), . . . , b(k) is performed on line 9 to reﬁne the bounds V and V (the updates ensure that V (b) := [HV ](b) and V (b) := [HV ](b) for selected beliefs b).

Convergence properties When discussing our proposed modiﬁcations of the HSVI algorithm, we will refer to key properties of HSVI algorithm for OS-POSG that sufﬁce to ensure the convergence: (1) Each trial terminates in b(k) with excessk(b(k)) 0 for some k. The payoff in the game is bounded and so is the gap between V and V . As ρ is unbounded, ρ(k) will eventually exceed V (b(k)) V (b(k)). Observe that this also means that all beliefs reachable from b(k 1) have negative excess gap on line 4. (2) Point-based update in b(k 1) ensures that all beliefs within hypersphere with radius D centered in b(k 1) have negative excess gap. [Hor ak et al., 2017, Lemma 4] (3) Eventually, no beliefs with excessk(b(k)) > 0 remain (unless the algorithm converges earlier). By (1) and (2), each trial identiﬁes (at least) one belief b(k 1) with positive excess gap and makes all beliefs within Dneighborhood of b(k 1) have negative excess gap. By a standard packing argument, the hyperspheres with radius D describing beliefs with negative excess gap eventually cover entire belief space, hence no beliefs with positive excess gap remain. [Hor ak et al., 2017, Theorem 3] Observe that the absence of beliefs with negative excess gap means that excess1(binit) < 0 and hence V (binit) V (binit) ρ(1) = ε.

3 Partially Observable Stochastic Shortest-Path Games (POSSPGs)

Partially observable stochastic shortest-path games (POSSPGs) extend the partially observable version of Stochastic Shortest-Path problem to a game with the presence of a well-informed adversary. The goal of the player 1 is to reach one of the states in the set SG of goal states with minimal cost, while the adversary tries to prevent reaching SG or at least render the cost for reaching SG as high as possible. We formalize POSSPGs based on

the formalism of one-sided POSGs (Section 2) as a tuple G = S, A1, A2, O, T, R, binit, SG . Unlike OS-POSGs where the objective was the maximization of discounted sum of rewards, the objective of POSSPGs is the maximization of total sum of rewards (for the reasons of notational consistency with prior work on OS-POSGs, we use negative rewards instead of positive costs):

U b (σ1, σ2) = Eb,σ1,σ2

t=1 R(s(t 1), a(t) 1 , a(t) 2 )

where b is the current belief of the game, and (σ1, σ2) are the strategies used by the players. We aim for the worst-case and assume that player 2 is perfectly informed (hence his strategy σ2 : S(A1A2OS) (A2) conditions on the entire history of the game), while player 1 who wants to reach SG has limited observability of the game (σ1 : (A1O) (A1)). Similarly to OS-POSGs, we use V (b) to denote the value of POSSPG with initial belief b (i.e., the payoff the players are guaranteed to achieve under optimal strategies). We also use the notation valσ1(b) to denote the payoff strategy σ1 of player 1 achieves in a game with initial belief b (i.e. valσ1(b) = minσ2 U b (σ1, σ2)). Since the undiscounted total-sum payoff can be inﬁnite in general, we impose the following assumption in POSSPGs: The goal of player 1 is to reach s SG. The game ends3 upon reaching a state in SG, and player 1 incurs no further costs (i.e., R(s, ) = 0 for all s SG). All other costs are strictly positive (i.e., R(s, ) R < 0 for all s SG). This ensures that any strategy σ1 with ﬁnite value valσ1 reaches SG almost surely. Uniform strategy σunif of player 1 has ﬁnite value valσunif > . In other words, player 2 cannot prevent rational player 1 from reaching SG and the value of the game is ﬁnite since < valσunif(b) maxσ1 valσ1(b) = V (b) 0. Although POSSPGs are similar to OS-POSGs, using discount factor γ = 1 and applying Algorithm 1 designed for discounted problems to solve POSSPGs is impossible. This is primarily based on the fact that the convergence results for Algorithm 1 rely on the γ-contractivity of the backup operator H (Equation 1). [Hor ak et al., 2018] provides further evidence supporting our claim by showing counterexamples illustrating that the standard HSVI algorithm for discounted problems cannot be applied to Goal-POMDPs. Since POSSPG is an extension of the Goal-POMDP model, these results naturally apply. To this end, we propose a novel algorithm that circumvents these issues by solving a ﬁnite-horizon variant of POSSPGs, and we show that this allows us to obtain a close approximation of the solution of an inﬁnite-horizon POSSPG.

3.1 k-cutoff Game Similarly to [Hor ak et al., 2018], we can aim at solving a ﬁnite variant of the problem. We term this ﬁnite-horizon variant a k-cutoff game, and we show that this game can be used as an approximation of an inﬁnite-horizon POSSPG.

3We model this by setting T(oreach, s | s, ) = 1 for every s SG where oreach O informs player 1 about reaching the goal.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

A k-cutoff game (denoted Gk) models a decision-making problem where player 1 can play arbitrarily in the ﬁrst k stages of POSSPG G and is forced to follow uniform strategy σunif in the rest of the game. This corresponds to a ﬁnitehorizon POSG associated with G with objective

U k b (σ1, σ2) = Eb,σ1,σ2 h k X

t=1 R(s(t 1), a(t) 1 , a(t) 2 ) + (5)

+ valσunif(b(k+1)) i

where b(k+1) stands for the belief of player 1 after k-th stage of the game. Observe that in the case the game terminated within ﬁrst k stages, player 1 has been notiﬁed of the termination by oreach observation and Supp(b(k+1)) SG, hence value of uniform strategy valσunif(b(k)) = 0. Similarly to the inﬁnite-horizon case, we deﬁne V k (b) as the value of k-cutoff game starting in belief b, and we use valσ1 k (b) = minσ2 U k b (σ1, σ2) to refer to the value of strategy σ1. We now formally prove that k-cutoff game Gk can be used to approximate the solution of inﬁnite-horizon POSSPG G. Namely, we show that (1) value of Gk forms a lower bound on the value of G, and (2) we can choose k to make the approximation of value of G arbitrarily tight.

Proposition 1. Let k 1 and b (S) be arbitrary. Then valσunif(b) V k (b) V (b).

Proof. Let σk 1 be an optimal strategy for player 1 in Gk and let σi 1 be the strategy of player 1 in G which imitates σk 1 for ﬁrst k stages and plays uniformly afterwards. Note that the game Gk is played for the k stages only. After the cutoff, the game ends, and the player 1 pays the one-time penalty associated with playing the uniform strategy from the current belief (see the valσunif(b(k+1)) term in Equation (5)). So, σk 1 deﬁnes the behaviour of player 1 only for k stages. On the other hand, as G is an inﬁnite horizon game, σi 1 deﬁnes the behavior of player 1 in the inﬁnite setting. In this strategy, the decision to play uniformly after the k stages is a deliberate choice of the player (unlike in the case of σk 1 in Gk where the uniform play is part of the game rules). As σk 1 is an optimal strategy for player 1 in Gk, we have

V k (b) = valσk 1 k (b). As σi 1 in G imitates σk 1 for ﬁrst k stages and plays uniformly afterwards, we argue that these two strategies yield the same value in their respective games.

Hence valσi 1(b) = valσk 1 k (b). We have V (b) = valσ 1 (b)

valσi 1(b) = valσk 1 k (b) = V k (b) where σ 1 is an optimal strategy for player 1 in G. Thus, V (b) V k (b). Let strategy σ1 of player 1 in Gk plays uniformly for ﬁrst k stages. Then V k (b) = valσk 1 (b) valσ1(b) = valσunif(b).

Theorem 1. Let k 1. Then V (binit) V k (binit) (1 pk) minb valσunif(b) for pk = [k R valσunif(binit)]/k R.

Proof. Let us deﬁne a game Gk+ as a game with payoff U k+ b (σ1, σ2) = Eb,σ1,σ2[Pk t=1 R(s(t 1), a(t) 1 , a(t) 2 )] (i.e., total sum of rewards in ﬁrst k stages) and let V k+ denote value function of Gk+. Since R( ) 0, we have U b (σ1, σ2)

U k+ b (σ1, σ2) and V k (b) V (b) V k+(b) (the ﬁrst inequality follows from Proposition 1). We prove the statement by deriving bound on V k+(binit) V k (binit). Take (σ 1 , σ 2 ) and (σ+ 1 , σ+ 2 ) as the Nash equilibrium strategies in Gk and Gk+, respectively. Since both games Gk and Gk+ correspond to the game G being played over k stages, the strategies can be used interchangeably and we get

U k binit(σ+ 1 , σ 2 ) U k binit(σ 1 , σ 2 ) = V k (binit) (6)

V k+(binit) = U k+ binit(σ+ 1 , σ+ 2 ) U k+ binit(σ+ 1 , σ 2 ) .

We now claim that the strategy proﬁle (σ+ 1 , σ 2 ) reaches the goal state with probability at least p = [k R valσunif]/k R in the ﬁrst k stages. To prove this claim, we use the following facts: 1. We have that the payoff valσunif(binit) U k+ binit. This is based on two inequalities: V k (binit) U k+ binit (see Equation (6)) and valσunif(binit) V k (binit) (see Proposition 1). 2. Player 1 accumulates negative reward R = max R( ) < 0 in every stage of the game (recall that all rewards are negative). Hence, in the case a goal state is not reached within k stages considered in Gk+, he accumulates a negative payoff k R. 3. Assuming that (σ+ 1 , σ 2 ) reaches a goal state within k stages with probability p, the total accumulated negative payoff of (σ+ 1 , σ 2 ) is U k+ binit(σ+ 1 , σ 2 ) (1 p)k R.

From (2) we know that valσunif (1 p)k R. Solving this inequality for p gives us p [k R valσunif]/k R. By deﬁnition U k+ binit(σ+ 1 , σ 2 ) U k binit(σ+ 1 , σ 2 ) = Ebinit,σ+ 1 ,σ 2 [ valσunif(b(k+1))]. The game reaches SG with

probability p (in which case valσunif(b(k+1)) = 0), hence we have Ebinit,σ+ 1 ,σ 2 [ valσunif(b(k+1))] (1 p) minb valσunif(b). As V (binit) V k (binit) U k+ binit(σ+ 1 , σ 2 ) U k binit(σ+ 1 , σ 2 ) we get the proof.

Observe that pk in Theorem 1 is monotonically increasing in k with limit limk pk = 1. Hence we can ﬁnd k such that (1 pk) minb valσunif(b) ε. Corollary 1. Let ε > 0. Then for k K = [minb valσunif(b) valσunif(binit)]/εR, we have V (binit) V k (binit) ε.

Proof. By Theorem 1, for k 1, we have V (binit) V k (binit) [minb valσunif(b) valσunif(binit)]/k R. Hence, for k K, we get V (binit) V k (binit) ε.

3.2 Finite-Horizon OS-POSGs The k-cutoff game introduced in Section 3.1 corresponds to a ﬁnite-horizon OS-POSG with total-sum objective. While the prior results for OS-POSGs consider only inﬁnite-horizon problems with discounted-sum objective, a straightforward reduction can be used to show that most of the results can be carried on towards setting with ﬁnite-horizon H and total sum of rewards E[PH t=1 R(s(t 1), a(t) 1 , a(t) 2 )] as

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

the objective. Consider a OS-POSG game structure G = S, A1, A2, O, T, R, binit with ﬁnite-horizon H and total sum objective. Let us deﬁne an equivalent inﬁnite-horizon discounted-sum game Gγ = Sγ, A1, A2, O, T γ, Rγ, binit by expanding the state space of G: Sγ = {sh | s S, 0 h H} (i.e., sh corresponds to a state s in G when h stages remain to be played), T γ(o, s h 1 | sh, a1, a2) = T(o, s | s, a1, a2) for h 1 (i.e., after each transition there is one less stage to be played in the remainder of the game), T γ( o, s0 | s0, a1, a2) = 1 for an arbitrary ﬁxed observation o O (i.e., when 0 stages remain transitions in the game stop), Rγ(sh, a1, a2) = R(s, a1, a2)/γH h for h 1 and Rγ(s0, a1, a2) = 0. Observe that the deﬁnition of rewards Rγ ensures that no reward is allocated after H stages of the game pass (since a state s0 is reached). Moreover the discounting terms in the discounted-sum objective E[PH t=1 γt 1Rγ(s(t 1), a(t) 1 , a(t) 2 )] cancel out and hence the objective of Gγ coincide with the total-sum objective of G. Applying the HSVI algorithm to the OS-POSG Gγ is, however, impractical as the value of state sh does not correspond to the total-sum payoff P1 achieves in the remaining h stages. Due to the multiplication of the rewards by 1/γH h, the value of sh is also multiplied by 1/γH h. We show that we can avoid this issue and solve Gγ for γ = 1 (referred to as Gγ=1) by altering the sequence ρ used in the HSVI algorithm (see Equation (2)) we term this algorithm FH-HSVI. For simplicity of discussion, let us partition the optimal value function V of Gγ=1 based on the number of remaining stages in the game

V h (b) = V (b|h) (7) for b (S), b|h (Sγ) : b|h(sh) = b(s) .

This allows us to rewrite Equation (1) as

V h (b) = [HV h 1](b) = max π1 min π2

h Eb,π1,π2[R(s, a1, a2)] +

a1,o Pb,π1,π2[a1, o] V h 1(τ(b, a1, π2, o)) i (8)

and obtain a Bellman equation for Gγ=1 in a traditional form for ﬁnite-horizon problems. Observe that it clearly holds that V 0 (b) = 0 for every b (S) as no rewards are allocated after H stages when the game reaches a state s0. We can therefore set the initial lower and upper bounds V 0 and V 0 on V 0 in the HSVI algorithm to 0 as well. This means that the Explore procedure eventually reaches a belief b|0 where the gap V 0(b) V 0(b) = 0 (unless it is terminated before reaching H-th stage). Unlike in the discounted setting, the sequence of desired gaps ρ hence need not be strictly increasing. Instead, we let ρ(t) = ϵ (t 1)ϵ/H (i.e., ρ(H + 1) = 0 after H-th stage is considered by Explore which is sufﬁcient for the termination of the trial as the gap V 0(b) V 0(b) = 0). Importantly, since the payoff of any strategy in H -horizon game can be bounded by [H min R( ), H max R( )], a value function V H is δH -Lipschitz continuous for δH =

H [max R( ) min R( )]/2 (cf. [Hor ak et al., 2017, Lemma 4]). Similarly to the discounted case, this allows us to consider only δH -Lipschitz continuous lower and upper bounds V H and V H on V H . As a consequence we can verify that the convergence properties discussed in Section 2.1 hold for FH-HSVI: (1) Each trial terminates for t H +1. After H-th recursion step, the game is in one of the states with zero remaining time and V 0(b) V 0(b) = 0 ρ(H + 1). (2) Point based update in b(t 1) ensures that all beliefs within hypersphere with radius ε/2HδH centered in b(t 1) have negative excess gap (Proposition 2). (3) By the same packing argument, we can see that the hyperspheres induced by the beliefs with negative excess gap eventually cover the entire belief space (unless we achieve convergence before that).

Proposition 2. Let b(t 1) (S) and π(t) 1 , π(t) 2 be Nash equilibrium strategies of P1 and P2 in [HV ](b(t 1)) and [HV ](b(t 1)), respectively. Assume that all beliefs τ(b(t 1), a1, π(t) 2 , o) reachable when following (π(t) 1 , π(t) 2 ) have negative excess gap excesst(τ(b(t 1), a1, π(t) 2 , o)) 0. Then after performing the point-based update in b(t 1) we have excesst 1(b ) 0 for every b such that b(t 1) b ε/2HδH.

Proof. Let u V,b(π1, π2) denote the utility in stage game [HV ](b) when the players use stage strategies (π1, π2). Let (π1, π(t) 2 ) and (π(t) 1 , π2) be Nash equilibrium strategies in stage games [HV H t](b(t 1)) and [HV H t](b(t 1)), respectively. By similar argument to the proof of [Hor ak et al., 2017, Lemma 4], we have

[HV H t](b(t 1)) [HV H t](b(t 1)) =

= u V H t,b(t 1)(π(t) 1 , π2) u V H t,b(t 1)(π1, π(t) 2 )

u V H t,b(t 1)(π(t) 1 , π(t) 2 ) u V H t,b(t 1)(π(t) 1 , π(t) 2 )

a1,o Pb(t 1),π(t) 1 ,π(t) 2 [a1, o]

V H t(τ(b(t 1), a1, π(t) 2 , o))

V H t(τ(b(t 1), a1, π(t) 2 , o))

ρ(t) = ρ(t 1) ε/H .

The point-based update in b(t 1) hence ensures that V H t+1(bt 1) V H t+1(bt 1) ρ(t 1) ε/H. Since V H t+1 and V H t+1 are δH-Lipschitz continuous (observe that δH δH for every 0 H H), we have that V H t+1(bt 1) V H t+1(bt 1) ρ(t 1) for every belief b such that b b(t 1) ε/2HδH.

Non-zero termination utilities The deﬁnition of k-cutoff game assumes that a one-time payoff valσunif(b(k)) is allocated after k stages of the game are played depending on the belief b(k) at that time. This means that k-cutoff game is technically not a ﬁnite-horizon OS-POSG where no rewards are assigned after H = k stages. However, we argue that the same algorithm can be applied upon setting V 0 (b) = V 0(b) = V 0(b) = valσunif(b) since

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

In the worst case, the Explore procedure reaches a belief b|0 where V 0(b) V 0(b) = 0, The payoff of any strategy in H -horizon game is bounded by [minb V 0 (b) + H min R( ), maxb V 0 (b) + H max R( )], and hence V H is Lipschitz continuous.

3.3 Solving POSSPGs Due to Corollary 1, we can use a k-cutoff games for k K to approximate the value of POSSPG within ε accuracy. The theoretical value of K, however, could be impractically large. For practical use, we suggest keeping track of an auxiliary upper bound V + on the value of the POSSPG V throughout the course of the algorithm. This allows us to terminate as soon as V +(binit) V k(binit) ϵ (which can be much earlier before we reach K). The auxiliary upper bound V +(binit) is described as follows. Let Gk+ be a variant of k-cutoff game in which the player 1 is perfectly informed and if the goal state is not reached in ﬁrst k stages, then the reward for player 1 is 0 instead of the continuation reward valσunif(b(k)). Let V k+ de-

note value function of Gk+. The function V + is initialized as V 0+. While the procedure solvecutoff(k) is called, ev-

ery time the point-based updates are performed on V k and V k at belief b, then point-based update is also performed on V + at belief b. As V 0+(binit) = 0 and V (binit) < 0, the ini-

tial value of V +(binit) is indeed an upper bound on V (binit). The point based updates ensure that V +(binit) remains an upper bound on V (binit). Our algorithm (Algorithm 2) starts by setting k = 1 and initializing V k, V k, and V +. For each value of k, we perform the following operations. 1. Solve k-cutoff game Gk using FH-HSVI from Section 3.2. This gives a lower bound V k(binit) on V k (binit) and consequently a lower bound on V (binit).4

The gap V k(binit) V k(binit) converges faster than the gap V +(binit) V k(binit) used for the algorithm termination. Therefore when solving Gk, we aim for a gap that the is tighter than the target gap ε.5 This ensures that the termination gap V +(binit) V k(binit) can reach the desired value and the algorithm can terminate. 2. For every point-based update on V k and V k at belief b (line 17 of Algorithm 2), perform a point-based update on V + at belief b. (line 18 of Algorithm 2) 3. If the gap V +(binit) V k(binit) < ε or if k > K, then algorithm is terminated. Otherwise increment k by 1. Final improvements We now suggest some modiﬁcations that we use in our implementation, which decrease the runtime of the algorithm. Instead of initializing k-cutoff game upper bound V k (line 5) and auxiliary upper bound V +

4Note that the course of the future exploration (line 16) solely relies on strategies computed for k-cutoff game (lines 13 and 14). 5The amount of tightness is controlled by the parameter η, 0 < η 1 (line 9 of Algorithm 2).

Algorithm 2: HSVI algorithm for POSSPGs

2 V k valσunif

4 while V +(binit) V k(binit) > ε and k K do

5 V k V 0+ 6 solvecutoff(k)

8 procedure solvecutoff(k)

9 while V k(binit) V k(binit) > η ε and V +(binit) V k(binit) > ε do

10 Explore(binit, 1)

11 procedure Explore(b(t), t)

12 if excesst(b(t)) 0 then return

13 π(t) 1 optimal strategy of P1 in [HV k t](b(t))

14 π(t) 2 optimal strategy of P2 in [HV k t](b(t))

15 b(t+1) arg maxa1,o Pb(t),π(t) 1 ,π(t) 2 [a1, o]

excesst+1(τ(b(t), a1, π(t) 2 , o))

16 Explore(b(t+1), t + 1)

17 Perform point-based update of V k t+1 and V k t+1 in b(t)

18 Perform point-based update of V + in b(t)

(line 3) as V 0+ = 0 for k > 1, we initialize them using points

representing V + in (k 1)-cutoff game. This initialization is indeed an upper bound on V since adding an additional round for which the game is not played as perfect information version of POSSPG can only result in the same or worse utility of player 1 in k-cutoff game compared to (k 1)-cutoff game. To avoid increased memory complexity of the algorithm, we keep only points that are not dominated when moving from k-cutoff game to (k + 1)-cutoff game. As V k is a lower bound of V k+1, the lower bound V k of V k (generated in line 6) is also a lower bound of V k+1. So while performing solvecutoff(k+1) (line 6), we use V k to initialize V k+1, instead of initializing as the value of uniform strategy. As V k is greater than the value of uniform strategy, this initialization decreases the runtime of the algorithm. Finally, the target value η ε of the gap V k(binit) V k(binit) has effect on the runtime too. In our case, we observed the best results with the parameter η = 0.9.

4 Experiments In this section, we present an experimental evaluation of the proposed algorithm (Algorithm 2) on the domain of pursuitevasion games and show how it performs compared to traditional solution approaches with discount factor γ < 1.

4.1 Experiment Settings In pursuit-evasion games ([Chung et al., 2011; Isler and Karnad, 2008]), a team of K centrally controlled pursuers (we

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Figure 1: Example of the pursuit-evasion game.

consider a team of K = 2) is trying to locate and capture the evader who is trying to avoid getting captured. The game is played on a grid (dimensions 3 N), with the pursuers starting in the top-left corner and the evader in the bottomright corner see Figure 1. In each step, the units move to one of their adjacent locations (i.e., the actions of the evader are A2 = {left, right, up, down}, while the actions available to the team of pursuers are joint actions for all units in the team, A1 = (A2)K). The game ends when one of the units from the team of pursuers enters the same cell as the evader or directly swaps position with the evader. The reward for all transitions in the game is 1. The pursuer knows the location of their units, but the current location of the evader is not known. All computational results have been obtained on computers equipped with Intel Xeon Scalable Gold 6146 processors while limiting the runtime to 10 hours and RAM to 128 GB. We used CPLEX 12.9 to solve linear programs. All solution methods were required to ﬁnd an ϵ-optimal solution where ϵ was set to 1. Since the reward for all transitions in the game is 1, such setting allows us to ﬁnd an optimal solution 1 move.

4.2 Algorithm Scalability

We compare the proposed algorithm (HSVI for POSSPGs) with the original HSVI algorithm for discounted OS-POSGs. Recall that the original algorithm [Hor ak et al., 2017] is capable of solving games with discount factor γ < 1, and the obtained solution can thus be considered only as an approximation. Since the approximation quality is directly related to the discount factor (the undiscounted setting can be seen as a limit solution as γ 1), we use several values of discount factor γ, namely γ = 0.95, γ = 0.99, and γ = 0.999. We report the results in Table 1. Observe that our algorithm is signiﬁcantly faster than the original HSVI algorithm for solving the discounted approximation of the game with discount factor γ = 0.999. The computational time required by our algorithm is 25-37x smaller compared to the prior approach. If we further sacriﬁce the accuracy of the discounted approximation and further reduce the discount factor γ, we have to expect improvements in computational time when solving the discounted approximation of the game. The setting with γ = 0.95 is solved substantially faster compared to our approach with γ = 1. However, we have to expect signiﬁcant degradation in solution quality as the contribution of future rewards diminishes quickly with γ = 0.95.

Instance Our Original HSVI algorithm (3 N) approach γ = 0.95 γ = 0.99 γ = 0.999

3 3 2 s 1 s 4 s 58.5 s 3 4 54.5 s 4 s 133.5 s 2 032 s 3 5 607 s 7.5 s 1 385.5 s 15 259 s

Table 1: Scalability in the size of grid dimension N.

The computational beneﬁts of our approach compared to using original HSVI algorithm to solve a discounted approximation with γ = 0.999 are further highlighted when solving the 3 6 instance. While our approach is able to approximate the value of the game within 1.47 precision after 10 hours (measured as the gap between upper and lower bounds computed by the algorithm), the gap for the original approach is 4.956.

5 Conclusion

We introduce a new algorithm for solving two-player zerosum partially observable stochastic shortest-path games (POSSPGs) a variant of a partially observable stochastic game with the undiscounted sum of rewards as an objective. We assume that the adversary has the perfect information, and thus our algorithm allows the agent to ﬁnd robust strategies. We provide theoretical guarantees for the convergence of our algorithm and compare the performance with the algorithm for the discounted case. The results show that it is not feasible to use very large discount factors to approximate the total reward objective, and thus our novel algorithm applies in all scenarios where the future rewards should not be discounted. Our algorithm is the ﬁrst one to solve the class of partially observable SSPGs, and thus follow-up research focused on the scalability improvements would open possibilities of various applications in robotics or, for example, network security. As a second direction for future work, we intend to analyze reachability/safety objectives where the probability of reaching some of the target states is optimized.

Acknowledgements

This research was supported by the Czech Science Foundation (no. 19-24384Y), by the OP VVV MEYS funded project CZ.02.1.01/0.0/0.0/16 019/0000765 Research Center for Informatics , by the ERC Co G 863818 (For M-SMArt), and by the Combat Capabilities Development Command Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the Combat Capabilities Development Command Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes not withstanding any copyright notation here on.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

[Bertsekas and Tsitsiklis, 1991] Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580 595, 1991.

[Bertsekas, 1995] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientiﬁc Belmont, MA, 1995.

[Chen et al., 2007] Y. Chen, Q. Zhao, V. Krishnamurthy, and D. Djonin. Transmission scheduling for optimizing sensor network lifetime: A stochastic shortest path approach. IEEE Transactions on Signal Processing, 55(5):2294 2309, 2007.

[Chen et al., 2020] Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition. ar Xiv preprint ar Xiv:2012.04053, 2020.

[Chung et al., 2011] Timothy H Chung, Geoffrey A Hollinger, and Volkan Isler. Search and pursuit-evasion in mobile robotics. Autonomous robots, 31(4):299 316, 2011.

[Delamer et al., 2019] Jean Alexis Delamer, Yoko Watanabe, and Caroline P. Carvalho Chanel. Solving path planning problems in urban environments based on a priori sensor availability and execution error propagation. AIAA Scitech 2019 Forum, 2019.

[Egorov et al., 2016] Maxim Egorov, Mykel J. Kochenderfer, and Jaak J. Uudmae. Target surveillance in adversarial environments using POMDPs. 30th AAAI Conference on Artiﬁcial Intelligence, AAAI 2016, pages 2473 2479, 2016.

[Hor ak et al., 2017] Karel Hor ak, Branislav Boˇsansk y, and Michal Pˇechouˇcek. Heuristic Search Value Iteration for One-Sided Partially Observable Stochastic Games. In 31st AAAI Conference on Artiﬁcial Intelligence, pages 558 564, 2017.

[Hor ak et al., 2018] Karel Hor ak, Branislav Boˇsansk y, and Krishnendu Chatterjee. Goal-HSVI: Heuristic search value iteration for goal-POMDPs. IJCAI International Joint Conference on Artiﬁcial Intelligence, pages 4764 4770, 2018.

[Hor ak et al., 2020] Karel Hor ak, Branislav Boˇsansk y, Vojtˇech Kovaˇr ık, and Christopher Kiekintveld. Solving zero-sum one-sided partially observable stochastic games. ar Xiv preprint ar Xiv:2010.11243, 2020.

[Isler and Karnad, 2008] Volkan Isler and Nikhil Karnad. The role of information in the cop-robber game. Theoretical Computer Science, 399(3):179 190, 2008.

[Lim et al., 2013] Sejoon Lim, Christian Sommer, Evdokia Nikolova, and Daniela Rus. Practical route planning under delay uncertainty: Stochastic shortest path queries. In Robotics: Science and Systems, volume 8, pages 249 256, 2013.

[Neu et al., 2012] Gergely Neu, Andras Gyorgy, and Csaba Szepesv ari. The adversarial stochastic shortest path problem with unknown transition probabilities. In Artiﬁcial Intelligence and Statistics, pages 805 813, 2012. [Norman et al., 2005] Gethin Norman, David Parker, Marta Kwiatkowska, Sandeep Shukla, and Rajesh Gupta. Using probabilistic model checking for dynamic power management. Formal aspects of computing, 17(2):160 176, 2005. [Patek and Bertsekas, 1999] Stephen D. Patek and Dimitri P. Bertsekas. Stochastic shortest path games. SIAM Journal on Control and Optimization, 37(3):804 824, 1999. [Patek, 1999] Stephen D Patek. On partially observed stochastic shortest path problems. Systems Engineering, (i):1 14, 1999. [Rosenberg and Mansour, 2020] Aviv Rosenberg and Yishay Mansour. Stochastic Shortest Path with Adversarially Changing Costs. ar Xiv preprint ar Xiv:2006.11561, 2020. [Saisubramanian et al., 2019] S. Saisubramanian, K. H. Wray, L. Pineda, and S. Zilberstein. Planning in stochastic environments with goal uncertainty. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1649 1654, 2019. [Smith and Simmons, 2004] Trey Smith and Reid Simmons. Heuristic search value iteration for POMDPs. In 20th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 520 527, 2004. [Smith and Simmons, 2005] Trey Smith and Reid Simmons. Point-based POMDP algorithms: improved analysis and implementation. In 21st Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 542 549, 2005.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)