# beliefstate_query_policies_for_useraligned_pomdps__ecdb5a5a.pdf

Belief-State Query Policies for User-Aligned POMDPs

Daniel Bramblett and Siddharth Srivastava Autonomous Agents and Intelligent Robots Lab School of Computing and Augmented Intelligence Arizona State University, AZ, USA {drbrambl,siddharths}@asu.edu

Planning in real-world settings often entails addressing partial observability while aligning with users requirements. We present a novel framework for expressing users constraints and preferences about agent behavior in a partially observable setting using parameterized belief-state query (BSQ) policies in the setting of goaloriented partially observable Markov decision processes (g POMDPs). We present the first formal analysis of such constraints and prove that while the expected cost function of a parameterized BSQ policy w.r.t its parameters is not convex, it is piecewise constant and yields an implicit discrete parameter search space that is finite for finite horizons. This theoretical result leads to novel algorithms that optimize g POMDP agent behavior with guaranteed user alignment. Analysis proves that our algorithms converge to the optimal user-aligned behavior in the limit. Empirical results show that parameterized BSQ policies provide a computationally feasible approach for user-aligned planning in partially observable settings.

1 Introduction

Users of sequential decision-making (SDM) agents in partially observable settings often have requirements and preferences on expected behavior, ranging from safety concerns to high-level knowledge of task completion requirements. However, users are ill-equipped to specify desired behaviors from such agents. For instance, although reward engineering can often encode fully observable preferences [Devidze et al., 2021, Gupta et al., 2023], it requires significant trial-and-error, and can produce unintended behavior even when done by experts working on simple domains [Booth et al., 2023]. These challenges are compounded in partially observable environments, where the agent will not know the full state on which the users requirements and preferences are typically defined. For example, defining a reward function on the belief state to align the agent s behavior with the user can result in wireheading [Everitt and Hutter, 2016] (see Sec. 2 for further discussion on related work).

Consider a simplified, minimal example designed to illustrate the key principles (Fig. 1(a)). A robot located on a spaceship experiences a communication error with the ship and needs to decide whether to attempt to repair itself or the ship. Importantly, while a robot error is harder to detect, the user would rather risk repairing the robot than repairing the ship, as each repair risks introducing additional failures. In other words, the user may expect the robot to work with the following goals and preferences: The objective is to fix the communication channel. First, if there is a high likelihood that the robot is broken, it should try to repair itself; otherwise, if there is a high likelihood that the ship is broken, it should try to repair that. Such preferences go beyond preferences in fully observable settings: they use queries on the current belief state for expressing users requirements while using the conventional paradigm of stating objectives in terms of the true underlying state. Such a formulation avoids wireheading, allowing users to express their constraints and preferences in partially observable settings. Although such constraints on behavior are intuitive and common, they leave a significant

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

(a) Spaceship Repair Problem

(c) Expected Cumulative Cost Function is Piecewise Constant and Non-Convex

(b) Parameterized Belief-State Query Policy

Figure 1: (a) Spaceship Repair running example. (b) parameterized BSQ policy for the user preference from the Introduction. (c) The expected cumulative cost function for (b) with a horizon of 12.

amount of uncertainty to be resolved by the agent: it needs to optimize the threshold values of high probability under which each rule would apply while attempting to achieve the objective.

We introduce mathematical and algorithmic foundations for addressing these problems by defining constraints on behaviors in terms of properties of the belief state, expressed through belief-state queries (BSQs). We prove the surprising result that although the space of possible threshold values in preferences such as the one listed above is uncountably infinite, only a finite number of evaluations are required for computing optimal, user-aligned policies for finite-horizon problems. We use this result to develop a probabilistically complete algorithm for computing optimal constrained policies. Our main contributions are:

1. A framework for encoding user requirements and preferences over agent behavior in goaloriented partially observable Markov decision processes (Sec. 3). 2. Mathematical analysis proving that the expected cost function of a parameterized BSQ policy w.r.t its parameters is piecewise constant but generally non-convex. (Sec. 4). 3. A probabilistically complete algorithm for computing optimal user-aligned policies in goal-oriented POMDPs (Sec. 5). 4. Empirical evaluation on a diverse set of problems showing both the efficiency of our algorithm and the quality of the computed user-aligned policies. (Sec. 7).

2 Related Work

Planning over preferences has been well studied in fully observable settings [Baier et al., 2007, Aguas et al., 2016]. Voloshin et al. [2022] present an approach for complying with an LTL specification while carrying out reinforcement learning. Other approaches for using LTL specifications use the grounded state to create a reward function to teach reinforcement learning agents [Toro Icarte et al., 2018, Vaezipoor et al., 2021]. These approaches do not extend to partially observable settings as they consider agents that can access the complete state.

In partially observable settings, existing approaches for using domain knowledge and preferences require extensive, error-prone reward design and/or do not guarantee compliance. LTL specifications have been incorporated either by designing a reward function that incentivizes actions more likely to adhere to these specifications [Liu et al., 2021, Tuli et al., 2022] or by imposing a compliance threshold [Ahmadi et al., 2020]. In both approaches, the user calibrates rewards for user alignment with those for objective completion; it is difficult to ensure user alignment. We focus on the problem of guaranteeing user alignment without reward engineering.

Mazzi et al. [2021, 2023] proposed expressing domain control knowledge using belief state probabilities. Mazzi et al. [2021] used expert-provided rule templates and execution traces to construct a shield to prevent irregular actions. Mazzi et al. [2023] used execution traces and domain-specified belief-state queries to learn action preconditions over the belief state. Both approaches use input traces and focus on ensuring a policy is consistent with previously observed behavior. We address the complementary problem of computing user-aligned policies without past traces.

Belief-state queries have been used to solve POMDPs with uniform parameter sampling [Srivastava et al., 2012] but formal analysis, feasibility of optimizing BSQ policies, and the existence of provably convergent algorithms have remained open as research questions prior to this work.

3 Formal Framework

This section formally defines the BSQ framework, which expresses user requirements on an agent s belief and is designed for relational goal-oriented partially observable Markov decision processes.

3.1 Goal-Oriented Partially Observable Markov Decision Process

Partially observable Markov decision processes (POMDPs) constitute a standard mathematical framework for modeling SDM problems in partially observable, stochastic settings [Kaelbling et al., 1998, Smallwood and Sondik, 1973]. State-of-the-art POMDP solvers often rely on approximate online approaches [Silver and Veness, 2010, Somani et al., 2013] where recent work addresses the problem of obtaining performance bounds [Barenboim and Indelman, 2023, Lim et al., 2023].

We use goal-oriented POMDPs (g POMDPs), where the agent aims to complete one of the tasks/goals. This eliminates the burden of error-prone reward engineering by using a default cost function that associates a constant cost for each timestep before reaching the goal. E.g., the Spaceship Repair problem (Sec. 1) has two objects: the robot and the spaceship. A state is defined using a Boolean function brokenpoq representing whether object o needs repair and an integer-valued function rlocationpq representing the robot s location. Both functions are not observable. The agent has two types of actions: try to repair object o (repairpoqq or wait pwaitpqq. A transition function expresses the distribution of rlocationpq depending on the action taken and the robot s previous location. At each timestep, the robot receives a noisy observation obs_errpoq regarding the status of object o. The set of observations can be expressed as tobs_errprobotq, obs_errpshipqu. Due to noisy perception, obs_errorpoq may not match brokenpoq. An observation function denotes the probability of each observation conditioned on the (hidden) current state. The goal is to reach the repair station corresponding to the truly broken component. We define g POMDPs formally as follows.

Definition 1. A goal-oriented partially observable Markov decision process P is defined as x C, F, A, O, T , Ω, G, Cost, H, b0y where C is the finite set of constant symbols and F is the finite set of functions. The set of state variables for F, VF , is defined as all instantiations of functions in F with objects in O. The set of states S is the set of all possible valuations for VF ; A is a finite set of actions, O is a subset of F of observation predicates, T : S ˆ A ˆ S Ñ r0, 1s is the transition function Tps, a, s1q Prps1|a, sq; G Ď S is the set of goal states that are also sink states, Ω: S ˆ A ˆ O Ñ r0, 1s is the observation function; Ωps, a, oq Prpo|s, aq, Costpsq t0 if s P G;else 1u is the cost function, H is the horizon, and b0 is the initial belief state. A solution for a g POMDP is a policy that has a non-zero probability of reaching G in H 1 timesteps.

3.2 Belief-State Queries and Policies

Computing a policy for any g POMDP requires planning around state uncertainty. This is done using the concept of a belief state, which is a probability distribution over the currently possible states. Formally, the belief state constitutes a sufficient statistic for observation-action histories [Astrom et al., 1965]. We express user requirements using queries on the current belief state.

For any belief state b, when action a is taken and observation o is observed, the updated belief state is computed using b1ps1q αΩps1, a, oq ř

s T ps, a, s1qbpsq where α is the normalization factor. We refer to this belief propagation as b1 bppb, a, oq. We extend the notation to refer to the sequential application of this equation to arbitrary bounded histories as bp pb0, a1, o1, ..., an, onq bpp. . . bppbppb0, a1, o1q, a2, o2q . . .q.

For example, the Spaceship Repair problem user preference has the expression a high likelihood that the robot is broken . This can be expressed as a query on a belief state b: Pr Jbrokenprobotq Kb ą Θrob where Θrob is a parameter. If rlocationpq is fully observable, the expression the robot location is smaller than Θl in a belief state b can be expressed as Pr Jrlocationpq ă Θl Kb 1. We can combine both queries to express a high likelihood the robot is broken and its location is lower than Θl , as: Pr Jbrokenprobotq Kb ą Θrob Pr Jrlocationpq ă Θl Kb 1.

Formally, BSQs use the vocabulary of the underlying g POMDP. There are two types of queries we can ask: (1) whether formula φ is true with a probability that satisfies a threshold Θ; (2) whether the fully observable portion of the state satisfies a formula φ containing a threshold Θ. These thresholds represent the parameters of a parameterized BSQ policy. The agent must optimize these parameters to achieve the goal while aligning with the user s requirements. BSQs can be combined using conjunctions or disjunctions to express more complex requirements, which we define as a compound BSQ in Def. 3. We omit subscripts when clear from context.

Definition 2. A belief-state query λPpb; φ, , Θq, where b is a belief state, φ is a first-order logic formula composed of functions in g POMDP P, is any comparison operator, and Θ P R is a parameter, is defined as λPpb; φ, , Θq Pr JφKb Θ.

Definition 3. A compound BSQ Ψpb; Θq, where b is a belief state and Θ P Rn , is either a conjunction or a disjunction of BSQs that contain n total parameters.

We use BSQs to formally express user requirements of the form discussed in the introduction by mapping BSQs with variable parameters to actions. Fig. 1(b) illustrates this with a parameterized BSQ policy for the Spaceship Repair problem. Formally,

Definition 4. Let b be a belief state and Θ be a tuple of n parameter variables over R. An nparameter Parameterized Belief-State Query policy πpb, Θq is a tuple of rules tr1, ..., rmu where each ri Ψi Ñ ai is composed of a compound BSQ Ψi and an action ai P A. The set tΨ1, ..., Ψmu is mutually exclusive and covers the n-dimensional parameter space Rn.

In practice, mutually exclusive coverage is easily achieved using an if... then... else structure, where each condition includes a conjunction of the negation of preceding conditions and the list of rules includes a terminal else with the catchall BSQ True (Fig. 1). Any assignment of values ϑ P Rn to the parameters Θ of a parameterized BSQ policy produces an executable policy that maps every possible belief state to an action:

Definition 5. A BSQ policy πpb, ϑq is a parameterized BSQ policy πpb, Θq with an assignment in R to each of the n parameters Θ.

Let Prπ t p Gq be the probability that an execution of a policy π reaches a state in G within t timesteps. A BSQ policy πpb, ϑq is said to be a solution to a g POMDP with goal G and horizon H iff

Prπpb,ϑq H 1 p Gq ą 0. The quality of a BSQ policy is defined as its expected cost; due to the uniform cost function in the definition of g POMDPs, the expected cost of a BSQ policy is the expected time taken to reach a goal state. Formally, the expected cost of a BSQ policy πpb, ϑq is Eπpϑ; Hq řH t 1 t ˆ Pr G,trπpb, ϑqs, where H is the horizon and Pr G,trπpb, ϑqs is the probability of policy πpb, ϑq reaching a goal state for the first time at timestep t. Thus, given a g POMDP P, with goal G, and a parameterized BSQ policy πpb, Θq, the objective is to compute:

ϑ argminϑt Eπpϑ; Hq : Prπpb,ϑq H 1 p Gq ą 0u

4 Formal Analysis

Our main theoretical result is that the continuous space of policy parameters is, in fact, partitioned into finitely many constant-valued convex sets. This insight allows the development of scalable algorithms for computing low-cost user-aligned policies. We introduce formal concepts and key steps in proving this result here; complete proofs for all results are available in the Appendix. We begin with the notion of strategy trees to conceptualize the search process for BSQ policies.

4.1 Strategy Trees

Every parameterized BSQ policy πpb, Θq and g POMDP P defines a strategy tree (e.g., Fig. 2(a)) that captures the possible decisions at each execution step. Intuitively, the tree starts at a belief node representing the initial belief state. Outgoing edges from belief nodes represent rule selection in πpb, Θq, resulting in action nodes. Outgoing edges from action nodes represent possible observations,

(a) Spaceship repair H=2 strategy tree (red and blue correlate to two distinct braids) (b) Partitioned parameter space

Figure 2: (a) Strategy tree created from parameterized BSQ policy in Fig. 1 and Spaceship Repair g POMDP with horizon of 2. (b) Complete partitions of parameter space with two of the braids highlighted. Error detection sensor accuracy for the robot and ship is 60% and 75%, respectively.

leading to belief nodes representing the corresponding updated belief. If the tree is truncated at horizon H, each leaf represents the outcome of a unique trajectory of rules from πpb, Θq and observations.

Each belief node represents a belief state that can be calculated using the rule-observation trajectory leading to that node. A labeling function l : VB Y VA Ñ B Y A maps the set of belief nodes VB to belief states in B and the set of action nodes VA to actions in A. For ease of notation we define b i lpviq for all belief nodes vi P VB and a j lpvjq for all action nodes vj P VA.

Definition 6. Let P be a g POMDP, πpb, Θq be a parameterized BSQ policy for P, and b0 be the initial belief state. The strategy tree Tπpboq is defined as Tπpboq x V, Ey where set V VB Y VA contains belief nodes VB and action nodes VA, whereas, set E EB YEA contains edges from belief nodes to action nodes (Eb Ď VB ˆVA) and edges from action nodes to belief nodes (EA Ď VA ˆVB). EB is defined as tpvi, r, vjq|vi P VB, vj P VA, r P πpb, Θq, and DΨ : r Ψ Ñ a j u. Ea is defined as tpvm, o, vnq|vm P VA, vn P VB, o P O, Dpvp, r Ψ Ñ a, vmq P Eb; b n bppb p, a, oqu.

Non-convexity of the expected cost function Each parameterized BSQ policy permits infinitely many BSQ policies, one for each assignment of real values to its parameters. Unfortunately, the expected cost of parameterized BSQ policies is not a convex function of these parameters. Fig. 1(c) shows this with a counterexample using the parameterized BSQ policy from Fig. 1(b), a horizon of 12, and setting the robot s initial distance from each repair station to 5. This plot was constructed by sampling the expected cost for 251,001 equally-spaced parameter assignments to the Fig. 1(a) parameterized BSQ policy. Eπpϑ; Hq is clearly not convex: the expected cost along the line Θ2 Θ1 0.25 has two inflection points at Θ1 0.6 and Θ1 0.8. This creates two local minima: Θ1 ď 0.16 and Θ1 ě 0.83 Θ2 ď 0.1. Intuitively, this is due to the short horizon, which causes the optimal strategy to be selecting a repair station and traversing to it regardless of the observations. This complicates finding good BSQ policies using existing solvers. However, every possible BSQ policy can be associated with a set of strategy tree leaves that are reachable under that policy. Thus, for a given horizon, there are only finitely many expected costs for BSQ policies for a given problem.

The main challenge in computing good BSQ policies is that the set of possible BSQ policies with distinct expected costs grows exponentially with the horizon and good BSQ parameters could be distributed arbitrarily in the high-dimensional, continuous space of parameter values. We use strategy trees to define groups of leaves called braids, which we will then use to prove that the space of BSQ policy parameters turns out to be well-structured in terms of the expected cost function.

Braids We refer to the set of all leaves reachable under a policy πpb, ϑq as the braid of ϑ. Due to the mutual exclusivity of rules for every assignment of parameter values to a parameterized BSQ policy, at most, one outgoing edge can be taken from each belief node (as these correspond to the rules and actions). However, the stochasticity of dynamics and observations allows for multiple outgoing edges to be possible from action nodes. E.g., in the strategy tree for the Spaceship Repair problem (Fig. 2(a)), leaves ℓ2 and ℓ10 cannot both be reachable under a BSQ policy because that would require rules r1 and r2 to be satisfied at the same belief. However, both ℓ1 and ℓ5 may be reachable under the same BSQ policy since their paths diverge on an action node. Formally,

Definition 7. Let H be the horizon, and let πpb, Θq be a parameterized BSQ policy for a g POMDP P. The braid of a parameter assignment ϑ, braidπ,Hpϑq, is the set of all leaves in strategy tree Tπpb0q rooted at the initial belief b0 that can be reached while executing πpb, ϑq: braidπ,Hpϑq tℓH : the path to ℓH is pr1, o1, ..., r H, o Hq; @i ri Ψi Ñ ai, bi bp pb0, r1, o1, ..., ri, oiq and ϑ satisfies Ψi.

The unique interval of parameter values where a leaf is reachable can be calculated by taking the intersection of the parameter intervals needed to satisfy each rule on the path to that leaf. This is because for any compound BSQ Ψ, we can compute the unique interval of parameter values IpΨq under which b will satisfy IpΨq by substituting each BSQ in Ψ with its corresponding inequality:

Lemma 1. Let Ψpb; Θq be an n-dimensional compound BSQ. There exists a set of intervals IpΨq Ď Rn s.t. Ψpb; Θq evaluates to true iff Θ P IpΨq.

We can utilize this result to compute the unique interval of parameter values consistent with a braid by taking the intersection of the intervals of each leaf contained in that braid (Def. 8):

Definition 8. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 be the initial belief state, and H be the horizon. The interval of leaf ℓ, Ipℓq, is defined as the intersection of intervals Ş

i IpΨiq of the conditions of each rule ri that occurs in the path to that leaf. The interval for a set of leaves L is defined as Ip Lq Ş

ℓi PL Ipℓiq.

Any leaf or braid with an empty parameter interval does not align with the user s requirements. For example, in Fig. 2, note that r1 is the only rule satisfiable if r1 is selected from b0 and the robot is observed to be broken. Using the Fig. 1(b) policy and assuming the sensor accuracy is 60%, picking a rule other than r1 implies that 50% likelihood was high enough to fix the robot yet 60% was not, which is a contradiction. Removing misaligned leaves and braids prunes the tree.

4.2 BSQ Policies are Piecewise Constant

We now use the concept of braids to prove that the continuous, high-dimensional space of parameter values of a parameterized BSQ policy reduces to a finite set of contiguous, convex partitions with each partition having a constant expected cost. This surprising result implies that although the expected cost of BSQ policies is not a convex function of parameter assignments, optimizing a parameterized BSQ policy requires optimization over a finite set rather than over a continuous space. We first define a notion of similarity over assignments to parameterized BSQ policies that define BSQ policies:

Definition 9. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, and H be the horizon. Two assignments ϑ1, ϑ2 P Θ are said to be similar, ϑ1 H ϑ2, iff braidπ,Hpϑ1q braidπ,Hpϑ2q.

It is trivial to show H is transitive, symmetric, and reflexive, making it an equivalence relation over Rn. As such, H defines a partition over the same space: Theorem 1. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 be the initial belief state, and H be the horizon. The operator H partitions Rn.

However, this result is not sufficient to define the structure of partitions induced in Rn, which will be required for an efficient optimization algorithm. Based on Sec. 4.1, we know that leaves whose trajectories diverge due to different rules must not be in the same braid. Furthermore, a belief state can only lead to one set of possible observations for an action regardless of the BSQ policy being followed. Intuitively, this prevents braids from being proper subsets of each other, which implies that the parameter intervals for two braids can never have overlapping parameter intervals. This gives us the desired structure for partitions induced in the space of parameter values for parameterized BSQ policies: there are parameter intervals corresponding to distinct braids in the policy tree. In other words, the set of braids partitions the parameter space into contiguous, high-dimensional intervals. This can be proved formally and stated as follows:

Theorem 2. Let πpb, Θq be a parameterized BSQ policy, b0 be the initial belief state, and H be the horizon. Each partition ρ created by operator H partitioning Rn is the disjoint intervals, ρ Ď Rn

where @ϑ P ρ, braidπ,Hpϑq L where L is a fixed set of leaves.

Since each partition corresponds to a braid and each braid corresponds to a fixed set of leaves, which defines the expected cost for all policies corresponding to that braid, all policies defined by a partition

of the parameter space have a constant expected cost. As such, the domain of the expected cost function Eπpϑ; Hq for g POMDP P can be represented as the disjoint intervals of each braid partition. Thus, Eπpϑ; Hq is piecewise constant. The following result formalizes this.

Theorem 3. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 be the initial belief state, and H be the horizon. Each partition created by H on Rn has a constant expected cost.

In some situations, the braids that partition the parameter space can be calculated in closed form (e.g., see the Appendix for partitions for the Spaceship Repair problem). The next section develops a general approach for computing the braids and intervals corresponding to a parameterized BSQ policy, for evaluating the expected cost for each such partition, and for optimizing over these partitions.

5 Partition Refinement Search

Algorithm 1 Partition Refinement Search (PRS)

1: Inputs: g POMDP P, parameterized BSQ policy πpb, Θq, horizon H 2: Output: Minimum cost partition and its expected cost xρopt, ˆErρoptsy 3: ρinit Ð Ś

ΘPΘ DΘ 4: X txρinit, 8yu, Xopt xρinit, 8y 5: while !Time Outpq do 6: xρ, ˆErρsy Ð Select Partitionp Xq 7: ϑs Uniform Samplepρq 8: ℓ, EℓÐ Rolloutp P, πpb, ϑsq, Hq 9: X Ð p Xzxρ, ˆErρsyq Y xρz Ipℓq, ˆErρsy 10: X Ð X Y xρ X Ipℓq, ˆErρs Y Eℓy 11: Xopt Ð arg min xρ, ˆ Erρsy PX ˆErρs

12: end while 13: return Xopt

In this section, we present a novel algorithm for optimizing the parameters for a parameterized BSQ policy using the theory of braids developed above. The Partition Refinement Search (PRS) algorithm (Algo. 1) constructs the set of partitions using hierarchical partition selection and refinement, where a partition is selected to be refined, a leaf that can occur in that partition is sampled and evaluated, and the partitions are refined to isolate the interval of the braid corresponding to the sample. The hypothesized optimal partition is tracked and returned as the final result after timeout.

PRS constructs the first parameter space interval as the domain of all possible parameter values (line 3). This is set as the initial hypothesized optimal partition (line 4). In each iteration, a partition ρ is selected using exploration-exploitation approaches discussed in Sec. 6 (lines 6). A leaf ℓis sampled from ρ by uniformly sampling parameter value ϑ from ρ s parameter intervals and performing rollouts from the initial belief state to a reachable leaf using the BSQ policy πpb, ϑq (lines 7 and 8). The sampled leaf ℓis used to refine partition ρ using the insight braids cannot overlap (Sec. 4.2). If there exists a subinterval of ρ where ℓdoes not occur, a new partition for this subinterval is constructed containing ρ s previous leaves and expected cost (line 9). The remaining portion of ρ, where ℓcan occur, is used to construct a partition with an updated expected cost representing ρ s previous leaves and ℓ(line 10). The hypothesized optimal partition is then updated (line 11).

PRS converges to the true optimal BSQ policies in the limit:

Theorem 4. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 the initial belief state, and H be the horizon. The likelihood of the Partition Refinement Search algorithm returning the optimal parameter interval converges to one in the limit of infinite samples.

Complexity analysis While the theoretical space and time complexity are linear in the number of leaves, due to PRS grouping leaves from the strategy tree (Def. 6), there is good reason to expect better performance in practice. As discussed in Sec. 4.1, strategy trees can get pruned with the removal of branches and leaves that do not align with the user s requirements. For example, in the Spaceship Repair problem using the Fig. 1 parameterized BSQ policy, a third of the possible leaves are pruned at a horizon of two, and the pruning becomes even more significant for longer horizons. Additionally, empirical results suggest that rules earlier in rule-observation trajectories are more important in dictating the partitions. Furthermore, selecting and refining partitions can be performed in parallel, further improving performance.

6 Partition Selection Approaches

We explored multiple partition selection approaches with a multiprocessing version of PRS. Each approach used the same dynamic exploration rate er that diminished over time. Each thread managed a subset of partitions X1 Ď X and updated a global hypothetical optimal partition. Additionally, we warm start PRS by randomly selecting 20 points in the parameter search space and evaluating them 40 times to build an initial set of partitions. Also, partitions that have a lower expected cost than the hypothesize optimal are sampled up to 40 before updating the hypothesize optimal. In this paper, we focus on three selection approaches and discuss two others in the Appendix.

Epsilon Greedy (PRS-Epsilon) We explore er percent of the time by uniformly sampling s U 1 0 and checking if s ď er. If we are exploring, we uniformly at random select a partition from X1. Otherwise, the partition with minimum expected cost, arg minxρ, ˆ Erρsy PX1 ˆErρs, is selected.

Boltzmann Exploration (PRS-Bolt) Partitions are selected in a weighted random fashion with the probability of selecting partition ρ as α ˆ expp ˆErρs{erq with α being the normalization factor.

Local Thompson Sampling (PRS-Local) Each thread treats the problem as a multi-armed bandit problem where the expected cost for the next sample from each partition is simulated using Npµc, σcq with µc and σc being the partition s mean and standard deviation, respectively. The partition with the lowest estimated expected cost is selected.

7 Empirical Results

We created an implementation of PRS and evaluated it on four challenging risk-averse problems. Complete source code is available in the supplementary material. We describe the problems and user preferences here; further details, including parameterized BSQ policy, can be found in the Appendix.

Lane merger (LM) In this problem, an autonomous vehicle driving on a two-lane road must switch lanes safely before reaching a lane merger. However, there is currently a car in the other lane that the agent does not know the location or speed of. Switching lanes too close to this car risks a severe accident. The autonomous vehicle has a noisy detection system that returns whether a vehicle is located in regions around the car. The user s preference is: If there s a high likelihood of safely switching lanes, do so. If there is a high likelihood of the other car being in close proximity and it is possible to slow down, slow down. Otherwise, keep going.

Spaceship repair (SR) This is a modified version of the running example with parameterized BSQ policy Fig. 1(b). The robots start 7 steps and 5 steps away from the robot and ship repair stations, respectively. Additionally, the robot s sensor is 75% accurate at detecting errors with the robot and only 55% for the ship. With the short horizon H 12, this results in the parameter space being not convex with multiple local minimums with differing expected costs.

Graph rock sample (GRS) We modified the classic Rock Sample(n, k) problem [Smith and Simmons, 2004] by replacing the grid with a graph with waypoints where some waypoints contain rocks. Additionally, we introduced risk by causing the robot to break beyond repair if it samples a rock not worth sampling. We also categorized the rocks into types, and the rover s goal is to bring a sample of each type to the desired location if a safe rock for that type exists. This goal requires a longer horizon to reach compared to the other problems. The user s preference is: Evaluating rocks of types not sampled in order r1, ..., rn, if the rock has a high likelihood of being safe to sample, go and take a sample of it. Else, if the rock has a high likelihood of being safe to sample, get close enough and scan it. Otherwise, move towards the exit if no rocks are worth sampling or scanning.

Store visit (SV) This problem is based on the partially observable Open Street Map problem in Liu et al. [2021]. A robot is located in a city where some locations are unsafe (e.g., construction, traffic), which can terminally damage the robot. The robot is initially uncertain of its location but it can scan its surroundings to determine its general location. The agent traverses the city and can visit the closest building. The goal is to visit a bank and then a store. This problem features a nuanced parameterized BSQ policy: If you are significantly unsure of your current location, scan the area. If you have visited a bank, do the following to visit a store; otherwise, do it to visit a bank. If you are sufficiently sure the current location has the building you are looking for, visit it. Otherwise, move towards where you think that building is while avoiding unsafe locations. If all else fails, scan the current area.

Expected Cost

Lane Merger

40Graph Rock Sample

10.0Spaceship Repair

60 Store Visit

0 40 80 120 0

% Goal Achievement

0 40 80 120 0

0 10 20 30 0

0 50 100 150 200 0

PRS (Ours) Nelder-Mead Particle Swarm

Time (Seconds)

Figure 3: Empirical results evaluating the hypothesized optimal partition performance tracked. Equally spaced samples across PRS evaluation time are taken while a sample is taken each iteration of Nelder-Mead and Particle Swarm. The error displayed is the standard deviation error.

7.1 Baselines

We evaluated PRS against three different types of baselines.

RCompliant Select random parameter values uniformly at random from the parameter space to produce user-aligned policies.

Hyperparamter optimization algorithms To measure the benefits of PRS against existing hyperparameter optimization algorithms, we implemented both Nelder-Mead [Nelder and Mead, 1965] and Particle Swarm [Kennedy and Eberhart, 1995]. The expected cost of parameter space point ϑ, for parameterized BSQ policy π, was computed by averaging 1,000 parallel runs of the policy πpb; ϑq. For Nelder-Mead optimization, we used a simplex that had vertices numbering one more than the number of parameters in the parameterized BSQ policy being optimized. We warm start by initially evaluating a 100 random points to construct the initial simplex using the best-performing points. For Particle Swarm optimization, 10 particles were used with the location and momentum of each particle clipped to the search space. The coefficients changed based on steps since the last improvement.

Unconstrained POMDP solvers To measure the differences between BSQ policies and unconstrained POMDP solvers, we implemented variations of our problems into POMDPX and solved them with DESPOT [Somani et al., 2013] and SARSOP [Kurniawati et al., 2009] for 1,000 evaluation runs. To measure whether an action-observation trajectory produced with these solvers aligns with the user s requirements, we check if there exist parameter values ϑ where policy πpb; ϑq could produce that trajectory. We use this to evaluate the solutions produced.

7.2 Analysis of Results

For each problem, we evaluated each baseline and PRS variant ten times. The horizon was 12 for Spaceship Repair and 100 for the other problems. The timeout for PRS was set on a problem-byproblem basis. Timeout for Nelder-Mead and Particle Swarm was one hour. Note that the highest expected cost is equal to the horizon due to the default cost function. The performance of each PRS partition selection approach can be found in Figure 4 and the quality of solutions over time compared to the baselines are shown in Figure 3.

Partition selection approach evaluations PRS partition selection approaches converged to a similar quality policy. The only difference was the time taken with approaches that did not rely on the standard deviation converging faster due to there being a lower standard deviation near the optimal solution, causing selection approaches that used the standard deviation to explore the wrong partitions. We use PRS-Bolt as a representative when comparing against the other baselines.

Figure 4: Results for PRS with different partition selection approaches from Section 6.

PRS solution quality PRS produced a higherquality policy compared to the ones produced by RCompliant. For Spaceship Repair, the simplest problem solved on the shortest horizon, policies produced by PRS-Bolt had a 15.68% lower expected cost and 3.47% higher goal achievement rate. For the other problems, policies produced by RCompliant had more than triple the expected cost and achieved only half the success rate on both Graph Rock Sample and Store Visit. These results demonstrate that optimizing BSQ parameter values has a significant impact on the performance of user-aligned policies.

Hyperparameter optimization evaluation Compared to traditional hyperparameter optimization algorithms, PRS always found the user-aligned policy with the lowest expected cost with little performance deviation. This is due to Nelder-Mead and Particle Swarm struggling to optimize a nonconvex piecewise-constant function using noisy data, resulting in known problems with local-search algorithms: problems of getting stuck in sub-optimal local minima and exploring the incorrect space. Additionally, PRS converged first since it is more sample-efficient. It is computationally expensive to update the belief state, resulting in poor-quality solutions being more expensive to evaluate due to taking longer to reach the goal. PRS only requires a couple of evaluations before spending the computational resources on more promising areas.

An interesting result is that, in Spaceship Repair, solutions found by Nelder-Mead and Particle Swarm both had a 7.73% higher expected cost and 18.31% higher goal achievement rate than the PRS-Epsilon solutions. There is likely a high negative correlation between the expected cost and goal achievement rate. PRS is better at optimizing the stated objective of minimizing the expected cost.

Unconstrained solver evaluation Without guiding from the parameterized BSQ policies, DESPOT and SARSOP struggled with this set of problems. SARSOP failed to converge to a policy due to the long problem horizon. DESPOT could not run on Lane Merger, which had the largest state space and branching factor. DESPOT also only achieved the goal 0.5% of the time on Graph Rock Sample. DESPOT achieved a lower expected cost of 20.0% and 13.3% on variations of Spaceship Repair and Store Visit, respectively. However, DESPOT s policy never aligned with the user s requirements on Store Visit and only 7.3% of the time on Spaceship Repair. This indicates that the BSQ framework offers a new approach for expressing both domain knowledge and user requirements.

8 Conclusion

We presented the BSQ policy framework for expressing users requirements over the belief state in partially observable settings for computing user-aligned agent behavior. We performed a formal analysis of these policies, proving that the parameter value space introduced in the parameterized BSQ policies can be partitioned, resulting in parameterized BSQ policies being optimizable through a hierarchical optimization paradigm. We introduced the probabilistically complete Partition Refinement Search algorithm to perform this optimization. Our empirical results show that it converges to the optimal user-aligned policy quicker and more consistently than existing approaches. Results indicate that parameterized BSQ policies provide a promising approach for solving diverse real-world problems requiring user alignment.

Limitations and future work There are many interesting directions for future work based on the current BSQ policy framework. BSQ representations can be made more expressive by allowing deterministic functions, which would not compromise the presented theoretical results. Furthermore, there exists a natural extension of this work into finite memory controllers that allows temporally extended requirements to be encoded with the same theoretical results. Relaxing the constraints on mapping each belief state to a single action would expand the usability. For more complex problems, a belief-state approximation approach would be required, but the underlying strategy tree discussed in this work would remain mostly unchanged. Another interesting research direction is to develop methods that help users express their requirements in the BSQ framework.

Acknowledgments and Disclosure of Funding

This work was supported in part by ONR grant N000142312416 and NSF grant IIS 1942856.

Rati Devidze, Goran Radanovic, Parameswaran Kamalaruban, and Adish Singla. Explicable reward design for reinforcement learning agents. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 20118 20131. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/ paper_files/paper/2021/file/a7f0d2b95c60161b3f3c82f764b1d1c9-Paper.pdf.

Dhawal Gupta, Yash Chandak, Scott Jordan, Philip S. Thomas, and Bruno C. da Silva. Behavior alignment via reward function optimization. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 52759 52791. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ a5357781c204d4412e44ed9cbcdb08d5-Paper-Conference.pdf.

Serena Booth, W. Bradley Knox, Julie Shah, Scott Niekum, Peter Stone, and Alessandro Allievi. The perils of trial-and-error reward design: Misdesign through overfitting and invalid task specifications. Proceedings of the AAAI Conference on Artificial Intelligence, 37(5):5920 5929, Jun. 2023. doi: 10.1609/aaai.v37i5.25733. URL https://ojs.aaai.org/index.php/AAAI/article/view/ 25733.

Tom Everitt and Marcus Hutter. Avoiding wireheading with value reinforcement learning. In Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9, pages 12 22. Springer, 2016.

Jorge A Baier, Christian Fritz, and Sheila A Mc Ilraith. Exploiting procedural domain control knowledge in state-of-the-art planners. In ICAPS, pages 26 33, 2007.

Javier Segovia Aguas, Sergio Jiménez Celorrio, and Anders Jonsson. Generalized planning with procedural domain control knowledge. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 26, pages 285 293, 2016.

Cameron Voloshin, Hoang Le, Swarat Chaudhuri, and Yisong Yue. Policy optimization with linear temporal logic constraints. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 17690 17702. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/ 2022/file/70b8505ac79e3e131756f793cd80eb8d-Paper-Conference.pdf.

Rodrigo Toro Icarte, Toryn Q Klassen, Richard Valenzano, and Sheila A Mc Ilraith. Teaching multiple tasks to an rl agent using ltl. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pages 452 461, 2018.

Pashootan Vaezipoor, Andrew C Li, Rodrigo A Toro Icarte, and Sheila A. Mcilraith. Ltl2action: Generalizing ltl instructions for multi-task rl. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10497 10508. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/vaezipoor21a.html.

Jason Liu, Eric Rosen, Suchen Zheng, Stefanie Tellex, and George Konidaris. Leveraging temporal structure in safety-critical task specifications for pomdp planning. 2021.

Mathieu Tuli, Andrew Li, Pashootan Vaezipoor, Toryn Klassen, Scott Sanner, and Sheila Mc Ilraith. Learning to follow instructions in text-based games. Advances in Neural Information Processing Systems, 35:19441 19455, 2022.

Mohamadreza Ahmadi, Rangoli Sharan, and Joel W Burdick. Stochastic finite state control of pomdps with ltl specifications. ar Xiv preprint ar Xiv:2001.07679, 2020.

Giulio Mazzi, Alberto Castellini, and Alessandro Farinelli. Rule-based shielding for partially observable monte-carlo planning. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 31, pages 243 251, 2021.

Giulio Mazzi, Daniele Meli, Alberto Castellini, and Alessandro Farinelli. Learning logic specifications for soft policy guidance in pomcp. ar Xiv preprint ar Xiv:2303.09172, 2023.

Siddharth Srivastava, Xiang Cheng, Stuart Russell, and Avi Pfeffer. First-order open-universe pomdps: Formulation and algorithms. Technical report, Technical report, EECS-2013-243, EECS Department, UC Berkeley, 2012.

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99 134, 1998.

Richard D Smallwood and Edward J Sondik. The optimal control of partially observable markov processes over a finite horizon. Operations research, 21(5):1071 1088, 1973.

David Silver and Joel Veness. Monte-carlo planning in large pomdps. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/ paper_files/paper/2010/file/edfbe1afcf9246bb0d40eb4d8027d90f-Paper.pdf.

Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with regularization. Advances in neural information processing systems, 26, 2013.

Moran Barenboim and Vadim Indelman. Online pomdp planning with anytime deterministic guarantees. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 79886 79902. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/fc6bd0eef19459655d5b097af783661d-Paper-Conference.pdf.

Michael H Lim, Tyler J Becker, Mykel J Kochenderfer, Claire J Tomlin, and Zachary N Sunberg. Optimality guarantees for particle belief approximation of pomdps. Journal of Artificial Intelligence Research, 77:1591 1636, 2023.

Karl J Astrom et al. Optimal control of markov decision processes with incomplete state estimation. Journal of mathematical analysis and applications, 10(1):174 205, 1965.

Trey Smith and Reid Simmons. Heuristic search value iteration for pomdps. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI 04, page 520 527, Arlington, Virginia, USA, 2004. AUAI Press. ISBN 0974903906.

John A Nelder and Roger Mead. A simplex method for function minimization. The computer journal, 7(4):308 313, 1965.

James Kennedy and Russell Eberhart. Particle swarm optimization. In Proceedings of ICNN 95international conference on neural networks, volume 4, pages 1942 1948. ieee, 1995.

Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. 2009.

A Appendix Organization

The Appendix is organized as follows. Appendix B contains the proofs for showing the expected cost function is piecewise constant. Appendix C contains the proof that PRS is probabilistically complete. Appendix D discusses the evaluation problems and provides the parameterized BSQ policies used. Appendix E discussed additional implementation details of both Nelder-Mead and Particle Swarm. Appendix F discusses two additional partition selection approaches we tested and provides additional analysis of our results. Appendix G discusses the experimental setup and computational cost of our experiments. Appendix H contains the calculated closed-form solution of the partitions for the Spaceship Repair problem. Appendix I discusses the broader impacts of our work. Finally, Appendix J discusses additional limitations not discussed in the main paper.

B Lemmas and Proofs From Formal Analysis [Section 3]

In this section, we provide the formal proofs for Lemma 1, Theorem 1, Theorem 2, and Theorem 3 from Section 3, where we proved that braids partition the parameter space resulting in the expected cost function of a parameterized BSQ policy w.r.t its parameter being piecewise constant. We define and prove Lemmas 2, 3, 4, and 5 in this section for building these proofs.

First, we prove that the similarity operator H for braids (Def. 9) has the properties of being reflexive, symmetric, and transitive. As such, H defines an equivalence relation over the n-dimensional parameter space Rn, meaning it defines a partition over Rn.

Theorem 1. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 be the initial belief state, and H be the horizon. The operator H partitions Rn.

Proof. Let ϑ P Rn be n-parameter values and H be the horizon. By way of contradiction, let s assume that ϑ is not similar to itself, ϑ ı ϑ. This would mean that braid H,1pϑq braid H,2pϑq. As such, there must exist a leaf ℓ, which is in one but not the other braid. Note that ℓrepresents a unique rule-observation trajectory tr1, o1, ..., r H, o Hu. Additionally, for ℓto be in one of these braids it would need to be true that @i, ri.Ψpb i , ϑq must be satisfied, where b i bp pb0, r1, o1, ..., ri, oiq (Def. 7). However, note that this would hold true for the other braid as well, making it a contradiction for ℓto be exclusive in either braid H,1pϑq or braid H,2pϑq. As such, ϑ must be similar to itself meaning the similarity property holds.

Let ϑ1, ϑ2, ϑ3 P Rn where ϑ1 H ϑ2 and ϑ2 H ϑ3. Therefore, braidπ,Hpϑ1q braidπ,Hpϑ2q and braidπ,Hpϑ2q braidπ,Hpϑ3q (Def. 7). Using substitution, braidπ,Hpϑ1q braidπ,Hpϑ3q meaning ϑ1 H ϑ3. As such, the transitive property holds.

Due to set equality being symmetric, the symmetric property holds. Thus, the operator H is an equivalence relation over Rn causing H to define a partition over Rn.

For compound BSQs Ψ, we now prove that there exist unique intervals of the parameter space where Ψ is satisfied that we can calculate.

Lemma 1. Let Ψpb; Θq be an n-dimensional compound BSQ. There exists a set of intervals IpΨq Ď Rn s.t. Ψpb; Θq evaluates to true iff Θ P IpΨq.

Proof. Let P be a g POMDP, b be a belief state, Θ P R be a parameter, be a comparison operator, and φ be a first-order logic formula composed of functions from P. There exist two possible forms for a BSQ (Def. 2). Let λppb; φ, , Θq Pr JφKb Θ. Note that Pr JφKb evaluates into the probability of φ being satisfied in a belief state b. Therefore, we can simplify λppb; φ, , Θq to p Θ where p P r0, 1s, meaning this type of BSQ simplifies to an inequality. Now, let λppb; φ, , Θq Pr JφKb 1 where φ is composed of Θ and fully observable functions in P. We assume that Θ cannot be used as a function parameter, meaning that it must be an operand of a relational operator in φ. Since the functions are fully observable, they can be evaluated for b, leaving the inequalities involving Θ to dictate whether φ is satisfied. Thereby, BSQs evaluate to inequalities involving Θ.

A compound BSQ Ψ comprises conjunctions/disjunctions of BSQs by Definition 3. By substituting each BSQ with its inequalities, we can calculate the interval of Ψ, IpΨq.

Let us assume that Θ P IpΨq. By way of contradiction, let us assume that Θ does not satisfy Ψ. If Ψ is a conjunction of BSQs, there exists at least one BSQ that is not satisfied by Θ. If Ψ is a disjunction, all the BSQs are unsatisfied by Θ. However, this would mean that Θ cannot satisfy the inequalities from these BSQs, so Θ cannot be in IpΨq since IpΨq is constructed using the regions of the parameter space that satisfy the necessary BSQs, which is a contradiction.

Conversely, let us assume that Θ satisfies Ψ. This means one or all the BSQs are satisfied by Θ depending on if Ψ is a conjunction or disjunction. If Θ was not in IpΨq, there could not exist a set of BSQs satisfied for Ψ to be satisfied.

Thus, for a belief state b, a n-parameter compound BSQ Ψ has an interval in the parameter space IpΨq s.t. @Θ P Rn, Θ P IpΨq iff Θpb; Θq evaluates true.

As mentioned in Section 3, braids cannot be proper subsets of each other, which we will now prove in Lemma 2. As a high-level intuition, removing a leaf can only occur if a rule along that leaf s rule-observation trajectory is not satisfied, which would mean another rule must be satisfied since Def. 2 guarantees coverage of the belief state and parameter space. This results in at least one leaf being added to a braid that removes this first leaf, making this new braid not a subset of the other one. Lemma 2. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 be the initial belief, and H be the horizon. @ϑ1, ϑ2 P Rn, if braidπ,Hpϑ1q Ď braidπ,Hpϑ2q then braidπ,Hpϑ1q braidπ,Hpϑ2q.

Proof. Assume there exists ϑ1, ϑ2 P Rn s.t. braidπ,Hpϑ1q Ă braidπ,Hpϑ2q implying there exists leaf ℓ2 where ℓ2 P braidπ,Hpϑ2qzbraidπ,Hpϑ1q.

Let ℓ1 P braidπ,Hpϑ1q be the leaf with the largest rule-observation trajectory τ0 prefix shared with ℓ2 before differing. The trajectory for ℓ1 can be expressed as τ0τ1 where τ1 is the remaining trajectory for reaching ℓ1. Similarly, the trajectory for ℓ2 can be expressed as τ0τ2. Note τ0 represents the actions executed and observations observed from the initial belief state till right before the diversion resulting in the the belief state b being the same for both leaves up to this point.

If the first element in τ1 and τ2 is a rule, note that braidπ,Hpϑ2q must also contain ℓ1. This would imply that πpb; ϑ2q is not mutually exclusive since two rules can occur in one element of the strategy tree. This is a contradiction by Def. 4. If the first element in τ1 and τ2 is an observation, different observations occurred after executing the last shared action in τ0. Due to the observation model and sharing the belief state b at this point, both observations must be possible. This means a leaf in braidπ,Hpϑ1q must have a larger shared trajectory prefix than ℓ1, which is a contradiction. Thus, braids cannot be strict subsets of each other.

Since braids cannot be proper subsets of each other, we can now prove that both braids must contain leaves the other does not have. In turn, this prevents the interval of braids from overlapping. Note that the interval of a braid can be calculated by taking the intersections of the intervals of each leaf contained in that braid (Def. 8): Ipbraidπ,Hpϑqq Ş

ℓPbraidπ,Hpϑq Ipℓq.

Lemma 3. Let πpb, Θq be a parameterized BSQ policy, P be g POMDP, b0 be the initial belief, and H be the horizon. @ϑ1, ϑ2 P Rn, if braidπ,Hpϑ1q X braidπ,Hpϑ2q and braidπ,Hpϑ1q braidπ,Hpϑ2q then Ipbraidπ,Hpϑ1qq X Ipbraidπ,Hpϑ2qq .

Proof. Let ϑ1, ϑ2 P Rn where braidπ,Hpϑ1q X braidπ,Hpϑ2q and braidπ,Hpϑ1q braidπ,Hpϑ2q. Both braids cannot be proper subsets (Lemma 2) meaning both braids must contain leaves that are not in the other braid: braidπ,Hpϑ2qzbraidπ,Hpϑ1q and braidπ,Hpϑ1qzbraidπ,Hpϑ2q .

By Definition 8, the interval of a braid is the conjunction of the intervals of each leaf it contains. Using the associative and commutative properties, this can be rewritten as the conjunction of two sets: the interval of leaves shared and the interval of leaves not.

Ipbraidπ,Hpϑ1qq Ipbraidπ,Hpϑ1q X braidπ,Hpϑ2qq X Ipbraidπ,Hpϑ1qzbraidπ,Hpϑ2q

Ipbraidπ,Hpϑ2qq Ipbraidπ,Hpϑ1q X braidπ,Hpϑ2qq X Ipbraidπ,Hpϑ2qzbraidπ,Hpϑ1q

A braid s interval must exclude these unreachable leaves since a braid is all reachable leaves (Def. 7). As such, Ipbraidπ,Hpϑ2qq must not overlap with Ipbraidπ,Hpϑ1qzbraidπ,Hpϑ2q and Ipbraidπ,Hpϑ1qq must not overlap with Ipbraidπ,Hpϑ2qzbraidπ,Hpϑ1q. However, due to the conjunctions of intervals, Ipbraidπ,Hpϑ1q Ď Ipbraidπ,Hpϑ1qzbraidπ,Hpϑ2q and Ipbraidπ,Hpϑ2q Ď Ipbraidπ,Hpϑ2qzbraidπ,Hpϑ1q. Thus, the intervals of Ipbraidπ,Hpϑ1q and Ipbraidπ,Hpϑ2q cannot overlap.

The fact that two braids cannot have overlapping intervals allows us to prove that the sets of parameter values are similar iff they share the same braid interval.

Lemma 4. @ϑ1, ϑ2 P Rn, ϑ1 H ϑ2 iff Ipbraidπ,Hpϑ1qq Ipbraidπ,Hpϑ2qq.

Proof. Let ϑ1 H ϑ2, meaning braidπ,Hpϑ1q braidπ,Hpϑ2q L where L is the set of reachable leaves (Defs. 7 and 9). By Definition 8, the interval of a set of leaves is the intersection of each leaf contained in the set, meaning both braids must have the same interval.

Let Ipbraidπ,Hpϑ1qq Ipbraidπ,Hpϑ2qq. By way of contradiction, assume braidπ,Hpϑ1q braidπ,Hpϑ2q. By Lemma 3, this would mean Ipbraidπ,Hpϑ1qq X Ipbraidπ,Hpϑ2qq , which is a contradiction. Thus, braidpϑ1q braidpϑ2q meaning ϑ1 H ϑ2 (Def. 9).

We can now prove that partitions produced by H partitioning the parameter space Rn each represent a single braid, causing each partition to have a disjoint interval where a constant set of leaves is reachable.

Theorem 2. Let πpb, Θq be a parameterized BSQ policy, b0 be the initial belief state, and H be the horizon. Each partition ρ created by operator H partitioning Rn is the disjoint intervals, ρ Ď Rn

where @ϑ P ρ, braidπ,Hpϑq L where L is a fixed set of leaves.

Proof. Let ρ be a partition produced by H partitioning the parameter space Rn. Note that this means that parameter value sets contained in ρ must be similar (Def. 9): @ϑ1, ϑ2 P ρ, ϑ1 H ϑ2. As such, all parameter values have the same braid (Def. 7), meaning there exists a set of leaves L that are reachable in ρ. By Def. 8, this set s interval must be Ip Lq Ş

ℓHPL IpℓHq. By Lemma 3, the interval of other braids cannot overlap with Ip Lq. Also, there are no proper subsets (Lemma 2), meaning that no other braid can occur in Ip Lq making it disjoint.

By Def. 8, Ip Lq must be contained in ρ due to all parameter value sets in Ip Lq having the same braid of L leaves. If ρ contained parameter value sets not in Ip Lq, this would imply there exists ϑ outside of Ip Lq where just the leaves in L are reachable, which is a contradiction due to Ip Lq being the only interval space where all the leaves of L are reachable. Meaning the interval of ρ is actually Ip Lq. Thus, each partition represents a disjoint interval where only all leaves in L are reachable.

Due to the braid intervals not overlapping, we can prove that parameter value sets contained in that braid s interval must have a constant expected cost.

Lemma 5. Let πpb, Θq be a parameterized BSQ policy and H be the horizon. @ϑ1 P Rn, @ϑ2, ϑ3 P Ipbraidpϑ1qq, Eπpϑ2; Hq Eπpϑ3; Hq.

Proof. Let ϑ2, ϑ3 P Ipbraidπ,Hpϑ1q where ϑ1 P Rn is a tuple of n parameters. Note that braidpϑ1q braidpϑ2q braidpϑ3q due there being no strict subsets (Lemma 2) and leaves in ϑ2 and ϑ3 would have to be reachable in ϑ1. Each braid represents a policy tree (Def. 5), and the expected cost is based on the probability distribution of leaves in the braid. Since both ϑ2 and ϑ3 represent the same policy tree, they must have identical expected cost values.

It is now trivial to show that each partition represents a disjoint interval of the parameter space where the expected cost is constant.

Theorem 3. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 be the initial belief state, and H be the horizon. Each partition created by H on Rn has a constant expected cost.

Proof. Let ρ be a partition created by partitioning Rn with H. By Theorem 2, all parameter value sets in the disjoint interval of ρ must have the same braid. As such, by Lemma 5, the expected cost is constant for all the parameter sets. Thus, the disjoint interval of each partition must have a constant expected cost.

C Proofs For Partition Refinement Search [Section 5]

In this section, we provide the formal proof for Theorem 4 proving that the Partition Refinement Search algorithm introduced in Section 5 is probabilistically complete. We define and prove Lemmas 6, 7, and 8 for building this proof.

When PRS refines a partition ρ using a leaf ℓ, it can produce up to two possible partitions: a partition for ρ where ℓis reachable and a partition for ρ where ℓis not (if it exists). We now show that this process prevents empty partitions.

Lemma 6. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 be the initial belief state, and H be the horizon. For each partition ρ constructed by Partition Refinement Search, ρ .

Proof. Let ρ Ď Rn be a partition constructed by PRS. Since PRS creates partitions based on whether sampled leaves are included or excluded, let Li be the leaves PRS included in partition ρ and Le be the leaves excluded. Therefore, ρ Ip Liqz Ip Leq.

By way of contradiction, let ρ . There are two cases where this could occur: (1) excluding leaf ℓ caused ρ or (2) including ℓcaused ρ . For case (1), we explicitly do not add partitions if excluding the leaf results in an empty interval, meaning this cannot happen. For case (2), this implies that there exists a previous partition ρ0 where sampling leaf ℓ0 resulted in the partition constructed from ρ0 including ℓ0 creating ρ where ρ . Due to ℓ0 being uniformly sampled from ρ0, ℓ0 must be reachable in ρ0 meaning ρ0 X Ipℓ0q . However, line 10 of Algo. 1 calculates the interval of ρ as ρ0 X Ipℓ0q meaning ρ , which is a contradiction. Thus, all partitions must be not empty.

A critical property of PRS is that each partition constructed converges to represent a single braid.

Lemma 7. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 be the initial belief state, and H be the horizon. Let ρ be a partition constructed by Partition Refinement Search. If all leaves reachable in ρ Ď Rn have been sampled, @ϑ P ρ, Ipbraidπ,Hpϑqq ρ.

Proof. Let ρ Ď Rn be a partition constructed by PRS. Let Lρ tℓ1, ..., ℓnu be the n-sampled unique leaves for ρ. Let all leaves reachable from ρ be sampled, @ℓ, ℓP Lρ Ø r Dϑ P ρ, ℓP braidπ,Hpϑqs.

Due to ρ (Lemma 6) and parameterized BSQ policies covering Rn (Def. 4), there must exist a non-empty set of leaves L reachable within ρ. Since all leaves are sampled, we know that L Ď Lρ. However, there cannot be proper subsets (Lemma 2) meaning L Lρ. This means that the interval of ρ must also equal the interval of leaves Ip Lq. Thus, ρ must represent a braid.

Since partitions are constructed by including/excluding sampled leaves hierarchically, we can prove that this makes each partition represent a unique braid.

Lemma 8. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 be the initial belief state, and H be the horizon. Let ρ1, ρ2 Ď Rn be partitions constructed by Partition Refinement Search. If all leaves reachable in ρ1 and ρ2 have been sampled, @ϑ1 P ρ1, @ϑ2 P ρ1, braidπ,Hpϑ1q braidπ,Hpϑ2q.

Proof. Let ρ1, ρ2 Ď Rn be two different partitions constructed by PRS. Note that the PRS partitions Rn by refining one partition using leaf ℓinto two by explicitly including Ipℓq in one partition and excluding Ipℓq in the other (Algorithm 1). Meaning ρ1 and ρ2 cannot overlap.

Since both partitions represent a possible non-empty braid (Lemma 7), there exists a set of leaves reachable in both partitions. However, by the partition construction process, there must exist at least one leaf included in one but excluded in the other. Due to there being no interval overlap between braids, two different braids must be reachable in each partition (Lemma 2). Thus, all partitions must represent a unique braid.

Using the property that each partition in PRS represents a unique braid, we can now prove that PRS is probabilistically complete.

Theorem 4. Let πpb, Θq be a parameterized BSQ policy, P be a g POMDP, b0 the initial belief state, and H be the horizon. The likelihood of the Partition Refinement Search algorithm returning the optimal parameter interval converges to one in the limit of infinite samples.

Proof. Note that g POMDPs have a finite set of observations and finite horizon (Def. 1), and parameterized BSQ policies have a finite number of rules (Def. 4). As such, there exists a finite number of unique rule-observation trajectories in the strategy tree (Def. 6). Therefore, there exists a finite number of leaves due to each leaf having a unique rule-observation trajectory. This results in there only being a finite set of braids being all possible combinations of reachable leaves (Def. 7). Since each partition represents a unique braid (Lemma 8), the number of partitions must be finite.

Let ρ Ď Rn be a partition constructed by PRS that is not equivalent to a braid. By Lemma 7, this means there exists a leaf ℓreachable in ρ that has not been sampled yet. This also means there must exist a non-empty interval ρ X Ipℓq where sampling from ρ can reach ℓ. Due to uniform sampling selecting parameter values when sampling a leaf for refining the partition (line 7 of Algorithm 1), the probability of selecting a parameter value that could sample ℓcan be calculated as ρXIpℓq

ρ Prp Ipℓq|ρq.

Note that ℓrepresents a unique rule-observation trajectory tr1, o1, ..., r H, o Hu. Note the probability of an observation o being observed in belief state b after action a is executed is Prpo|b, aq ř

s1rΩps1, a, oq ř

s T ps, a, s1qbpsqs. Meaning that the probability of reaching ℓduring rollout is Prpℓq ś

i Prpoi|biq where bi bp pb0, r1, o1, ..., ri, oiq. Since we know that ℓis reachable, Prpℓq ą 0.

We assume that partition selection approaches discussed in Section 6 have a non-zero probability of refining any partition. Let Prpρq be the probability of ρ being selected. This means that in any refinement step, the probability of sampling leaf ℓis Prpℓq Prp Ipℓq{ρq Prpρq. Due to each probability being greater than zero, the probability of any non-sampled leaf being sampled must be greater than zero. Therefore, with enough refinement steps, all the leaves will be sampled since there is only a finite number of leaves. Thus, the set of partitions will be refined to the set of braids as the number of samples increases to infinite.

Note that each partition represents a unique braid (Lemma 8) with a set probability distribution of outcomes based on the reachable leaves. Due to a non-zero probability of refining a partition Prpρq, the sampled expected cost of a partition will converge to the actual expected cost due to the law of large numbers.

Therefore, within a finite number of samples, the partitions constructed by PRS will accurately represent the set of braids with an accurate representation of their expected costs. Thus, PRS will find the minimal expected cost partition as the number of samples increases to infinite.

D Evaluation Problem s Belief-State Query Preferences

In this section, we provide the parameterized BSQ policies for the Lane Merger, Graph Rock Sample, and Store Visit problems discussed in Section 7. To do this, we first describe the functions that compose each problem s states and actions. We use loops and quantifiers in the parameterized BSQ policies for clarity that can be unrolled on a problem-by-problem basis.

D.1 Lane Merger

The Lane Merger problem is that there are two lanes, and the agent must merge into the other lane within a certain distance. In this other lane, there is another car whose exact location and speed are unknown. Therefore, there exist two objects in the environment: the agent (agent) and the other car (other). For either object o, the location and speed are tracked using the unary integer functions locpoq and speedpoq. For actions, the agent can increase their speed (speed_uppq), decrease their speed (slow_downpq), remain in their current lane at their current speed (keep_speedpq), or attempt to merge lanes (mergepq). Using these functions, the parameterized BSQ policy πlmpb; Θ1, Θ2q is formally defined as follows.

πlmpb; Θ1, Θ2q : If Pr Jlocpagentq ą locpotherq speedpotherq 2_ locpagentq speedpagentq 2 ă locpotherq Kb ą Θ1 Ñ mergepq Else if Pr J|locpagentq locpotherq| ď 1Kb ą Θ2 Pr Jspeedpagentq ą 0Kb 1 Ñ slow_downpq Else keep_speedpq

D.2 Graph Rock Sample

The Graph Rock Sample problem is that there is a rover with pre-programmed waypoints, where some waypoints contain rocks. These rocks have been categorized into types, and whether it is safe for the rover to sample them is unknown. The objective of the rover is to sample each type with a safe rock before traversing to a dropoff location. The objects are the waypoints, including the rocks tr1, ..., rnu and the dropoff location (dropoff). The rover knows if or if it is not located at waypoint w using the unary Boolean function locpwq. The rover also knows whether it needs to sample rocks of type t using the unary Boolean function neededptq. For any rock r, the distance from the rover, whether the rock is type t, and if the rock is safe to sample are tracked using the unary double function distanceprq and the Boolean functions typepr, tq, and safeprq, respectively. The rover can move to neighboring waypoint w (movepwq), sample rock r at its current waypoint (sampleprq), and scan any rock r (scanprq). For clarity, we use the function gotopwq to specify taking the edge that moves the rover closer to waypoint w. Using these functions, the parameterized BSQ policy πgrspb; Θ1, Θ2, Θ3q is formally defined as follows.

πgrspb; Θ1, Θ2, Θ3q : For rc P tr1, ..., rnu : If Pr JDt|typeprc, tq neededptq locprcq safeprcq Kb ě Θ1 Ñ sampleprcq Else if Pr JDt|typeprc, tq neededptq locprcq safeprcq Kb ě Θ1 Ñ gotoprcq Else if Pr JDt|typeprc, tq neededptq safeprcq Kb ě Θ2 Pr Jdistanceprcq ď Θ3Kb 1 Ñ scanprcq Else if Pr JDt|typeprc, tq neededptq safeprcq Kb ě Θ2 Pr Jdistanceprcq ą Θ3Kb 1 Ñ gotoprcq Else gotopdropoffq

D.3 Store Visit

The Store Visit problem involves an agent in a city with a grid-based layout. Some locations are unsafe, while others contain a bank or a store. The objective is for the agent to visit a bank safely and then a store. The objects are the agent, the set of stores ts1, ..., snu, and the set of banks tb1, ..., bmu. Labeling functions bankpoq and storepoq check whether object o is a bank or store, respectively. The ternary Boolean function keeps track of the current px, yq location of the object o, locpo, x, yq. Similarly, whether location px, yq is safe is tracked by the binary Boolean function is_safepx, yq. Lastly, the state keeps track of whether the agent has visited a bank using the nullary Boolean function vbankpq. The agent can move left (leftpq), right (rightpq), up (up), and down (downpq) in the grid. The agent can also visit a building in its current location (visitpq) or scan its surroundings to figure out its location (scanpq). Using these functions, the parameterized BSQ policy πsvpb; Θ1, Θ2, Θ3q is formally defined as follows.

πsvpb; Θ1, Θ2, Θ3q : If @x, y|Pr Jlocpagent, x, yq Kb ă Θ3 Ñ scanpq Else if Pr JDs, x, y|vbankpq storepsq locps, x, yq locpagent, x, yq Kb ě Θ1 Ñ visitpq For sc P ts1, ..., snu : Else if Pr JDx1, y1, x2, y2|vbankpq storepscq locpagent, x1, y1q locpsc, x2, y2q x1 ă x2 is_safepx1 1, y1q Kb ě Θ2 Ñ rightpq Else if Pr JDx1, y1, x2, y2|vbankpq storepscq locpagent, x1, y1q locpsc, x2, y2q x1 ą x2 is_safepx1 1, y1q Kb ě Θ2 Ñ leftpq Else if Pr JDx1, y1, x2, y2|vbankpq storepscq locpagent, x1, y1q locpsc, x2, y2q y1 ą y2 is_safepx1, y1 1q Kb ě Θ2 Ñ downpq Else if Pr JDx1, y1, x2, y2|vbankpq storepscq locpagent, x1, y1q locpsc, x2, y2q y1 ă y2 is_safepx1, y1 1q Kb ě Θ2 Ñ uppq Else if Pr JDk, x, y| vbankpq bankpkq locpk, x, yq locpagent, x, yq Kb ě Θ1 Ñ visitpq For kc P tk1, ..., kmu : Else if Pr JDx1, y1, x2, y2| vbankpq bankpkcq locpagent, x1, y1q locpkc, x2, y2q x1 ă x2 is_safepx1 1, y1q Kb ě Θ2 Ñ rightpq Else if Pr JDx1, y1, x2, y2| vbankpq bankpkcq locpagent, x1, y1q locpkc, x2, y2q x1 ą x2 is_safepx1 1, y1q Kb ě Θ2 Ñ leftpq Else if Pr JDx1, y1, x2, y2| vbankpq bankpkcq locpagent, x1, y1q locpkc, x2, y2q y1 ą y2 is_safepx1, y1 1q Kb ě Θ2 Ñ downpq Else if Pr JDx1, y1, x2, y2| vbankpq bankpkcq locpagent, x1, y1q locpkc, x2, y2q y1 ă y2 is_safepx1, y1 1q Kb ě Θ2 Ñ uppq Else scanpq

E Hyperparameter Optimization Algorithms Implementation

As a baseline comparison, we implemented Nelder-Mead and Particle Swarm as hyperparameter optimization algorithms to compare solving for the optimal parameter values for a parameterized BSQ policy to minimize the expected cost of the resulting BSQ policy. Both algorithms evaluate points in the parameter space to decide which areas to explore next. For both, we evaluate a parameter point by taking a thousand parallel runs of the BSQ policy with those values to approximate the expected cost.

Nelder-Mead We used a simplex that has edges numbering one more than the number of parameters in the parameterized BSQ policy being optimized. To start with a better initial simplex, we randomly sampled a hundred points and tracked the points that had lower expected costs and were 0.4 distance away from each of the better-performing points. The closer points were saved but were given a lower priority. Each iteration followed the standard Nelder-Mead steps with the sum quality of all the edges in the simplex calculated. If five iterations pass without an increase in quality, the run is deemed to have converged, and the best quality point of the simplex is returned as the solution.

Particle Swarm Particle swarm used 10 particles randomly selected from within the parameter space with a random velocity. Let t be the number of iteration steps since the last improvement in the best quality point found. For each iteration, the cognitive coefficient is 1.0 0.1t, and the social coefficient is 0.1 0.1t, which causes the particles to become more greedy as time since the last improvement increases. The momentum is statically set to 0.6 with the velocity clipped between 0.5. The location of points is also clipped to the parameter search space. If 10 iteration steps pass without seeing an improvement, the run is deemed to have converged, and the best quality point of the swarm is returned as the solution.

Expected Cost

Lane Merger

40Graph Rock Sample

10.0Spaceship Repair

60 Store Visit

0 40 80 120 0

% Goal Achievement

0 40 80 120 0

0 10 20 30 0

0 50 100 150 200 0

PRS-Bolt PRS-Epsilon PRS-Local PRS-Global PRS-Max

Time (Seconds)

Figure 5: Performance of the hypothesized optimal partition while solving for the Lane Merger, Spaceship Repair, and Store Visit problems. Each line is the average over 10 independent runs with the standard deviation error shown.

F Additional Results

In this section, we provide additional results from the experiments performed. This includes introducing two additional partition selection approaches we evaluated: Global Thompson Sampling and Maximum Confidence. We also provide graphs of the performance of the hypothesized optimal partition across all PRS variants. Finally, we provide a results table for all five partition selection approaches and the baseline RCompliant.

PRS is implemented for multiprocessing by having each process manage a subset of the partitions X1 Ď X but share a global hypothesis of the optimal partition. Also, a dynamic exploration rate er is used that diminishes over the solving time. Using this framework, two additional partition selection approaches were explored.

Maximum Confidence (PRS-Max) We explore er percent of the time by uniformly sampling s U 1 0 and checking if s ď er. If exploring, we uniformly at random select a partition from X1. Otherwise, the partition with maximum standard deviation, arg minxρ, ˆ Erρsy PX1 σp ˆEqrρs, is selected.

Global Thompson Sampling (PRS-Global) Unlike the other partition selection approaches, each processor iterates over all partitions it manages before selecting multiple partitions to refine. Partitions are chosen for two reasons: (1) they are below the minimum number of samples, or (2) the partition has the potential of being better than the current global hypothesized optimal partition. This is simulated for each partition using Npµc, σc ˆ erq with µc and σc being the mean and standard deviation of that partition, respectively. If the sample taken from this normal distribution has a lower expected cost than the hypothesized optimal partition, this partition is selected for refinement.

Performance of the hypothesized optimal In Figure 5, the hypothesized optimal over the runtime of PRS for each partition selection approach is shown. On Lane Merger and Spaceship Repair, the performance of each PRS variant is quite similar, with the solver quickly converging to a near-optimal policy. However, PRS-Global has a much slower convergence rate due to trying to evaluate all promising partitions rather than focusing on the most promising ones. This resulted in PRS-Global not converging before timeout on Store Visit. PRS-Max is expected to perform poorly due to its poor partition-selection strategy. These results highlight that, with a competent partition-selection strategy, PRS will converge to the optimal policy that minimizes the expected cost, with the main variation being the convergence time.

Tabulated performance In Table 1 and Table 2, the expected cost and the goal achievement rate have been tabulated, showing the near identical performance of four of the partition refinement approaches. The solution for Nelder-Mead and Particle Swarm are taken at PRS s timeout time to give each solver the same solving time. For all the problems, the more effective partition-selection approaches

Problems Lane Merger Graph Rock Sample Spaceship Repair Store Visit PRS-Bolt 4.39 0.04 18.19 0.82 8.52 0.00 19.81 2.09 PRS-Epsilon 4.40 0.04 17.84 0.25 8.52 0.00 22.20 4.70 PRS-Global 4.40 0.03 18.47 1.72 8.61 0.31 38.07 13.63 PRS-Local 4.39 0.03 20.61 2.96 8.57 0.05 21.57 5.80 PRS-Max 4.48 0.08 34.12 7.00 8.58 0.07 58.16 18.33 Nelder-Mead 4.89 0.65 19.56 1.44 8.82 0.38 22.39 8.24 Particle Swarm 4.91 0.56 21.12 2.34 8.69 0.33 18.95 2.42 RCompliant 22.10 15.70 60.91 18.46 9.97 0.77 56.64 33.04 Table 1: Expected cost of Partition Refinement Search, Nelder-Mead, Particle Swarm, and RCompliant on the Lane Merger, Graph Rock Sample, Spaceship Repair, and Store Visit problems. The performance was measured over ten runs to calculate the performance average and standard deviation.

Problems Lane Merger Graph Rock Sample Spaceship Repair Store Visit PRS-Bolt 99.6% 0.1% 96.0% 1.8% 49.8% 0.0% 95.6% 2.7% PRS-Epsilon 99.6% 0.1% 97.3% 1.3% 49.8% 0.0% 92.0% 5.8% PRS-Global 99.6% 0.0% 96.3% 2.9% 50.6% 2.6% 72.6% 16.3% PRS-Local 99.6% 0.1% 92.2% 4.0% 49.3% 0.4% 92.7% 7.2% PRS-Max 99.5% 0.2% 83.6% 9.6% 49.3% 0.6% 48.2% 21.3% Nelder-Mead 98.9% 1.2% 94.0% 2.7% 52.7% 7.9% 92.4% 10.6% Particle Swarm 99.0% 1.1% 91.3% 3.5% 52.0% 5.2% 96.6% 3.5% RCompliant 86.9% 16% 41.3% 20.5% 48.1% 10.2% 49.8% 38.4% Table 2: Goal achievement rate of Partition Refinement Search, Nelder-Mead, Particle Swarm, and RCompliant on the Lane Merger, Graph Rock Sample, Spaceship Repair, and Store Visit problems. The performance was measured over ten runs to calculate the performance average and standard deviation.

discussed in the main paper achieved equal, if not better, performance than Nelder-Mead and Particle Swarm.

G Experimental Setup And Computational Cost

In this section, we go through the empirical setup of the experiments performed in Section 7 and include an estimate of the computation cost for running the experiments for this paper.

All experiments were performed on an Intel(R) Xeon(R) W-2102 CPU @ 2.90GHz without using a GPU. The Partition Refinement Search algorithm was implemented using a manager-worker design pattern where 8 workers were initialized when solving. The manager maintained the hypothesized optimal partition and current exploration rate. Table 3 shows the timeout and sample rate used for each problem for PRS. PRS was allowed to use an addition minute beyond timeout in the case the hypothesized optimal partition had less than the minimum allowed samples, however this case did not occur.

Both solutions and recorded hypothesized optimal partitions were evaluated using the same random seed to ensure that the same initial states were assessed. This evaluation process was carried out in parallel using a manager-worker design pattern with 16 workers. 25,000 independent runs were conducted for each solution to determine the expected cost and goal achievement rate. Additionally,

Problem Timeout (seconds) Sample Rate (seconds) Lane Merger 120 0.5 Graph Rock Sample 120 1 Spaceship Repair 30 0.125 Store Visit 300 2.5 Table 3: The timeout and the sample rate of the hypothesized optimal partition for PRS for the evaluation problems.

for each recorded hypothesized optimal partition, 10,000 runs were performed. The average performance and standard deviation error were calculated by averaging the results of ten runs for each combination of problem and solver. A similar approach was used to evaluate the random-parameter user-compliant policy RCompliant. Instead of using solved policies, ten parameter value sets were uniformly selected randomly from the parameter space, and each set was evaluated for 25,000 runs. These results are presented in Figure 3.

For constructing the Spaceship Repair heatmap (Figure 1), all combinations of parameters Θ1 and Θ2 were evaluated with parameter values sampled from 0 to 1 with increments of 0.002. This produced 251,001 equally-spaced parameter values. Parameter values were evaluated on 300 runs with a horizon of 12 to calculate the expected cost.

Computational cost Running the Partition Refinement Search algorithm for the empirical evaluation section (Section 7) involved nine processes running simultaneously for 25 minutes across ten trials for each of the five partition-selection approaches. This resulted in 1.58 hours of CPU usage when run in parallel, equivalent to 14.22 hours if executed sequentially. Evaluation complexities were significant, such as the variance in time per run and problem type. For instance, evaluating the solutions and hypothesized optimal partitions for the Lane Merger problem using 17 processes took approximately 48 hours in parallel. The overall CPU usage for the main experiment approximates to 360 hours (15 days) in parallel, translating to about 6,288 hours (262 days) if run sequentially. Additionally, constructing the Spaceship Repair heatmap (Figure,1) required approximately 24 hours of CPU time using 11 processes. These experiments were conducted thrice, culminating in an estimated total computational cost of 2,160 hours (90 days) using an Intel(R) Xeon(R) W-2102 CPU @ 2.90GHz, or 19,656 hours (819 days) if operations were performed sequentially.

H Spaceship Repair Partitions Closed Form

In this section, we calculate the braids that partition the parameter space for the Spaceship Repair problem with the parameterized BSQ policy from Fig. 1.

First, we give the exact observation model used. From Section 3.1, the Spaceship Repair state is composed of two functions: brokenpoq and rlocationpq. This means each state is expressed as tbrokenprobotq, brokenpshipq, rlocationpqu. Additionally, the set of observations can be expressed as tobs_errprobotq, obs_errpshipqu. Let pr and ps be the probability of the observation reflecting the actual state of the robot and spaceship, respectively. The probability of observation o in state s after action a is executed is calculated as follows.

Prpo tobs_errprobotq, obs_errpshipqu|s tbrokenprobotq, brokenpshipq, rlocationpquq $ &

prps, if brokenprobotq obs_errprobotq brokenpshipq obs_errorpshipq prp1 psq, if brokenprobotq obs_errprobotq brokenpshipq obs_errorpshipq p1 prqps, if brokenprobotq obs_errprobotq brokenpshipq obs_errorpshipq p1 prqp1 psq, otherwise (1)

Note observations are independent of the robot s location and actions. For clarity, we express the states as whether or not the robot and spaceship are broken, tbrokenprobotq, brokenpshipqu. This means there are four possible states depending on whether the robot and ship are broken. For ease of notation, we represent these states as S ts T T , s T F , s F T , s F F u, where s T F represents that state where the robot is broken and the spaceship is not. Similar, let the four possible observations be represented as O to T T , o T F , o F T , o F F u.

The precondition of the first rule of the Spaceship Repair problem parameterized BSQ policy is Jbrokenprobotq Kb ď Θ1 (Figure 1). For any belief state b, the probability of the robot being broken is the probability of the states where that is true: Jbrokenprobotq Kb bps T T q bps T F q.

Let ta1, o1, ..., at, otu be an action-observation trajectory for t timesteps where at each timestep an action is executed followed by an observation being observed. We can calculate the probability of the state where the robot and spaceship are broken, s T T , as follows.

btps T T q αPrpot|s T T , atq ÿ

s T ps, at, s T T qbt 1psq (2)

Note that, due to the observations being independent of the robot s location, the observation and transition functions are independent of the action. For example, there is no action the robot can perform to change whether the robot or spaceship is broken due to the problem being to reach a broken component rather than fixing it. We can simplify Equation 2 significantly as follows.

btps T T q αPrpot|s T T qbt 1ps T T q (3)

We can now rewrite Equation 3 by unrolling the recursion. Note that α is the normalization factor meaning we don t need to calculate α each timestep because the final normalization will factor in all these changes. Additionally, due to the probability of each initial state being uniform, we don t need to keep track of the initial belief. Also, note there exist four possible observations. Due to the commutativity of multiplication, we can rearrange to get the following. Let c T T , c T F , c F T , and c F F be the counts of the number of each observation where c T T c T F c F T c F F t.

btps T T q αPrpo T T |s T T qc T T Prpo T F |s T T qc T F Prpo F T |s T T qc F T Prpo F F |s T T qc F F (4)

Using Equation 1, the probability of this state can be written in terms of pr and ps.

btps T T q αpprpsqc T T pprp1 psqqc T F pp1 prqpsqc F T pp1 prqp1 psqqc F F (5)

btps T T q αpc T T c T F r pc T T c F T s p1 prqc F T c F F p1 psqc T F c F F (6)

This same process can be applied to the other three states to get the equation of their likelihoods.

btps T F q αpc T T c T F r pc T F c F F s p1 prqc F T c F F p1 psqc T T c F T (7) btps F T q αpc F T c F F r pc T T c F T s p1 prqc T T c T F p1 psqc T F c F F (8) btps F F q αpc F T c F F r pc T F c F F s p1 prqc T T c T F p1 psqc T T c F F (9)

We can group the states into two groups depending on whether or not the robot is broken. By factoring we can get the following.

btps T T q btps T F q

αpc T T c T F r p1 prqc F T c F F rpc T T c F T s p1 psqc T F c F F pc T F c F F s p1 psqc T T c F T s (10)

btps F T q btps F F q

αpc F T c F F r p1 prqc T T c T F rpc T T c F T s p1 psqc T F c F F pc T F c F F s p1 psqc T T c F T s (11)

Note that in Equations 10 and 11 everything in the brackets is shared, which is due to the individual observations of the spaceship and robot being independent of each other. Also, for normalization, we just divide the sum of Equations 10 and 11, which is equivalent to the sum probability of all states. Additionally, note that Equation 10 is equivalent to the BSQ precondition Jbrokenprobotq Kbt. Substituting into this BSQ and simplifying we get the following.

Jbrokenprobotq Kbt pc T T c T F r p1 prqc F T c F F

pc T T c T F r p1 prqc F T c F F pc F T c F F r p1 prqc T T c T F (12)

Note that there are two exponent values: the number of times the robot is observed to be broken and the number it is not. Let dr c T T c T F c F T c F F be the difference in the number of times that the robot is observed to be broken to not. If dr ą 0, then the robot has been observed to be broken more often than not. By substituting dr c F T c F F c T T c T F into Equation 12 the equation simplifies down.

Jbrokenprobotq Kbt pdr r pdr r p1 prqdr (13)

Following a similar process, the BSQ from the second rule in the parameterized BSQ policy from Figure 1 can be written similarly. Let ds be the difference in the number of times the spaceship is observed to be or not. If ds ą 0, then the spaceship has been observed to be broken more often than not.

Jbrokenpshipq Kbt pds s pds s p1 psqds (14)

Note that the observation model used pr 0.6 and ps 0.75 for the heatmap in Figure 1. The horizontal thresholds can be calculated using Equation 14 and the vertical with Equation 13.

Therefore, a partition is specific value of dr P Z and ds P Z that is equivalent to saying: The objective is to fix the communication channel. If the difference in the number of times the robot has been observed being broken than not is greater than dr, it should try to repair itself; otherwise, if the difference in the number of times the spaceship has been observed being broken than not is greater than ds, it should try to repair that. Formally, this partition represents the parameter space where pdr 1 r pdr 1 r p1 prqdr 1 ď Θ1 ă pdr r pdr r p1 prqdr and pds 1 s pds 1 s p1 psqds 1 ď Θ2 ă pds s pds s p1 psqds where all parameter value sets that satisfy both inequalities are similar. Due to dr and ds being the difference between observation counts, the set of possible partitions is finite for finite horizons.

We explored solving the Spaceship Repair problem directly using these inequalities. The belief state reflects the probability of each outcome, meaning the main challenge is calculating the average number of timesteps to reach the goal. This can be solved by finding the average length of time of the Gambler s Ruin problem. One possible direction of future work is exploring solving parameterized BSQ policies and g POMDPs this way.

I Broader Impacts

The primary positive impact of parameterized BSQ policies is their accessibility to non-experts, allowing them to input their requirements directly into a solver that optimizes the completion of tasks while aligning with the user. Moreover, parameterized BSQ policies enable encoding safety constraints with enforceable guarantees over the belief state. Thus, this paper represents an important step in making AI more usable for non-experts, particularly in encoding constraints and preferences, while addressing safety concerns in real-world applications."

A potential negative impact of making AI more accessible through parameterized BSQ policies is that it could also be exploited by bad actors who might encode harmful preferences. To mitigate this risk, one approach is to design goals such that negative outcomes inherently prevent goal completion, thereby teaching the agent to avoid these outcomes. Additionally, future work can explore methods for prioritizing certain constraints to ensure that the AI does not align with harmful intentions.

J Additional Limitations

While we discussed in Section 8 some of the limitations of this work, one additional limitation is an essential direction of future work: aligning user and problem objectives. For example, in Graph Rock Sample, if the encoded goal for the g POMDP did not require collecting rocks but the user still wanted to collect one rock of each type using the parameterized BSQ policy in Appendix D.2, PRS would optimize the parameters to make it so no rocks are worth scanning or sampling to exit as fast as possible to minimize the expected cost. While this case is an obvious misalignment between the g POMDP and parameterized BSQ policy, these misalignments can be more subtle, leading to the optimal policy not behaving as intended. Therefore, future work needs to be done to explore catching misalignments to allow the user to understand and fix them.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The claims made in both the abstract and introduction reflect the paper where we introduced a new framework for user preferences (Section 3), performed a formal analysis of it (Section 4), introduced a piecewise constant algorithm (Section 5), and empirically evaluated this algorithm (Section 7). Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Please refer to both Section 8 and Appendix J. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: Refer to Appendix B and Appendix C for the formal proofs of the lemmas and theorems defined in the paper. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Both the code has been provided in the supplementary material and the detailed methodology can be found in Appendix G. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code used for this paper has been provided in the supplementary material. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Refer to Appendix G for a detailed methodology. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: In both Section 7 and Appendix F, all shown results are shown with standard deviation error. Additionally, we make it clear in both sections that we are using standard deviation error. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: Refer to Appendix G.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We have reviewed and can confirm our research conforms to Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: Refer to Appendix I.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] . Justification: This paper poses no risk of being misused. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: This paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: Documentation on the code used in this paper is provided with the code. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.