# value_alignment_verification__449a2bd6.pdf

Value Alignment Veriﬁcation

Daniel S. Brown * 1 Jordan Schneider * 2 Anca Dragan 1 Scott Niekum 2

As humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important to be able to efﬁciently evaluate an agent s performance and correctness. In this paper we formalize and theoretically analyze the problem of efﬁcient value alignment veriﬁcation: how to efﬁciently test whether the behavior of another agent is aligned with a human s values. The goal is to construct a kind of driver s test that a human can give to any agent which will verify value alignment via a minimal number of queries. We study alignment veriﬁcation problems with both idealized humans that have an explicit reward function as well as problems where they have implicit values. We analyze veriﬁcation of exact value alignment for rational agents and propose and analyze heuristic and approximate value alignment veriﬁcation tests in a wide range of gridworlds and a continuous autonomous driving domain. Finally, we prove that there exist sufﬁcient conditions such that we can verify exact and approximate alignment across an inﬁnite set of test environments via a constantquery-complexity alignment test.

1. Introduction

If we desire autonomous agents that can interact with and assist humans and other agents in performing complex, risky tasks, then it is important that humans can verify that these agents policies are aligned with what is expected and desired. This alignment is often termed value alignment and is deﬁned in the Asilomar AI Principles1 as follows: "Highly autonomous AI systems should be designed so that their goals and behaviors can be assured to align with human values throughout their operation." In this paper, we pro-

*Equal contribution 1University of California, Berkeley, USA 2University of Texas at Austin, USA. Correspondence to: Daniel Brown <dsbrown@berkeley.edu>, Jordan Schneider <joschnei@cs.utexas.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 1https://futureoflife.org/ai-principles/

vide a theoretical analysis of the problem of efﬁcient value alignment veriﬁcation: how to efﬁciently test whether a robot is aligned with a human s values.

Existing work on value alignment often focuses on qualitative evaluation of trust (Huang et al., 2018) or asymptotic alignment of an agent s performance via interactions and active learning (Hadﬁeld-Menell et al., 2016; Christiano et al., 2017; Sadigh et al., 2017). By contrast, our work analyzes the difﬁculty of efﬁciently evaluating another agent s correctness by formally deﬁning value alignment and seeking efﬁcient tests for value alignment veriﬁcation that are applicable when two or more agents already have learned a policy or reward function and want to efﬁciently test compatibility. To the best of our knowledge, we are the ﬁrst to deﬁne and analyze the problem of value alignment veriﬁcation. In particular, we propose exact, approximate, and heuristic tests that one agent can use to quickly and efﬁciently verify value alignment with another agent.

As depicted in Figure 1, the goal of value alignment veriﬁcation is to construct a kind of driver s test that a human can give to any agent which will verify value alignment and consists of only a small number of queries. We deﬁne values in the reinforcement learning sense, i.e., with respect to a reward function: a robot is exactly value aligned with a human if the robot s policy is optimal under the human s reward function. The two agents in a value alignment veriﬁcation problem (human and robot) may have different communication mechanisms and different value introspection abilities. Thus, the way we analyze value alignment veriﬁcation will depend on whether the human s and robot s access to their values is explicit, e.g., able to write down a value function or reward function or implicit, e.g., able to answer preference queries or sample actions from a policy. The most general version of value alignment veriﬁcation involves a human with implicit values who seeks to verify the value alignment of a robot with implicit values, e.g. a black-box policy. This setting motivates our work; however, it is challenging and we postpone many questions for future research.

We follow a ground-up approach where we analyze the difﬁculty of value alignment veriﬁcation starting in the most idealized setting, and then gradually relax our assumptions. We ﬁrst analyze sufﬁcient conditions under which efﬁcient exact value alignment veriﬁcation is possible in the explicit

Value Alignment Veriﬁcation

human, explicit robot setting, where an idealized human tester knows their reward function and so does the robot. When the robot is rational with respect to a reward function that is a linear combination of known features, we show that it is possible to provably verify the alignment of any rational explicit robot via a succinct test consisting of either reward queries, value queries, or trajectory preference queries. We next consider the explicit human, implicit robot setting, where an idealized human knows their reward function, but seeks to efﬁciently verify the alignment of a black-box policy via action queries. We study heuristics for generating value alignment veriﬁcation tests in this setting and compare their performance on a range of gridworlds.

Finally, in Section 4.5 we study the most general setting of implicit human, implicit robot. We propose an algorithm for approximate value alignment veriﬁcation in continuous state and action spaces and provide empirical results in a continuous autonomous driving domain where the human can only query the robot for preferences over trajectories. We conclude with a brief discussion of the challenge of designing value alignment veriﬁcation tests that generalize across multiple MDPs. Somewhat surprisingly, we provide initial theory demonstrating that if the human can create the test environment for the robot, then exact and approximate value alignment across an inﬁnite family of MDPs can be veriﬁed by observing the robot s policy in only two carefully constructed test environments. Source code and videos are available at https: //sites.google.com/view/icml-vav.

2. Related work

Value Alignment: Most work on value alignment focuses on how to iteratively train a learning agent such that its ﬁnal behavior is aligned with a user s intentions (Leike et al., 2018; Russell et al., 2015; Amodei et al., 2016). One example is cooperative inverse reinforcement learning (CIRL) (Hadﬁeld-Menell et al., 2016; Fisac et al., 2020; Shah et al., 2020), which formulates value alignment as a game between a human and a robot, where both try to maximize a shared reward function that is only known by the human. CIRL and other research on value alignment focus on ensuring the learning agent asymptotically converges to the same values as the human teacher, but do not provide a way to check whether value alignment has been achieved. By contrast, we are interested in value alignment veriﬁcation. Rather than assuming a cooperative setting, we assume the robot being tested has already learned a policy or reward function and the human wants to efﬁciently verify whether the robot is value aligned.

Reward Learning: Inverse reinforcement learning (IRL) (Ng & Russell, 2000; Abbeel & Ng, 2004; Arora

& Doshi, 2018) and active preference learning (Wirth et al., 2017; Christiano et al., 2017; Bıyık et al., 2019) algorithms aim to determine the reward function of a human via ofﬂine demonstrations or online queries. In contrast, value alignment veriﬁcation only seeks to answer the question of whether two agents are aligned, without concern for the exact reward function of the robot. In Section 6 we prove that value alignment veriﬁcation can be performed in a constant number of queries whereas active reward learning requires a logarithmic number of queries (Amin & Singh, 2016; Amin et al., 2017). In cases where the human has implicit values, active reward learning can be used to infer the reward function of the human tester, and then this inferred reward function can be used to automatically generate a high-conﬁdence value alignment test. While active reward learning may be a subcomponent of value alignment veriﬁcation, it focuses on customizing reward inference queries for a single agent, whereas value alignment veriﬁcation seeks to design a single alignment test that works for all agents.

Machine Teaching: In machine teaching (Zhu et al., 2018), a teacher seeks to optimize a minimal set of training data such that a student (running a particular learning algorithm) learns a desired set of model parameters. Value alignment veriﬁcation can be seen as a form of machine testing rather than teaching machine teaching algorithms typically search for a minimal set of training data that will teach a learner a speciﬁc model, whereas we seek a minimal set of questions that will allow a tester to verify whether another agent s learned model is correct. Thus, in machine teaching, the teacher provides examples and their answers, but in machine testing the tester provides examples and then queries the testee for the answer. While machine teaching has been applied to sequential decision making domains (Cakmak & Lopes, 2012; Huang et al., 2017; Brown & Niekum, 2019), we are not aware of any work that considers the problem of value alignment veriﬁcation.

Policy Evaluation Policy evaluation (Sutton & Barto, 1998) aims to answer the question, "How much return would another agent achieve according to my values?" By focusing on the simpler decision problem, "Is the robot value aligned with the human?", we seek tests that are much more sample-efﬁcient than running a full policy evaluation. Off Policy Evaluation (OPE) seeks to perform policy evaluation without executing the testee s policy (Precup, 2000; Thomas et al., 2015; Hanna et al., 2017). However, OPE is often sample-inefﬁcient, provides high-variance estimates, and typically assumes explicit access to the tester s reward function, and the tester and testee policies. Value alignment veriﬁcation is applicable in settings where the policies and reward functions of both agents may be implicit and only accessible indirectly.

Value Alignment Veriﬁcation

Test Generator

or Preferences

Alignment Test Agents to be verified

Verification

Figure 1. The tester provides a reward function either explicitly or implicitly to a test generation algorithm which distills the human s values into a succinct alignment test. This single test is used to efﬁciently verify the value alignment of any agent.

3. Notation

We adopt notation proposed by Amin et al. (Amin et al., 2017) where a Markov Decision Process (MDP) M consists of an environment E = (S, A, P, S0, γ) and a reward function R : S R. An environment E, consists of a set of states S, a set of actions A, a transition function P : S A S [0, 1] from state-action pairs to a distribution over next states, a discount factor γ [0, 1), and a distribution over initial states S0. A policy π : S A [0, 1] is a mapping from states to a distribution over actions. The state and state-action values of a policy π are V π R (s) = Eπ[P t=0 γt R(st) | s0 = s] and Qπ R(s, a) = Eπ[P t=0 γt R(st) | s0 = s, a0 = a] for s S and a A. We denote V R(s) = maxπ V π R (s) and Q R(s, a) = maxπ Qπ R(s, a). The expected value of a policy is denoted by V π R = Es S0[V π R (s)].

We assume that the arg max operator returns a set, i.e., arg maxx f(x) := {x | f(y) f(x), y}. We let π R arg maxπ V π R denote an optimal policy under reward function R. We also let AR(s) = arg maxa A Q R(s, a ) denote the set of all optimal actions at state s under reward function R. Thus, AR(s) = {a A | π R(a|s) > 0}

As is common (Ziebart et al., 2008; Barreto et al., 2017; Brown et al., 2020), we assume that the reward function is linear under features φ : S 7 Rk, so that R(s) = w T φ(s), where w Rk. Thus, we use R and w interchangeably. Note that this assumption of a linear reward function is not restrictive as these features can be arbitrarily complex nonlinear functions of the state and could be obtained via unsupervised learning from raw state observations (Laskin et al., 2020; Brown et al., 2020). Given that R(s) = w T φ(s), the state-action value function can be written in terms of discounted expectations over features (Abbeel & Ng, 2004): Qπ R(s, a) = w T Φ(s,a) π , where Φ(s,a) π = Eπ[P t=0 γtφ(st) | s0 = s, a0 = a].

4. Value Alignment Veriﬁcation

In this section we ﬁrst explicitly deﬁne value alignment and value alignment veriﬁcation. Next, we discuss how

assuming rationality of the robot enables efﬁcient provable value alignment veriﬁcation. We then examine how to perform (approximate) value alignment veriﬁcation in tabular MDPs under different forms of test queries, including reward, value, preference, and action queries. We conclude this section by presenting a method for approximate value alignment veriﬁcation when the tester is a human with implicit values and the state and action spaces are continuous.

We ﬁrst formalize value alignment. Consider two agents: a human and a robot. We will assume that the human has a (possibly implicit) reward function that provides the ground truth for determining value alignment veriﬁcation of the robot. We deﬁne (approximate) value alignment as follows:

Deﬁnition 1. Given reward function R, policy π is ϵ-value aligned in environment E if and only if

V R(s) V π R (s) ϵ, s S. (1)

Exact value alignment is achieved when ϵ = 0.

We are interested in efﬁcient value alignment veriﬁcation where we can correctly classify agents as aligned or misaligned within certain error tolerances while keeping the total test size small. Formally, efﬁcient (approximate) value alignment veriﬁcation is a solution to the following:

min T T |T|, s.t. π Π, s S (2)

V R(s) V π R (s) > ϵ Pr[π passes test T] δfpr

V R(s) V π R (s) ϵ Pr[π fails test T] δfnr

where T is the choice set of possible test queries, Π denotes the set of robot policies for which we design the test, δfp, δfn [0, 1] denote the allowable false positive rate and false negative rate, and |T| denotes the cardinality, or complexity of the test, T. If ϵ = δfpr = 0, then we seek the test that enables exact value alignment veriﬁcation.

4.1. Query Types and Rational Agents

The difﬁculty of solving Equation 2 can change signiﬁcantly as a function of ϵ, δfpr, δfnr, the set of policies for which we design the test Π, and the type of queries available in

Value Alignment Veriﬁcation

the choice set T . For example, exact alignment is impossible in settings where one can only query for actions (see Appendix A.1). Even when possible, achieving high conﬁdence may require multiple action queries at every state.

One of the main goals of this paper is to understand under what settings we can achieve efﬁcient, provable value alignment veriﬁcation. Towards this end, we assume that the robot behaves rationally with respect to some reward function R . A rational agent is one that picks actions to maximize its utility (Russell & Norvig, 2016). Formally π

is a rational agent if:

a A, π (a|s) > 0 = a arg max a Q R (s, a), (3)

where arg maxa Q R (s, a) returns the set of all optimal actions at state s under R .

Note that rationality in itself does not restrict the set of policies Π for which we can test, since all policies are rational under the trivial all zero reward function (Ng & Russell, 2000). Rationality also does not limit the choice set T since a rational agent can answer any question related to its policy or values. The rationality assumption is helpful because it directly connects the behavior of the agent to a reward function: given behavior we can infer rewards and given rewards we can infer behavior. It also allows us to extrapolate robot behavior to new situations, enabling efﬁcient value alignment veriﬁcation.

4.2. Exact Value Alignment

We start with the idealized query setting of explicit human, explicit robot. In this section we discuss exact value alignment (ϵ = 0, δfpr = 0) of a rational robot and review related work by (Ng & Russell, 2000) on sets of rewards consistent with an optimal policy. Then in the next section we will examine how to construct veriﬁcation tests for exact alignment. We assume that both the human and robot know the states reward features φ(s), and that the robot acts rationally with respect to a reward function linear in these features.

Consider two rational agents with reward functions R and R . Because there are inﬁnite reward functions that lead to the same optimal policy (Ng & Russell, 2000), determining that s S, R(s) = R (s) does not necessarily imply misalignment. For ease of notation, we deﬁne

OPT(R) = {π | π(a|s) > 0 a arg max a Q R(s, a)},

as the set of all optimal (potentially stochastic) policies in MDP (E, R). Combining Deﬁnition (1) and Equation (3) immediately gives us that a rational robot is aligned with a human if all optimal policies under the robot s reward function are also optimal policies under the human s reward function. We formally state this as the following Corollary.

(a) Policy π

(b) CRS(π).

Figure 2. An example of the consistent reward set (CRS) for a policy π in a simple gridworld and a linear reward function with two binary reward features (white and gray) with reward weights w0 and w1, respectively.

Corollary 1. We have exact value alignment in environment E between a rational robot with reward function R and a human with reward function R if OPT(R ) OPT(R).

We now review foundational work on IRL by Ng and Russell (Ng & Russell, 2000) which inspires our proposed approach for efﬁcient value alignment veriﬁcation.

Deﬁnition 2. Given an environment E, the consistent reward set (CRS) of a policy π in environment E is deﬁned as the set of reward functions under which π is optimal:

CRS(π) = {R | π OPT(R)}. (4)

When R(s) = w T φ(s), the CRS is the following polytope:

Corollary 2. (Ng & Russell, 2000; Brown & Niekum, 2019) Given an environment E, the CRS(π) is given by the following intersection of half-spaces:

{w Rk | w T (Φ(s,a) π Φ(s,b) π ) 0,

a arg max a A Qπ R(s, a ), b A, s S}. (5)

As an example consider the grid world MDP shown in Figure 2. The CRS is an intersection of half-spaces which deﬁne all reward functions under which π is optimal. Note that the all zero reward function and the reward function where white cells have zero reward are included; however, not all optimal policies under these reward functions lead to the policy shown in Figure 2a.

Thus, we cannot directly use Corollary 2 to verify alignment with a human s optimal policy Corollary 2 only provides a necessary, but not sufﬁcient, condition for testing whether a reward function R is value aligned with a policy π. Consider the example of the trivial all zero reward function: it is always in the CRS of any policy; however, an agent optimizing the zero reward can result in any arbitrary policy. Even ignoring the all zero reward, rewards can be on the boundaries of the CRS polytope that are consistent with a policy,

Value Alignment Veriﬁcation

but not value aligned since they lead to more than one optimal policy, one or more of which may not be optimal under the human s reward function. In the next section we show that if we remove all such edge cases, we can construct an aligned reward polytope (ARP) similar to the CRS, which enables provable value alignment veriﬁcation. Furthermore, we show that the aligned reward polytope can be used for alignment veriﬁcation even when the human cannot directly query for the robot s reward function.

4.3. Sufﬁcient Conditions for Provable Veriﬁcation of Exact Value Alignment

We seek an efﬁcient value alignment veriﬁcation test which enables a human to query the robot to determine exact value alignment as in Corollary 1. The following theorem demonstrates that provable veriﬁcation of exact value alignment is possible under a variety of query types.

Theorem 1. Under the assumption of a rational robot that shares linear reward features with the human, efﬁcient exact value alignment veriﬁcation is possible in the following query settings: (1) Query access to reward function weights w , (2) Query access to samples of the reward function R (s), (3) Query access to V R (s) and Q R (s, a), and (4) Query access to preferences over trajectories.

4.3.1. CASE 1: REWARD WEIGHT QUERIES

We ﬁrst consider the case where the human can directly query the robot for their reward function weights w . While this problem setting is mainly of theoretical interest, we will show that Cases (2) and (3) also reduce to this setting. Querying directly for the robot s reward function is maximally efﬁcient since by deﬁnition it only requires a single query. Although one can solve for the optimal policy under a given w and evaluate it under the human s reward function w, this brute force approach is computationally demanding and must be repeated for each robot that needs to be tested. By contrast, we will prove that there exists a single efﬁcient veriﬁcation test that does not require solving for the robot s optimal policy and can be used to verify the alignment of any robot.

As mentioned in the previous section, the CRS for the human s optimal policy does not provide a sufﬁcient test for value alignment veriﬁcation. Under the assumption of a rational robot, a sufﬁcient condition for value alignment veriﬁcation is to test whether a robot s reward function lies in the following set:

Deﬁnition 3. Given an MDP M composed of environment E and reward function R, the aligned reward set (ARS) is deﬁned as the following set of reward functions:

ARS(R) = {R | OPT(R ) OPT(R)}. (6)

Using Deﬁnition 3, we prove the following lemma which will enable efﬁcient veriﬁcation of exact value alignment. As a reminder, we use the notation Qπ R(s, a) = w T Φ(s,a) π , for Φ(s,a) π = Eπ[P t=0 γtφ(st) | s0 = s, a0 = a], and AR(s) = arg maxa A Q R(s, a ).

Lemma 1. Given an MDP M = (E, R), assuming the human s reward function R, and the robot s reward function R can be represented as linear combinations of features φ(s) Rk, i.e., R(s) = w T φ(s), R (s) = w T φ(s), and given an optimal policy π R under R then

(s,a,b) O HR s,a,b = R ARS(R) (7)

where HR s,a,b = w | w T (Φ(s,a) π R Φ(s,b) π R ) > 0 and O = {(s, a, b)|s S, a AR(s), b / AR(s)} .

Proof sketch. First we show π R is optimal under R using the policy improvement theorem. Then, using the uniqueness of the optimal value function, we show that all optimal actions under R are also optimal actions under R , and so all optimal policies under R are optimal under R. (see Appendix A.3 for the full proof).

Lemma 1 provides a sufﬁcient condition for verifying exact value alignment. We now have the necessary theory to construct an efﬁcient value alignment veriﬁcation test in the explicit human, explicit robot setting. We aim to efﬁciently verify whether the robot s reward function, R , is within the above intersection of half-spaces, which we call the Aligned Reward Polytope (ARP), as this gives a sufﬁcient condition for R being value aligned with the human s reward function R. Our analysis in this section will be useful later when we consider approximate tests for value alignment veriﬁcation when one or both of the agents have implicit values.2

The veriﬁcation test is constructed by precomputing the following matrix representation of the ARP:

Φ(s,a) π R Φ(s,b) π R ...

where each row corresponds to a tuple (s, a, b) O. Thus, a is an optimal action and b is a suboptimal action under R and each row of represents the normal vector for a strict half-space constraint based on feature count differences between an optimal and suboptimal action. Note that, using this notation, exact value alignment can now be veriﬁed by checking whether w > 0. This test can be made more

2Our results may also be of interest in the analysis of explicit robot, explicit robot teaming, e.g., ad hoc teamwork (Stone et al., 2010) where value alignment veriﬁcation could provide a framework for verifying whether two robots can work together.

Value Alignment Veriﬁcation

efﬁcient by only including non-redundant half-space normal vectors in . In Appendix G.2 we discuss a straightforward linear programming technique to efﬁciently obtain the minimal set of half-space constraints that deﬁne the intersection of half-spaces speciﬁed in Lemma 1.

4.3.2. CASE 2: REWARD QUERIES

We now consider the case where the tester can query for samples of the robot s reward function R (s). Verifying alignment via queries to R (s) can be reduced to Case (1) by querying the robot for R (s) over a sufﬁcient number of states and then solving for a system of linear equations to recover w , since we assume both the human and robot have access to the reward features φ(s).3 Let Φ be deﬁned as the matrix where each row corresponds to the feature vector φ(s)T for a distinct state s S. Then, the number of required queries is equal to rank(Φ) since we only need samples corresponding to linearly independent rows of Φ. Thus, if w Rk, in the worst case we only need k samples from the robot s reward function, since we have rank(Φ) k. If there is noise in the sampling procedure, then linear regression can be used to efﬁciently estimate the robot s weight vector w . Given w we can verify value alignment by checking whether w > 0.

4.3.3. CASE 3: VALUE FUNCTION QUERIES

Given query access to the robot s state and state-action value functions, w can be determined by noting that R (s) = w T φ(s) and

R (s) = Q R (s, a) γEs [V R (s )] . (9)

Computing the expectation requires enumerating successor states. If we deﬁne the maximum degree of the MDP transition function as

dmax = max s S,a A |{s S | P(s, a, s ) > 0}|, (10)

then at most the dmax possible next state value queries are needed to evaluate the expectation. Thus, at most rank(Φ)(dmax + 1) queries to the robot s value functions are needed to recover w , and the tester can verify value alignment via Case (1). Since rank(Φ) k as before, at most k(dmax + 1) queries are required for w Rk.

4.3.4. CASE 4: PREFERENCE QUERIES

Finally, we consider the implicit robot setting where the tester can only query the robot for preferences over trajectories, ξ. Each preference over trajectories, ξA ξB, induces the constraint w T (Φ(ξB) Φ(ξA)) > 0, where Φ(ξ) = Pn i=1 γiφ(si) is the cumulative discounted reward

3Note that our results also hold for rewards that are functions of (s, a) and (s, a, s ).

features along a trajectory. Thus, our choice set of tests, T , consists of all trajectory preference queries, and we can guarantee value alignment if we have a test T such that w T (Φ(ξB) Φ(ξA)) > 0, (ξA, ξB) T implies that w T HR s,a,b. We can then construct in a similar fashion as above, except each row corresponds to a half-space normal resulting from a preference over individual trajectories (see Appendix A.3). Only a logarithmic number of preferences over randomly generated trajectories are needed to accurately represent T HR s,a,b via intersection of half-spaces formed by the rows in (Brown et al., 2019).

4.4. Value Alignment Veriﬁcation Heuristics

In the next section we relax our assumptions on the robot and consider the explicit human, implicit robot setting, where the human seeks to verify value alignment but the robot has a black-box policy that only affords action queries. In this case, we resort to heuristics for value alignment as exact value alignment veriﬁcation becomes impossible, and ϵ-value alignment veriﬁcation by directly attempting to solve Equation (2) when T consists of state-action queries is computationally intractable. As we discuss in detail in Appendix B, a direct optimization approach would involve estimating Π by computing the optimal policies for a large number of different reward functions, evaluating each policy under w to determine which policies are not ϵ-aligned with the tester s reward function R, and then solving a combinatorial optimization problem over all possible state queries.

Instead, we resort to efﬁcient heuristics. We consider three heuristic alignment tests designed to work in the black-box value alignment veriﬁcation setting, where the tester can only ask the robot policy action queries over states. Each heuristic test consists of a method for selecting states at which to test the robot by querying for an action from the robot s policy and checking if that action is an optimal action under the human s reward function. Note that querying only a subset of states for robot actions is fundamentally limited to value alignment veriﬁcation tests with δfpr > 0 since we will never know for sure that the agent will not take a different action in that state if we query its policy again. Thus, receiving the right answer" an optimal action under the tester s reward R to an action query in a state is not a sufﬁcient condition for exact value alignment. We brieﬂy discuss three action query heuristics with full details in Appendix C. Figure 3 shows examples of the state queries generated by each heuristic in a simple gridworld.

Critical States Heuristic Our ﬁrst heuristic is inspired by the notion of critical states: states where Q R(s, π R(s)) 1 |A| P

a A Q R(s, a) > t, and t is a user deﬁned threshold (Huang et al., 2018). We adapt this idea to form a critical state alignment heuristic test (CS) consisting of critical states under the human s reward function R. Intuitively,

Value Alignment Veriﬁcation

these states are likely to be important; however, often many critical states will be redundant since different states are often important for similar reasons (see Figure 3).

Machine Teaching Heuristic Our next heuristic is based on Set Cover Optimal Teaching (SCOT) (Brown & Niekum, 2019), a machine teaching algorithm that approximates the minimal set of maximally informative state-action trajectories necessary to teach a speciﬁc reward function to an IRL agent. Brown & Niekum (2019) prove that the learner will recover a reward function in the intersection of halfspaces that deﬁne the CRS (Corollary 2). We generate informative trajectories using SCOT, and turn them into alignment tests by querying the robot for their action at each state along the trajectories. SCOT replaces the explicit checking of halfspace constraints in Section 4.3 with implicit half-space constraints that are inferred by querying for robot actions at states along trajectories, thus introducing approximation error and the possibility of false positives. Furthermore, generating a test using SCOT is more computationally intensive than generating a test via the CS heuristic; however, unlike CS, SCOT will seek to avoid redundant queries by reasoning about reward features over a collection of trajectories.

ARP Heuristic Our third heuristic takes inspiration from the deﬁnition of the ARP to deﬁne a black-box alignment heuristic (ARP-bb). ARP-bb ﬁrst computes (see Equation (8)), removes redundant half-space constraints via linear programming, and then only queries for robot actions from the states corresponding to the non-redundant constraints (rows) in . Intuitively, states that are queried by ARP-bb are important in the sense that taking different actions reveals important information about the reward function. However, ARP-bb uses single-state action queries to approximate checking each half-space constraint. Thus, ARP-bb trades off smaller query and computational complexity with the potenital for larger approximation error.

4.5. Implicit Value Alignment Veriﬁcation

We now discuss value alignment veriﬁcation in the implicit human, implicit robot setting. Without an explicit representation of the human s values we cannot directly compute the aligned reward polytope (ARP) via enumeration over states and actions to create an intersection of half-spaces as described above. Instead, we propose the pipeline outlined in Figure 1 where an AI system elicits and distills human preferences and then generates a test which can be used to approximately verify the alignment of any rational agent.

As is common for active reward learning algorithms (Bıyık et al., 2019), we assume that the preference elicitation algorithm outputs both a set of preferences over trajectories P = {(ξi, ξj) : ξi ξj} and a set of reward weights w sampled from the posterior distribution {wi} P(w|P).

Given P and P(w|P), the ARP of the human s implicit reward function can be approximated as

(ξi,ξj) P {w | w T (Φ(ξi) Φ(ξj)) > 0 ,

(11) which generalizes the deﬁnition of the ARP to MDPs with continuous states and actions. To see this, note that the intersection of half-spaces in Lemma 1 enumerates over states and pairs of optimal and suboptimal actions under the human s reward R to create the set of half-space normal vectors , where each normal vector is a difference of expected feature counts. This enumeration can only be done in discrete MDPs. Equation (11) approximates the ARP for continuous MDPs via half-space normal vectors constructed with empirical feature count differences obtained from pairs of actual trajectories over continuous states and actions.

This test can be further generalized to ϵ-value alignment (Deﬁnition 1) to test agents with bounded rationality or slightly misspeciﬁed reward functions. One method of constructing an ϵ-alignment test is to use the mean posterior reward E[w] to approximate the value difference of each pair of trajectories E[w](Φ(ξi) Φ(ξj)), and only include preference queries with estimated value differences of at least ϵ. A robot with implicit values is veriﬁed as ϵ-value aligned by test T if its preferences over each pair of trajectories in T match the preferences provided by the human (see Appendix F for more details).

5. Experiments

We now study the empirical performance of value alignment veriﬁcation tests, ﬁrst in the explicit human setting and then in the implicit human setting.

5.1. Value Alignment Veriﬁcation with Explicit Human

We ﬁrst study the explicit human setting and analyze the efﬁciency and accuracy of exact value alignment veriﬁcation tests and heuristics. We consider querying for the weight vector of the robot (ARP-w), querying for trajectory preferences (ARP-pref), and the action-query heuristics: CS, SCOT, and ARP-bb, described in Section 4.4.

5.1.1. CASE STUDY

To illustrate the types of test queries found via value alignment veriﬁcation, we consider two domains inspired by the AI safety gridworlds (Leike et al., 2017). The ﬁrst domain, island navigation is shown in Figure 3. Figure 3a shows the optimal policy under the tester s reward function, R(s) = 50 1green(s) 1 1white(s) 50 1blue(s), where 1color(s) is an indicator feature for the color of the grid cell. Shown in ﬁgures 3b and 3c are the two preference queries generated by ARP-pref which consist of pairwise

Value Alignment Veriﬁcation

trajectory queries (black is preferable to orange under R). Preference query 1 veriﬁes that the robot would rather move the to terminal state (green) rather than visit more white cells. Preference query 2 veriﬁes that the robot would rather visit white cells than blue cells. Figures 3d, 3e, and 3f show action query tests designed using the ARP-bb, SCOT, and CS heuristics. The robot is asked which action its policy would take in each of the states marked with a question mark. To pass the test, the agent must respond with an optimal action under the human s policy in each of these states. ARP-bb chooses two states based on the half-space constraints deﬁned by the expected feature counts of π R, resulting in an small but myopic test. SCOT queries over a maximally informative trajectory that starts near the water, but includes several redundant states. CS only reasons about Q-value differences and asks many redundant queries (see Appendix D for more results).

5.1.2. SENSITIVITY ANALYSIS

We also analyze the accuracy and efﬁciency of value alignment veriﬁcation in the explicit human, explicit robot and explicit human, implicit robot settings for verifying exact value alignment. We analyze performance across a suite of random grid navigation domains with varying numbers of states and reward features. We summarize our results here and refer the reader to Appendix E for more details. As expected, ARP-w and ARP-pref result in perfect accuracy. SCOT has uses fewer samples than the CS heuristic while achieving nearly perfect accuracy. ARP-bb results in higher accuracy tests, but generates more false positives than SCOT. CS has signiﬁcantly higher sample cost than the other methods and requires careful tuning of the threshold t to obtain good performance. Our results indicate that in the implicit robot setting, ARP-pref and ARP-bb provide highly efﬁcient veriﬁcation tests. Out of the action query heuristics, SCOT achieved the highest accuracy, while having larger sample complexity than ARP-bb, but achieving lower sample complexity than CS.

5.2. Value Alignment Veriﬁcation with Implicit Human

We next analyze approximate value alignment veriﬁcation in the continuous autonomous driving domain from Sadigh et al. (2017), shown in Figure 4a, where we study the implicit human, implicit robot setting and consider verifying ϵ-value alignment. As depicted in Figure 1 we analyze the use of active preference elicitation (Bıyık et al., 2019) to perform value alignment veriﬁcation with implicit human values. We ﬁrst analyze implicit value alignment veriﬁcation using preference queries to a synthetic human oracle unobserved ground-truth reward function R.

We collected varying numbers of oracle preferences, and computed a non-redundant ϵ-alignment test as described

in 4.5 and Appendix G.2. Tests were evaluated for accuracy relative to a set of test reward weights. See Appendix G for experimental parameters and details of the testing reward generation protocol. Figure 4b displays the results of the synthetic human experiments. The best tests achieved 100% accuracy. Although collecting additional synthetic human queries consistently improved veriﬁcation accuracy, above 50 human queries, accuracy gains were minimal, demonstrating the potential for human-in-the-loop preference elicitation. Furthermore, the generated veriﬁcation tests were often succinct: one of the tests with perfect accuracy required only six questions out of the original 100 elicited preferences. Additional experiments and results are detailed in Appendix G, including false positive and false negative rate plots, and different methods of estimating the value gap of questions. We also ran an initial pilot study using real human preference labels which resulted in a veriﬁcation test that achieves 72% accuracy.

6. Generalization to Multiple MDPs

Up to this point, we have considered designing value alignment tests for a single MDP; however, it is also interesting to try and design value alignment veriﬁcation tests that enable generalization, e.g., if a robot passes the test, then this veriﬁes value alignment across many different MDPs.

As a step towards this goal, we present a result in the explicit human, explicit robot setting where the human can construct testing environments. We consider the idealized setting of an omnipotent tester that is able to construct a set of arbitrary test MDPs and can query directly for the entire optimal policy of the robot in each MDP. This tester aims to verify value alignment across an inﬁnite family of environments that share the same reward function. Our result builds on prior analysis on the related problem of omnipotent active reward learning. Amin & Singh (2016) prove that an active learner can determine the reward function of another agent within ϵ precision via O(log |S| + log(1/ϵ)) policy queries. By contrast, we prove in the following theorem that the sample complexity of ϵ-value alignment veriﬁcation is only O(1) (see Appendix A.5 for the proof). Theorem 2. Given a testing reward R (not necessarily linear in known features), there exists a two-query test (complexity O(1)) that determines ϵ-value alignment of a rational agent over all MDPs that share the same state space and reward function R, but may differ in actions, transitions, discount factors, and initial state distribution.

We also note that if the human has access a priori to a ﬁnite set of MDPs over which they want to verify value alignment, then our results from earlier sections on exact, heuristic, and approximate value alignment could be extended to this setting. For example, we can deﬁne a generalized aligned reward polytope for a family of MDPs as the intersection of

Value Alignment Veriﬁcation

(a) Optimal policy

(b) Preference query 1

(c) Preference query 2

(d) ARP-bb queries

(e) SCOT queries

(f) CS queries

Figure 3. Examples of exact and heuristic value alignment veriﬁcation tests for an island navigation gridworld (Leike et al., 2017). Only two preference queries (b) and (c) are required to provably verify any robot policy (black should be preferred over orange). Figures (d)-(f) show heuristic tests that query for actions at individual states.

(a) Driving domain

0.0 1.25 2.5 3.75 5.0 ϵ

10 25 50 100

(b) Alignment test accuracy vs ϵ

Figure 4. Implicit human, implicit robot: ϵ-value alignment veriﬁcation in a continuous autonomous driving domain. (a) A preference query. The human is asked if they prefer the blue or the red trajectory w.r.t. the trajectory of the white car. (b) 80% conﬁdence intervals on veriﬁcation accuracy for different values of ϵ, different human query budgets n, averaged over ten seeds.

the aligned reward polytope for each individual MDP. This intersection of half-spaces provides a sufﬁcient condition for testing value alignment across the entire family of MDPs. We leave this as a promising area for future work.

7. Discussion

We analyzed the problem of efﬁcient value alignment veriﬁcation: how to generate an efﬁcient test that can be used to verify the value alignment of another agent with respect to the human s reward function. We developed a theoretical foundation for value alignment veriﬁcation and proved sufﬁcient conditions for verifying the alignment of a rational agent under explicit and implicit values for both the human and robot. Our empirical analysis demonstrates that action query heuristics can achieve low sample complexity and high accuracy while only requiring black-box access to an agent s policy. When the human has only implicit access to their values, we analyzed active preference elicitation algorithms as a potential means to automatically construct an approximate value alignment test that can efﬁciently test another agent with implicit values.

The biggest assumption we make is that the reward function is a linear combination of features shared by both the human and robot. We would like to emphasize three points: First, on representing rewards as linear combinations of features, note that the features can be arbitrarily complex, and can even be features learned via a deep neural network which are then linearly transformed by a ﬁnal linear layer (Brown

et al., 2020). Second, there is the issue of the human and the robot sharing the features. The reason this might actually be a reasonable assumption is that recent techniques enable robots to detect when they cannot explain human input with their existing features and ask for new input speciﬁc to the missing features (Bobu et al., 2020; 2021), thereby explicitly aligning the robot s reward representation with the human s reward representation. Third, even if the features are not perfectly aligned, our approach can still provide value by learning a linear combination of features that approximates the human s reward function to design an alignment test.

Our pilot study with the driving simulation hints that this might be the case, as it gives evidence that value alignment veriﬁcation is possible when using real human preferences that are determined using pixel-based observations. Furthermore, the only true requirement for generating value alignment tests that query for robot actions or preferences is for the tester to have a reward function that can be approximated by a linear combination of features. Thus, these tests could be possibly be applied in cases where a human uses a linear combination of learned or human-designed features to construct an approximate alignment test for robots who have pixel-based policies and/or rewards.

In conclusion, we believe that value alignment veriﬁcation is an important problem of practical interest, as it seeks to enable humans to verify and build trust in AI systems. It may also be possible for a robot to use value alignment veriﬁcation to verify the performance of a human, e.g., AIgenerated assessment tests. Future work also includes relaxing rationality assumptions, analyzing value alignment veriﬁcation tests in more complex domains, and performing a full user study to better analyze the use of human preferences for alignment veriﬁcation.

Acknowledgements

We would like to thank the anonymous reviewers and Stephen Giguere for their suggestions for improving the paper. This work was funded in part by NSF, AFOSR, ARO, NSF NRI SCHOOL, and ONR YIP.

Value Alignment Veriﬁcation

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 1. ACM, 2004.

Amin, K. and Singh, S. Towards resolving unidentiﬁability in inverse reinforcement learning. ar Xiv preprint ar Xiv:1601.06569, 2016.

Amin, K., Jiang, N., and Singh, S. Repeated inverse reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1815 1824, 2017.

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. Concrete problems in ai safety. ar Xiv preprint ar Xiv:1606.06565, 2016.

Arora, S. and Doshi, P. A survey of inverse reinforcement learning: Challenges, methods and progress. ar Xiv preprint ar Xiv:1806.06877, 2018.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055 4065, 2017.

Biyik, E. and Sadigh, D. Batch active preference-based learning of reward functions. PMLR, 2018.

Bıyık, E., Palan, M., Landolﬁ, N. C., Losey, D. P., and Sadigh, D. Asking easy questions: A user-friendly approach to active reward learning. In Conference on Robot Learning (Co RL), 2019.

Bobu, A., Bajcsy, A., Fisac, J. F., Deglurkar, S., and Dragan, A. D. Quantifying hypothesis space misspeciﬁcation in learning from human robot demonstrations and physical corrections. IEEE Transactions on Robotics, 36(3):835 854, 2020.

Bobu, A., Wiggert, M., Tomlin, C., and Dragan, A. D. Feature expansive reward learning: Rethinking human input. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, pp. 216 224, 2021.

Brown, D. S. and Niekum, S. Machine teaching for inverse reinforcement learning: Algorithms and applications. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 7749 7758, 2019.

Brown, D. S., Goo, W., and Niekum, S. Better-thandemonstrator imitation learning via automaticaly-ranked demonstrations. In Conference on Robot Learning (Co RL), 2019.

Brown, D. S., Niekum, S., Coleman, R., and Srinivasan, R. Safe imitation learning via fast bayesian reward inference from preferences. In International Conference on Machine Learning. 2020.

Cakmak, M. and Lopes, M. Algorithmic and human teaching of sequential decision tasks. In AAAI, 2012.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4299 4307, 2017.

Fisac, J. F., Gates, M. A., Hamrick, J. B., Liu, C., Hadﬁeld Menell, D., Palaniappan, M., Malik, D., Sastry, S. S., Grifﬁths, T. L., and Dragan, A. D. Pragmatic-pedagogic value alignment. In Robotics Research, pp. 49 57. Springer, 2020.

Hadﬁeld-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems 29, pp. 3909 3917. 2016.

Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping with models: Conﬁdence intervals for off-policy evaluation. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, 2017.

Huang, S. H., Held, D., Abbeel, P., and Dragan, A. D. Enabling robots to communicate their objectives. In Robotics: Science and Systems, 2017.

Huang, S. H., Bhatia, K., Abbeel, P., and Dragan, A. D. Establishing appropriate trust via critical states. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3929 3936. IEEE, 2018.

Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp. 5639 5650. PMLR, 2020.

Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt, T., Lefrancq, A., Orseau, L., and Legg, S. Ai safety gridworlds. ar Xiv preprint ar Xiv:1711.09883, 2017.

Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., and Legg, S. Scalable agent alignment via reward modeling: a research direction. ar Xiv preprint ar Xiv:1811.07871, 2018.

Ng, A. Y. and Russell, S. J. Algorithms for inverse reinforcement learning. In ICML, pp. 663 670, 2000.

Paulraj, S., Sumathi, P., et al. A comparative study of redundant constraints identiﬁcation methods in linear programming problems. Mathematical Problems in Engineering, 2010.

Value Alignment Veriﬁcation

Precup, D. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000.

Russell, S., Dewey, D., and Tegmark, M. Research priorities for robust and beneﬁcial artiﬁcial intelligence. Ai Magazine, 36(4):105 114, 2015.

Russell, S. J. and Norvig, P. Artiﬁcial intelligence: a modern approach. Malaysia; Pearson Education Limited 2016.

Sadigh, D., Dragan, A. D., Sastry, S. S., and Seshia, S. A. Active preference-based learning of reward functions. In Proceedings of Robotics: Science and Systems (RSS), July 2017. doi: 10.15607/RSS.2017.XIII.053.

Shah, R., Freire, P., Alex, N., Freedman, R., Krasheninnikov, D., Chan, L., Dennis, M., Abbeel, P., Dragan, A., and Russell, S. Beneﬁts of assistance over reward learning. Workshop on Cooperative AI (Cooperative AI @ Neur IPS, 2020.

Stone, P., Kaminka, G. A., Kraus, S., Rosenschein, J. S., et al. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In AAAI, 2010.

Sutton, R. S. and Barto, A. G. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.

Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-conﬁdence off-policy evaluation. In AAAI, pp. 3000 3006, 2015.

Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J., et al. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136): 1 46, 2017.

Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. An overview of machine teaching. ar Xiv preprint ar Xiv:1801.05927, 2018.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, 2008.