# distributionally_robust_imitation_learning__e1e7722c.pdf Distributionally Robust Imitation Learning Mohammad Ali Bashiri Brian D. Ziebart Xinhua Zhang Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 {mbashi4, zhangx, bziebart}@uic.edu We consider the imitation learning problem of learning a policy in a Markov Decision Process (MDP) setting where the reward function is not given, but demonstrations from experts are available. Although the goal of imitation learning is to learn a policy that produces behaviors nearly as good as the experts for a desired task, assumptions of consistent optimality for demonstrated behaviors are often violated in practice. Finding a policy that is distributionally robust against noisy demonstrations based on an adversarial construction potentially solves this problem by avoiding optimistic generalizations of the demonstrated data. This paper studies Distributionally Robust Imitation Learning (DROIL) and establishes a close connection between DROIL and Maximum Entropy Inverse Reinforcement Learning. We show that DROIL can be seen as a framework that maximizes a generalized concept of entropy. We develop a novel approach to transform the objective function into a convex optimization problem over a polynomial number of variables for a class of loss functions that are additive over state and action spaces. Our approach lets us optimize both stationary and non-stationary policies and, unlike prevalent previous methods, it does not require repeatedly solving an inner reinforcement learning problem. We experimentally show the significant benefits of DROIL s new optimization method on synthetic data and a highway driving environment. 1 Introduction We consider the imitation learning setting of learning to perform a task based only on demonstrations that are provided by experts. There are two main approaches often considered for this learning problem: Behavioral Cloning [19] and Inverse Reinforcement Learning [18, 1, 24, 14]. In behavioral cloning, the learner attempts to learn a policy in a supervised learning manner, in which a direct mapping from states to actions is estimated from the demonstrated trajectories. Behavioral cloning, while simple, often generalize poorly when attempting to predict goal-directed sequential decisions due to compounding errors caused by covariate shift and only tends to succeed when given large amounts of data [22, 23]. Alternatively, inverse reinforcement learning (IRL) rationalizes demonstrated trajectories by estimating a reward function that makes the expert s policy optimal. The problem of determining the reward function is inherently ill-posed, since a single policy can be optimal for multiple reward functions. To obtain a unique solution in IRL, many methods have been proposed, with the maximum entropy principle [31, 32] and margin maximization [20] being two widely employed methods among researchers. Maximum (causal) entropy IRL [32], in particular, seeks an entropy-maximizing distribution over sequences that matches expected feature counts [1] with those observed from demonstrations. For imitation learning in very high dimensional and continuous spaces, where function approximations such as deep neural networks are often used, IRL methods have generally 35th Conference on Neural Information Processing Systems (Neur IPS 2021). been less efficient than behavioral cloning methods [13] since they require reinforcement learning as an inner loop. Recent adversarial IRL methods [10, 9, 25], however, seem more likely to be effective. A common assumption in imitation learning is that all expert behaviors have the same level of trustworthiness and are optimal/near-optimal [30]. However, it is common for noisy expert behaviors to violate this optimality assumption in practice. Thus, relying heavily on the optimality of the expert s behavior may degrade an imitation learner, making it prone to failure [30]. Existing methods have proposed to inject noise into the expert s policy demonstrations to obtain a more robust policy [15], or to train a discriminator that distinguishes between expert trajectories and generated trajectories that learns reward functions that are robust to changes in dynamics [11]. An alternative approach to obtain a robust policy, which we adopt in this work, is to search for a policy that is distributionally robust given the training data. For this purpose, the learner s policy is obtained by solving a game between a learner and an adversary [5], where the learner seeks to minimize a loss defined between them, and the adversary seeks to maximize this loss by choosing a distribution over policies subject to a set of constraints that match statistics from the training data. This approach leverages two uncertainty sets as opposed to typical Distributionally Robust Optimization (DRO) methods [4, 6] where the uncertainty set is only defined over the adversary (demonstrator-estimator) and the learner s policy is assumed to have a specific parametric form. Previous work [5] proposed to solve imitation learning under an adversarial formulation using the Double Oracle method [17] to solve the corresponding optimization. However, this method may take up to exponential time in the size of state-action space and decision horizon. In this paper, we connect Distributionally Robust Imitation Learning (DROIL) and Maximum Entropy Inverse Reinforcement Learning (Max Ent IRL) by showing that Max Ent IRL is a special case of DROIL when a certain loss function and policy description is used. We then show that DROIL is a framework for maximizing a general entropy function that is defined based on a particular loss of interest. We then cast DROIL s objective function into a convex optimization problem over a polynomial number of variables, which is simpler to understand and implement and also significantly improves the training time. We extend the formulation to stationary policies, which enables us to experimentally show the benefits of learning a robust policy in a highway driving simulation. 2 Preliminaries We model sequential decision making problems using discrete Markov Decision Processes (MDPs). A MDP M is specified by a tuple (S, A, Γ, R, γ, P0) where: S and A are state and action (control input) spaces (assumed to be finite, |S|, |A| < ); P0 is the initial state distribution; Γ represents the transition probabilities, the state distribution upon taking action a in state s, P(s |a, s); reward function R : S A R; and discount factor γ (0, 1]. We assume feature vectors φ : S A [0, 1]d over state-action pairs that capture the most salient properties distinguishing preferred and nonpreferred trajectories where R can be written as a (linear) function of these feature vectors given a reward vector w: R(s, a) = w φ(s, a). A policy π Π is the probability of taking action a in state s, π(a, s) = P(a|s) and Π represents the set of all possible stochastic policies. Demonstrations by an expert are given as a set of trajectories D = {τ 1, , τ m}. A trajectory is a sequence of state-action pairs τ = (s0, a0, , s T , a T ) over horizon T. From an optimization perspective, Imitation Learning with a General Loss, finds a policy ˆπ that minimizes the difference between learner behavior and expert behavior πE: ˆπ argmin π L(π, πE), 1 where L measures the dissimilarity between two policies behaviors. Having access to only sample demonstrations of a policy for training, it is essential to quantify the behavior of a policy. Commonly used in behavioral cloning, one approach is to measure the marginal distribution of states and actions, P(s, a) induced by a policy. Note that in the case of infinite-horizon MDPs, expected (discounted) number of visits to state-action pairs is used. A more desirable measure for long horizons is the expectation of trajectory features [1]. This approach is often used in inverse reinforcement learning, where a reward function of the features is learned. The expected (discounted 1We overload L for any type of loss in different contexts (e.g., loss between deterministic policies or states). if γ (0, 1)) feature of a policy π (similar to [1]) is defined as: t=0 γtφ(st, at) π, Γ 3 Robust Imitation Learning and Maximum Entropy Foundational work [27, 12] has developed a close relationship between robust Bayes decisions, maximum entropy, and minimizing worst-case expected loss also known as adversarial learning. Several works have followed this framework of robustness for different supervised learning problems, such as cost-sensitive classification [2], multivariate loss prediction [29], ordinal regression [7], and graphical models [8]. For the imitation learning setting, [5] has applied the adversarial learning framework on inverse reinforcement learning problem for a restricted class of loss functions and proposed a double oracle algorithm [17] to solve the corresponding optimization problem. In the following, we define a general framework of distributional robustness for imitation learning to obtain the policy that robustly minimizes an imitative loss: Definition 1. Given an imitative loss function L that measures the distance of two policies behavior, Distributionally Robust Imitation Learning (DROIL) is defined as a two-player zero-sum game between the learner and the demonstrator, in which each player chooses a distribution over control policies - constructing a stochastic policy, ˆπ and ˇπ Ξ, then the players receive the loss between the behaviors L(ˆπ, ˇπ) as their payoff . The minimax strategy for the learner is given by: min ˆπ max ˇπ Ξ L(ˆπ, ˇπ), (2) where Ξ is a set of constraints characterized by the demonstrated data. Generally, Ξ can be in the form of moment matching in (1) that is commonly used in inverse reinforcement learning: t=0 γtφ(st, at) ˇπ, Γ = µ E T X t=0 γtφ(st, at) πE, Γ , where πE represents the policy that demonstrated trajectories are generated from. This formulation assumes that except for certain properties of the limited samples of available demonstrated behavior, the demonstrator s policy is the worst-case possible for the learner. This approach avoids generalizing from available demonstrations in an optimistic manner that may be unrealistic and lead to a policy that does not work well in practice, especially in situations where there is noise in the demonstrated behaviors. The minimax formulation in Definition 1 is closely related to the principle of maximum entropy, which is used in Maximum Entropy Inverse Reinforcement Learning (Max Ent IRL) [31]. Max Ent IRL provides a probabilistic approach under the constraint of matching the reward value of demonstrated behavior to resolve the ambiguity in choosing a distribution over decisions. Under this model, trajectories with equivalent rewards have equal probabilities, and trajectories with higher rewards are exponentially more preferred according to the following: P(ζi|w) = 1 Z(w)ew φζi, (3) where ζi and φζi represent a trajectory and its corresponding sum of features. Note that a distribution over trajectories (paths) is an alternative policy description that provides a stochastic policy where the probability of an action is weighted by the expected exponentiated rewards of all paths that begin with that action: ζ:a ζt=0 P(ζ|w). To show the connection between DROIL and Max Ent IRL, we employ a tool from the general theory of exponential families [3] that shows for certain classes of two-player zero-sum games, there exists a parametric distribution for the minimax strategy, as shown in Lemma 1. Lemma 1 (Barndorff-Nielsen [3]). Let p Ξ be a probability distribution over space X, where: Ξ = {p : Ep(T (X)) = C} describes a mean-value constraint; T (X) represents a vector value statistic; and C is a constant. Let q be also a distribution chosen from the set of all probability mass functions. For the maximum entropy distribution, maxq Ξ minp E[ log q(X)], p = q exists as a parameterized function of the form p = exp{α0 + α T (X)} with parameters α0 and α. Equipped with Lemma 1, we develop a connection between DROIL and Max Ent IRL in Theorem 1. Theorem 1. The stochastic policy P(ζi|w) obtained from Max Ent IRL in Equation (3) is obtained from DROIL minimax strategy in Definition 1 when the logarithmic loss is used. Proof. We construct a stochastic policy in Definition 1 as a probability distribution over all possible trajectories {ζ1, ..., ζM}. Let ˆπ = P(ζ) and ˇπ = Q(ζ) be the probability distributions of the learner and the demonstrator, respectively. The expected feature of demonstrator now can be written as µˇπ = P i Q(ζi)φζi, which resembles the mean value constraint in Lemma 1, therefore, for logarithmic loss, P (ζi) = Q (ζi) = exp{w0 + w φζi}. We can generalize beyond the logarithmic loss and resulting maximum entropy policy (Theorem 1) to other losses of interest by using Generalized Entropy functions, H(P) := inf Q E[L(P, Q)]. Proposition 1 describes the relation between DROIL and generalized maximum entropy. Proposition 1. For any policy descriptions and loss functions that DROIL in Definition 1 has Nash equilibrium, ˆπ is a robust action and ˇπ is the maximizer of a generalized entropy function. Proposition 1 provides a general approach to resolve the ambiguity of matching constraints where many policies lead to the same feature counts by choosing either a policy ˇπ that does not exhibit any additional preferences beyond matching feature expectations with respect to a loss of interest L (maximum generalized entropy) or a policy ˆπ that minimizes the worst-case expected loss. This can be seen as Maximum Generalized Entropy Inverse Reinforcement Learning, where the choice of loss function is not restricted to the logarithmic loss function. We extend the following lemma that describes a class of loss functions and policies for which the solution of the game in Definition 1 exists, thus Proposition 1 can be applied. Lemma 2. For the game in Definition 1, where learner ˆπ and demonstrator ˇπ simultaneously choose a stochastic policy, if the loss function is additive and the payoff can be written as a bilinear function: Pn i=1 Pm j=1 lijˆπiˇπj, the game has a solution and minimax (maximin) strategy producing the solution as long as m and n are finite. Proof. In his generalized minmax theorem, Von Neumann [28] shows that bilinear games with arbitrary nonempty closed, bounded convex sets of actions have saddle points as long as m and n are finite. Assuming the demonstrated data are generated from a policy (optimality not necessary) the convex moment matching constrained in Definition 1 keeps Ξ closed, bounded, and non-empty. Example 1. One illustrating example is to let each player s pure strategy to be a deterministic policy δ and define the loss as the expectation over a given loss between each individual deterministic policies. A mixed strategy now represents a stochastic policy π, which is a probability distribution over the set of all deterministic policies and Equation (2) can be written as a bilinear game with payoff Pn i=1 Pm j=1 lij P(ˆδi)P(ˇδj) where lij is L(ˆδiˇδj). Lemma 2 describes a restricted class of loss functions (e.g., that the logarithmic loss does not belong to) for which the resulting policy is the maximizer of a more flexible concept of entropy. A class of loss functions of interest are loss functions that are additive over state and action spaces. Similar to Example 1, previous work [5] constructs a policy using a distribution over deterministic policies and for additive loss function by writing Equation (2) as a bilinear game and solving the corresponding optimization. We also focus on this class of loss functions. However, we provide an efficient optimization algorithm for this class of loss functions that avoids the exponential worst-case time complexity of the previous double oracle [17] approach. 4 Learning and Inference To solve the optimization problem in Equation (2), one needs to specify the policy description π, and the distance measure between two policies behaviors L. The choice of π and L can result in different algorithmic approaches. 4.1 Double Oracle ˇδ1 ˇδ2 ˇδj ˆδ1 L(ˆδ1, ˇδ1) +ψ(ˇδ1) L(ˆδ1, ˇδ2) +ψ(ˇδ2) L(ˆδ1, ˇδj) +ψ(ˇδj) ˆδ2 L(ˆδ2, ˇδ1) +ψ(ˇδ1) L(ˆδ2, ˇδ2) +ψ(ˇδ2) L(ˆδ2, ˇδj) +ψ(ˇδj) ... ... ... ... ... ˆδi L(ˆδi, ˇδ1) +ψ(ˇδ1) L(ˆδi, ˇδ2) ψ(ˇδ2) L(ˆδi, ˇδj) ψ(ˇδj) Table 1: Payoff matrix of G(ˆπ , ˇπ |w) with loss function L, deterministic policies δ and Lagrangian potentials ψ. In [5], the authors construct the learner and the demonstrator stochastic policies as mixtures of deterministic nonstationary policies. Concretely, let Υ = {δ1, δ2, ...}, be the set of all possible nonstationary deterministic policies, then ˆπ := P({ˆδ Υ}) and ˇπ = P({ˇδ Υ}). Note that |Υ| O(|A||S|T ) which is exponential in size of state space and time horizon, |S|, T. Assuming the loss function is additive over state-action pairs, [5] shows that Equation (2) can be written as a bilinear game: G(ˆπ ,ˇπ |w) z }| { min ˆπ max ˇπ j p(δi)p(δj)L(ˆδi, ˇδj) + X j p(δj) w E[φ|ˇδj] | {z } This will result in a matrix game of exponential size in the number of states S and time horizon, as shown in Table 1. To obtain the Lagrange variables w, this matrix game needs to be repeatedly solved to compute the gradient with respect to w. This requires solving a linear program with O(|A||S|T ) variables with a simplex constraint, which is impractical for even modestly sized problems. To mitigate this problem, they employed the double oracle method [17] in an attempt to construct a smaller sub-portion of the matrix by gradually adding pure actions through solving a time-varying control problem. However, there is no guarantee that the support set of Nash-equilibrium of the defined game is small and the algorithm may need to solve up to an exponential number of timevarying optimal control problems. 4.2 State-Action Distribution We propose an alternative way to transform the optimization in Equation (2) into a convex problem. Our approach is based on using state-action marginals to construct the stochastic policies of learner and the demonstrator, where the number of required variables is linear in |S| and |A|, and it can also be extended to stationary policies. In the following, we first look at the non-stationary case and then extend our method to stationary policies. A policy π induces a probability distribution Pt(s) over the states of an MDP M at each time step. State-action marginals are similarly defined as Pt(s, a) = Pt(s)πt(a|s). A valid state-action marginal is a set of simplices corresponding to each time-step that satisfy the Bellman flow constraints for a given MDP M. Let Pt(s, a) be the probability of state-action pairs (a, b) at time t; a valid marginal distribution P Ωsatisfies the following affine constraints: for all s , P s,a Pt(s, a)P(s |s, a) = P a Pt+1(s , a), where P(s |s, a) Γ and Ωrepresents the set of all valid marginal distribution for a given MDP M. Employing state-action marginals allows us to write the objective function in Equation (2) as a convex problem of O(|A||S|T) variables as shown in Theorem 2 in a vectorized form. Theorem 2. For an additive loss function over states and actions L, Solving DROIL optimization in Equation (2) is equivalent to solving the following convex minimax problem over marginal state-action probabilities of the learner P = (p1, ..., p T ) and demonstrator Q = (q1, ..., q T ) parameterized with Lagrange multipliers w: min w max Q Ωmin P Ω t=0 p t Lqt + w Φ qt where pt (similarly qt) is a vector of size |A||S| storing marginal probability of state-action pairs at time t: Pt(s, a); L is the general loss function that is defined over state-action pairs L : |A||S| |A||S|. φ is a d |A||S| matrix storing the feature function for each state-action pair. We denote system dynamics with Γ, which is a matrix of size |A||S| |S| storing transition probabilities P(s |s, a). Intuitively, each player in the above formulation searches over a valid state-action distribution to reach an equilibrium. One of the benefits of constructing a policy using marginals is that the feature expectation can be written as an inner product of the state-action distribution and the state-action feature vector. Thus, the demonstrator s expected feature is µ(Q) = PT t=0 Φqt, and the feature matching constraint is realized by minimizing w (PT t=0 Φqt µ) over dual variables w. Writing the objective function in Equation (2) in terms of state-action marginal probabilities reduces the number of variables needed to represent the equilibrium from O(|A||S|T ) to O(|A||S|T). The unconstrained optimization over dual variables w can be solved using any gradient descent method where the gradient is given by PT t=0 Φq t µ and Q is the solution of the following game for the current wt: G(wt) = max Q Ωmin P Ω t=0 p t Lqt + w t Φ qt 4.3 Stationary Policy Our approach extends with some modifications to the stationary policy setting. Stationary policies are desirable because they are simpler to describe, and are more natural and intuitive in terms of the behavior that they prescribe. Similar to state-action marginals, we utilize an occupancy measure ρπ : S A R+ 0 to characterize a stationary policy π. It is defined as the expected (discounted) number of visits to state-action pair (s, a), when following policy π and can be written as a feasible set of affine constraints: G = {ρ : ρ 0 | X a ρ(s , a) = p0(s ) + X s,a γP(s |s, a)ρ(s, a) s S}, (6) where P(s |s, a) Γ and p0 is the distribution of starting states. For a given additive loss L, with the use of an occupancy measure, we write the expected loss E[L(π1, π2)] as ρ π1Lρπ2 and the expected discounted feature as µπ = Φ ρπ. Theorem 3 shows how we can write Equation (2) as a convex optimization using occupancy measure: Theorem 3. For an additive loss function over state and actions L, solving DROIL optimization in Equation (2) is equivalent to solving the following convex minimax problem over occupancy measures of the learner P and demonstrator Q: min w max Q G min P G P LQ + w (Φ Q µ). (7) The convex optimization in Equation (7) can also be solved using any gradient-based method where the gradient is obtained by solving a constrained game with O(|A||S|) variables. 4.4 Inferred Policy In the non-stationary case, after obtaining w , one can use either Q or P as the produced nonstationary Markovian stochastic policy by computing π (a|st) = pt(a|s) = pt(a,s) P a pt(a,s). Q corresponds to the policy that maximizes the generalized entropy and P corresponds to the policy that has minimized the worst-case expected loss. For the stationary case, [26] has proved that there is a one-to-one mapping between G and Π, in a sense that for an occupancy measure P G, π(a|s) P(a,s) P a P(a,s) is the only policy that results in Pπ. Therefore, one can similarly use P or Q as the produced stationary stochastic policy. For either case, assuming the reward function can be written as R(s, a) = w φ(s, a), then w plays the role of the reward weight vector that rationalizes the demonstrated behaviors. Consequently, we can use w Φ to obtain the optimal reward function R and use an MDP solver to obtain the (deterministic) corresponding policy. 5 Optimization In both stationary and non-stationary cases, any gradient descent method can be used to optimize over dual variables w. To compute the gradient, one needs to compute Q for non-stationary and Q . However, since w is unconstrained, by adding a norm of w with hyperparameter λ, one can directly solve w and replace it in the objective. Since the optimization algorithms are similar in both cases, we only mention the non-stationary case: min w max Q Ωmin P Ω t=0 p t Lqt + w Φ qt and setting: w = µ Φ PT t=0 qt λ we have: t=0 p t Lqt, (9) which is a constrained quadratic optimization in Q. Using Danskin s theorem, the gradient of qt is given by 1 λ(Φ µ Φ Φ t=0 Tqt) Lp t , where p can be found using linear programming (linear objective with affine constraints) efficiently using standard linear programming toolbox. An alternative approach is to solve the dual of optimization over pt and maximize it along with Q. Algorithm 1 Distributionally Robust Imitation Learning (DROIL) Input: D = {τ 1, τ 2, , τ m}, Γ, p0 Initialize Q0, compute µ using D, and set i = 0 repeat Compute if(Qi) in Equation (9) i = i + 1 Using if(Qi) calculate with Qi+1 if Qi+1 Ωthen Qi+1 = Qi+1 else Qi+1 =project ( Qi+1) where projection function is defined in Equation (10) end if until convergence Projection Step At each iteration of the algorithm, we need to project Q to a convex domain to maintain a valid state-action distribution given Γ and the initial state distribution. Essentially, t=0 ||q t qt||2 s.t. Q Ω, (10) from which, by using a Lagrangian method and strong duality, we obtain: max V min Q 1 2 q t qt 2 + q t 1Γ q t Z vt, (11) where Z operator computes the state distribution: q t Z = P a qt(s, a). To compute the gradient V, a quadratic program over Q with probability simplex constraints needs to be solved. This can be analytically determined by sorting each qt, which takes O(|S||A| log(|S||A|)) time. 6 Experimental Results In our experiments, we compare DROIL with prior methods on several imitation learning tasks. We investigate: 1) How our convex optimization improves the training time compared to the double oracle method; 2) How the choice of loss function affects the performance of DROIL; and 3) How accurately DROIL predicts actions compared to other IRL methods. 6.1 Training Time Size DROIL Time Do Cost 128 1.6 7.39 30.1 7.39 432 1.8 21.61 42.1 21.74 1024 6.0 20.0 141.1 20.0 2000 98.6 30.1 608.1 30.1 3500 496 39.2 2020 39.1 5500 881 58.7 4970 58.9 Table 2: Elapsed time in second until convergence with 10 3 tolerance. To compare the training time of our proposed convex optimization with the double oracle approach [5], we adopt their experimental setup in Grid World. In this experiment, trajectories are collected from simulated navigation across a discrete 2D grid where the agent starts from a random starting point, and navigates through the grid to reach a specified target location. Taking a step in the grid has a cost and the agent s goal is to reach the target location while minimizing the accumulated navigation cost (maximizing the reward). This problem can be formulated as an optimal sequential decision-making problem in a finite Markov decision process where the optimal policy is non-stationary. We consider linear cost function C(s) = w φ(s)+ϵ(s), where feature function φ(s, a) and weight vector w are drawn from U(0, 1)d, and ϵ U(0, 1). Transition function is non-deterministic with parameter pm (0, 1] which navigates the agent to randomly choose neighbors with probability (1 pm). The loss function for this experiment is set to E PT t=0 ( ˆXt ˇXt)2 + ( ˆYt ˇYt)2 , where (X, Y ) represents the grid position of the agent. We generate trajectories from the optimal policy that is obtained by solving the true reward function and train DROIL and DO until convergence. Along with expected loss, we report the elapsed time that it takes to converge for our proposed method and double oracle (DO) in Table 2, averaged over eight repetitions of the experiment for this result and our later results. As Table 2 shows, our proposed method requires significantly less training time for the same performance and scales very well with the size of state space. 6.2 Loss Function Effect 2 4 8 16 32 64 0 misprediction rate 0-1 action-loss random euclidean 2 4 8 16 32 64 0 misprediction rate Figure 1: Performance of DROIL when different loss functions are used. For the second question, we show that different choices of loss function result in different performance for the learned policy in the imitation learning setting. Therefore, the choice of loss function provides an extra tool to incorporate certain domain knowledge, and design a problem-specific loss that potentially results in better produced policies. For the purpose of comparison of different losses, we revisit the Grid Word environment, however, we train several stationary policies with different losses. We consider 0-1 loss, which equally penalizes any mismatch between state-action pairs; actionloss, which incurs loss only when an action differs from another pair in the same state; random loss, which is drawn from a uniform distribution; and finally Euclidean distance between two positions in the grid. It is clear from Figure 1 that the policy produced from Euclidean loss outperform other policies from other losses, which shows the benefit of using a task-specific loss function. 6.3 Highway Driving To evaluate DROIL in a complex environment with more realistic behavior, we compared it with several other IRL methods in a highway driving simulator with non-linear reward function: MMP [20], the projection algorithm of Abbeel and Ng [1], and LEARCH [21]. Following the setting in Levine et al. [16], the task is to drive a car on a three-lane highway in which the agent can switch lanes and drive at up to four times the speed of traffic while all other vehicles move at a constant speed. The set of features includes the distance to the nearest vehicle in each lane (in front and behind), current speed, and current lane. We also evaluate each method on the original environment and on four additional random environments, denoted as "transfer". We set a uniformly random loss for DROIL and train all methods using examples sampled from the stochastic Max Ent IRL policy which can intuitively be viewed as noisy samples of an underlying optimal policy. To evaluate the performance of each method, we use the misprediction rate, which is defined as the ratio of incorrect actions compared to the optimal policy, and the expected value difference, which measures the suboptimality of the learned policy under the true reward. Since DROIL is able to produce stochastic policy, with the same argument from [16], we could evaluate the optimal stochastic policies. However, this would unfairly penalize other methods. Therefor, we first obtain the reward weight vector and find the optimal deterministic policy the corresponding reward function. Then, we measure its expected sum of discounted rewards under the true reward function, and subtract this quantity from the expected sum of discounted rewards of the optimal policy. Figure 2: Expected value differences for 64-car-length highways with varying example counts. Lower values are better. As shown in Figure 2 and Figure 3, DROIL performs very well in terms of the obtained reward and the accuracy of the produced policy in both the original and transfer environments. In contrast, we find that MMP and Abbeel & Ng s approach degrade as the number of examples increase. This matches theory since the suboptimality of the demonstrations becomes more apparent as the number of examples increases. This indicates that under noisy demonstrations (samples from a stochastic policy), a robust approach has the potential to outperform alternative approaches. Figure 3: Misprediction rate results for 64-car-length highways with varying example counts. Lower values are better. 7 Discussion & Conclusion We demonstrated a connection between DROIL, which accepts any loss of interest, and Maximum Entropy Inverse Reinforcement Learning one of the widely used IRL approaches which robustly minimizes logarithmic loss. We showed that Max Ent is a special case of DROIL framework when logarithmic loss function is used and showed that DROIL can be seen as the maximizer of a generalized concept of entropy. We provided a novel approach to cast DROIL s objective into a convex optimization over a polynomial number of variables and experimentally showed our proposed algorithm provides faster training time. DROIL is naturally designed to perform robustly against noisy demonstrations. Our experiment in the highway driving task showed that when demonstrations are noisy, it robustly learns an appropriate policy. Improvements in imitation learning have the potential for both societal benefits and harms. For example, better imitating top surgeons could scale their abilities to a broader populations that are medically under-served. Enabling robots that better imitate effective soldiers could cause great harm if used inappropriately. Like all general purpose tools, avoiding intentional harms while still providing benefits is an unsolved challenge. We take the position that providing methods that are more robust to noise will help to avoid unintentional harms the application of methods in a well-intentioned manner that fail to maximize their benefits and may instead produce harms through their fragility. Our presented experiments are restricted to discrete/low-dimensional decision processes. For future work, we are interested in finding a connection between DROIL and the Generalized Exponential family and applying DROIL on very high dimensional state and action spaces that need function approximators such as deep neural networks. Acknowledgements This material is based upon work supported by the National Science Foundation under Grant Nos. 1652530 and 1910146. [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 1 8. ACM, 2004. [2] Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classification. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2015. [3] Ole E Barndorff-Nielsen. Information and exponential families. 1978. [4] Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013. [5] Xiangli Chen, Mathew Monfort, Brian D Ziebart, and Peter Carr. Adversarial inverse optimal control for general imitation learning losses and embodiment transfer. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2016. [6] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595 612, 2010. [7] Rizal Fathony, Mohammad Ali Bashiri, and Brian Ziebart. Adversarial surrogate losses for ordinal regression. In Advances in Neural Information Processing Systems, pages 563 573, 2017. [8] Rizal Fathony, Ashkan Rezaei, Mohammad Ali Bashiri, Xinhua Zhang, and Brian Ziebart. Distributionally robust graphical models. In Advances in Neural Information Processing Systems, pages 8344 8355, 2018. [9] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. ar Xiv preprint ar Xiv:1611.03852, 2016. [10] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the International Conference on Machine Learning, pages 49 58, 2016. [11] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017. [12] Peter D. Grünwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Annals of Statistics, 32:1367 1433, 2004. [13] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565 4573, 2016. [14] Rust John. Maximum likelihood estimation of discrete control processes. SIAM Journal on Control and Optimization, 26(5):1006 1024, 1988. [15] Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. ar Xiv preprint ar Xiv:1703.09327, 2017. [16] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with Gaussian processes. Advances in Neural Information Processing Systems, 24:19 27, 2011. [17] H Brendan Mc Mahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the International Conference on Machine Learning, pages 536 543, 2003. [18] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 663 670, 2000. [19] Dean A Pomerleau. ALVINN: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, pages 305 313, 1989. [20] Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the International Conference on Machine Learning, pages 729 736, 2006. [21] Nathan D Ratliff, David Silver, and J Andrew Bagnell. Learning to search: Functional gradient techniques for imitation learning. Autonomous Robots, 27(1):25 53, 2009. [22] Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 661 668, 2010. [23] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 627 635, 2011. [24] Stuart J Russell. Learning agents for uncertain environments. In COLT, volume 98, pages 101 103, 1998. [25] Gokul Swamy, Sanjiban Choudhury, J Andrew Bagnell, and Steven Wu. Of moments and matching: A game-theoretic framework for closing the imitation gap. In Proceedings of the International Conference on Machine Learning, pages 10022 10032, 2021. [26] Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. In Proceedings of the International Conference on Machine Learning, pages 1032 1039, 2008. [27] Flemming Topsøe. Information theoretical optimization techniques. Kybernetika, 15(1):8 27, 1979. [28] John Von Neumann. Uber ein okonomsiches gleichungssystem und eine verallgemeinering des browerschen fixpunktsatzes. In Erge. Math. Kolloq., volume 8, pages 73 83, 1937. [29] Hong Wang, Wei Xing, Kaiser Asif, and Brian Ziebart. Adversarial prediction games for multivariate losses. In Advances in Neural Information Processing Systems, pages 2710 2718, 2015. [30] Jiangchuan Zheng, Siyuan Liu, and Lionel M Ni. Robust Bayesian inverse reinforcement learning with sparse behavior noise. In Proceedings of the AAAI Conference on Artificial Intelligence, 2014. [31] Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1433 1438, 2008. [32] Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Modeling interaction via the principle of maximum causal entropy. In Proceedings of the International Conference on Machine Learning, 2010. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] Theoretical claims are supported by theorems and experimental claims are supported empirically. (b) Did you describe the limitations of your work? [Yes] See Section 7. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 7. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] We discussed them in each Theorem (b) Did you include complete proofs of all theoretical results? [Yes] We include these in Supplementary materials. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We provide these in the supplementary materials. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We describe these details in the supplementary materials. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Section 6. (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] See the Supplementary Materials. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] A Proof of Theorem 2 & 3 Suppose the loss function L is decomposable over state action pairs, then we can write, Equation (2) as following: min ˆπ max ˇπ ˆs,ˆa ˇs,ˇa h P( ˆSt = ˆs, ˆAt = ˆa|ˆπ, Γ)L(ˆs, ˇs, ˆa, ˇa)P( ˇSt = ˇs, ˇAt = ˇa|ˇπ, Γ) i (12) subject to: ˇs,ˇa Pt( ˇSt = ˇs, ˇAt = ˇa|ˇπ, Γ)φ(ˇs, ˇa) = µ. By introducing dual variables w for the feature expectation constraints, the Lagrangian function of Equation (12) is given by: min ˆπ max ˇπ min w ˆs,ˆa ˇs,ˇa h P( ˆSt = ˆs, ˆAt = ˆa|ˆπ, Γ)L(ˆs, ˇs, ˆa, ˇa)P( ˇSt = ˇs, ˇAt = ˇa|ˇπ, Γ) i (13) ˇs,ˇa Pt( ˇSt = ˇs, ˇAt = ˇa|ˇπ, Γ)φ(ˇs, ˇa) i µ The optimization in Equation (13) is over ˆπ and ˇπ. However, the objective function Equation (13), decomposes over the state-action distribution induced by policies ˆπ and ˇπ. We directly optimize over the marginals: min (p1,p2,...,p T ) max (q1,q2,...,q T ) min w ˆs,ˆa ˇs,ˇa h pt(ˆs, ˆa)L(ˆs, ˇs, ˆa, ˇa)qt(ˆs, ˆa) i (14) ˇs,ˇa qt(ˇs, ˇa)φ(ˇs, ˇa) i µ , where pt(ˆs, ˆa) = P( ˆSt = ˆs, ˆAt = ˆa) and qt(ˇs, ˇa) = P( ˇSt = ˇs, ˇAt = ˇa). This optimization needs to be over valid state-action marginals (marginals induced by a policy). So the following constrained need to be satisfied: ˆs,ˆa pt(ˆs, ˆa)P(ˆs |ˆs, ˆa) = X ˆa pt(ˆs , ˆa) t, s Similarly for qt, X ˇs,ˇa qt(ˇs, ˇa)P(ˇs |ˇs, ˇa) = X ˇa qt(ˇs , ˇa) t, s Since Equation (14) is convex in all variables pt, qt, and w, the order of optimization can be changed: min w max Q Ωmin P Ω ˆs,ˆa ˇs,ˇa pt(ˆs, ˆa)L(ˆs, ˇs, ˆa, ˇa)qt(ˆs, ˆa) + w X ˇs,ˇa qt(ˆs, ˆa)φ(ˇs, ˇa) i w µ, where P = (p1, p2, ..., p T ) and Q = (q1, q2, ..., q T ). The proof for stationary case is similar.