# goal_recognition_as_reinforcement_learning__67d1a781.pdf Goal Recognition as Reinforcement Learning Leonardo Amado1, Reuth Mirsky, 2, Felipe Meneguzzi 3,1 1 Pontif ıcia Universidade Cat olica do Rio Grande do Sul, Brazil 2 Bar Ilan University, Israel and The University of Texas at Austin, USA 3 University of Aberdeen, Scotland leorosaamado@gmail.com, mirskyr@cs.biu.ac.il, felipe.meneguzzi@abdn.ac.uk Most approaches for goal recognition rely on specifications of the possible dynamics of the actor in the environment when pursuing a goal. These specifications suffer from two key issues. First, encoding these dynamics requires careful design by a domain expert, which is often not robust to noise at recognition time. Second, existing approaches often need costly real-time computations to reason about the likelihood of each potential goal. In this paper, we develop a framework that combines model-free reinforcement learning and goal recognition to alleviate the need for careful, manual domain design, and the need for costly online executions. This framework consists of two main stages: offline learning of policies or utility functions for each potential goal, and online inference. We provide a first instance of this framework using tabular Q-learning for the learning stage, as well as three mechanisms for the inference stage. The resulting instantiation achieves state-of-the-art performance against goal recognizers on standard evaluation domains and superior performance in noisy environments. Introduction Goal recognition (GR) is a key task in artificial intelligence, where a recognizer infers the goal of an actor based on a sequence of observations. Consider a service robot that wishes to assist a person in the kitchen by fetching appropriate utensils without interrupting the task execution or demanding attention for specifying instructions (Kautz and Allen 1986; Monteiro et al. 2017; Granada et al. 2020; Bishop et al. 2020). A common approach to enable the robot to perceive and infer the person s goal in this situation consists of a pipeline of activity recognition from raw images and translation into actions for a symbolic GR algorithm (Figure 1). After processing the raw images into observations, a goal recognizer further processes a sequence of these observations into a goal or a distribution of goals. Most GR approaches rely on arduously informing the recognizer about the feasibility and likelihood of the different actions that the actor can execute. This process might include crafting elaborate domain theories, multiple planner executions in real-time, intricate domain optimizations, or any combination of these tasks, leading to limitations: Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Cost of Domain Description: Crafted domain theories require deliberate design and accurate specification of domain dynamics, which is usually a process done manually by an expert. In highly complex environments, manual elicitation of such a model might even be impossible. Noise Susceptibility: As specifying accurate domain dynamics is costly, many specifications are incomplete and cannot inform the recognizer about unlikely observations or partial observation sequences. This property makes many goal recognizers susceptible to noise. Online Costs: Some recognizers require costly online computations, such as multiple planner or parser executions. These computations can hinder the recognizer s real-time inference ability, especially when observations are processed incrementally and the goal of the actor needs to be reevaluated many times throughout the plan execution. We address these limitations by replacing manually crafted representations and online executions with modelfree Reinforcement Learning (RL) techniques. The resulting framework performs efficient and noise-resistant GR without the need to craft a domain model and without any planner or parser executions during recognition. The three key contributions of this paper are: (1) Revisiting the GR problem definition to accommodate RL-based domains and developing a new framework that relies on policies or utility functions derived from any model-free RL technique. This framework consists of two main stages: Offline learning for each potential goal, and an online inference stage that compares an observed trajectory to those learned policies. (2) A first instance of the new framework for tabular RL using an off-the-shelf implementation of a Q-learning algorithm and three recognition measures: Accumulated utility, a modified KL-divergence, and Divergence Point. (3) Evaluation of the new framework on domains with partial and noisy observability. Experiments show that even with a very short learning process, we can still accurately and robustly perform GR on challenging problems. We show this framework s ability to perform comparably to a state-ofthe-art goal recognizer on standard evaluation domains, and have superior performance in noisy environments. Background We begin by defining a GR problem in a way that is consistent with existing literature (Meneguzzi and Pereira 2021; The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) (:action make-breakfast :parameters (?egg ?coffee ?bread) :precondition (and (fried ?egg) (toasted ?bread) (ready ?coffee)) :effect (breakfast ?egg ?bread ?coffee)) Domain Theory (Hand-coded) This person is making breakfast Activity Recognition Goal Recognition (Planner / Parser) Activity Recognition Goal Recognition (Measure-based) This person is making breakfast Domain Theory (Learned) Figure 1: A comparison of existing model-based approaches for goal recognition (left) and our proposed framework (right). The key changes in our approach are presented in red. Mirsky, Keren, and Geib 2021). Given a domain theory T, a set of possible goals G, and a sequence of observations O, a goal recognition problem consists of a goal g G that explains O. The semantics of T and explains can vary greatly between goal recognizers. For example, in Ramirez and Geffner (2009), a domain theory is a planning domain instantiated in a specific initial state s0 and goal g is explained by O if there is some optimal plan for g, generated by a planner, that begins in s0 and is compatible with O. They refine this interpretation to rank goals likelihood when there is more than one goal with an optimal plan that is compatible with O (Ram ırez and Geffner 2010). In this work we propose multiple semantics for explains, but we first focus on defining our RL-based domain theory. For that, we use the definition of a Markov Decision Process (MDP), a policy, and a Q-function (Sutton and Barto 2018). Definition 1 (MDP) A Markov Decision Process M, is a 4-tuple (S, A, p, r) such that S are the possible states in the environment, A is the set of actions the actor can execute, p(s | s, a) is a transition function that gives the probability of transitioning from state s to state s after taking action a and r(s, a, s ) defines a reward function. A policy π(a | s) for an MDP is a function that defines the probability of the agent taking action a A in state s S. Some RL algorithms, such as Q-learning, compute the policy of an agent using a Q-function Q(s, a), which is an estimation of the expected return starting from s after taking action a. In our new framework, a domain theory T consists of the state and action spaces of an MDP and a set of policies or Q-functions. Unlike planning-based GR where the domain theory is decoupled from the problem instance (the set of possible goals G), here T depends on the set of goals. We define two types of domain theories: Definition 2 (Utility-based Domain Theory) A utility-based domain theory TQ(G) is a tuple (S, A, Q) such that Q is a set of Q-functions {Qg}g G. Definition 3 (Policy-based Domain Theory) A policy-based domain theory Tπ(G) is a tuple (S, A, Π) such that Π is a set of policies {πg}g G. Essentially, we can view both domain theories as a set of MDPs with the same transitions but different reward functions for different goals. Our aim is to learn either a good policy or a utility function that represents the expected behavior of actors under each of these MDPs. We use this formulation to provide a new definition for a goal recognition problem in which we replace the abstract notion of T and combine the goal set G into these domain theories. Definition 4 (Goal Recognition Problem) Given domain theory TQ(G) or Tπ(G) and a sequence of observations O, output a goal g G that explains O. Note that every GR approach using a policy-based domain theory Tπ(G), can also be given by a utility-based domain theory TQ(G). We can make this change by generating for each goal g a softmax policy πg based on Qg, as shown in Equation 1. Consider a case where all values in a Q-function are negative. This case would actually result in the policy πg being more likely to take the worst action. Thus, if the utility function has some negative values, we rescale the function with the additive inverse of the largest negative value (some number C): Q = Q(s, a) + C. This modification ensures that the resulting policy πg will prioritize high-value actions. Thus, for brevity, from now on we refer to our framework as one that relies on Q-functions, unless we explicitly wish to discuss specific properties of a policy-based domain theory. πg(a | s) = Qg(s, a) P a A Qg(s, a ) (1) Using this new problem definition, we develop our framework to solve these goal recognition problems, discuss how to learn TQ(G) or Tπ(G), and how to decide which goal g best explains observations O. The Goal Recognition as Reinforcement Learning Framework Our new framework consists of two main stages: (1) learning a set of Q-functions; and (2) inferring the goal of an actor given a sequence of observations. Figure 2 illustrates this process. First, the initial inputs are state and action spaces, S and A, and a set of goals G. There is no restriction on the properties of S and A; they can be either discrete or continuous, and no transition or reward function is required apriori. Our framework can employ any off-the-shelf RL algorithm to learn a Q-function for each goal, {Qg}g G which Stage 1 Learn Tabular Q-learning Stage 2 Infer Max Util, KL, DP Figure 2: The GR as RL framework. The Goal Recognition as Q-Learning (GRAQL) instance appears in italics. together with the original S and A constructs the domain theory TQ(G) for the recognition stage. Once the framework receives an observation sequence, a measure between an observation sequence and a Q-function computes a distance between the observations O and each Qg. Here we focus primarily on state-action observation sequences, where O = s0, a0, s1, a1, . . . , but the next Section contains examples of how to handle state-only (Os = s0, s1, . . . ) or action-only (Oa = a0, a1, . . . ) observations. The inferred goal g is the one that minimizes the measured distance between its respective Q-function and the observations, as defined in Equation 2. g = arg min g G DISTANCE(Qg, O) (2) Stage 1 (Learning): The first part of the GR as RL framework is learning a set of Q-functions (or policies) for each goal following Algorithm 1. It defines a set of n RL problems where n is the number of goals in G: for each goal g, we generate a reward function in which there is some positive gain when reaching the goal g (Line 3), and no reward otherwise (Line 2). This basic setting allows us to leverage additional reward shaping or other optimizations to improve the learning of the Q-functions, just as in any other RL problem. However, in the most naive form of this problem, we require no penalty for actions that fail to advance the agent towards reaching the goal, nor any specific discount factor. We explicitly chose this simple problem formulation to highlight the efficacy of our approach, as its aim is to generate informative-enough Q-functions, not to perfectly maximize the reward of the RL agent. As we show in our empirical evaluation, even though we do not train Q-functions until convergence and provide poor solvers for reaching their respective goals, they suffice to create an accurate and robust domain theory for our goal recognizers. This formulation enables us to use well-established RL research to learn a set of Q-functions given the properties of our environment: discrete or continuous, deterministic or stochastic transitions, etc. Acquiring the domain theory then becomes an RL problem with its respective challenges: selecting the most appropriate algorithm for learning, tuning its hyperparameters, etc. Stage 2 (Inference): Once we have a set of Q-functions QG and an observation sequence O, the next stage in the framework is online GR: inferring the Q-function (i.e., goal) of the actor that explains O. Traditional GR algorithms require complex computations, such as planner or parser executions to reason about the similarity of O to each goal. In this work, we take on a measure-based approach instead (and Algorithm 1: Learn a Q-function for each goal Require: S, A : State and action spaces Require: G: a set of candidate goals 1: for all g G do 2: a s = g, r(s, a) 0 Create a reward function 3: a, r(g, a) C Reaching g yields some positive value 4: Qg LEARN(S, A, r) 5: return {Qg}g G Algorithm 2: Infer most likely goal for the observations Require: TQ(G): S, A, {Qg}g G : State and action spaces, and Q-functions per goal Require: O: an observation sequence s0, a0, s1, a1, . . . 1: mg Init shortest distance 2: for all g G do Compute distances from O 3: mg DISTANCE(Qg, O) Use distance measure 4: if mg mg then 5: g g and mg mg 6: return g present potential measures). Algorithm 2 implements the inference process using DISTANCE as a measure function, which implements Equation 2: given O, find for each goal g the distance between O and Qg (Line 3). The algorithm then chooses the goal with the minimum distance value as the most likely goal of the actor (Line 6). This formulation of the GR task aligns well with Ramirez and Geffner (2010), who introduce the notion of similarity between an observation sequence and optimal, goaland observation-dependent plans. As the algorithm computes the similarity between the observation sequence and a Q-function that is defined over all state-action pairs rather than a single trajectory, it inherently reasons about noisy and missing observations, as our empirical evaluation shows. Goal Recognition as Q-Learning In this section, we detail the components of the first instance of our framework specifically for tabular Q-learning approaches Goal Recognition as Q-Learning (GRAQL). We focus the first instance of this framework on tabular domains to enable an evaluation against existing GR baselines. Planning-based GR algorithms use PDDL as their domain descriptions, which can be easily translated into tabular representations (Ram ırez and Geffner 2009; Amado et al. 2018). Figure 2 illustrates the specific components that require implementation in italics. We start by explaining the hyperparameters and discussing these choices in the learning stage. Then, we introduce three different measures for an observation sequence and a Q-function. Learning {Qg}g G Using Q-Learning For the learning stage, we use an off-the-shelf Q-learning algorithm. As learning the Q-functions for each goal are a means to an end rather than our ultimate aim, we do not focus on techniques to optimize this stage, but rather employ a single solution, showing that we can acquire an informative domain theory with minimal effort. We set the reward for reaching the goal to 100, and 0 otherwise, and the discount factor to 0.9. As exploration is more important in this case than maximizing the reward, the sampling strategy we use is ϵ-greedy with linearly decaying values (ϵ = 1 . . . 0.01). Instead of using random exploration to reach the goal, shaping the initial policy can speed up the learning process: for each goal g, an optimal planner generates a single trajectory to the goal pg = { s0, a0 , s1, a1 , . . .}. We can use this trajectory to initialize Qg with positive values for stateaction pairs that are part of its goal s optimal path pg. Qg(s, a) = 1, if s, a pg 0, otherwise (3) This shaping initializes the Q-function to give high utility to a single trajectory, which is similar to the original formulation of planning-based GR, where a single optimal plan constitutes the baseline for the actor s presumed path to the goal (Ram ırez and Geffner 2009). In that sense, GRAQL bridges a gap between planning-based GR and RL: in that formulation, a planner outputs a single optimal plan for goal g, which might not be the plan the actor chooses to follow. Later work overcomes this problem by searching for a set of diverse plans for each goal (Sohrabi, Riabov, and Udrea 2016). Instead of using search, GRAQL refines a single optimal plan into a policy that captures the cost of alternative plans, even if these plans are not necessarily close to optimal. We implemented our approach with and without this shaping process. To ensure that it does not overfit or create an unfair bias in the Q-functions towards the planningbased observation sequence used in our evaluation, we explicitly chose problems with multiple optimal plans per goal and ran different planners for shaping (LAMA (Richter and Westphal 2010)) and for testing (Fast Downward (Helmert 2006)), so that pg is not the only possible optimal path. The resulting Q-functions with and without shaping were not significantly different, so our empirical results only show the performance of the Q-functions without the shaping process. Measures for Tabular Q-functions For the inference stage, we use three different distance measures for a distance between a Q-function Qg and an actionstate observation sequence O, inspired by three common RL measures: Max Util, KL-divergence, and Divergence Point. We then extend Max Util s definition to handle state-only observation sequences and action-only observation sequences. Max Util is an accumulation of the utilities collected from the observed trajectory. Max Util(Qg, O) = X i |O| Qg(si, ai) (4) KL-Divergence is a measure for the distance between two distributions, so we construct two policies, πg and πO for Qg and O respectively. The goal-dependent policy πg is defined as a softmax stochastic policy (Equation 1). The observations policy πO is a pseudo-policy where πO(ai | si) = 1 for each si, ai O and provides a uniform distribution for all actions taken in unobserved states. KL(Qg, O) = DKL(πg || πO) = X i |O| πg(ai | si) log πg(ai | si) πO(ai | si) (5) Divergence Point (DP) is a measure adapted from Macke et al. (2021), where given a trajectory O and a policy π, it is defined as the minimal point in time in which the action taken by O has zero probability to be chosen by π. We implement a softer version of the original measure, where the probability threshold is a parameter δ instead of exactly 0. The reason for this softened version of DP is that, for similar enough goals, the probability of an action to be chosen for both goals is unlikely to be exactly 0. Finally, as DP gets higher values when O and π share more resemblance, we take its additive inverse to get a distance compatible with the minimization formulation of Algorithm 2. Here too, the goal-dependent policy πg is defined as the softmax stochastic policy from Equation 1: DP(Qg, O) = min{t | πg(at 1 | st 1) δ} (6) Max Util for State-only O applies to states-only: Os = s0, s1, . . . . In this case, similar to offline policy learning, we optimistically take the action with the highest Q-value: Max Util(Qg, Os) = X i |O| max a Qg(si, a) (7) Max Util for Action-only O applies to actions-only: Oa = a0, a1, . . . . We optimistically take the state with the highest Q-value in this case as well. To do that, we first need to find the set of all states for which the observation ai is an optimal action according to Qg. This set can be formally defined as Opt(ai | Qg) = {s S | Qg(s, ai) Qg(s, a) a A}. From this set, we choose the state with the maximal utility (presumed to be the state in the optimal path) and associate this utility with the observation. Max Util(Qg, Oa) = X i |O| max s Opt(ai|Qg) Qg(s, ai) (8) Experimental Evaluation To be able to compare GRAQL and planning-based GR, we use PDDLGym (Silver and Chitnis 2020) as our evaluation environment. PDDLGym is a python framework that automatically constructs Open AI Gym environments from PDDL domains and problems. Thus, for each PDDL domain used by state-of-the-art GR algorithms, we generate the parallel representation in Gym for GRAQL. We use three domains from the PDDLGym library for their similarity with commonly used GR evaluation domains: Blocks, Hanoi, and Sk Grid (The latter highly resembles common GR navigation domains such as those used by Masters and Sardina (2019)). For each domain, we generate 10 GR problems with 4 candidate goals in G. We manually choose ambiguous goals, i.e., goals that are close to one another rather than in different corners of a grid. Each problem has 7 variants, including partial and noise observations. We have 5 variants with varying degrees of observability (10%, 30%, 50%, 70%, and full observability), and 2 variants that include noise observations with varying degrees of observability (50% and full observability). Thus, our test set includes 210 GR problems. These GR problems can pose a real challenge to existing recognizers, especially when they are partially observable or noisy. Note that, while PDDLGym uses a PDDL description to drive the simulations, our approach only has access to the same data available through a regular Gym simulator, i.e., discrete observations, action names, and reward function. Next, we discuss the hyperparameters used in this evaluation and explain our three sets of performance tests, using different types of O as input: state-action pairs where O is fully observable; state-action pairs with missing observations, noisy observations, or both; and state-only or action-only trajectories with missing observations. We use standard machine learning metrics in our evaluation: accuracy, precision, recall, and F-score. We note that the accuracy metric reported here is different from the accuracy metric in Ram ırez and Geffner (2010), which refers to the ratio of problems where the recognizer ranks the true goal (i.e., True Positives) with the highest likelihood, and the number of problems tested. This measure differs from standard ML accuracy, which also takes into consideration whether the algorithm successfully ranks incorrect goals lower than the correct one (i.e., True Positives + True Negatives). In our experiments, we report accuracy using the latter definition. Hyperparameters and Algorithmic Choices For the learning stage of our experiments, we use an offthe-shelf model-free Q-learning algorithm (Sutton 1988). For each goal, we run the learner for a fixed number of episodes, whether it reaches convergence or not. We evaluate greedy policy executions after training the Q-functions for 500, 10k and 30k episodes. The difference in the resulting plan lengths was around 5%, and went both ways sometimes the policy learned after 500 episodes was not informed enough and drifted away from the goal during execution, and sometimes policies learned after 30k episodes overfitted a specific solution. The key differentiator is often whether the learner reaches the goal once during training, as after that the learned Q-function can generate an optimal plan for that goal. Since the performance of these training regimes does not vary much, we report our empirical results with a consistent value of 500 training episodes. For the inference stage, we compare our framework with each of our three proposed measures, denoted Max Util, KL, and DP, as well as the probabilistic planning-based recognizer by Ramirez and Geffner (2010), denoted R&G. For the DP measure, we use a threshold probability for divergence of δ = 0.1. As we wish to be able to compare GRAQL against a planning-based algorithm, we focus our evaluation on discrete state and action spaces, but we conjecture that the framework can directly handle continuous spaces by using the same measures, only with a different learning algorithm. Testing under Full Observability First, we show the performance of Max Util, KL, DP, and R&G on fully observable, optimal state-action trajectories. The average length of these trajectories is 8 (min = Figure 3: Comparison of R&G, Max Util, KL, DP by their accuracy, precision, recall, and F-score for full observability. 5, max = 15) steps for Blocks, 10.4 (min = 5, max = 15) for Hanoi, and 12.4 (min = 6, max = 24) for Sk Grid. Figure 3 summarizes the performance of each approach over ten instances per setup. With full observability, almost all results are above 80%, and KL-divergence achieves perfect performance, consistently predicting the actor s goal correctly. Notably, DP performs worse than the other GRAQL measures, especially in the Sk Grid environment. As DP requires an action to be highly unlikely for some goal to rank it as incorrect, the high variability of Sk Grid presents a real challenge to it. In this domain, the same goal might have a number of equally optimal trajectories requiring diagonal movement due to the nature of square-grid navigation environments. All GRAQL approaches perform similarly or better than R&G, except for DP on Sk Grid, where the largest difference was in precision (DP: 0.53, R&G: 0.69). On the other hand, the performance of R&G in the Hanoi environment is inferior to all GRAQL methods in terms of accuracy, (DP: 0.95, R&G: 0.78), precision (DP:0.83, R&G: 0.48), and F-score (DP:0.91, R&G:0.65). Hanoi has many actions that appear in plans for different goals, causing high ambiguity in recognition time, which makes it especially challenging to R&G to distinguish between those goals. These results show that GRAQL is able to achieve comparable results to the stateof-the-art with fully observable trajectories. Testing under Partial Observability and Noise We evaluate our approaches under partial observability with varying degrees of observability (10%, 30%, 50%, 70%, and full observability). We use the same trajectories in all experiments, removing steps randomly to achieve a specific observability ratio. Table 1 shows the average performance of each approach over ten instances per setup. It is clear that as observability decreases, so does the performance of all approaches. However, partial observability seems to highly affect the performance of R&G, with values that decrease to about a half in the 10% observability level (e.g., accuracy of 0.96 drops to 0.44 in Blocks). In general, KL and Max Util Accuracy Precision Recall F-Score O Domain Mx U KL DP RG Mx U KL DP RG Mx U KL DP RG Mx U KL DP RG Blocks 0.93 0.90 0.93 0.44 0.82 0.80 0.77 0.25 0.90 0.80 1.00 0.90 0.86 0.80 0.87 0.39 Hanoi 0.93 0.95 0.90 0.50 0.77 0.90 0.71 0.29 1.00 0.90 1.00 1.00 0.87 0.90 0.83 0.44 Sk Grid 0.80 0.90 0.55 0.72 0.60 0.80 0.36 0.42 0.60 0.80 1.00 1.00 0.60 0.80 0.53 0.59 Blocks 1.00 0.95 0.97 0.74 1.00 0.90 0.91 0.42 1.00 0.90 1.00 0.80 1.00 0.90 0.95 0.55 Hanoi 0.95 0.95 0.93 0.68 0.83 0.90 0.77 0.38 1.00 0.90 1.00 1.00 0.91 0.90 0.87 0.55 Sk Grid 0.90 0.95 0.70 0.88 0.80 0.90 0.45 0.63 0.80 0.90 1.00 1.00 0.80 0.90 0.62 0.77 Blocks 1.00 1.00 0.97 0.80 1.00 1.00 0.91 0.50 1.00 1.00 1.00 0.90 1.00 1.00 0.95 0.64 Hanoi 0.95 0.95 0.93 0.72 0.83 0.90 0.77 0.42 1.00 0.90 1.00 1.00 0.91 0.90 0.87 0.59 Sk Grid 0.80 0.90 0.72 0.88 0.60 0.80 0.48 0.63 0.60 0.80 1.00 1.00 0.60 0.80 0.65 0.77 Blocks 1.00 1.00 0.97 0.94 1.00 1.00 0.91 0.77 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.87 Hanoi 0.95 0.90 0.93 0.72 0.83 0.80 0.77 0.42 1.00 0.80 1.00 1.00 0.91 0.80 0.87 0.59 Sk Grid 0.85 0.95 0.72 0.92 0.70 0.90 0.47 0.71 0.70 0.90 0.90 1.00 0.70 0.90 0.62 0.83 Blocks 1.00 1.00 0.97 0.96 1.00 1.00 0.91 0.85 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.92 Hanoi 0.97 1.00 0.95 0.78 0.91 1.00 0.83 0.48 1.00 1.00 1.00 1.00 0.95 1.00 0.91 0.65 Sk Grid 0.90 1.00 0.78 0.90 0.80 1.00 0.53 0.69 0.80 1.00 0.90 0.90 0.80 1.00 0.67 0.78 Blocks 0.98 0.97 0.96 0.78 0.96 0.94 0.88 0.56 0.98 0.94 1.00 0.92 0.97 0.94 0.93 0.67 Hanoi 0.95 0.95 0.93 0.68 0.83 0.90 0.77 0.40 1.00 0.90 1.00 1.00 0.91 0.90 0.87 0.56 Sk Grid 0.85 0.94 0.69 0.86 0.70 0.88 0.45 0.62 0.70 0.88 0.96 0.98 0.70 0.88 0.61 0.75 Table 1: Impact of partial observability: comparing Mx U, KL, DP, RG with varying observability levels of O. perform better than DP and R&G, except for the recall metric, where these latter ones are the superior approaches. The reason for this loss of recall is that DP and R&G are more likely to have ties between potential goals, so when these approaches cannot distinguish between them using O, they return multiple goals, trading off recall with precision. Finally, we evaluate our approaches with the addition of noise in the observations. We add noise to the observations by first generating an optimal plan using the Fast Downward planner. We then randomly choose a step-index and inject two consecutive non-optimal actions, which induces states along this sub-optimal path as a state-action pair, from that step, thus forcibly deviating the observed plan from the optimal. Finally, to get back to the goal with no additional noise, we rerun the planner to get an optimal plan to the goal from the state reached after executing these two noisy actions. Using this process to add noise, we have four additional noisy actions per trajectory for all of our environments: two that make the agent drift away from its goal, and two additional actions to backtrack. We chose this form of noise as these actions are still valid, even if not optimal. An alternative noise could have been an injection of invalid actions, or any random state-action pair that is not part of the generated trajectory. However, R&G and most other planning-based approaches cannot trivially reason about this type of noise, as they will simply label that noisy plan as an impossible plan for the goal. We tested the noisy trajectories with full observability and with partial observability set to 0.5. Table 2 shows these results. Unlike the noise-free case, in these results Max Util outperforms KL in most setups in terms of accuracy, precision, and F-score. When comparing the overall performance of each approach with and without noise, KL and R&G are more noise-sensitive than Max Util and DP. State-Only and Action-Only Observations R&G, for example, uses only actions in its inference process, but no states (or vice versa), even if they are available. Our next set of experiments shows the performance of GRAQL when given such observations, and compares it with R&G. We use the same observation sequences as in our partial observability experiments, but we provide our Max Util-based approaches with either the full sequences, the states Os = s0, s1, . . . (Max Util for state-only, in Equation 7) or the actions Oa = a0, a1, . . . (Max Util for action-only, in Equation 8). Figure 4 summarizes these results, where each bar represents the average performance for all observation levels (from 10% to full). All versions of Max Util perform well in Blocks and Hanoi and outperform R&G in all metrics but recall. The main issue is the Actiononly version in the Sk Grid domain (e.g. accuracy of 0.54 and precision of 0.23), which underperforms significantly against other versions. Equation 8 estimates states using the most optimistic of all of the states for which the observed action is an optimal action. Given Sk Grid s structure, every action is an optimal action for about half of the states, thus taking this optimistic approach is unlikely to be accurate. Related Work A large body of work involves learning for planning domains (Zimmerman and Kambhampati 2003; Arora et al. 2018). While some approaches learn action models from data, they do not link these action models to policies for reaching specific goals (Amir and Chang 2008; Amado et al. 2019; Asai and Muise 2020; Juba, Le, and Stern 2021). For example, Zeng et al. (2018) use inverse reinforcement learning (IRL) to learn the rewards of the actor and then use an MDP-based GR. However, for GR, the motivation (rewards) that lead the actor to choose one action over another is redundant. By directly using RL, we skip this stage and learn utility functions or policies based on past actor experiences towards achieving specific goals. Amado et al. (2018) learn domain theories for GR using autoencoders. However, they require observation of all possible transitions of a domain in order to infer its encoding, whereas we need only a small sample of transitions to learn a utility function informative enough to carry out GR effectively. Accuracy Precision Recall F-Score O Domain MU KL DP RG MU KL DP RG MU KL DP RG MU KL DP RG Blocks 0.95 0.62 0.93 0.84 0.95 0.33 0.77 0.56 0.90 0.50 1.00 1.00 0.90 0.40 0.87 0.71 Hanoi 0.97 0.90 0.93 0.68 0.91 0.80 0.77 0.38 1.00 0.80 1.00 1.00 0.95 0.80 0.87 0.56 Sk Grid 0.75 0.75 0.57 0.88 0.50 0.50 0.35 0.64 0.50 0.50 0.80 0.90 0.50 0.50 0.48 0.75 Blocks 1.00 1.00 0.95 0.96 1.00 1.00 0.83 0.83 1.00 1.00 1.00 1.00 1.00 1.00 0.91 0.91 Hanoi 1.00 0.95 0.90 0.78 1.00 0.90 0.71 0.48 1.00 0.90 1.00 1.00 1.00 0.90 0.83 0.65 Sk Grid 0.85 0.95 0.65 0.90 0.70 0.90 0.40 0.69 0.70 0.90 0.80 0.90 0.70 0.90 0.53 0.78 Blocks 0.97 0.81 0.94 0.90 0.97 0.60 0.80 0.70 0.95 0.75 1.00 1.00 0.95 0.67 0.89 0.81 Hanoi 0.99 0.93 0.91 0.73 0.95 0.85 0.74 0.43 1.00 0.85 1.00 1.00 0.98 0.85 0.85 0.61 Sk Grid 0.80 0.85 0.61 0.89 0.60 0.70 0.37 0.67 0.60 0.70 0.80 0.90 0.60 0.70 0.51 0.77 Table 2: Impact of noise: comparing MU, KL, DP, RG with varying observability and with 4 noisy observations in O. Unlike approaches that learn models for planning, we do not reason about the plan of the acting agent, but rather about the plan of another agent. In this case, we cannot control the actor s choices, and we might not know or care how the actor represents the environment and the task. Nevertheless, we need to be able to find a good-enough explanation for its actions to be able to assist it (as in the kitchen example from Figure 1). This setup is not the one used in existing work on learning other agent s behavior, e.g., the LOPE system (Garc ıa-Mart ınez and Borrajo 2000), (Safaei and Ghassem Sani 2007), and IRALe (Rodrigues et al. 2011), as these systems choose the execution sequences it learns from. We can, however, use observed actions of other agents to improve our learning process. Gil (1994) does so by investigating cases where executing new experiments can refine operators. Other metric-based GR use distance metrics between an optimal plan and the observation sequence, which can somewhat alleviate the need in online planner executions (Masters and Sardina 2017; Mirsky et al. 2019). This work differs from this problem statement, as it relies on the distance between a Q-function and an observation sequence rather than an optimal plan and an observation sequence. Figure 4: Performance comparison of R&G and Max Util with Os (state-only), Oa (action-only), and O (state-action). Discussion and Conclusion In this paper, we introduce a new framework for Goal Recognition (GR) as model-free reinforcement learning, which obviates the need for an explicit model of the environment and candidate goals in the GR process. Our framework uses learned Q-values implicitly representing the agents under observation in lieu of explicit goals from traditional GR. This approach allows us to solve GR problems by minimizing the distance between an observation sequence and Q-values representing goal hypotheses or policies extracted from them. The GRAQL instantiation includes several possible distance measures we can derive from the Q-tables, based on KL-divergence, Max Util, and Convergence point. Our distance measures are competitive with the reference approach from the literature (Ram ırez and Geffner 2009) in all experimental environments, and some distance measures outperform the reference approach in most domains, especially when the observation sequence is noisy or partial. Besides recognition performance, GR needs to be computationally efficient so that an observer can quickly make decisions in response to the recognized goal in real-time. In this respect, our approaches differ substantially from recent planning-based GR, as it shifts almost all computation load to a pre-processing stage, instead of costly online planner runs. While computing the policies for each candidate goal is even more costly than running a planner for each goal, this computation can be done once prior to the recognition stage, and then the computation of processing observations is trivial and proportional to the number of observations. Finally, learning the policies saves the time of a domain expert who needs to carefully design the planning dynamics a cost which is even harder to quantify. In closing, our work paves the way for a new class of GR approaches based on model-free reinforcement learning. Future work will focus on new, more robust distance measures and mechanisms to handle noise explicitly, as well as experimenting with models learned using function approximation (e.g., neural networks). While our work is theoretically compatible with non-tabular representations of the value functions, we chose to focus our experiments on domains that are translatable to PDDL so our approach can be compared to planning-based GR. GRAQL does not depend on PDDL, as we can apply the learning stage on any domain where we can compute a policy, and then infer the correct goal using the set of observations and learned policies. Amado, L.; Pereira, R. F.; Aires, J.; Magnaguagno, M.; Granada, R.; and Meneguzzi, F. 2018. Goal recognition in latent space. In 2018 International Joint Conference on Neural Networks (IJCNN), 1 8. IEEE. Amado, L. R.; Aires, J. P.; Fraga Pereira, R.; Magnaguagno, M. C.; Granada, R.; Licks, G. P.; and Meneguzzi, F. R. 2019. Latrec: Recognizing goals in latent space. In International Conference on Automated Planning and Scheduling (ICAPS). Amir, E.; and Chang, A. 2008. Learning partially observable deterministic action models. Journal of Artificial Intelligence Research, 33: 349 402. Arora, A.; Fiorino, H.; Pellier, D.; Etivier, M.; and Pesty, S. 2018. A review of learning planning action models. Knowledge Engineering Review, 33. Asai, M.; and Muise, C. 2020. Learning Neural-Symbolic Descriptive Planning Models via Cube-Space Priors: The Voyage Home (to STRIPS). In Bessiere, C., ed., International Joint Conference on Artificial Intelligence (IJCAI), 2676 2682. Bishop, J.; Burgess, J.; Ramos, C.; Driggs, J. B.; Williams, T.; Tossell, C. C.; Phillips, E.; Shaw, T. H.; and de Visser, E. J. 2020. CHAOPT: a testbed for evaluating humanautonomy team collaboration using the video game overcooked! 2. In 2020 Systems and Information Engineering Design Symposium (SIEDS), 1 6. IEEE. Garc ıa-Mart ınez, R.; and Borrajo, D. 2000. An integrated approach of learning, planning, and execution. Journal of Intelligent and Robotic Systems, 29(1): 47 78. Gil, Y. 1994. Learning by experimentation: Incremental refinement of incomplete planning domains. In Machine Learning Proceedings 1994, 87 95. Elsevier. Granada, R.; Monteiro, J.; Gavenski, N.; and Meneguzzi, F. 2020. Object-Based Goal Recognition Using Real-World Data. In Mexican International Conference on Artificial Intelligence, 325 337. Springer. Helmert, M. 2006. The fast downward planning system. Journal of Artificial Intelligence Research, 26: 191 246. Juba, B.; Le, H. S.; and Stern, R. 2021. Safe Learning of Lifted Action Models. In International Conference on Principles of Knowledge Representation and Reasoning (KR). Kautz, H. A.; and Allen, J. F. 1986. Generalized plan recognition. In Conference on Artificial Intelligence (AAAI), volume 86, 5. Macke, W.; Mirsky, R.; and Stone, P. 2021. Expected Value of Communication for Planning in Ad Hoc Teamwork. In Conference on Artificial Intelligence (AAAI). Masters, P.; and Sardina, S. 2017. Cost-based goal recognition for path-planning. In Conference on Autonomous Agents and Multi Agent Systems. Masters, P.; and Sardina, S. 2019. Cost-based goal recognition in navigational domains. Journal of Artificial Intelligence Research, 64: 197 242. Meneguzzi, F.; and Pereira, R. F. 2021. A Survey on Goal Recognition as Planning. In International Joint Conference on Artificial Intelligence (IJCAI), 4524 4532. Mirsky, R.; Keren, S.; and Geib, C. 2021. Introduction to Symbolic Plan and Goal Recognition. Synthesis Lectures on Artificial Intelligence and Machine Learning, 16(1): 1 190. Mirsky, R.; Majadly, A.; Gal, K.; Puzis, R.; Felner, A.; et al. 2019. New goal recognition algorithms using attack graphs. In International Symposium on Cyber Security Cryptography and Machine Learning, 260 278. Springer. Monteiro, J.; Granada, R.; Barros, R.; and Meneguzzi, F. 2017. Deep Neural Networks for Kitchen Activity Recognition. In 2017 International Joint Conference on Neural Networks (IJCNN). IEEE. Ram ırez, M.; and Geffner, H. 2009. Plan recognition as planning. In Twenty-First International Joint Conference on Artificial Intelligence (IJCAI). Ram ırez, M.; and Geffner, H. 2010. Probabilistic plan recognition using off-the-shelf classical planners. In AAAI Conference on Artificial Intelligence. Richter, S.; and Westphal, M. 2010. The LAMA planner: Guiding cost-based anytime planning with landmarks. Journal of Artificial Intelligence Research, 39: 127 177. Rodrigues, C.; G erard, P.; Rouveirol, C.; and Soldano, H. 2011. Active learning of relational action models. In International Conference on Inductive Logic Programming, 302 316. Springer. Safaei, J.; and Ghassem-Sani, G. 2007. Incremental learning of planning operators in stochastic domains. In International Conference on Current Trends in Theory and Practice of Computer Science, 644 655. Springer. Silver, T.; and Chitnis, R. 2020. PDDLGym: Gym Environments from PDDL Problems. Co RR, abs/2002.06432. Sohrabi, S.; Riabov, A. V.; and Udrea, O. 2016. Plan Recognition as Planning Revisited. In International Joint Conference on Artificial Intelligence (IJCAI), 3258 3264. Sutton, R. S. 1988. Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3: 9 44. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press. Zeng, Y.; Xu, K.; Yin, Q.; Qin, L.; Zha, Y.; and Yeoh, W. 2018. Inverse Reinforcement Learning Based Human Behavior Modeling for Goal Recognition in Dynamic Local Network Interdiction. In AAAI Workshops, 646 653. Zimmerman, T.; and Kambhampati, S. 2003. Learningassisted automated planning: Looking back, taking stock, going forward. AI Magazine, 24(2): 73 73.