# online_agent_supervision_in_the_situation_calculus__916e9bfa.pdf Online Agent Supervision in the Situation Calculus Bita Banihashemi York University Toronto, Canada bita@cse.yorku.ca Giuseppe De Giacomo Sapienza Universit a di Roma Roma, Italy degiacomo@dis.uniroma1.it Yves Lesp erance York University Toronto, Canada lesperan@cse.yorku.ca Agent supervision is a form of control / customization where a supervisor restricts the behavior of an agent to enforce certain requirements, while leaving the agent as much autonomy as possible. In this work, we investigate supervision of an agent that may acquire new knowledge about her environment during execution, for example, by sensing. Thus we consider an agent s online executions, where, as she executes the program, at each time point she must make decisions on what to do next based on what her current knowledge is. This is done in a setting based on the situation calculus and a variant of the Con Golog programming language. The main results of this paper are (i) a formalization of the online maximally permissive supervisor, (ii) a sound and complete technique for execution of the agent as constrained by such a supervisor, and (iii) a new type of lookahead search construct that ensures nonblockingness over such online executions. 1 Introduction In many settings, we want to restrict an agent s behavior to conform to a set of specifications. For instance, the activities of agents in an organization have to adhere to some business rules and privacy/security protocols. Similarly, a mobile robot has to conform to safety specifications and avoid causing injuries to others. One form of this is customization, where a generic process for performing a task or achieving a goal is refined to satisfy a client s constraints or preferences. Process customization includes personalization [Fritz and Mc Ilraith, 2006] and configuration [Liaskos et al., 2012] and finds applications in number of areas. A key challenge in such settings is ensuring conformance to specifications while preserving the agent s autonomy. Motivated by this and inspired by supervisory control of discrete event systems [Wonham and Ramadge, 1987; Wonham, 2014; Cassandras and Lafortune, 2008], De Giacomo, Lesp erance and Muise [De Giacomo et al., 2012] (DLM) proposed agent supervision as a form of control/customization of an agent s behavior. The DLM framework is based on the situation calculus [Mc Carthy and Hayes, 1969; Reiter, 2001] and a variant of the Con Golog [De Giacomo et al., 2000] programming language. DLM represent the agent s possible behaviors as a nondeterministic Con Golog process. Another Con Golog process represents the supervision specification, i.e., which behaviors are acceptable/desirable. If it is possible to control all of the agent s actions, then it is easy to obtain the behaviors of the supervised agent through a kind of synchronous concurrent execution of the agent process and the supervision specification process. However, some of the agent s actions may be uncontrollable. DLM formalize a notion of maximally permissive supervisor that minimally constrains the behavior of the agent in the presence of uncontrollable actions so as to enforce the desired behavioral specifications. The original DLM account of agent supervision assumes that the agent does not acquire new knowledge about her environment while executing. This means that all reasoning is done using the same knowledge base. The resulting executions are said to be offline executions. In this paper we study how we can apply the DLM framework in the case where the agent may acquire new knowledge while executing, for example through sensing. This means that the knowledge base that the agent uses in her reasoning needs to be updated during the execution. For instance, consider a travel planner agent that needs to book a seat on a certain flight. Only after querying the airline web service offering that flight will the agent know if there are seats available on the flight. Technically, this requires switching from offline executions to online executions [De Giacomo and Levesque, 1999; Sardi na et al., 2004], which, differently from offline executions, can only be defined meta-theoretically (unless one adds a knowledge operator/fluent) since at every time point the knowledge base used by the agent to deliberate about the next action is different. Based on online executions, we formalize the notion of online maximally permissive supervisor and show its existence and uniqueness, as in the simpler case of DLM. Moreover, we meta-theoretically define a program construct (i.e., supervision operator) for online supervised execution that given the agent and specification, executes them to obtain only runs allowed by the maximally permissive supervisor, showing its soundness and completeness. We also define a new lookahead search construct that ensures the agent can successfully complete the execution (i.e., ensures nonblockingness). Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) 2 Preliminaries The situation calculus (SC) is a well known predicate logic language for representing and reasoning about dynamically changing worlds. Within the language, one can formulate action theories that describe how the world changes as the result of actions [Reiter, 2001]. We assume that there is a finite number of action types A. Moreover, we assume that the terms of object sort are in fact a countably infinite set N of standard names for which we have the unique name assumption and domain closure. As a result a basic action theory (BAT) D is the union of the following disjoint sets: the foundational, domain independent, (second-order, or SO) axioms of the situation calculus ( ), (first-order, or FO) precondition axioms stating when actions can be legally performed (Dposs), (FO) successor state axioms describing how fluents change between situations (Dssa), (FO) unique name axioms for actions and domain closure on action types (Dca); (SO) unique name axioms and domain closure for object constants (Dcoa); and (FO) axioms describing the initial configuration of the world (DS0). A special predicate Poss(a, s) is used to state that action a is executable in situation s; precondition axioms in Dposs characterize this predicate. The abbreviation Executable(s) means that every action performed in reaching situation s was possible in the situation in which it occurred. In turn, successor state axioms encode the causal laws of the world being modeled; they replace the so-called effect axioms and provide a solution to the frame problem. We write do([a1, a2, . . . , an 1, an], s) as an abbreviation for the situation term do(an, do(an 1, . . . , do(a2, do(a1, s)) . . .)); for an action sequence ~a, we often write do(~a, s) for do([~a], s). To represent and reason about complex actions or processes obtained by suitably executing atomic actions, various so-called high-level programming languages have been defined. Here, we concentrate on (a fragment of) Con Golog that includes the following constructs: δ ::= | '? | δ1; δ2 | δ1|δ2 | x.δ | δ | δ1kδ2 | δ1& δ2 In the above, is an action term, possibly with parameters, and ' is situation-suppressed formula, i.e., a SC formula with all situation arguments in fluents suppressed. As usual, we denote by '[s] the formula obtained from ' by restoring the situation argument s into all fluents in '. Program δ1|δ2 allows for the nondeterministic choice between programs δ1 and δ2, while x.δ executes program δ for some nondeterministic choice of a legal binding for variable x (observe that such a choice is, in general, unbounded). δ performs δ zero or more times. Program δ1kδ2 represents the interleaved concurrent execution of programs δ1 and δ2. The intersection/synchronous concurrent execution of programs δ1 and δ2 (introduced by DLM) is denoted by δ1& δ2. Formally, the semantics of Con Golog is specified in terms of single-step transitions, using two predicates [De Giacomo et al., 2000]: (i) Trans(δ, s, δ0, s0), which holds if one step of program δ in situation s may lead to situation s0 with δ0 remaining to be executed; and (ii) Final(δ, s), which holds if program δ may legally terminate in situation s. The definitions of Trans and Final we use are as in [De Giacomo et al., 2010]; differently from [De Giacomo et al., 2000], the test construct '? does not yield any transition, but is final when satisfied. Thus, it is a synchronous version of the original test construct (it does not allow interleaving). As a result, in our version of Con Golog, every transition involves the execution of an action. Predicate Do(δ, s, s0) means that program δ, when executed starting in situation s, has as a legal terminating situation s0, and is defined as Do(δ, s, s0) .= 9δ0.Trans (δ, s, δ0, s0) Final(δ0, s0) where Trans denotes the reflexive transitive closure of Trans. A Con Golog program δ is situation-determined (SD) in a situation s [De Giacomo et al., 2012] if for every sequence of actions, the remaining program is determined by the resulting situation, i.e., Situation Determined(δ, s) .= 8s0, δ0, δ00. Trans (δ, s, δ0, s0) Trans (δ, s, δ00, s0) δ0 = δ00, For example, program (a; b) | (a; c) is not SD, while a; (b | c) is (assuming the actions involved are always executable). Thus, a (partial) execution of a SD program is uniquely determined by the sequence of actions it has produced. Hence a program in a starting situation generates a set/language of action sequences, its executions, and operations like intersection and union become natural. In the rest, we use C to denote the axioms defining the Con Golog programming language. 3 Agents Executing Online In our account of agent supervision, we want to accommodate agents that can acquire new knowledge about their environment during execution, for example by sensing, and where their knowledge base is updated with this new knowledge. Thus we consider an agent s online executions, where, as she executes the program, at each time point, she makes decisions on what to do next based on what her current knowledge is. Sensing. A crucial aspect of online executions is that the agent can take advantage of sensing. Similarly to [Lesp erance et al., 2008], we model sensing as an ordinary action which queries a sensor, followed by the reporting of a sensor result, in the form of an exogenous action. Specifically, to sense whether fluent P holds within a program, we use a macro: Sense P .= Qry If P; (rep V al P(1) | rep V al P(0)), where Qry If P is an ordinary action that is always executable and is used to query (i.e., sense) if P holds and rep V al P(x) is an exogenous action with no effect that informs the agent if P holds through its precondition axiom, which is of the form: Poss(rep V al P(x), s) P(s) x = 1 _ P(s) x = 0. Thus, we can understand that Sense P reports value 1 through the execution of rep V al P(1) if P holds, and 0 through the execution of rep V al P(0) otherwise. For example, consider the following agent program: δi = Sense P; [P?; A] | [ P?; B] and assume the agent does not know if P holds initially. So initially we have D [ C |= Trans(δi, S0, δ0, S1) where S1 = do(Qry If P, S0) and δ0 = nil; (rep V al P(1) | rep V al P(0))); [P?; A] | [ P?; B]. At S1, the agent knows either of the exogenous actions rep V al P(0) or rep V al P(1) could occur, but does not know which. After the occurrence of one of these actions, the agent learns whether P holds. For example, if rep V al P(1) occurs, the agent s knowledge base is now updated to D [ C [ {Poss(rep V al P(1), S1)}. With this updated knowledge, she knows which action to do next: D [ C [ Poss(rep V al P(1), S1) |= Trans(nil; [P?; A] | [ P?; B], do(rep V al P(1), S1), nil, do([rep V al P(1), A], S1)). Notice that with this way of doing sensing, we essentially store the sensing results in the situation (which includes all actions executed so far including the exogenous actions used for sensing). In particular the current KB after having performed the sequence of actions ~a is: D [ C [ {Executable(do(~a, S0)}. Note that this approach also handles the agent s acquiring knowledge from an arbitrary exogenous action. Agent online configurations and transitions. We denote an agent by σ, denoting a pair h D, δii, where δi is the initial program of the agent expressed in Con Golog and D is a BAT that represents the agent s initial knowledge (which may be incomplete). We assume that we have a finite set of primitive action types A, which is the disjoint union of a set of ordinary primitive action types Ao and exogenous primitive action types Ae. An agent configuration is modeled as a pair hδ,~ai, where δ is the remaining program and ~a is the sequence of actions performed so far starting from S0. The initial configuration ci is hδi, i, where is the empty sequence of actions. The online transition relation between agent configurations is (a meta-theoretic) binary relation defined as: hδ,~ai !A(~n) hδ0,~a A(~n)i if and only if either A 2 Ao, ~n 2 N k and D [ C [ {Executable(do(~a, S0))} |= Trans(δ, do(~a, S0), δ0, do(A(~n), do(~a, S0))) or A 2 Ae, ~n 2 N k and D [ C [ {Executable(do(~a, S0)), Trans(δ, do(~a, S0), δ0, do(A(~n), do(~a, S0)))} is satisfiable. Here, hδ,~ai !A(~n) hδ0,~a A(~n)i means that configuration hδ,~ai can make a single-step online transition to configuration hδ0,~a A(~n)i by performing action A(~n). If A(~n) is an ordinary action, the agent must know that the action is executable and know what the remaining program is afterwards. If A(~n) is an exogenous action, the agent need only think that the action may be possible with δ0 being the remaining program, i.e., it must be consistent with what she knows that the action is executable and δ0 is the remaining program. As part of the transition, the theory is (implicitly) updated in that the new exogenous action A(~n) is added to the action sequence, and Executable(do([~a, A(~n)], S0)) will be added to the theory when it is queried in later transitions, thus incorporating the fact that Poss(A(~n), do(~a, S0)) is now known to hold. The (meta-theoretic) relation c ! ~a c0 is the reflexivetransitive closure of c !A(~n) c0 and denotes that online configuration c0 can be reached from the online configuration c by performing a sequence of online transitions involving the sequence of actions ~a. We also define a (meta-theoretic) predicate c X meaning that the online configuration c is known to be final: hδ,~ai X if and only if D [ C [ {Executable(do(~a, S0))} |= Final(δ, do(~a, S0)). Online situation determined agents. In this paper, we are interested in programs that are SD, i.e., given a program, a situation and an action, we want the remaining program to be determined. However this is not sufficient when considering online executions. We want to ensure that the agent always knows what the remaining program is after any sequence of actions. We say that an agent is online situationdetermined (online SD) if for any sequence of actions that the agent can perform online, the resulting agent configuration is unique. Formally, an agent σ = h D, δii with initial configuration ci = hδi, i is online SD if and only if for all sequences of action ~a, if ci ! ~a c0 and ci ! ~a c00 then c0 = c00. In [Banihashemi et al., 2016b], it is shown that for an agent to be online SD, it is sufficient that the agent always knows what the remaining program is after an exogenous action. From now on, we assume that the agent is online SD. Online Runs. For an agent σ that is online SD, online executions can be succinctly represented by runs formed by the corresponding sequence of actions. The set RR(σ) of (partial) runs of an online SD agent σ with starting configuration ci is the sequences of actions that can be produced by executing ci from S0: RR(σ) = {~a | 9c.ci ! ~a c}. A run is complete if it reaches a final configuration. Formally we define the set CR(σ) of complete runs as: CR(σ) = {~a | 9c.ci ! ~a c c X}. Finally we say that a run is good if it can be extended to a complete run. Formally we define the set GR(σ) of good runs as: GR(σ) = {~a | 9c, c0, ~a0.ci ! ~a0 c0 c0X}. 4 Online Agent Supervision Agent supervision aims at restricting an agent s behavior to ensure that it conforms to a supervision specification while leaving it as much autonomy as possible. DLM s account of agent supervision is based on offline executions and does not accommodate agents that acquire new knowledge during a run. DLM represent the agent s possible behaviors by a (nondeterministic) SD Con Golog program δi relative to a BAT D. The supervision specification is represented by another SD Con Golog program δs. First note that if it is possible to control all the actions of the agent, then it is straightforward to specify the result of supervision as the intersection of the agent and the specification processes (δi& δs). However in general, some of agent s actions may be uncontrollable. These are often the result of interaction of an agent with external resources, or may represent aspects of agent s behavior that must remain autonomous and cannot be controlled directly. This is modeled by the special fluent Au(a, s) that means action a is uncontrollable in situation s. DLM say that a supervision specification δs is controllable wrt the agent program δi in situation s iff: 8~aau.9~b.Do(δs, s, do([~a,~b], s)) Au(au, do(~a, s)) (9~d.Do(δi, s, do([~a, au, ~d], s)) 9~b0.Do(δs, s, do([~a, au, ~b0], s))), i.e., if we postfix an action sequence ~a that is good offline run for δs (i.e., such that 9~b.Do(δs, s, do([~a,~b], s)) holds) with an uncontrollable action au which is good for δi, then au must also be good for δs. Then, DLM define the offline maximally permissive supervisor (offline MPS) mpso (δi, δs, s) of the agent behavior δi which fulfills the supervision specification δs as: mpso (δi, δs, s) = set(S E2E E) where E = {E | 8~a 2 E Do(δi & δs, s, do(~a, s)) and set(E) is controllable wrt δi in s} This says that the offline MPS is the union of all sets of action sequences that are complete offline runs of both δi and δs (i.e., such that Do(δi & δs, s, do(~a, s))) that are controllable wrt δi in situation s. The above definition uses the set(E) construct introduced by DLM, which is a sort of infinitary nondeterministic branch; it takes an arbitrary set of sequences of actions E and turns it into a program. We define its semantics as follows: Trans(set(E), s, δ0, s0) 9a,~a.a~a 2 E Poss(a, s) s0 = do(a, s) δ0 = set({~a | a~a 2 E Poss(a, s)}) Final(set(E), s) 2 E Therefore set(E) can be executed to produce any of the sequences of actions in E.1 DLM show that their notion of offline MPS, mpso (δi, δs, s), has many nice properties: it always exists and is unique, it is controllable wrt the agent behavior δi in s, and it is the largest set of offline complete runs of δi that is controllable wrt δi in s and satisfy the supervision specification δs in s, i.e., is maximally permissive. However, the notion of offline MPS is inadequate in the context of online execution, as the following example shows. Example 1 Suppose that we have an agent that does not know whether P holds initially, i.e., D 6|= P(S0) and D 6|= P(S0). Suppose that the agent s initial program is: 4 = [P?; ((A; (C | U)) | (B; D))] | [ P?; ((A; D) | (B; (C | U)))] where all actions are ordinary, always executable, and controllable except for U, which is always uncontrollable. Suppose that the supervision specification is: 4 = ( a.a 6= U?; a) i.e., any action except U can be performed. It is easy to show that the offline MPS obtained using DLM s definition is different depending on whether P holds or not: D [ C |= (P(S0) mpso (δi 4, S0) = set({[B, D]})) ( P(S0) mpso (δi 4, S0) = set({[A, D]})) For models of the theory where P holds, the offline MPS is set({B, D}), as the set of complete offline runs of δs 1Obviously there are certain sets that can be expressed directly in Con Golog, e.g., when E is finite. However in the general case, the object domain may be infinite, and set(E) may not be representable as a finitary Con Golog program. is {[B, D], [A, C]} and set({[A, C]}) is not controllable wrt δi 4 in S0. For models where P does not hold, the offline MPS is set({A, D}), since the set of complete offline runs of δs 4 in S0 is {[A, D], [B, C]} and set({[B, C]}) is not controllable wrt δi 4 in S0. Since it is not known if P holds, it seems that a correct supervisor should neither allow A nor B. As the above example illustrates, we have an offline MPS for each model of the theory. Instead, we want a single online MPS that works for all models and includes sensing information when acquired. The difference between offline MPS and online MPS is analogous to the difference between classical plans and conditional plans that include sensing in the planning literature [Ghallab et al., 2004]. Online Maximally Permissive Supervisor. In our account of supervision, we consider agents that may acquire knowledge through sensing and exogenous actions as they operate and make decisions based on what they know, and we model these as online SD agents. To see how we can formalize supervision for such agents, assume that we have an online SD agent σ = h D, δii whose behavior we want to supervise. Also suppose that we have a supervision specification δs of what behaviors we want to allow in the supervised system and that the system h D, δsi is also online SD. We say that a specification δs is online controllable wrt online SD agent σ = h D, δii iff: 8~aau.~a 2 GR(h D, δsi) and D [ {Executable(do(~a, S0))} 6|= Au(au, do(~a, S0)) implies if ~aau 2 GR(σ) then ~aau 2 GR(h D, δsi). i.e, if we postfix a good online run ~a for h D, δsi with an action au that is not known to be controllable which is good for σ (and so ~a must be good for σ as well), then au must also be good for h D, δsi (~aau 2 GR(σ) and ~aau 2 GR(h D, δsi) together imply that ~aau 2 GR(h D, δi& δsi)). This definition, differently from DLM s, applies to online runs. Moreover it treats actions that are not known to be controllable as uncontrollable, thus ensuring that δs is controllable in all possible models/worlds compatible with what the agent knows. As DLM, we focus on good runs of the process, assuming that the agent will not perform actions that don t lead to a final configuration of δi. The supervisor only ensures that given this, the process always conforms to the specification. Given this, we can then define the online maximally permissive supervisor mpsonl(δs, σ) of the online SD agent σ = h D, δii which fulfills the supervision specification δs: mpsonl(δs, σ) = set(S E2E E) where E = {E | E CR(h D, δi & δsi) and set(E) is online controllable wrt σ} i.e., the online MPS is the union of all sets of action sequences that are complete online runs of both δi and δs that are online controllable wrt the agent σ. We can show that: Theorem 1 For the online maximally permissive supervisor mpsonl(δs, σ) of the online SD agent σ = h D, δii which fulfills the supervision specification δs, where h D, δsi is also online SD, the following properties hold: 1. mpsonl(δs, σ) always exists and is unique; 2. h D, mpsonl(δs, σ)i is online SD; 3. mpsonl(δs, σ) is online controllable wrt σ; 4. for every possible online controllable supervision spec- ification ˆδs for σ such that CR(h D, δi& ˆδsi) CR(h D, δi&δsi), we have that CR(h D, δi& ˆδsi) CR(h D, mpsonl(δs, σ)i), i.e., mpsonl is maximally permissive; 5. RR(h D, mpsonl(δs, σ)i) = GR(h D, mpsonl(δs, σ)i), i.e., mpsonl(δs, σ) is non-blocking. Example 2 If we return to the agent of Example 1, who does not know whether P holds initially, it is easy to show that our definition of online MPS yields the correct result, i.e. mpsonl(δs 4i) = set({ }). Example 3 Supervision can also depend on the information that the agent acquires as it executes. Again, suppose that we have an agent that does not know whether P holds initially. Suppose also that the agent s initial program is δi 5 = Sense P ; δi 4. We can show that: D [ C |= (P(S0) mpso (δi 4, S0) = set({[Qry If P, rep V al P(1), B, D]})) ( P(S0) mpso (δi 4, S0) = set({[Qry If P, rep V al P(0), A, D]})) Again, we have different offline MPSs depending on whether P holds. But since the exogenous report makes the truth value of P known after the first action, we get one online MPS for this agent as follows: 5i) = set({[Qry If P, rep V al P(1), B, D], [Qry If P, rep V al P(0), A, D]}) Because the agent queries if P holds, the supervisor has enough information to decide the maximal set of runs from then on in each case. So if the reported value of P is true, then the online supervisor should eliminate the complete run [A, C] as it is not controllable, and if P does not hold, the run [B, C] should be eliminated for the same reason. As well, an action s controllability or whether it satisfies the specification may depend on a condition whose truth only becomes known during the execution. Such cases cannot be handled by DLM s original offline account but our online supervision account does handle them correctly. 5 Online Supervision Operator We also introduce a meta-theoretic version of a synchronous concurrency operator δi&onl Au δs that captures the maximally permissive execution of an agent h D, δii under online supervision for specification δs. Wlog, we assume that both δi and δs start with a common controllable action (if not, it is trivial to add a dummy action in front of both). We define δi&onl Au δs by extending the online transition relation as follows: Auδs,~ai !a hδi0&onl Auδs0,~aai if and only if hδi,~ai !a hδi0,~aai and hδs,~ai !a hδs0,~aai and if D [ {Executable(do(~a, S0))} |= Au(a, do(~a, S0)) then for all ~au s.t. D [ {Executable(do(~aa~au, S0)), Au(~au, do(~aa, S0))} is satisfiable, if ~aa ~au 2 GR(h D, [~a; δi]i), then ~aa ~au 2 GR(h D, [~a; δs]i). where Au( ~au, s), means that action sequence ~au is uncontrollable in situation s, and is inductively defined on the length of ~au as the smallest predicate such that: (i) Au( , s) true; (ii) Au(au ~au, s) Au(au, s) Au( ~au, do(au, s)). Thus, the online maximally permissive supervised execution of δi for the specification δs is allowed to perform action a in situation do(~a, S0) if a is allowed by both δi and δs and moreover, if a is known to be controllable, then for every sequence of actions ~au not known to be controllable, if ~au may be performed by δi right after a on one of its complete runs, then it must also be allowed by δs (on one of its complete runs). Essentially, a controllable action a by the agent must be forbidden if it can be followed by some sequence of actions not known to be controllable that violates the specification. The final configurations are extended as follows: Auδs,~ai)X if and only if (hδi,~ai)X and (hδs,~ai)X We can show that firstly, if both the agent and supervision specification processes are online SD, then so is the program obtained using the online supervision operator, and moreover, this program is controllable wrt to the agent process: Theorem 2 1. If h D, δsi and h D, δii are online SD, then so is h D, δi&onl Au δs is online controllable wrt h D, δii. Moreover, the complete runs of the program obtained using the online supervision operator are exactly the same the complete runs generated under synchronous concurrency of the agent and mpsonl(δs, σ): Theorem 3 CR(h D, δi &onl Au δsi) = CR(h D, δi & mpsonl(δs, σ)i). While δi &onl Au δs and mpsonl(δs, σ) have the same complete runs, they differ in their set of partial runs. In general, RR(h D, δi &onl Au δsi) 6= GR(h D, δi &onl Au δsi), i.e., the program obtained using the online supervision operator is not necessarily non-blocking. This contrasts mpsonl(δs, σ), which is guaranteed to be non-blocking (Theorem 1). Example 4 Suppose we have the agent program: 6 = (A | [B; C; (U1 | U2; D)]) where all actions except U1 and U2 are ordinary and controllable. Moreover, assume the supervision specification is: 6 = ( a.a 6= D?; a) i.e. any action except D can be performed. The online MPS for this agent is simply set({A}), since CR(h D, δs 6i) = {A, [B, C, U1]} and set({[B, C, U1]}) is not controllable wrt δi 6. However, under online supervised execution, the agent may execute the action B. We have hδi 6, i !B hδ0 6, Bi where δ0 6 is what remains from δi 6 after executing B. The resulting program is not final in do(B, S0), yet there is no transition from this state, as the action C could be followed by the uncontrollable action U2 and it is not possible to ensure successful completion of the process, as the action D is not allowed. Thus, one must do lookahead search over online executions of δi 6 to obtain good/complete runs. We propose such a search/lookahead construct next. 6 Search Over a Controllable Process When we have a specification/process δs that is (online) controllable wrt an agent h D, δii (e.g. δi &onl Au δs), for any choice of uncontrollable action that is on a good run of δi, it is always possible to find a way to continue executing δs until the process successfully completes. We define a search construct2 that makes an arbitrary choice of action that is on a good run of δi when the action is not known to be controllable, while still only performing actions that are on a good run of δs otherwise. We call this construct weak online search w onl(δs, δi) and define it (metatheoretically) as: 3 onl(δs, δi),~ai !a h w onl(δs0, δi0),~aai if and only if hδs,~ai !a hδs0,~aai and hδi,~ai !a hδi0,~aai and if D [ Executable(~a, S0)} |= Au(a, do(~a, S0)) then ~aa 2 GR(h D, [~aa; δs0]i) else ~aa 2 GR(h D, [~aa; δi0]i) The final configurations are extended as follows: onl(δs, δi),~ai)X iff (hδs,~ai)X and (hδi,~ai)X It is easy to show that: Theorem 4 If h D, δsi and h D, δii are online SD, than so is h D, w onl(δs, δi)i. Now, we can show that the weak online search construct has many nice properties when the process is controllable: Theorem 5 Suppose that we have an agent h D, δii, and a supervision specification δs which are online SD. Suppose also that δs is online controllable with respect to h D, δii, and that CR(h D, δsi) CR(h D, δii). Then we have that: 1. CR(h D, w onl(δs, δi)i) = CR(h D, δsi), i.e. the complete runs of w onl(δs, δi) are the complete runs of δs. 2. If CR(h D, δsi) 6= ;, then RR(h D, w onl(δs, δi)i) = GR(h D, δsi), i.e., the partial runs of w onl(δs, δi) are the good runs of δs. 3. If CR(h D, δsi) 6= ;, then RR(h D, w onl(δs, δi)i) = GR(h D, w onl(δs, δi)i), i.e., partial runs must be good runs, and the resulting program is non blocking . It is also easy to show that none of these properties hold for arbitrary non-controllable processes. Now we can show that if we apply this weak lookahead search to δi &onl Au δs, we obtain a program that has the same partial runs as mpsonl(δs, σ) and is thus non-blocking: 2In Indi Golog a simple type of search is provided that only allows a transition if the remaining program can be executed to reach a final state [De Giacomo and Levesque, 1999]. However, this search does not deal with sensing and online executions. 3Since δi can include exogenous actions, in general, executions of the process could actually perform exogenous actions that are not on a good run of δi. However, in this paper we are interested in the case where the exogenous actions are mainly sensor reports and external requests (rather than the actions of an adversary) and assume that this won t occur. Handling adversarial nondeterminism in δi is left for future work. onl(δi &onl Au δs, δi)i) = RR(h D, δi & mpsonl(δs, σ)i). If we apply the weak online search construct over δi 6 in Example 4, we no longer have an online transition involving action B; the only possible online transition is h w 6), i !A h w onl(nil &onl 6, nil), Ai where action A is performed, after which we have (h w onl(nil &onl 6, nil), Ai)X. 7 Discussion A popular approach to automated service composition [Mc Ilraith and Son, 2002; Sohrabi et al., 2006] customizes a generic Con Golog process based on the user s constraints. [Sardi na and De Giacomo, 2009] on the other hand, synthesizes a controller that orchestrates the concurrent execution of library of Con Golog programs to realize a target program not in the library. However, they assume complete information on the initial situation, and their controller is not maximally permissive. In related work, [De Giacomo et al., 2013b] synthesize a controller generator that represents all possible compositions of the target behavior and may adapt reactively based on runtime feedback. In [Yadav et al., 2013], optimal realization of the target behavior (in the presence of uncontrollable exogenous events) is considered when its full realization is not possible. [Alechina et al., 2015] regulates multiagent systems using regimented norms. A transition system describes the behavior of a (multi-) agent system and a guard function can enable/disable options that (could) violate norms after a system history (possibly using bounded lookahead). Finally, the approach in [Aucher, 2014] reformulates the results of supervisory control theory in terms of model checking problems in an epistemic temporal logic. While these approaches model behaviors as (nondeterministic) finite state transition systems, our approach enables users to express the system model and the specifications in a high-level expressive language. Moreover, due to its first-order logic foundations, it can handle infinite object domains and infinite states. In this paper we have developed an account of supervision for agents that execute online and can acquire new knowledge as they operate. The framework uses a truely first-order representation of states and allows for an infinite object domain and infinite states. Proofs and examples of using online agent supervision to customize a travel planner agent are presented in [Banihashemi et al., 2016a]. If the object domain is finite, then finite-state techniques developed for discrete events systems [Wonham and Ramadge, 1987] can be adapted to synthesize a program that characterizes the online MPS. It should also be possible to effectively synthesize supervisors for agents that use bounded action theories [De Giacomo et al., 2013a; 2014]; verification of temporal properties over such agents is known to be decidable. Acknowledgments We acknowledge the support of Sapienza 2015 project Immersive Cognitive Environments and the National Science and Engineering Research Council of Canada. References [Alechina et al., 2015] Natasha Alechina, Nils Bulling, Mehdi Dastani, and Brian Logan. Practical run-time norm enforcement with bounded lookahead. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 443 451. ACM, 2015. [Aucher, 2014] Guillaume Aucher. Supervisory control theory in epistemic temporal logic. In International conference on Autonomous Agents and Multi-Agent Systems. IFAAMAS/ACM, 2014. [Banihashemi et al., 2016a] Bita Banihashemi, Giuseppe De Giacomo, and Yves Lesp erance. Online agent supervision in the situation calculus - Extended version. Technical Report EECS-2016-02, York University, 2016. [Banihashemi et al., 2016b] Bita Banihashemi, Giuseppe De Giacomo, and Yves Lesp erance. Online situation-determined agents and their supervision. In Proceedings of the 15th International Conference on Principles of Knowledge Representation and Reasoning. AAAI, 2016. [Cassandras and Lafortune, 2008] C. G. Cassandras and S. Lafortune. Introduction to Discrete Event Systems, Second Edition. Springer, 2008. [De Giacomo and Levesque, 1999] Giuseppe De Giacomo and Hector J. Levesque. An incremental interpreter for high-level programs with sensing. In Logical Foundations for Cognitive Agents: Contributions in Honor of Ray Reiter, pages 86 102. Springer Berlin Heidelberg, 1999. [De Giacomo et al., 2000] Giuseppe De Giacomo, Yves Lesp erance, and Hector J. Levesque. Con Golog, a concurrent programming language based on the situation calculus. Artificial Intelligence, 121(1 2):109 169, 2000. [De Giacomo et al., 2010] Giuseppe De Giacomo, Yves Lesp erance, and Adrian R. Pearce. Situation calculus based programs for representing and reasoning about game structures. In Principles of Knowledge Representation and Reasoning: Proceedings of the Twelfth International Conference. AAAI Press, 2010. [De Giacomo et al., 2012] Giuseppe De Giacomo, Yves Lesp erance, and Christian J. Muise. On supervising agents in situation-determined Con Golog. In International Conference on Autonomous Agents and Multiagent Systems, pages 1031 1038. IFAAMAS, 2012. [De Giacomo et al., 2013a] Giuseppe De Giacomo, Yves Lesp erance, and Fabio Patrizi. Bounded epistemic situation calculus theories. In IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence. IJCAI/AAAI, 2013. [De Giacomo et al., 2013b] Giuseppe De Giacomo, Fabio Pa- trizi, and Sebastian Sardina. Automatic behavior composition synthesis. Artificial Intelligence, 196:106 142, 2013. [De Giacomo et al., 2014] Giuseppe De Giacomo, Yves Lesp erance, Fabio Patrizi, and Stavros Vassos. LTL verification of online executions with sensing in bounded situation calculus. In ECAI 2014 - 21st European Conference on Artificial Intelligence, volume 263 of Frontiers in Artificial Intelligence and Applications, pages 369 374. IOS Press, 2014. [Fritz and Mc Ilraith, 2006] Christian Fritz and Sheila A. Mc Il- raith. Decision-theoretic Golog with qualitative preferences. In Proceedings, Tenth International Conference on Principles of Knowledge Representation and Reasoning, pages 153 163. AAAI Press, 2006. [Ghallab et al., 2004] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: Theory and Practice. Morgan Kaufmann/Elsevier, San Francisco, CA, USA, 2004. [Lesp erance et al., 2008] Yves Lesp erance, Giuseppe De Gia- como, and Atalay NafiOzgovde. A model of contingent planning for agent programming languages. In 7th International Joint Conference on Autonomous Agents and Multiagent Systems, pages 477 484. IFAAMAS, 2008. [Liaskos et al., 2012] Sotirios Liaskos, Shakil M. Khan, Marin Litoiu, Marina Daoud Jungblut, Vyacheslav Rogozhkin, and John Mylopoulos. Behavioral adaptation of information systems through goal models. Information Systems, 37(8):767 783, 2012. [Mc Carthy and Hayes, 1969] J. Mc Carthy and P. J. Hayes. Some Philosophical Problems From the Stand Point of Artificial Intelligence. Machine Intelligence, 4:463 502, 1969. [Mc Ilraith and Son, 2002] Sheila A. Mc Ilraith and Tran Cao Son. Adapting Golog for composition of semantic web services. In Proceedings of the Eights International Conference on Principles and Knowledge Representation and Reasoning, pages 482 496. Morgan Kaufmann, 2002. [Reiter, 2001] Ray Reiter. Knowledge in Action. Logical Foun- dations for Specifying and Implementing Dynamical Systems. The MIT Press, 2001. [Sardi na and De Giacomo, 2009] Sebastian Sardi na and Giuseppe De Giacomo. Composition of Con Golog programs. In IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 904 910, 2009. [Sardi na et al., 2004] Sebastian Sardi na, Giuseppe De Gia- como, Yves Lesp erance, and Hector J. Levesque. On the semantics of deliberation in Indigolog - from theory to implementation. Ann. Math. Artif. Intell., 41(2-4):259 299, 2004. [Sohrabi et al., 2006] Shirin Sohrabi, Nataliya Prokoshyna, and Sheila A. Mc Ilraith. Web service composition via generic procedures and customizing user preferences. In Proceedings of the 5th International Semantic Web Conference (ISWC-06), volume 4273, pages 597 611. Springer, 2006. [Wonham and Ramadge, 1987] WM Wonham and PJ Ra- madge. On the supremal controllable sub-language of a given language. SIAM Journal on Control and Optimization, 25(3):637 659, 1987. [Wonham, 2014] WM Wonham. Supervisory Control of Discrete-Event Systems. University of Toronto, 2014 edition, 2014. [Yadav et al., 2013] Nitin Yadav, Paolo Felli, Giuseppe De Giacomo, and Sebastian Sardina. Supremal realizability of behaviors with uncontrollable exogenous events. In IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pages 1176 1182. IJCAI/AAAI, 2013.