# learning_structured_decision_problems_with_unawareness__a94bb7e5.pdf Learning Structured Decision Problems with Unawareness Craig Innes 1 Alex Lascarides 1 Structured models of decision making often assume an agent is aware of all possible states and actions in advance. This assumption is sometimes untenable. In this paper, we learn influence diagrams from both domain exploration and expert assertions in a way which guarantees convergence to optimal behaviour, even when the agent starts unaware of actions or belief variables that are critical to success. Our experiments show that our agent learns optimal behaviour on small and large decision problems, and that allowing an agent to conserve information upon discovering new possibilities results in faster convergence. 1. Introduction Probabilistic graphical models have proven useful in representing richly-structured decision tasks (Koller, 2009). Many techniques exist to learn a model s structure and parameters from experience (Tsamardinos et al., 2006; Bartlett & Cussens, 2017) and expert advice (Masegosa & Moral, 2013). Unfortunately, all such methods assume the way the domain is conceptualized the belief and action variables used to describe the problem is completely known in advance. Often, this assumption does not hold. In medicine, for example, suppose an agent prescribes a drug, but later a senior pharmacologist objects to the prescription based on a reason unforeseen by the agent the patient carries a newly discovered genetic trait, and the drug produces harmful side effects in its carriers. Further, this discovery may occur after the agent has already learned a lot about how other (foreseen) factors impact the drug s effectiveness. As Coenen et al. (2017) point out, such scenarios are common in human discussion the answer to a question may not only provide information about which of the questioner s existing hypotheses are likely, but also reveal unforeseen hypotheses not yet considered. This example 1University of Edinburgh, UK. Correspondence to: Craig Innes , Alex Lascarides . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). also shows that while it may be infeasible for an agent to gather all relevant factors of a problem before learning, it may be easy for an expert to offer contextually relevant advice during learning. Another example is in robotic skill learning. Methods like Cakmak & Thomaz (2012) teach a robot how to perform a new action it was unaware of, but not how to integrate it into an existing decision model. Current learning and decision making models don t address these issues; they assume the task is to use data to refine a distribution over a fixed hypothesis space. The above examples, however, illustrate a sort of reverse bayesianism (Karni & Viero, 2013), where the hypothesis space expands over time. Rather than overcoming unawareness of states and actions, we could instead represent unawareness as an infinite number of hidden variables (Wood et al., 2006), or abandon learning a structured model and use densely connected hidden layers to implicitly learn structure from raw sensory input (Mnih et al., 2015). Both approaches have drawbacks. First, neither currently addresses how to adapt what one has learned when a new action is discovered. More importantly, the hidden layers/variables are not tied to grounded concepts with explicit meaning, so an agent cannot easily justify its decisions to a user, or articulate the limits of its current understanding to solicit help from others. We instead propose an agent which explicitly overcomes its unawareness, and constructs an interpretable model of the problem. This paper makes three contributions: First, an algorithm which incrementally learns all components of a decision problem as a Influence Diagram (ID). This includes the reward function and probabilistic dependencies between variables, but also the set of belief and action variables themselves (Sections 2, 3.2). Second, an agent-expert communication protocol (Section 3.1) which interleaves contextual advice with learning, and guarantees our agent converges to optimal behaviour despite starting unaware of factors critical to success. Third, experiments on decision tasks of varying sizes showing our agent successfully learns optimal behaviour in practice (Section 4). Such contributions should be most relevant in interactive task learning (Laird et al., 2017), where agents must learn the underlying concepts which define a task. Learning Structured Decision Problems with Unawareness 2. Learning with Full Awareness We focus on learning observable, single-stage decision problems with discrete states and actions. In this paper, we learn optimal behaviour by learning an Influence Diagram (a bayesian network extended with actions and rewards), which defines the optimal policy. We start with an agent who is aware of all states and actions, then address the task where the agent is unaware of relevant states and actions. 2.1. Making Optimal Decisions Our agent s goal is to learn the optimal policy π+ which, given evidence e, chooses the action a which maximizes its expected utility EU(a|e) = P s P(s |a, e)R(s ) across all states s . If P(s|a, e) or reward function R are unknown, the agent must learn them via trial and error. Our agent balances exploiting its current estimate of π+ with exploring the domain by using an ϵ-greedy policy (one which acts randomly with probability ϵ > 0, and according to the current estimate of π+ with probability 1 ϵ). For any policy π, we can measure the expected loss in reward against the true optimal policy π+ using (1) the policy error. e P(e) (EU(π+(e)|e) EU(π(e)|e)) (1) If there are many states, evaluating the expected utility directly is infeasible. The next sections show how to exploit a problem s inherent structure to make computations tractable. 2.2. Bayesian Networks Bayesian networks (BNs) decompose the state space of the problem into a set of belief variables X = {X1, . . . Xn}. Each state s is a joint assignment to all variables in X (i.e., s v(X), where v(X) is the cross-product of possible assignments to each variable in X). An BN is a directed acyclic graph (DAG) Pa and parameters θ which exploit the conditional independencies between variables in X. For each X X, Pa X Pa is the parent set of X. Given Pa X, the value of X is conditionally independent of all Y X \ Pa X which are non-descendants of X. The parameters θX θ then define the conditional probability distributions θX = P(X|Pa X). Given X, we can learn the most likely structure Pa and parameters θ from observed data D0:t = [d0, . . . , dt] (where di v(X)) by learning P(Pa|D0:t) and P(θ|D0:t, Pa). A common way to calculate P(Pa|D0:t) is to use the Bayesian Dirichlet Score (BD-Score) (Heckerman et al., 1995): P(Pa|D0:t) Y X X BDt(X, Pa X) (2) BDt(X, Pa X) = β(N t 0|j + α0|j, . . . , N t m|j + αm|j) β(α0|j, . . . , αm|j) (3) Here, β(n1, . . . , nm) is the multivariate beta function, and N t X=i|Pa X=j is the number of states in D0:t where X has assignment i and its parents have joint assignment j.1 The αi|j parameters come from the dirichlet priors over θ and act as a pseudo-count when data is sparse. We typically choose the prior P(Pa) to penalize complex structures, as in equation (4) which assigns a cost of ρ < 0.5 for each extra parent: P(Pa X) = ρ|Pa X|(1 ρ)|X\Pa X| (4) Unfortunately, evaluating the full posterior of (2), or even finding its maximum, is infeasible for even a moderate number of variables, as the number of possible BNs is hyperexponential in the size of X (Tian & He, 2009). To tackle this, most methods approximate Pa via either local search (Teyssier & Koller, 2005), sampling methods (Madigan et al., 1995), or by reducing the search space of structures considered reasonable (Buntine, 1991). For our task, we have another issue most BN learning methods operate once on a single batch of data, and thus consider the collection of sufficient statistics (i.e., the BD-scores of each Pa X and their associated Ni|j) to be a negligible pre-computed cost. In a decision problem, agents gather data incrementally, and exploit their current beliefs during learning, so we must update the sufficient statistics each time step. To address these issues, we follow Buntine (1991) by constructing a reasonable parent lattice PX for each X. Figure 1 shows an example lattice for X1. Starting from , we construct larger parent sets by combining sets seen so far,and track the highest scoring set according to (3). Any parent-set with a score lower than some proportion κ of the best score, is considered unreasonable , and is not expanded further.2 Once we have our reduced set PX for each X, we can find the combination of reasonable parent sets which maximize (2) and form a valid DAG. In our task we will be learning X, so cannot assume we know a total-ordering over X (as Buntine (1991) does), and thus cannot choose each Pa X independently. Instead, following Bartlett & Cussens (2017), we treat finding Pa as the linear program given below: 1Where the context is clear, we compress this notation to N t i|j 2In contrast to Buntine (1991) we also consider all subsets of a reasonable parent set to be reasonable. Learning Structured Decision Problems with Unawareness {X2, X3} {X2, X4} {X2, X3, X4} Figure 1. Parent lattice for X1. Grey nodes are unreasonable Pa X PX I(Pa X X) ln[BD(X, Pa X)] Pa X PX I(Pa X X) = 1 X P X C Pa X C= I(Pa X X) 1 C X I(Pa X X) {0, 1} X, Pa X (5) The variables I(Pa X X) denote whether we chose Pa X as X s parent set, while the constraints ensure each X has only one parent set and that the parent sets form a DAG. Once we have learned the most likely structure Pa , we can learn its most likely parameters θ via (6): Eθ|D0:t,Pa(θi|j) = Ni|j + αi|j N.|j + α.|j (6) Our method has three main advantages. First, reducing the number of reasonable parent sets and using the modular BDscore means we can incrementally update the probability of each Pa X using past calculations of (3) via (7):3 BDt(X, Pa X) = BDt 1(X, Pa X) N t i|j + αi|j 1 N t .|j + α.|j 1 (7) Second, expressing structure learning as a linear program makes it easy to add extra structural constraints later (as we do in section 3.1). Third, each lattice PX implicitly captures an approximate distribution over each X s parents.4 In section 3.2, the agent uses this distribution to conserve information as its awareness of X expands. 2.3. Learning to Act with Influence Diagrams An ID (Howard & Matheson, 2005) is a BN augmented with actions and rewards. It is a tuple A, X, R, Pa, θ . Here, 3We include a proof of (7) in the technical supplement. 4Technically, without , each PX is a distribution over markov blankets for X, but we enforce acyclicity, so this is not an issue. Topical Cream Side Effects Figure 2. ID for medical example. Rectangles are action variables, ovals are belief variables, and diamonds are reward nodes A = {A1, . . . , Am} are the action variables the agent controls (where a full action a is a member of v(A)). As before, X are the belief variables, but they are now partitioned into X = B O, where B is the set of variables the agent observes before taking action, and O are the variables which describe the outcome of that action. The reward function R : v(scope(R)) R gives the reward received for each state, where scope(R) X are the variables which determine the reward. While Pa and θ still describe the probabilistic dependencies, each Pa X can now include belief and action variables (i.e., X, Pa X X A). A domain trial at time t is a tuple dt = st, at, rt , where st = bt, ot , bt v(B), ot v(O), at v(A) and rt = R(st). Figure 2 shows an example ID. Here, the agent chooses assignments for action variables TOPICAL CREAM and ORAL DRUG based on observing GENE X, then receives a reward based on outcomes DISEASE, and SIDE EFFECTS. Note that IDs often also include information arcs edges from belief variables to action variables that show which variables an agent observes before making each decision. For simplicity of explanation, we omit information arcs, and assume in this paper that all action variables are set simultaneously, with access to observations of all variables in B. We can use the same techniques from section 2.2 to learn Pa and θ for IDs. Similarly, we can learn reward function Rt from scope(R), and D0:t by setting Rt(st[scope(R)]) = rt,5 and by defaulting to indifference for unseen states (e.g., by setting the reward of unseen states to 0). Given a learned ID, the agent then chooses a policy which maximizes the expected utility from section 2.1 (where a v(A) and e v(B)). 3. Overcoming Unawareness So far, we ve assumed our agent was aware of all relevant variables in X A and all members of scope(R). We now drop this assumption. From here onward we denote the true set of belief variables, actions, and reward scope as X +, A+ and scope+(R), and the learner s awareness of them at time t as X t, At, and scopet(R). 5Here, s[Y] projects assignment s over the variables in Y. Learning Structured Decision Problems with Unawareness Suppose X+ = {X1, X2}, X0 = {X1}, A+ = {A1, A2}, At = {A1}. We assume the agent can t observe the value of variables it is unaware of. In the medical example from section 1, if X2 is a particular gene, we assume the agent cannot detect the presence of that gene if it is unaware it exists. We also assume the agent cannot perform actions it is unaware of (i.e., if the agent is unaware of variable A2 then it cannot set it to a value other than its default value).6 This means at time t = 0, the agent does not observe the true trial d0, but rather s0[X 0], a0[A0], r0 . But awareness of those missing factors may be crucial to learning π+. For example, the best action may depend upon v(X2), or the optimal policy may sometimes involve setting A2 to true. The next sections thus answer two main questions. First, how can an agent discover and overcome its own unawareness by asking for help? Second, when an agent discovers a new variable, how can they integrate it into their current model while conserving what they have learned from past experience? 3.1. Expert Guidance Our agent can expand its awareness via advice from an expert. Teacher-apprentice learning is common in the real world, as it allows learners to receive contextually relevant advice which may inform them of unforeseen concepts. This paper assumes the expert is cooperative and infallible. Further, we abstract away the complexity of grounding natural language statements in a formal semantics and instead assume that the agent and expert communicate via a pre-specified formal language (though see e.g., (Zettlemoyer & Collins, 2007) for work on this problem). We do not, however, assume the expert knows the agent s beliefs about the ID. The variables X e and Ae express the expert s knowledge of our agent s awareness, and include only those actions the expert has explicitly seen the agent perform, or variables the agent mentions in conversation. As argued in the introduction, the goal is to provide a minimal set of communicative acts so that interaction between the agent and expert mirrors human teacher-apprentice interaction. Concretely, this means we want our system to have two properties. First, the expert should mostly allow the agent to learn by themselves, interjecting only when the agent performs sufficiently poorly, or when they explicitly ask a question. Second, our expert should be able to give non-exhaustive answers to queries. This is because, in practice, it is unlikely a human expert will be able to give exhaustive answers to all questions due to either cognitive bounds or a lack of information. Following the maxims of Grice (1975), a cooperative speaker should give answers which provide just enough information to resolve the agent s current dilemma. The next sections identify four advice 6This assumption, while reasonable, may not always hold (E.g., an agent may lean on a button while unaware it is part of the task). types, whose combination guarantee the agent eventually behaves optimally, regardless of initial awareness. 3.1.1. BETTER ACTION ADVICE If the expert sees the agent perform a sub-optimal action, it can tell the agent a better action to take instead. For example: When it is raining, take your umbrella instead of your sun hat . Our goal is to avoid interrupting the agent every time it makes a mistake, so we specify the following conditions for when the agent performs poorly enough to warrant correction: Let t be the current time step, and t be the time the expert last uttered advice. When (8-12) hold: t t > µ (8) X EU(π+(bi)|bi) EU(ai|bi) t t > β (9) EU(at|bt) rt (10) A , a v(Ae A ), EU(a |bt) > EU(at|bt) (11) A , (a v(Ae A ), EU(a |bt) > EU(at|bt) |A | < |A |) (12) Then the expert will utter advice of the form (13): EU(a |wb t) > EU(at[At A ]|wb t) (13) Equation (8) ensures some minimum time µ has passed since the expert last gave advice, while (9) ensures the expert won t interrupt unless its estimate of the agent s policy error is above some threshold β. Together, µ and β define how tolerant the expert will be to the agent s poor performance before deciding to give advice. Equation (10) ensures that the expert s suggested action has an expected reward higher than what the agent received at time t. Finally, (11-12) ensure not only that a better action a exists, but also that a introduces the minimum amount of potentially unforeseen action variables A needed to improve the agent s behaviour. This means there isn t another action a which could be advised while revealing fewer unaware variables to the agent. This follows from our desire to give non-exhaustive advice, by first suggesting improvements which use concepts the agent is already aware of, rather than introducing many new action variables all at once. Equation (13) the expert s utterance uses an ambiguous term wb, whose intended meaning is the true state b (i.e Jwb K v(B+)), but whose default interpretation by the agent is b[Bt]. In words, the expert says In the last step, a would have been better than at . By introducing ambiguity, the agent can now interpret (13) in two ways. The first is as a partial description of the true Learning Structured Decision Problems with Unawareness ID, which is true regardless of what it learns in future. On hearing (13), the agent adds (14-15) to its knowledge: A A , A A+ X scope+(R), anc(A, X) (14) s v(X +), s[Bt] = st[Bt] R+(s) > rt (15) Where anc(X, Y ) means X is the ancestor of Y (i.e., there is a directed path from X to Y in the true ID). Equations (1415) imply that any variable the expert utters must be relevant to the problem. In other words, it must exert influence on at least one variable in scope+(R), and that there exists some state the agent could have reached which would yield a better reward than the one it got. In fact, we can enforce all equations of the form (14) when learning Pa by adding the constraint (16) to our linear program: Y / scope(R) : X X X Pa X:Y Pa X I(Pa X X) 1 (16) The second way the agent could use (13) is by adding its default interpretation of the advice to its current knowledge: b B+ : b[Bt] = bt[Bt] EU(a |b) > EU(a|b) (17) The agent can then enforce (17) by choosing a whenever b[Bt] = bt[Bt], regardless of what seems likely from D0:t. We should now see that even with a cooperative, infallible expert, even abstracting away issues of grounding natural language, misunderstandings can still happen due to differences in agent and expert awareness. As the next section shows, such misunderstandings can reveal gaps in the agent s awareness and help to articulate queries whose answers guarantee the agent expands its awareness. Lemma 1 guarantees the expert s advice strategy continues to reveal unforeseen actions to the agent so long as its performance in trials exceeds the expert s tolerance.7 Lemma 1. Consider an agent with awareness At A+, and expert following (8-13). As k , either PE(πt+k) c β or the expert utters (13) where A = . 3.1.2. RESOLVING A MISUNDERSTANDING We noted before that the agent s default interpretation of (13) could lead it to misunderstand the expert s intended meaning. To illustrate, suppose the agent receives advice (18) and (19) at times t k and t: 7We include proofs of all lemmas in the technical supplement EU(a|wb t k) > EU(a |wb t k) (18) EU(a|wb t) < EU(a |wb t) (19) While the intended meaning of each statement is true, the agent s default interpretations of ws t k and ws t may be identical. That is, bt k[Bt] = bt[Bt]. From the agent s perspective, (18) and (19) conflict, and thus give the agent a clue that its current awareness of X + is deficient. To resolve this conflict, the agent asks (20) (in words, which B has distinct values in bt k and bt? ) and receives an answer which provides monotonic information of the form (21): ?λB(B B+ st k[B] = st[B]) (20) B B+ X scope+(R), anc(B , X) (21) Notice there may be many variables in B+ \ Bt whose assignments differ in bt k and bt. Thus, the expert s answer can be non-exhaustive. This means the agent must abandon its previous defeasible interpretations of the form (17), but can keep (14-15), as these are true regardless of what the true ID is. Lemma 2 guarantees the expert will reveal new belief variables provided misunderstandings can still arise. Lemma 2. Consider an agent with awareness X t X +, At = A+. If b , b = b , b[Bt] = b [Bt] and π+(b) = π+(b ), then as k , either PE(πt+k) c β, or the expert utters (21) such that B / Bt 3.1.3. UNFORESEEN REWARDS In typical IDs (where we assume the agent starts fully aware of X +, A+, and scope+(R)), we tend only to think of the trials as providing counts. For an unaware agent, a trial dt = st, at, rt also encodes monotonic information: s, s[X t] = st[X t] R+(s) = rt (22) This, along with (15), constrain the form of R the agent can learn. Recall that scopet(R) may be only a subset of scope+(R), so it might be impossible to construct an R : v(scopet(R)) R satisfying all descriptions (22) gathered so far. Further, those extra variables in scope+(R) \ scopet(R) may not be in X t. To resolve this, if the agent fails to construct a valid R, it asks (23) (in words, which variable X (that I don t already know) is in scope(R)? ), receiving an answer of the form (24): ?λX(X scope+(R) X scopet(R) X = X ) (23) X scope+(R) X X + (24) Learning Structured Decision Problems with Unawareness Again (24) may be non-exhaustive. Even so, we can guarantee that the agent s reward function eventually equals R+. Lemma 3. Consider an ϵ-greedy agent with awareness At = A+, scopet(R) scope+(R). As k , there exists a K such that s, k K, Rt+k(s) = R+(s). 3.1.4. UNKNOWN EFFECT Recall that, to keep the problem tractable, our agent searches for IDs in the space of reasonable parent sets P. Unfortunately, there might be no valid ID within P which also satisfies the constraints of the form (16). The most obvious case of this is when a variable X / scopet(R) has no children in any reasonable DAG (i.e., Y, Pa Y PY , X / Pa Y ). Here, the agent can ask why X is relevant by asking (25) ( what V does X affect directly? ). The answer (e.g., X Pa V ) imparts information (26) to the agent: ?λV (X Pa V ) (25) V X + Y scope+(R), anc(V , Y ) (26) 3.2. Adapting to Unforeseen Factors Section 3.1 showed four ways to expand the agent s awareness of X, A, and scope(R). To improve on simply restarting learning, we must now say how the agent adapts its beliefs about Pa and θ when its awareness expands. 3.2.1. ADDING A NEW BELIEF VARIABLE When the agent discovers a belief variable Z, its old distributions over parent sets no longer cover all possible parents. That is, for X X t 1, PX did not cover cases where Z was a parent of X. Worse, the agent cannot observe Z s values in D0:t, so cannot observe N t k Z=i|j, or N t k X=i|j when Z Pa X. The α-parameters involving Z are also undefined, yet we need them to calculate Pa and θ via (6-7). Discovering a new variable makes the size of each (observed) state dynamic, in contrast to standard problems where they are static (e.g., X1 = 0, X2 = 1 becomes X1 = 0, X2 = 1, Z =? ) . We could phrase this as a missing data problem: Z was hidden in the past but visible in future states, so estimate missing values and structure via e.g., expectation maximization (Friedman, 1998). But such methods commit us to costly passes over the full batch of data and makes finding Pa for problems with more than a handful of variables intractable by eliminating the modularity of the BD-score. Alternatively, we could ignore states with missing information when we require counts involving Z. For example, we could use P(Pa X|Dt:n) to score Pa X when Z Pa X but use P(Pa X|D0:n) when Z / Pa X. But as Friedman & Goldszmidt (1997) point out, most structure scores, including (2), assume we evaluate models with the same data. If we compare models using different data sets (even two sets from the same distribution), the learner favours models evaluated with the smallest data set. Instead, our method discards data D0:t 1 gathered during the learner s old view of the hypothesis space, but conserves the posterior probabilities learned from it to create new priors for Pa and θ in the expanded space. On discovering the new variable Z, the agent updates each distribution over the parent sets Pa X (X = Z) to cover parent sets which include Z. Equation (27) creates a new prior P (Pa X) from the old posterior. Here, C is the normalizing constant and 0 < γ 1 expresses how much we trust our current P: P (Pa X) = γP (Pa X|PX) + (1 γ)P(Pa X) (27) P (Pa X|PX) = 0 if Pa X / PX (1 ρ) C BDt 1(X, Pa X) if Z / Pa X ρ C BDt 1(X, Pa X) if Pa X = Pa X {Z} This update preserves the relative likelihood among the parent sets without Z. It also preserves our bias towards simple structures by giving only a small portion ρ of the probability mass of parent sets without Z to those with Z. This still leaves us to define a distribution over Pa Z itself. The agent has no evidence on Z s parents at the moment of Z s discovery, so defaults to using the initial prior (4). To adapt θ , we return to the issue of the counts N t i|j and αparameters. As before, we wish to avoid the complexity of estimating Z s past values. Instead, we throw away the past states D0:t 1 and their counts N t 1 i|j , but keep the relative likelihoods they gave rise to by packing these into new αparameters, as shown in (29) (K is a constant): N t i|j = 0 for all i, j, X, Pa X K |Z|P(j|θ t 1) if X = Z K |Z|P(i, j[Pa X \ Z]|θ t 1) if Z Pa X KP(i, j|θ t 1) otherwise Equation (29) summarizes the counts D0:t 1 based on inferences from the old ID, then encodes these inferences in the α-parameters of the new parameter prior. The new αparameters ensure that the likelihoods inferred from past states bias the estimate of Pa and θ as more trials arrive. The larger K is, the more what it learned before discovering Z influences its reasoning after discovering Z. 3.2.2. ADDING A NEW ACTION VARIABLE When the agent discovers a new action variable A, the same issue arises the agent did not consider A as a possible Learning Structured Decision Problems with Unawareness Algorithm 1 Learning IDs with Unawareness 1: Input: A0, X 0, Pa0, θ0, P0 2: for t = 1 . . . max Trials do 3: bt Generate new state 4: ot, at, rt ϵ-GREEDY(bt[X t]) 5: N t i|j Update with bt[X t], ot[X t] 6: BDt(X, Pa X) Update via (7) for Pa X PX 7: if t 0 (mod τ) then 8: Revise Pt 9: end if 10: {Unknown Effect (Section 3.1.4)} 11: Ask (25), update X t, Pt if any X has no children 12: {Unforeseen Rewards (Section 3.1.3)} 13: Rt Update with constraints (22, 15) 14: Ask (23), update scopet(R), X t, Pt if update fails. 15: {Better Action Advice (Section 3.1.1)} 16: if (8-12) are true then 17: act Advice Expert advises (13) 18: Update At, Pt according to act Advice 19: {Resolving a Misunderstanding (Section 3.1.2)} 20: Ask (20) and update X t, Pt when act Advice conflicts with past utterances 21: end if 22: if X t 1 = X t then 23: Update N t i|j, αi|j, P, P(Pa) via (4, 27, 29) 24: end if 25: Pat, θt Solve (5) (with (16)), and (6) 26: end for parent to any X X t 1, so must revise P with A as a possibility. Unlike for belief variables, however, we don t need to discard D0:t 1. Since we assume that the agent cannot influence action variables it is unaware of, we can simply fill in the default value of A in all past trials D0:t 1. 3.2.3. EXPANDING THE REWARD FUNCTION SCOPE Learning that a variable X is in the scope of R may cause us to revise Pa , even if X X t 1. This is because expanding scopet(R) loosens the constraints of the form (16) which may allow us to construct a higher scoring ID structure that was previously invalid. Algorithm 1 outlines the entire learning process. In most steps, the agent incrementally revises its model based on the latest trial and current reasonable parents. If enough time τ passes, or the agent s awareness expands, the agent makes larger changes to its model, including revising P. Given algorithm 1, theorem 1 guarantees our agent converges to a near-optimal policy (where near-optimal is bounded by the experts tolerance β), regardless of initial awareness (provided P includes the true structure). Theorem 1. Consider an agent with initial awareness X 0 X +, A0 A+, scope0(R) scope+(R) following algorithm 1 without any pruning of the parent lattice (i.e., κ = 0). As t , PE(πt) c β. 4. Experiments Our experiments show that agents following algorithm 1 converge to near-optimal behaviour not only in theory but also in practice. We also show that conserving information improves results. We do not investigate assigning explicit costs to agent-expert communication, but do show how varying the expert s tolerance affects the agent s performance. We tested agents on three randomly generated IDs of increasing size: 12, 24, and 36 variables. Our results were similar across all sizes, but the differences between agents were most pronounced on the largest case, so we present those here (Full ID specifications and results for the small and medium cases are included in the technical supplement). In each, our agent begins with minimal awareness of the true ID (X 0 = {O1}, A0 = {A1}, scope0(R) = {O1}). The agent acts in 5000 trials, using an ϵ-greedy policy (ϵ = 0.1). We repeat experiments 100 times and average the results. We use the cumulative reward across trials as our evaluation metric, which acts as a proxy for the quality of the agent s policy over time. To make the results more readable, we apply a discount of 0.99 at each step, resulting in the metric Rdisc t = rt + 0.99 Rdisc t 1 for all t. We test several variants of our agent to show our approach is effective. The default agent follows algorithm 1 as is, with parameters κ = 0.001, τ = 100, ρ = 0.1, γ = 0.99, K = 5.0, µ = 10, β = 0.01 in equations (4), (8), (9), (27), and (29). The non Conservative agent does not conserve information about P, Pa, nor θ via (27) and (29) when it discovers a new factor. Instead, it discards all trials and reverts to the original prior of (4). We include this agent to show the value of conserving information as X and A expand. The non-relevant agent is like the default, but does not include any constraints of the form (16) when searching for Pa . This means the agent might learn IDs where variables are completely disconnected. The true Policy and random agents start knowing the true ID, and execute an ϵ-greedy version of π+, or choose a random action respectively. These agents give an upper/lower bound on performance. The low Tolerance and high Tolerance agents change the expert s tolerance to β = 0.001 and β = 0.1. 4.1. Results Across all IDs, our default agent converges to the optimal policy, despite starting unaware of factors critical to success. Figure 3a shows the cumulative reward of the default agent compared to the non-conservative, non-relevant, true-policy and random agents. The default agent converges to the op- Learning Structured Decision Problems with Unawareness (a) (b) (c) Figure 3. Rewards (a and b) and size of |A| (c) over time on the 36 variable task. Results for other tasks included in the supplement. timal policy, and does so faster than its non-conservative counterpart. This shows conserving one s beliefs on discovering new factors has value. The difference in reward compared to the non-relevant agent is smaller than expected. We suspect this is because actions which exert a strong causal influence over scope+(R) will be an ancestor to scope+(R) in Pa regardless of whether we enforce (16). Actions the non-relevant agent leaves disconnected are those which exert little causal influence over scope+(R) (and thus usually have little effect on the optimal policy). This means that how much abandoning connectivity affects performance depends on how noisy the causation in the domain model is, which we don t know with certainty in advance. Figure 3b shows how the agent s performance differs when varying the expert s tolerance level: As the tolerance increases, the agent takes longer to converge towards a good policy, and also learns a worse final policy. Figure 3c shows why: Early on, the agent s policy is good enough for the high tolerance expert, meaning the expert does not reveal the last few extra action variables which would net only small increases in the agent s total reward. 5. Related Literature Models of unawareness exist in logic (Board et al., 2011; Heifetz et al., 2013) and game theory (Feinberg, 2012), but interpret (un)-awareness from an omniscient perspective. We instead model awareness from the agent s perspective and offer methods to overcome one s unawareness. Contextual bandits (Langford & Zhang, 2007) also deal with single-shot decision making as covered in this paper. Unlike our work, they focus on unstructured, noisy rewards, and don t address unawareness of actions or states. Rong (2016) present Markov Decision Processes with Unawareness, which let an agent learn optimal behaviour even when unaware of some actions. We build on this work by providing a concrete mechanism for discovering unforeseen factors and by allowing the agent to discover explicit be- lief variables rather than atomic states this is necessary to tackle large, richly-structured problems. Mc Callum & Ballard (1996) also learn an increasingly complex representation of the state space by gradually distinguishing between states which yield different rewards. Rather than dealing with unawareness, the focus of such approaches is on refining an existing state space. In other words, they do not support introducing unforeseen states or actions that the learner was unaware of before learning. Several works use expert interventions to correct agent behaviour (Torrey & Taylor, 2013; Stone, 2009), or via agentdriven active learning (Masegosa & Moral, 2013; Murphy, 2001). Yet all such methods assume the expert s intended meaning can be understood without expanding the agent s conception of the state and action space. Our work allows experts to utter advice where ambiguity arises from their greater awareness of the problem. 6. Conclusion We have presented an expert-assisted framework for learning structured decision problems, even when the learner starts unaware of factors critical to success. Further, our experiments show that being conservative about one s beliefs improves the effectiveness of learning. In future work, we intent to extend our model to sequential tasks, as well as single-shot decision problems (Innes & Lascarides, 2019). Additionally, we aim to lift some assumptions imposed on the expert, and expand the range of situations in which the agent can ask for advice. For instance, we could let the expert be fallible (Masegosa & Moral, 2013), or leverage clues about the location of hidden variables present in the structure of the learner s graphical model (Elidan et al., 2000) to guide questions asked to the expert. Acknowledgements This work is supported by EPSRC (UK). We would like to thank Kobi Gal, Subramanian Ramamoorthy, Benji Rosman, Learning Structured Decision Problems with Unawareness Stefano Albrecht, Vaishak Belle, and anonymous reviewers for their helpful feedback. Bartlett, M. and Cussens, J. Integer Linear Programming for the Bayesian network structure learning problem. Artificial Intelligence, 244:258 271, March 2017. ISSN 0004-3702. doi: 10.1016/j.artint.2015.03. 003. URL http://www.sciencedirect.com/ science/article/pii/S0004370215000417. Board, O. J., Chung, K.-S., and Schipper, B. C. Two models of unawareness: Comparing the object-based and the subjective-state-space approaches. Synthese, 179(1):13 34, 2011. Buntine, W. L. Theory Refinement on Bayesian Networks. Co RR, abs/1303.5709, 1991. URL http://arxiv. org/abs/1303.5709. Cakmak, M. and Thomaz, A. Designing robot learners that ask good questions. Proceedings of the 7th Annual ACM/IEEE International Conference on Human-Robot Interaction, 2012. Coenen, A., Nelson, J. D., and Gureckis, T. M. Asking the right questions about human inquiry. Open Coenen, Anna, Jonathan D Nelson, and Todd M Gureckis.Asking the Right Questions About Human Inquiry. Psy Ar Xiv, 13, 2017. Elidan, G., Lotner, N., Friedman, N., Koller, D., and others. Discovering hidden variables: A structurebased approach. In NIPS, volume 13, pp. 479 485, 2000. URL http://ai.stanford.edu/ nir/ Papers/ELFK1.pdf. Feinberg, Y. Games with Unawareness. Technical report, Stanford University, Graduate School of Business, 2012. Friedman, N. The Bayesian structural EM algorithm. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 129 138. Morgan Kaufmann Publishers Inc., 1998. Friedman, N. and Goldszmidt, M. Sequential update of Bayesian network structure. In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pp. 165 174. Morgan Kaufmann Publishers Inc., 1997. URL http://dl.acm.org/citation. cfm?id=2074246. Grice, H. P. Logic and conversation. 1975, pp. 41 58, 1975. Heckerman, D., Geiger, D., and Chickering, D. M. Learning Bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197 243, 1995. Heifetz, A., Meier, M., and Schipper, B. C. Dynamic unawareness and rationalizable behavior. Games and Economic Behavior, 81:50 68, 2013. Howard, R. A. and Matheson, J. E. Influence diagrams. Decision Analysis, 2(3):127 143, 2005. Innes, C. and Lascarides, A. Learning Factored Markov Decision Processes with Unawareness. In Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Montreal, Canada, 2019. Karni, E. and Viero, M.-L. Reverse Bayesianism : A choice-based theory of growing awareness. American Economic Review, 103(7):2790 2810, 2013. Koller, D. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. Laird, J. E., Gluck, K., Anderson, J., Forbus, K. D., Jenkins, O. C., Lebiere, C., Salvucci, D., Scheutz, M., Thomaz, A., and Trafton, G. Interactive task learning. IEEE Intelligent Systems, 32(4):6 21, 2017. Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pp. 817 824. Citeseer, 2007. Madigan, D., York, J., and Allard, D. Bayesian graphical models for discrete data. International Statistical Review/Revue Internationale de Statistique, pp. 215 232, 1995. Masegosa, A. R. and Moral, S. An interactive approach for Bayesian network learning using domain/expert knowledge. International Journal of Approximate Reasoning, 54(8):1168 1181, October 2013. ISSN 0888-613X. doi: 10.1016/j.ijar.2013.03. 009. URL http://www.sciencedirect.com/ science/article/pii/S0888613X13000698. Mc Callum, A. K. and Ballard, D. Reinforcement learning with selective perception and hidden state. Ph D Thesis, University of Rochester. Dept. of Computer Science, 1996. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., and others. Human-level control through deep reinforcement learning. Nature, 518 (7540):529 533, 2015. Murphy, K. P. Active learning of causal Bayes net structure. 2001. URL http://citeseerx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.20.8206. Rong, N. Learning in the Presence of Unawareness. Ph D Thesis, Cornell University, 2016. Learning Structured Decision Problems with Unawareness Stone, W. B. K. a. P. Interactively Shaping Agents via Human Reinforcement: The TAMER Framework. 2009. URL http://www.cs.utexas.edu/ users/ai-lab/?KCAP09-knox. Teyssier, M. and Koller, D. Ordering-based search: a simple and effective algorithm for learning Bayesian networks. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, pp. 584 590. AUAI Press, 2005. Tian, J. and He, R. Computing posterior probabilities of structural features in Bayesian networks. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 538 547. AUAI Press, 2009. Torrey, L. and Taylor, M. Teaching on a budget: Agents advising agents in reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 1053 1060. International Foundation for Autonomous Agents and Multiagent Systems, 2013. Tsamardinos, I., Brown, L. E., and Aliferis, C. F. The maxmin hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1):31 78, 2006. Wood, F., Griffiths, T. L., and Ghahramani, Z. A nonparametric Bayesian method for inferring hidden causes. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pp. 536 543. AUAI Press, 2006. Zettlemoyer, L. and Collins, M. Online learning of relaxed CCG grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-Co NLL), 2007.