# improving_reinforcement_learning_with_confidencebased_demonstrations__09f7b877.pdf

Improving Reinforcement Learning with Conﬁdence-Based Demonstrations

Zhaodong Wang School of EECS Washington State University zhaodong.wang@wsu.edu

Matthew E. Taylor School of EECS Washington State University taylorm@eecs.wsu.edu

Reinforcement learning has had many successes, but in practice it often requires signiﬁcant amounts of data to learn high-performing policies. One common way to improve learning is to allow a trained (source) agent to assist a new (target) agent. The goals in this setting are to 1) improve the target agent s performance, relative to learning unaided, and 2) allow the target agent to outperform the source agent. Our approach leverages source agent demonstrations, removing any requirements on the source agent s learning algorithm or representation. The target agent then estimates the source agent s policy and improves upon it. The key contribution of this work is to show that leveraging the target agent s uncertainty in the source agent s policy can signiﬁcantly improve learning in two complex simulated domains, Keepaway and Mario.

1 Introduction

Reinforcement learning [Sutton and Barto, 1998] (RL) methods have been successfully applied to both virtual and physical robots. In some complex domains, the learning speed may be too slow to be feasible. One common speed up method is transfer learning [Taylor and Stone, 2009], where one (source) agent is used to speed up learning in a second (target) agent. Unfortunately, many transfer learning methods make assumptions about the source and/or target agent s internal representation, learning method, prior knowledge, etc. Instead of requiring a particular type of knowledge to be transferred, past work on the Human Agent Transfer [Taylor et al., 2011] (HAT) algorithm allowed the source agent to demonstrate its policy, the target agent to bootstrap based on this policy, and then the target agent to improve its performance over that of the source agent. So that there are no restrictions on how the source agent learns, HAT records data from the source agent as state-action pairs. In this work the source agent could be either a human or a virtual agent, underlying how different the source and target agents can be.1

1Because of the small number of sub-optmial demonstrations from source agents, experience replay [Lin, 1992] would have limited use in complex tasks.

We also note that this approach is different from much of the existing learning from demonstration [Argall et al., 2009] approaches, as the target agent can autonomously improve upon (and outperform) the source agent s policy via RL. The HAT algorithm can be brieﬂy summarized in three steps. First, the source agent acts for a time in the task and the target agent records a set of demonstrations. Second, a decision tree learning method (e.g., Quinlan s J48 [1993]) summarizes the demonstrated policy as a static mapping from states to actions. Third, these rules are used by the target agent as a bias in the early stages of its learning. The key component of HAT is that it uses the learned classiﬁer to bias its exploration. Initially, the target task agent follows the classiﬁer, attempting to mimic the source agent. Over time, it integrates exploration and exploitation of its learned knowledge with exploiting the classiﬁer, effectively improving its performance relative to the source agent. Immediately after performing transfer, it is unlikely that the target agent will be optimal due to multiple sources of error. First, the source agent may be suboptimal. Second, the source agent (or source human) may be inconsistent, resulting in an inability to correctly summarize the source agent s policy. Third, the source data must be summarized, not memorized because the decision tree will not exhaustively memorize all possible states. When it combines multiple (similar) states, some states may be classiﬁed incorrectly. Fourth, the source agent typically cannot exhaustively demonstrate all possible state action pairs the learned decision tree must generalize to unseen states, which may be incorrect. Different types and qualities of demonstrations may be more or less effective, depending on these four types of potential errors. Error types two and three, and possibly error type four, may be addressed by considering the uncertainty in the classiﬁer. Rather than blindly following a decision tree to select an action in a given state, as is done by HAT, this paper shows the beneﬁts of leveraging the measured uncertainty in the transferred information. This work takes a critical ﬁrst step in this direction by introducing CHAT (conﬁdence-HAT), an enhancement to the HAT algorithm leveraging conﬁdence-based demonstration. We evaluate CHAT using the domains of simulated robot soccer and Mario, empirically showing it outperforms both HAT and learning without transfer. We have three function approximators for CHAT: Gaussian process (GPHAT),

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

neural network (NNHAT) and decision tree (DTHAT), to show that uncertainty measurement helps. Even when low amounts of demonstration data are used, the initial performance (jumpstart) and overall performance (total reward) are signiﬁcantly improved. By leveraging uncertainty in the estimate of the source agent s policy, CHAT may be useful in domains where initial performance is critical, but demonstrations from a trained agent (or human) are available but nontrivial to collect.

2 Background

This section will present some basic techniques discussed in the paper: reinforcement learning, learning from demonstration, and the HAT algorithm.

2.1 Reinforcement Learning

Reinforcement learning is a process where an agent learns through experience by exploring the environment. RL algorithms typically leverage the Markov decision process (MDP) formulation. In an MDP, A is a set of actions an agent can take and S is a set of states. There are two (initially unknown) functions within this process: a transition function (T : S A 7 S) and a reward function (R : S A 7 R). Different RL algorithms have different ways of learning to maximize the expected reward. In this paper, we use ϵ-greedy action selection with SARSA [Rummery and Niranjan, 1994; Singh and Sutton, 1996]:

Q(s, a) Q(s, a) + α[r + γQ(s , a ) Q(s, a)]

and Q-learning [Watkins and Dayan, 1992]:

Q(s, a) Q(s, a) + α[r + γmax a Q(s , a ) Q(s, a)]

In cases where the state is continuous or very large, Q can not be represented as a table. In such cases, some type of function approximation is needed. In this paper we use a CMAC tile coding function approximator [Albus, 1981], where a state is represented by a vector of state variables.

2.2 Demonstration

Demonstrations are typically recorded as a vector of stateaction pairs as x, a , in which x is the state vector (where multiple state features are composed to describe a state s) and a is the corresponding action. There are many ways of collecting this data, from visual observation to directly recording actions during teleoperation. Learning from demonstration methods typically try to mimic this collected data. HAT differs from much of the existing work [Argall et al., 2009] because its goal is to improve upon the initial demonstration data. Most relevant to this work is that of Chernova and Veloso [2009], which showed that a nearest neighbor distance metric provides a measurement of conﬁdence in pure Lf D, allowing the agent to know when to use existing demonstrations and when to request additional demonstrations from a human expert. This paper focuses on leveraging conﬁdence measures to help RL agents select actions to improve upon demonstrated data.

2.3 Human Agent Transfer HAT s goal is to leverage data from a source agent or source human, and then improve upon its performance with RL. HAT leverages rule transfer [Taylor and Stone, 2007] and the demonstrated knowledge is summarized via a decision tree. The following steps summarize HAT:

1. Learn a policy from the source task: A source agent has some policy (π : S 7 A) in a task, and takes actions following a policy. The state-action pairs are stored as demonstration data. 2. Train a decision tree: A decision tree is trained to summarize the state-action pairs. The decision tree is essentially a static set of rules. 3. Bootstrap the target task with the decision tree: Instead of randomly exploring, the agent will use the learned rules to guide action selection. There are three ways of using the decision tree to improve learning performance but this paper focuses on probabilistic policy reuse (PPR). In PPR, there is a parameter Φ that determines whether the learning agent should follow the classiﬁer: the RL agent will reuse the transferred rule policy with a probability of Φ, act randomly with a probability of ϵ, and exploit its Q-values with probability 1 Φ ϵ. Φ typically starts near 1 and decays exponentially, forcing the agent to initially follow the source policy and leverage its learned Q-values over time.

3 Conﬁdence Measurement of HAT (CHAT) In this section we introduce improved methods (CHAT) leveraging the conﬁdence of demonstration based on three models: Gaussian process (GPHAT), neural network (NNHAT) and decision tree (DTHAT). Once calculated, a learning agent could use this conﬁdence in multiple ways. When our agent attempts to exploit source knowledge, it will execute the action suggested by the provided demonstration if it s conﬁdence is above some conﬁdence threshold. Otherwise, it will execute the default action (null action or random exploration). We build upon PPR, letting Φ decay. To implement CHAT, we 1) record data from a source policy, 2) train a conﬁdence-aware classiﬁer on this dataset, and 3) use Algorithm 1 to learn the task.

Gaussian Process (GPHAT) A Gaussian model is typically deﬁned as:

P(ωi|x) = 1 p

2π|Σi| exp{ 1

2(x µi)T Σ 1 i (x µi)}

where ωi is the predicted label, Σi is the covariance matrix of data of class i, and µi is the mean of data of class i. Considering Bayes decision rules, we have the prediction by a classiﬁer:

ωi = arg max ωi [ln P(ωi|x) + ln P(ωi)]

= arg max ωi [ 1

2(x µi)T Σ 1 i (x µi)

2 ln 2π|Σi| + ln P(ωi)] = arg min ωi [d(x, µi) + αi]

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

where d(x, µi) = (x µi)T Σ 1 i (x µi) and αi = ln 2π|Σi| 2 ln P(ωi). This classiﬁer is generated from Bayes decision rules and it optimizes the boundary of the data with different labels. If we directly use the above classiﬁer, we can only receive a binary decision. Instead, we deﬁne a conﬁdence function with the classiﬁcation (for class label i):

Ci(x) = exp{ d(x, µi) αi} (1) Notice that a typical GP maps from input space (data) to output space (class), but this still just provides classiﬁcation result. Additionally, what we want is the conﬁdence along with the classiﬁcation output, and thus we take advantage of the original GP and then deﬁne the above conﬁdence function to calculate conﬁdence.

Algorithm 1: GPHAT: Bootstrap target learning

Input: Conﬁdence model GP, conﬁdence threshold T, PPR initial probability Φ0, PPR decay ΦD(= Φ0)

1 Φ Φ0 2 for each episode do

3 Initialize state s to start state

4 for each step of an episode do

6 if rand() Φ then

7 Use GP to compute Ci as shown in (1) for each action

8 if max Ci T then

9 a corresponding ai

11 a default a0

13 if rand() ϵ then

14 a random action

16 a action that maximizes Q

17 Execute action a

18 Observe new state s and reward r

19 Update Q (SARSA, Q-Learning, etc.)

Neural Network (NNHAT) We use a 2-hidden-layer neural network as our conﬁdence model. To calculate the uncertainty of demonstration, we apply softmax regression [Bishop, 2006, pp. 206 209] at output layer:

Ci(x) = 1 P i exp(θT i x)

exp(θT 1 x)) exp(θT 2 x)) ... exp(θT i x))

Ci(x) is then used as conﬁdence.

Decision Tree (DTHAT) We use the accuracy of each leaf node as an estimate of the conﬁdence. Assuming the training and test data have the same distributions, we use the heuristic that the more data a node correctly classiﬁes, the less uncertainty we expect in the node s decision. The percentage of correctly classiﬁed data of each leaf node is used as the classiﬁcation conﬁdence.

4 Experimental Setting

This section discusses the two experimental domains and our methodology.

Mario is a benchmark domain [Karakovskiy and Togelius, 2012], based on Mario Brothers. In this simulation, Mario (the learning agent) is trained to score as many points as possible. The game state is represented in a 27-tuple vector space, indicating the state and position information of Mario and his enemies [Suay et al., 2016]. This vector space allows for 3.65 1010 different states, indicating the complexity of its learning problem. The action space for Mario is generated from these three groups: {no direction, left, right}, {don t jump/jump}, and {run/ﬁre, don t run/ ﬁre}. By selecting one sub-action from each of the three groups simultaneously, Mario has a total of 12 (3 2 2) different actions.

4.2 Keepaway Simulation

Keepaway is a simulated robot soccer game. We use version 9.4.5 of the Robocup Soccer Server [Noda et al., 1998], and version 0.9 of the Keepaway player framework [Stone et al., 2006]. There are 3 keepers and 2 takers, playing within a bounded square. Keepers learn to keep control of the ball while takers follow hard-coded rules to chase after the ball. An episode of the simulated game starts with an initial state and ends with an interception by the takers or the ball going out of bounds. The game is mapped into a discrete time sequence to make it possible to control every player. We use a continuous 13tuple vector to represent the states (e.g., position information like distances and angles). Once a keeper gets the ball, it must make a decision among three actions: Hold: hold the ball, Pass1: pass the ball to the closer teammate, or Pass2: pass the ball to the further teammate. The two keepers without the ball follow a ﬁxed policy to try to get open for a pass. The reward is +1 per time step for every keeper.

4.3 Methodology

Demonstrations (state-action trajectories) are collected from a human participant via a visualizer or directly from an agent. We ﬁrst evaluate CHAT with 3 conﬁdence models in Mario. For GPHAT, we train one-vs.-all Gaussian classiﬁers for each action. For NNHAT, we build a 2-hidden-layer network (50 nodes of each hidden layer). For DTHAT, we train J48 tree with the default parameters of Weka 3.6. Second, we also evaluate GPHAT in Keepaway. We train Gaussian classiﬁers only on actions Pass1 and Pass2, as action Hold is executed roughly 70% of the time, making the data unbalanced. Notice that the Gaussian classiﬁers for Pass1 and Pass2 are one-vs.-all since it is a multiclass problem. Similarly, two one-vs.-all decision trees are trained exclusively for Pass1 and Pass2. To achieve better classiﬁcation accuracy, we will ﬁrst use clustering to help determine the number of Gaussian classiﬁers in GPHAT (which can be greater than or equal to the number of actions). We cluster these data using the

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 1: This ﬁgure compares the learning curves of Conﬁdence HAT with HAT and RL without any bootstrapping in Mario.

Expectation-Maximization (EM) algorithm [Celeux and Govaert, 1992], with default parameter settings in Weka 3.6 [Witten and Frank, 2005], into N groups and then train N Gaussian classiﬁers for this class. We determine N by comparing the average performance of the ﬁrst few episodes. By having several smaller data clusters, the precision of the classiﬁer can be increased. We only use clustering for the Gaussian model. We use SARSA in Keepaway and Q-learning in Mario to be consistent with previous work. SARSA uses: α = 0.05, ϵ = 0.1, and γ = 1. Q-learning uses: α = 1 10 32, ϵ = 0.1, and γ = 0.9. Notice that these parameters are consistent with previous research in these domains. The parameter Φ determines when the agent listens to prior knowledge. Φ is multiplied by a decay factor, ΦD, on every time step. Among {0.9, 0.99, 0.999, 0.9999}, preliminary experiments found ΦD = 0.999 to be the best for Keepaway and ΦD = 0.9999 to be the best for Mario (explored further in Section 5.2). We evaluate learning performance in terms of jumpstart and total reward. Jumpstart is the average initial performance before learning. A higher jumpstart indicates that prior knowledge is more useful. The overall performance is measured by the area under a learning curve.

5 Mario Results

In this section, we show our results of learning performance by leveraging conﬁdence in Mario domain. We also discuss and evaluate techniques that help improve CHAT. Simulation results are all averaged over 10 trials.

5.1 CHAT Outperforms HAT In Mario, we collect demonstration data of 20 episodes (roughly 15 minutes) from a human player with an average score of 1876 points. For the benchmark of CHAT in Mario domain, we compare our algorithm with HAT and learning without any prior. Figure 1 shows the learning curves CHAT can successfully outperform HAT and RL with no prior. In particular, the jumpstart of GPHAT, relative to HAT,

Figure 2: This ﬁgure compares validation with or without prior knowledge using conﬁdence neural network.

Figure 3: Different Φ0 = ΦD settings with a Gaussian

was statistically signiﬁcant (p < 10 4 via t-tests). Here GPHAT uses four (on average) clusters for each action, PPR Φ0 = ΦD = 0.9999, and a conﬁdence threshold of 0.8. Note that the learned performance is less than the average demonstrator for the training times considered due to the domain complexity. We next evaluate the two other conﬁdence models (see Figure 1). The conﬁdence threshold of neural network is 0.6 while that of the decision tree is 0.85. We again see improvement relative to HAT. To highlight the contribution of conﬁdence demonstration, we perform an additional validation to see how performance changes if the agent selects actions based only on its learned experience rather than prior knowledge during learning process ( without Prior in Figure 2). This is averaged over 1000 episodes with and without prior knowledge every 5000 episodes. In Figure 2, the difference shows performance improves signiﬁcantly when leveraging conﬁdence measures.

5.2 Tuning CHAT s Reuse Probability

Taking Gaussian model (GPHAT) as an example, it uses prior knowledge with a decaying probability as mentioned before. In order to see the effect of this parameter, comparison results

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 4: This ﬁgure shows behavior transfer consistency in different conﬁdence intervals.

are plotted in Figure 3. A lower reuse factor (e.g., 0.9991) would lead to a decrease in performance shortly after the start, while a higher factor (e.g., 0.9999) would not. Notice that this does not indicate that the reuse probability should be as close to 1 as possible once the reuse probability becomes too high, exploration will be decreased to the point that is difﬁcult for the agent to learn to outperform the source demonstration.

5.3 Tuning CHAT s Conﬁdence Threshold In Mario, the Gaussian s conﬁdence threshold T is 0.8, determined through initial parameter tuning. We now discuss how policy consistency interacts with this parameter, where the behaviors of two agents are deﬁned as consistent when they select the same action for the same state. First, we let a trained agent play Mario using its ﬁxed policy (following its ﬁxed Q-values) to generate 20 demonstration episodes. Second, we train GPHAT (with the same settings as above) on that demonstration. Third, we compare the actions suggested by the GPHAT classiﬁer with actual actions made by the ﬁxed-policy agent to see how often they are the same. Figure 4 shows how the GPHAT act with different conﬁdence thresholds. For each conﬁdence threshold, we show the number of actions made by GPHAT and the rate of consistency with respect to the ﬁxed-policy agent. When the conﬁdence threshold is too low, actions made by GPHAT are less likely to be the same as the source task agent s actions. When the conﬁdence threshold is too high, the actions are now consistent, but very few actions will be selected.

6 Keepaway Results

This section evaluates our methods in a continuous domain, showing the beneﬁts of our methods and investigating different types of demonstrations. Simulation results are all averaged over 10 trials.

6.1 Improvement Over Baselines To make comparisons between different human demonstrations, we consider four different demonstrations (each with 20 episodes), their source and performance (the average episode duration and standard deviation), as summarized in

Table 1: This table summarizes the Keepaway demonstration datasets.

Demonstration Source Average Duration

Simple-Response Human 10.5s 3.5s Complex-Strategy Human 10.1s 3.8s Novice Human 7.45s 2.2s Learned-Policy Learned Agent 10.1s

Table 1.2 The human player demonstrate three qualitatively different policies:

1. Simple-Response: The player holds the ball until the takers are very close to the keeper with the ball. The player only passes the ball when necessary.

2. Complex-Strategy: The player is more ﬂexible and active in this setting. The player tries to pass the ball more often, requiring the keepers to move more. However, the player also tries to act inconsistently when possible, so that the player would not always take the same action as long as those actions are also rational.

3. Novice: Consider an even worse case where we have Novice demonstration, which is only slightly better than a random policy, where many sub-optimal actions are demonstrated.

We show learning curves using the ﬁrst two demonstrations in Figure 5. Calculated total rewards are in Table 3. Notice that HAT with double DTs works better than that with single DT. We therefore focus on comparing GPHAT with double-DT HAT in the remainder of this section. As expected, both sets of demonstrations allow HAT and GPHAT to outperform learning without any prior and GPHAT improves more than HAT. However, there is a signiﬁcant difference in the two datasets. The Complex-Strategy data is harder to train classiﬁers on. This is supported by Table 2: the J48 pruned tree needed to be deeper but still had lower accuracy, indicating that the Complex-Strategy demonstration needs a decision tree with more complexity relative to the Simple-Response demonstration. Besides, we compare the robustness of GPHAT and HAT on Novice demonstration in Figure 5 and Table 2. GPHAT still improves learning performance upon such worse data because it can put less weight on actions that have lower conﬁdence (i.e., instances when the classiﬁer is less certain in the source agent s policy). These results allow us to conclude that 1) CHAT can outperform RL agent with no bias and HAT agent for a variety of demonstration data qualities and 2) CHAT agents able to successfully outperform demonstrated policies.

6.2 Ensembles of Conﬁdence Thresholds Rather than tuning CHAT s conﬁdence threshold, we also consider ensemble methods [Dietterich, 2000] made of learners with different Φ0 s. The ensemble uses majority voting, weighted by the conﬁdence threshold of each prediction. Voting weights used in this paper are conﬁdence threshold of each classiﬁer. If we considered a conﬁdence threshold

2Our code and demonstration datasets are available at the ﬁrst author s website: irll.eecs.wsu.edu/lab-members/zhaodong-wang/.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 2: This table shows comparisons among different methods. For double DTs, depth and accuracy are averaged over the two trees. For Gaussian processes, the conﬁdence threshold is 0.9.

Demonstration HAT (single DT) HAT (double DTs) GPHAT Jumpstart Depth Accuracy Jumpstart Depth Accuracy Jumpstart Clusters Accuracy Simple-Response +2.23 4 87.52% +2.53 4 90.61% +3.42 2 83.21% Complex-Strategy +1.76 7 67.21% +2.26 6 84.36% +3.37 3 80.16% Novice +1.13 5 86.24% +1.44 5 92.86% +3.49 2 84.37% Learned-Policy +3.18 4 88.67% +3.26 4 91.22% +4.55 2 86.12%

Figure 5: This ﬁgure compares the learning curves of GPHAT with HAT and RL without any bootstrapping in Keepaway.

Table 3: Total rewards of different methods in Keepaway

Total Reward (5 hours)

Total Reward (20 hours)

GPHAT (Simple-Response) 76.9 290.4 GPHAT (Complex-Strategy) 75.1 283.6 HAT (Simple-Response) 72.8 270.3 HAT (Complex-Strategy) 62.7 251.6 No-Prior 47.2 219.8

of 0.7, classiﬁers in GPHAT with conﬁdence higher than 0.7 vote for the ﬁnal action selection, but with a scaled weight (multiplied by its threshold, 0.7 in this case). An intuitive way of understanding is that we would like those predictions with lower conﬁdence to be considered in the ﬁnal action selection, but with less signiﬁcance. By doing this at different conﬁdence thresholds, we select the ﬁnal action with highest votes. Figure 6 shows the result of an ensemble of 5 conﬁdence thresholds (from 0.5 to 0.9). GPHAT with an ensemble can outperform the best single conﬁdence threshold (0.9), from the previous section, at the expense of additional computation (linear in the number of ensemble members).

7 Conclusion and Future Work

This paper has introduced and evaluated CHAT, showing successful transfer from a human to an RL in two complex domains. CHAT outperformed both an existing method and the original demonstrations. Such improvements are most important when learning is slow or initial performance is critical.

Figure 6: This ﬁgure shows the performance improvement using an ensemble of different conﬁdence thresholds.

Results have shown that by applying different conﬁdence models we get different learning performance and this could depend on the type/amount of human demonstrated data. In our domain, CHAT with Gaussian model could converge to the best performance, even when there is little demonstrated data. Additional results investigated how parameters in CHAT affect performance. Having shown the potential of CHAT, future work will consider a number of extensions. First, we will investigate whether the conﬁdence factor could be used to target where additional human demonstrations are needed. Second, we will use CHAT to learn from multiple agents we expect that the conﬁdence of a classiﬁer and the demonstrated ensemble method will both be useful when the target agent is deciding which source agent to follow. Third, we have also shown how, in the Keepaway domain, the actions executed are unbalanced and have unequal importance. To make transfer more efﬁcient, the demonstration data could be modiﬁed to focus on the most important data, eliminating redundant data. Fourth, we will investigate adaptive methods that could take advantage of judging the signiﬁcance of demonstration data, further improving learning performance.

Acknowledgements We thank Tim Brys for sharing code for Mario. This research has taken place in part at the Intelligent Robot Learning (IRL) Lab, which is supported in part by NASA NNX16CD07C, NSF IIS-1149917, NSF IIS-1643614, and USDA 2014-67021-22174.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Albus, 1981] JS Albus. Brains, behavior. & Robotics. Peterboro, NH: Byte Books, 1981.

[Argall et al., 2009] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469 483, 2009.

[Bishop, 2006] Christopher M Bishop. Pattern recognition. Machine Learning, 128:1 58, 2006.

[Celeux and Govaert, 1992] Gilles Celeux and G erard Govaert. A classiﬁcation em algorithm for clustering and two stochastic versions. Computational statistics & Data analysis, 14(3):315 332, 1992.

[Chernova and Veloso, 2009] Sonia Chernova and Manuela Veloso. Interactive policy learning through conﬁdencebased autonomy. Journal of Artiﬁcial Intelligence Research, 34(1):1, 2009.

[Dietterich, 2000] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classiﬁer systems, pages 1 15. Springer, 2000.

[Karakovskiy and Togelius, 2012] Sergey Karakovskiy and Julian Togelius. The mario ai benchmark and competitions. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):55 67, 2012.

[Lin, 1992] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293 321, 1992.

[Noda et al., 1998] Itsuki Noda, Hitoshi Matsubara, Kazuo Hiraki, and Ian Frank. Soccer server: A tool for research on multiagent systems. Applied Artiﬁcial Intelligence, 12(2-3):233 250, 1998.

[Quinlan, 1993] Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA, 1993.

[Rummery and Niranjan, 1994] Gavin A Rummery and Mahesan Niranjan. On-line q-learning using connectionist systems. 1994.

[Singh and Sutton, 1996] Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces. Machine learning, 22(1-3):123 158, 1996.

[Stone et al., 2006] Peter Stone, Gregory Kuhlmann, Matthew E Taylor, and Yaxin Liu. Keepaway soccer: From machine learning testbed to benchmark. In Robo Cup 2005: Robot Soccer World Cup IX, pages 93 105. Springer, 2006.

[Suay et al., 2016] Halit Bener Suay, Tim Brys, Matthew E Taylor, and Sonia Chernova. Learning from demonstration for shaping through inverse reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 429 437. International Foundation for Autonomous Agents and Multiagent Systems, 2016.

[Sutton and Barto, 1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. [Taylor and Stone, 2007] Matthew E Taylor and Peter Stone. Cross-domain transfer for reinforcement learning. In Proceedings of the 24th international conference on Machine learning, pages 879 886. ACM, 2007. [Taylor and Stone, 2009] Matthew E. Taylor and Peter Stone. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research, 10(1):1633 1685, 2009. [Taylor et al., 2011] Matthew E. Taylor, Halit Bener Suay, and Sonia Chernova. Integrating reinforcement learning with human demonstrations of varying ability. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2011. [Watkins and Dayan, 1992] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279 292, 1992. [Witten and Frank, 2005] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2nd edition, 2005.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)