# madras__multi_agent_driving_simulator__758a6395.pdf

Journal of Artiﬁcial Intelligence Research 70 (2021) 1517-1555 Submitted 11/2020; published 04/2021

MADRa S : Multi Agent Driving Simulator

Anirban Santara nrbnsntr@gmail.com Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, WB, India

Sohan Rudra sohanrudra@iitkgp.ac.in Department of Mathematics, Indian Institute of Technology Kharagpur, Kharagpur, WB, India

Sree Aditya Buridi buridiaditya@iitkgp.ac.in Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, WB, India

Meha Kaushik meha.kaushik@microsoft.com Microsoft, Vancouver, Canada

Abhishek Naik abhishek.naik@ualberta.ca Department of Computing Science, University of Alberta, Alberta, Canada

Bharat Kaul bharat.kaul@intel.com Parallel Computing Lab, Intel Labs, Intel, Bengaluru, KA, India

Balaraman Ravindran ravi@cse.iitm.ac.in Robert Bosch Center for Data Science and Artiﬁcial Intelligence, Indian Institute of Technology Madras, Chennai, TN, India

Autonomous driving has emerged as one of the most active areas of research as it has the promise of making transportation safer and more eﬃcient than ever before. Most realworld autonomous driving pipelines perform perception, motion planning and action in a loop. In this work we present MADRa S, an open-source multi-agent driving simulator for use in the design and evaluation of motion planning algorithms for autonomous driving. Given a start and a goal state, the task of motion planning is to solve for a sequence of position, orientation and speed values in order to navigate between the states while adhering to safety constraints. These constraints often involve the behaviors of other agents in the environment. MADRa S provides a platform for constructing a wide variety of highway and track driving scenarios where multiple driving agents can be trained for motion planning tasks using reinforcement learning and other machine learning algorithms. MADRa S is built on TORCS, an open-source car-racing simulator. TORCS oﬀers a variety of cars with diﬀerent dynamic properties and driving tracks with diﬀerent geometries and surface properties. MADRa S inherits these functionalities from TORCS and introduces support for multi-agent training, inter-vehicular communication, noisy observations, stochastic actions,

c 2021 AI Access Foundation. All rights reserved.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

and custom traﬃc cars whose behaviors can be programmed to simulate challenging traﬃc conditions encountered in the real world. MADRa S can be used to create driving tasks whose complexities can be tuned along eight axes in well deﬁned steps. This makes it particularly suited for curriculum and continual learning. MADRa S is lightweight and it provides a convenient Open AI Gym interface for independent control of each car. Apart from the primitive steering-acceleration-brake control mode of TORCS, MADRa S oﬀers a hierarchical track-position speed control mode that can potentially be used to achieve better generalization. MADRa S uses a UDP based client server model where the simulation engine is the server and each client is a driving agent. MADRa S uses multiprocessing to run each agent as a parallel process for eﬃciency and integrates well with popular reinforcement learning libraries like RLLib. We show experiments on single and multi-agent reinforcement learning with and without curriculum.

1. Introduction

Ineﬃcient driving habits of humans result in accidents, congestion and environmental pollution. These issues can be addressed eﬃciently if cars are able to operate autonomously. Additionally, humans lose hours of productivity in their cars towards their daily commute. These possibilities have, of late, spurred an unprecedented amount of interest towards selfdriving car technology from researchers around the world.

Although realization of fully autonomous driving seems far ﬂung, some speciﬁc low level tasks pertaining to driving such as adaptive cruise control, lane keep assistance and parking assistance have already been automated at a production scale in the form of Advanced Driver-Assistance Systems (ADAS) (Dikmen & Burns, 2016; Minster, Haghighat, Chu, & Vogt, 2018). Safe, optimal and fast motion planning in complex, multi-modal, multi-agent, and partially observable environments is the foremost technological challenge towards achieving full autonomy. Achieving these goals tractably using traditional motion planning algorithms like Model Predictive Control, RRT, A , and Dijkstra is only possible under certain simplifying assumptions on the complexity the environment (La Valle, 2006). On the other hand, Machine Learning based approaches including Reinforcement Learning (RL) (Sutton & Barto, 2018) and Learning from Demonstration (Lf D) (Argall, Chernova, Veloso, & Browning, 2009) are capable of fast, reactive control under fewer assumptions (Shalev-Shwartz, Shammah, & Shashua, 2016; Bojarski, Yeres, Choromanska, Choromanski, Firner, Jackel, & Muller, 2017; Sharifzadeh, Chiotellis, Triebel, & Cremers, 2016; You, Lu, Filev, & Tsiotras, 2019). However the training phase of these algorithms is often data-hungry (Fayjie, Hossain, Oualid, & Lee, 2018; Talpaert., Sobh., Kiran., Mannion., Yogamani., El-Sallab., & Perez., 2019) especially for those using highly expressive and complex models like deep neural networks. RL based methods also require online interaction with the environment that entails risk (Shalev-Shwartz & Shashua, 2016; Santara, Naik, Ravindran, Das, Mudigere, Avancha, & Kaul, 2018). Driving simulators attempt to address these problems by rendering realistic driving conditions and traﬃc patterns in which agents can collect training data many times faster than real time. They also provide a sandbox environment where the agent can run into catastrophic situations while learning

MADRa S : Multi Agent Driving Simulator

to drive without causing physical damage in the real world.

Real world driving scenarios have a high degree of variability and require the driver to optimize for multiple often conﬂicting objectives depending on the situation they are in. Curriculum learning (Bengio, Louradour, Collobert, & Weston, 2009) and continual learning (Parisi, Kemker, Part, Kanan, & Wermter, 2019) are two families of machine learning algorithms that are relevant in this case. Curriculum learning provides a way of learning complex skills eﬃciently by breaking up the problem into a hierarchy of sub-tasks and learning to accomplish them in the order of increasing complexity. Continual learning on the other hand deals with learning to accomplish new tasks without forgetting previously acquired skills. A simulator for curriculum and continual learning of autonomous driving agents should be able to create a large variety of driving scenarios with ﬁne-grained control on their complexities.

Since the early days of autonomous driving research, simulators have been used in the development of diﬀerent parts of the perceive-plan-act pipeline (Sulkowski, Bugiel, & Izydorczyk, 2018). Most of these simulators cater to the task of perception. Back in 1989, the creator of ALVINN, Dean A. Pomerleau, had used a simulator to generate training images for road detection (Pomerleau, 1989). Thanks to the recent advances in computer graphics, modern driving simulators and games like Grand Theft Auto V 1 can render photo-realistic driving scenes with accurate depiction of illumination, weather and other physical phenomena. They also simulate real-life sensors that can be used to collect synthetic data from these scenes to augment real-world driving datasets. Recent works (Chen, Seﬀ, Kornhauser, & Xiao, 2015; Richter, Vineet, Roth, & Koltun, 2016; Richter, Hayder, & Koltun, 2017; Ros, Sellart, Materzynska, Vazquez, & Lopez, 2016) have demonstrated that training perception algorithms on these augmented datasets result in better generalization in the real world that is crucial for safe and reliable autonomous driving. Most notable open-source driving simulators in this category are CARLA (Dosovitskiy, Ros, Codevilla, Lopez, & Koltun, 2017), Microsoft Air Sim (Shah, Dey, Lovett, & Kapoor, 2018), Deep Drive.io and Udacity s Self Driving Car Simulator (Brown et al., 2018). These simulators can, in principle, be also used for planning tasks. However, an agent learning to face real world driving scenarios must learn to be invariant to road geometries, traﬃc patterns and vehicular dynamics. These simulators do not oﬀer enough variability along these dimensions that is necessary to learn the invariances. In a typical driving scene, multiple entities (cars, buses, bikes, and pedestrians) try to achieve their objectives of getting from one place to another fast, yet safely and reliably. A simulator for such an environment should provide an easy way to create arbitrary traﬃc conﬁgurations. The task of negotiating in traﬃc is akin to ﬁnding the winning strategy in a multi-agent game (Dresner & Stone, 2008). Hence, an autonomous driving simulator should be able to simulate diﬀerent varieties of traﬃc and support multiple agents learning to negotiate and drive through cooperation and competition. Among the aforementioned simulators, Air Sim, Deep Drive.io and Udacity provide some preset driving conditions mostly without traﬃc. They do not provide any straightforward way to create custom traﬃc or train multiple agents. CARLA does provide an API for

1. Available online at https://www.rockstargames.com/V/

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

independent control of cars that can be used for multi-agent training and creating custom traﬃc cars. However, most of the variability presented by CARLA is in the perceived inputs and not in the behavioral dynamics of the ego-vehicle or the traﬃc agents. This motivated us to develop a dedicated simulator for learning to plan in autonomous driving with a focus on learning invariances to road geometries, traﬃc patterns and vehicular dynamics in both single and multi-agent learning settings.

In this paper we present MADRa S, a M ulti-Agent DRiving Simulator for motion planning in autonomous driving and demonstrate its ability to create driving scenarios with high degrees of variability. We present results of training reinforcement learning agents to accomplish challenging tasks like driving vehicles with drastically diﬀerent dynamics, maneuvering through a variety of track geometries, navigating through a narrow road avoiding collisions with both moving and parked traﬃc cars and making two cars learn to cooperate and pass through a traﬃc bottleneck. We also demonstrate how curriculum learning can help in reducing the sample complexity of some of these tasks. Built on top of the TORCS platform (Wymann, Espi e, Guionneau, Dimitrakakis, Coulom, & Sumner, 2000), MADRa S uses simpliﬁed physics simulation and representative graphics to reduce the computational overhead for perception and action. It allows for the addition of learning and non-learning cars to a driving scene to create custom traﬃc conﬁgurations and train multiple agents simultaneously. Each driving agent gets a high-level object-oriented representation of the world as observation and an Open AI gym (Brockman, Cheung, Pettersson, Schneider, Schulman, Tang, & Zaremba, 2016) interface for independent control. MADRa S is open source2

and aims to contribute to the democratization of artiﬁcial intelligence.

The rest of the paper is organised as follows. Section 2 introduces the theoretical concepts that guide the organization of MADRa S. Section 3 describes our contributions in this project in detail. Section 4 presents six experimental studies that highlight the ability of MADRa S to simulate driving tasks of high variance. Finally, Section 5 concludes the paper with scopes of future work.

2. Background

In this section, we introduce the concepts of Markov Decision Process (MDP), Markov Game (MG), Reinforcement Learning (RL) and Episodic Learning that comprise the foundation of MADRa S.

2.1 Reinforcement Learning in Markov Games

Markov Decision Process (MDP) is a mathematical construct that is commonly used in the artiﬁcial intelligence literature to describe an environment in which agents learn to act (Sutton & Barto, 2018). In a single-agent learning set-up, an MDP can be expressed as a 4-tuple: M = S, A, P, R . It consists of a state space S, an action space A, a transition

2. Code available at https://github.com/madras-simulator/MADRa S

MADRa S : Multi Agent Driving Simulator

dynamics function P : S A S [0, 1] that gives the probability distribution over next states for each action taken in a given state and a reward function R : S A R that qualiﬁes the task at hand. An agent learning to act in this environment receives observations about the current state and samples actions from its policy π : S A [0, 1] which is a conditional distribution over A given a state in S. The reward function R gives scalar feedback about these actions that indicate the agent s progress towards the goal. The agent optimizes the parameters of its policy to maximize the cumulative reward received from the environment. This form of learning through trial and error with feedback from the environment is known as Reinforcement Learning (RL).

In a multi-agent reinforcement learning set up, the environment is described as a Markov Game (MG) which is a generalization of MDP to capture the interplay of multiple agents (Littman, 1994; Bu, Babu, De Schutter, et al., 2008; Bowling & Veloso, 2000; Da Silva & Costa, 2019; Lin, Beling, & Cogill, 2017; Yu, Song, & Ermon, 2019; Lin, Adams, & Beling, 2018). An MG is a tuple S, {αi}n i=1, {Ai}n i=1, P, {Ri}n i=1 . Here, {αi}n i=1 denotes a set of n agents that simultaneously learn to act in an environment with state space S and transition dynamics function P. Ai and Ri denote the set of actions and reward function for agent αi.

2.2 Episodic Learning

In episodic learning (Seel, 2011), an agent s experience happens in the form of episodes. Each episode begins with the agent in one of the initial states of the environment. The state of the environment changes in response to the agent s actions and the episode ends when the environment sends a done signal to the agent. In a general multi-agent learning setting, the environment may send a done signal to each agent separately at diﬀerent time steps resulting in diﬀerent episode lengths for each agent. When the episodes of all the agents end, the environment resets itself to one of its initial states and starts new episodes for each agent.

3. MADRa S Simulator

In this section we describe the structure and organization of the MADRa S simulator which constitutes the main contribution of this paper. The current version of MADRa S is focused on track driving. Track driving is traditionally used in the automotive world to benchmark driver skill and car agility. We ﬁrst present a brief overview of the TORCS simulator and associated prior works that MADRa S builds upon. Then we describe the new features that we develop in this project and present a thorough empirical analysis of their relevance in the context of planning in autonomous driving.

3.1 TORCS Simulator

MADRa S is based on TORCS which stands for The Open Racing Car Simulator (Wymann et al., 2000). It is capable of simulating the essential elements of vehicular dynamics such as mass, rotational inertia, collision, mechanics of suspensions, links and diﬀerentials, friction and aerodynamics. Physics simulation is simpliﬁed and is carried out through Euler

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Table 1: Comparison of Gym TORCS (Yoshida, 2016) with MADRa S

Feature Gym TORCS MADRa S scr-server architecture observation noise stochastic outcomes of actions parallel rollout support multi-agent training inter-vehicular communication custom traﬃc cars domain randomization centralized conﬁguration modular reward and done functions hierarchical action space

Figure 1: Architecture of the MADRa S simulation environment. Each double headed arrow indicates one UDP communication channel between the TORCS server and one of the clients (traﬃc or MADRa S agents). The server listens to the ith client through a dedicated port denoted by pi in the ﬁgure. MADRa S assigns these ports in order, ﬁrst to the traﬃc agents and then to the learning agents. The Markov Game terms are also marked in their respective places of deﬁnition in the ﬁgure.

MADRa S : Multi Agent Driving Simulator

integration of diﬀerential equations at a temporal discretization level of 0.002 seconds. The rendering pipeline is lightweight and based on Open GL (Neider, Davis, & Woo, 1993) that can be turned oﬀfor faster training. TORCS oﬀers a large variety of tracks and cars as free assets that we discuss later in this section. It also provides a number of programmed robot cars with diﬀerent levels of performance that can be used to benchmark the performance of human players and software driving agents. TORCS was built with the goal of developing Artiﬁcial Intelligence for vehicular control and has been used extensively by the machine learning community ever since its inception (Li, Song, & Ermon, 2017; Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, & Wierstra, 2019; Loiacono, Prete, Lanzi, & Cardamone, 2010b; Koutn ık, Cuccu, Schmidhuber, & Gomez, 2013; Koutn ık, Schmidhuber, & Gomez, 2014; Onieva, Cardamone, Loiacono, & Lanzi, 2010).

3.2 SCR Server-Client Architecture

The Simulated Car Racing (SCR) Championship (Loiacono, Lanzi, Togelius, Onieva, Pelta, Butz, Lonneker, Cardamone, Perez, S aez, et al., 2010a) is an annual car-racing competition where participants submit controllers for racing in the TORCS environment. It provides a software patch for TORCS known as scr server (Loiacono, Cardamone, & Lanzi, 2013). It sets up a UDP based client-server architecture in which the competing cars can operate independent of one another. The server runs the TORCS simulator. Each client represents a car that runs as a separate process and communicates with the server through a dedicated UDP port. The patch also provides a layer of abstraction over TORCS in which each car has access to an egocentric view of the environment and not the entire game state. The server polls actions from the clients and updates the game-state every 0.02 seconds of simulated time. The oﬃcial build of TORCS supports up to 10 SCR clients at a time but with modiﬁcations like in (Kaushik, Prasad, Krishna, & Ravindran, 2018) the number of clients can be increased arbitrarily.

3.3 Gym TORCS Environment

Gym TORCS (Yoshida, 2016) is an Open AI Gym (Brockman et al., 2016) wrapper for SCR cars built for use in Reinforcement Learning experiments. It uses a custom library called Snake Oil to create a client for communicating with the TORCS server through the scr server interface. Snake Oil also provides plug-ins for automatic-transmission, traction control and throttle control which can be used to provide diﬀerent control modes to the driving agent. Gym TORCS is popular in the reinforcement learning community for experiments on driving tasks (Kaushik et al., 2018; Liu, Siravuru, Prabhakar, Veloso, & Kantor, 2017; de Bruin, Kober, Tuyls, & Babuˇska, 2018; Dossa, Lian, Nomoto, Matsubara, & Uehara, 2019). MADRa S builds on Gym TORCS by increasing its stability and ease of use and adding features like multi-agent training and custom traﬃc cars.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

3.4 MADRa S: Multi-Agent Driving Simulator

Having described TORCS and associated prior works that form the foundation of MADRa S, we now present our contributions in this project. As Gym TORCS is pre-dominantly designed for single-agent training, the environment is inherently structured as an MDP. This restricts its usage for multi-agent training. MADRa S is Gym TORCS restructured as an MG with some added functionalities (see Table 1). Figure 1 describes the architecture of MADRa S. MADRa S Environment consists of a MADRa S World and a given number of MADRa S Agents ({αi}i). MADRa S World consists of a TORCS server and a given number of traﬃc agents each of which executes an independently conﬁgured behavior. The state space (S) and the transition dynamics (P) of the MG are deﬁned by the MADRa S World. Each MADRa S Agent αi runs as an SCR Client with a modiﬁed Snake Oil interface that has its own action space Ai and reward function Ri which are independent of the action spaces and reward functions of the other agents. Unlike Gym TORCS, MADRa S Agents can not reset the TORCS server. This allows for multiple agents to complete their episodes independently. MADRa S Environment resets its MADRa S World and in turn its TORCS server when all the agents have terminated their episodes. MADRa S also provides a number of ways to conﬁgure the initial state of the environment for the task at hand. The initial distance from the start line and position with respect to the track edges can be speciﬁed individually for both the learning cars as well as the traﬃc agents. Thus MADRa S harnesses the full potential of the SCR server-client architecture and enables multi-agent training. We describe the salient features of MADRa S in the remaining part of this section.

3.4.1 Traffic Agents

MADRa S introduces support for adding non-learning traﬃc agents in the environment that execute a pre-deﬁned behaviour. These are diﬀerent from the robot cars that come bundled with TORCS for benchmarking racing agents. MADRa S provides a base class that can be used as template to create traﬃc cars with interesting behavioral patterns and some sample traﬃc classes as free assets (see Table 2). The base class also comes equipped with methods to prevent collision and going out of track. Each traﬃc agent runs as a parallel process independent of the learning agent and has an SCR client that talks to the TORCS server through a dedicated port. MADRa S takes care of the conﬁguration and assignment of a requisite number of server ports for connecting all the learning and traﬃc agents properly at the start of each episode. The number and behavior of traﬃc agents can be varied between episodes.

3.4.2 Tracks

One of the major advantages of TORCS as the platform of choice for building MADRa S is the availability of a large number of tracks with diﬀerent geometric (see Figure 2) and surface properties. At the time of writing this paper, TORCS oﬀers 9 oval, 21 road, and 8 dirt tracks. It also oﬀers a software package3 that can be used to create diﬀerent variants

3. Oﬃcial track-editor package of TORCS: http://www.berniw.org/trb/download/trackeditor-0.6.2c. tar.bz2

MADRa S : Multi Agent Driving Simulator

Table 2: Sample traﬃc agents in MADRa S.

Name Behaviour Const Vel Traffic Agent Drives at a given speed at a given track-position. Sinusoidal Speed Agent Varies the speed sinusoidally while driving at a given track-position. Random Lane Switch Agent Agent switches lanes randomly while driving. Drive And Park Agent Agent drives to a given distance and track-position and parks itself. Parked Agent Agent remains parked at a given distance and trackposition throughout. Random Stopping Agent Agent halts randomly while driving.

Figure 2: Schematic diagrams of road tracks in TORCS (Wymann et al., 2000).

of these tracks. MADRa S inherits these free assets from the TORCS project. A limitation of Gym TORCS is that a track chosen at the beginning of a training experiment remains ﬁxed throughout. This often causes the agent to memorize the track resulting in poor generalization. MADRa S ameliorates this by introducing an option to select a track at the beginning of each episode. Thus the agent can be exposed to multiple tracks during training.

3.4.3 Car Models

TORCS provides 42 car models with a wide range of dynamic properties. However, Gym TORCS only supports a single default car type named car1-trb1. MADRa S is capable of changing cars at the beginning of each training episode. Thus it makes it possible to train an agent to drive cars with drastically diﬀerent dynamic properties. Also, the learning and traﬃc agents can be assigned diﬀerent car types for visual distinction.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

3.4.4 Modular Configuration

As Reinforcement Learning (RL) is one of the most powerful and actively researched approaches for robot motion planning, MADRa S has some features tailor-made for that purpose. The exercise of tuning an RL algorithm for a given task usually involves tweaking the reward function and episode termination ( done ) criteria. It is important to keep accurate track of these parameters across experiments to be able to arrive at the optimal training conﬁguration. Gym TORCS has particularly poor conﬁgurability as it requires the user to make changes in the Python source code which are diﬃcult to keep track of. The entire MADRa S environment including the initial state and the reward and done functions are conﬁgurable through a single ﬁle named madras config.yml. A copy of this conﬁguration ﬁle can be saved in the training directory for eﬀortless tracking across experiments. Please refer to Appendix A for a detailed discussion on commonly used conﬁguration variables in madras config.yml and their functions. Appendix C explains how the initial state can be conﬁgured for each episode.

The reward and done functions are usually composed of multiple parts that try to capture events like arrival at the goal state, crashes and damages. Modularity of these deﬁnitions in code is essential for fast iteration. MADRa S provides Madras Reward and Madras Done base classes as templates for deﬁning the components of the reward and done functions. Specifying a reward or done function in MADRa S is as simple as listing the names of their components in the conﬁguration ﬁle. Each MADRa S Agent comes with a reward handler and a done handler that organize the listed components and set up the corresponding functions. This modular architecture makes it easy to deﬁne new reward and done functions and plug them in and out of experiments easily.

3.4.5 Observation Space

The Snake Oil library of Gym TORCS provides a parser for the state information returned by the TORCS server. These state variables include odometry, range data, obstacle detection, engine statistics and metadata regarding the position of the ego vehicle relative to the other cars on the road. Such a high-level representation of the world is common in practical autonomous driving pipelines (Bansal, Krizhevsky, & Ogale, 2018) as it helps in decoupling the perception and planning modules allowing them to be improved independently and also reduces the sample complexity of machine learning based planning algorithms (Shalev-Shwartz & Shashua, 2016). Raw visual inputs in the form of a stream of images are also available. For a full list of state variables please refer to the Simulated Car Race Championship paper (Loiacono et al., 2013). The observation vector of a MADRa S agent is composed of a selection of these normalized state variables. For modularity and ease of conﬁguration, MADRa S provides an observation handler class that can toggle between diﬀerent sets of observed variables. The observations can optionally be made noisy to simulate a partially observed driving scenario.

MADRa S : Multi Agent Driving Simulator

3.4.6 Action Space

The Snake Oil library allows Gym TORCS agents to control cars via steering, acceleration and brake commands. MADRa S inherits this primitive control mode and provides a generalised interface that supports both hierarchical and non-hierarchical controllers. We show experiments with both kinds of controllers and compare their relative performances in this paper. The hierarchical controller used in our experiment implements a track-position speed control mode. In track-position speed control mode, a MADRa S agent produces its desired position with respect to the left and right edges of the track and its desired speed. A low-level controller takes these non-primitive actions (desires) as inputs and calculates a sequence of steering, acceleration and brake commands. The architecture of MADRa S does not restrict the class of low-level controllers. We use a simple PID controller in the experiments presented in this paper and we plan to add more tuned low level controllers to the repository in the future.

The PID controller used in our experiments works in feedback mode over a number of time steps denoted by PID latency. The PID latency controls the relative time scales of the higher and lower level action spaces. The following is the expression of a PID controller for control variable u.

u(t) = Kpe(t) + Ki

0 e(t )dt + Kd de(t)

Kp, Ki and Kd are the constants for the proportional, integral and derivative terms respectively. Appendix B gives a detailed account of the implementation and behavior of the PID controller. The track-position speed action space is inspired by (Shalev-Shwartz et al., 2016), where the authors note that training an RL agent to generate high-level desires while relegating the low-level implementation of the desires to an analytical controller like PID signiﬁcantly reduces real world risk and increases the explainability of the agent s behavior. High level actions have also been reported to show better generalizability across vehicular platforms (Behere & T orngren, 2016). All actions are normalized between 1 and 1 for ease of optimization of neural network policies. The outcomes of the agent s actions can optionally be made stochastic. MADRa S implements this stochasticity by adding zero-mean Gaussian noise to actions before sending them to the TORCS server.

3.4.7 Inter-vehicular Communication

The most salient feature of MADRa S is its support for multi-agent training. The success of multi-agent learning is contingent on the ability of the agents to communicate among themselves and plan actions taking into account the states and actions of the other agents (Lowe, Wu, Tamar, Harb, Abbeel, & Mordatch, 2017). MADRa S provides a highly ﬂexible framework for inter-vehicular communication through a communication buﬀer and an agent mapping function. The agent mapping function allows the user to specify a list of variables that the ith agent wants to observe from the jth agent. The communication buﬀer records these shared variables from the step t 1 and makes them a part of the agents observation

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Table 3: Parameters of the PID controller used in our experiments.

Kp Ki Kd acceleration PID 10.5 0.05 2.8 steering PID 5.1 0.001 0.000001

vectors at step t.

3.4.8 Curriculum Design for Driving Agents

MADRa S has been designed to provide a playground for reinforcement learning agents to learn to drive any car on any track in any kind of traﬃc within the TORCS environment. In order to construct a driving problem of high variance, MADRa S can present an agent with a diﬀerent car to drive in a diﬀerent track with a diﬀerent number of traﬃc cars of diﬀerent behaviors chosen randomly or in a given order in every training episode. MADRa S can also produce additional stochasticity by making the outcome of an action probabilistic. Training deep neural network policies in high variance environments poses a highly non-convex problem that is diﬃcult to optimize. Curriculum learning (Bengio et al., 2009) has been shown to be eﬀective in reducing the sample complexity in such problems. Curriculum learning involves training an agent on a sequence of tasks of increasing complexity. MADRa S is designed with curriculum learning in mind. The complexity of the driving task in MADRa S can be systematically increased in well deﬁned steps along the following eight dimensions:

1. Number of learning agents.

2. Number of cars to be presented to the agent to drive.

3. Number of tracks to be presented to the agent to drive.

4. Number of traﬃc agents.

5. Level of obstructive behavior from the traﬃc agents.

6. Target speed of the learning agent(s).

7. Degree of stochasticity to action-outcomes.

8. Presence of noise in observations.

In the following section we present a set of experiments to highlight the key features of MADRa S.

4. Experiments

In this section we present the results of six experiments on single and multi-agent RL for learning to drive in MADRa S. The purpose of these experiments is to highlight the features of MADRa S that were discussed in the previous section as an improvement over Gym TORCS.

MADRa S : Multi Agent Driving Simulator

4.1 Experimental Setup

We demonstrate how MADRa S can be used to create a wide variety of driving tasks that can be addressed by RL. Table 4 presents a brief outline of our experiments and their individual motivations. We use the Proximal Policy Optimization (PPO) algorithm (Schulman, Wolski, Dhariwal, Radford, & Klimov, 2017) for RL in all our experiments. PPO is a trustregion based local policy optimization algorithm that has been shown to be very eﬀective in learning policies for continuous control tasks (Andrychowicz, Baker, Chociej, Jozefowicz, Mc Grew, Pachocki, Petron, Plappert, Powell, Ray, et al., 2020). We save the comparison of diﬀerent RL algorithms on MADRa S tasks for a future paper in the interest of brevity. All the performance statistics presented in this section are estimated over at least 100 episodes. All experiments with the track-position speed action space have a PID latency of 5 time steps. The reward functions of the RL agents are deﬁned as weighted sums of reward (r) and penalty (p) components with weights wr and wp, respectively:

agent reward = X

r rewards wrr X

p penalties wpp (2)

Some general purpose reward and penalty components that are used in all the experiments are as follows:

Progress Reward: Progress Reward rewards the agent for making a ﬁnite progress at every time step. We calculate progress relative to a target speed. We reward the agent proportional to its speed until it reaches the target speed. If the speed goes beyond the target speed, we do not give the agent any extra reward. This way we prevent the agent from maximizing its cumulative rewards by running fast and crashing rather than ﬁnishing the race. Let d(t) be the distance (in meters) covered by the agent in the tth time step and starget denote the target speed in meters per step. Progress reward is given by:

progress reward(t) = min 1, d(t)

Average Speed Reward: Average Speed Reward rewards the agent for maintaining a high average speed only if it manages to complete a full lap of the track. Suppose the average speed of the agent for a lap is savg. Average Speed Reward is calculated as:

average reward = savg starget (4)

The Average Speed Reward is also scaled (but not capped) relative to the target speed starget of the agent.

Angular Acceleration Penalty: This penalty is meant to discourage the agent from making frequent unnecessary side-wise movements while running down a track. We calculate a numerical approximation of angular acceleration from the the past 3 recorded values of the angle between the car s direction and the direction of the track axis. We scale the penalty with respect to a reference αreference. Let at 2, at 1, at be three consecutive angles of the agent. We calculate Angular Acceleration Penalty as:

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

angular accleration penalty(t) = |at + at 2 2at 1|

αreference (5)

We set αreference to 2.0 in all our experiments.

Turn Backward Penalty: A ﬁxed penalty of 1 if the car turns backwards.

Collision Penalty: A ﬁxed penalty of 1 if the car collides with obstacles or other cars and incurs a damage.

Apart from these we also use task speciﬁc rewards that we deﬁne separately in each experiment.

We terminate an episode if one of the following events happen:

car turns backwards,

car goes out of track,

car collides with an obstacle,

agent fails to complete its task within the maximum allowable duration of an episode,

agent successfully completes the task at hand.

Unless otherwise stated, we set the learning rate to 5 10 5. The policy and value functions are modelled using fully connected neural networks with 2 hidden layers and 256 tanh units in each layer. We use the PPO implementation of RLLib (Liang, Liaw, Moritz, Nishihara, Fox, Goldberg, Gonzalez, Jordan, & Stoica, 2018) for all our experiments for its stability and support for multi-agent training. The PID parameters used for track-position speed control are given in Table 3. Although ideally these parameters must be tuned for each car and for each speed range, we use the same set of parameters (originally tuned for medium-low speeds of car1-trb1) everywhere to check if it is possible to teach RL agents to be robust to imperfections in the low level controller.

The remaining part of this section is dedicated to a detailed discussion of our experiments and major observations4 that can be made from them.

Experiment 1: Generalization across tracks with higher level actions

In our ﬁrst experiment, we compare two RL agents, one having the high-level track-position speed (T-S) control mode and the other having the low-level steer acceleration brake (SA-B) control mode, on their ability to generalize across multiple driving tracks in MADRa S. We train the agents to drive car1-stock1 in the Alpine-1 track and evaluate them on the

4. Accompanying video: https://youtu.be/io5m P0HUyt Y

MADRa S : Multi Agent Driving Simulator

Table 4: Outline of the experiments presented in this paper.

Exp No. Exp Name Motivation 1 Generalization across tracks with higher level actions

Comparison of primitive and high-level control modes oﬀered by MADRa S in terms of generalization and handling. 2 Generalization across vehicular dynamics through random car selection

Demonstration of how one of the taskrandomization modes of MADRa S can be leveraged to train a single agent to drive a wide range of cars with diﬀerent vehicular dynamics by RL. 3 Curriculum learning for driving in the Spring track

Showcasing how the complexity of a driving task in MADRa S can be tuned in well deﬁned steps for designing curricula for learning agents. 4 Learning under partial observability and stochastic outcomes of actions

Learning robust driving policies using the ability of MADRa S to simulate noisy sensor and imprecise control scenarios. 5 Learning to drive in traﬃc Example of how MADRa S s library of driving agents with pre-deﬁned behaviors can be used to simulate a variety of real-world scenarios for learning to negotiate complex traﬃc situations. 6 Learning to navigate safely through a traﬃc bottleneck by multi-agent cooperation and RL

Demonstration of the inter-vehicular communication architecture and the multi-agent training infrastructure of MADRa S.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Table 5: RL training criteria for Experiments 1-3. Please refer to (Loiacono et al., 2013) for details on the observed variables.

Reward function

Reward Function Component Weightage Progress Reward 1.0 Average Speed Reward 1.0 Collision Penalty 10.0 Turn Backward Penalty 10.0 Angular Acceleration Penalty 5.0 Observed variables angle, track, track Pos, speed X, speed Y, speed Z Done criteria One Lap Completed, Time Out, Collision, Turn Backward, Out of Track

other road tracks. Table 5 lists the observed variables and the components of the reward and done functions. We set the maximum duration of an episode at 15000 time steps and the target speed at 100 km/hour. We evaluate the agents in terms of the average fraction of lap covered in an episode, average speed and successful lap completion rate.

Table 7 presents the results of this experiment. We see that the agent with high-level track-position speed (T-S) control generalizes signiﬁcantly better than the one with lowlevel steer acceleration brake (S-A-B) control as given by higher average scores. The low-level S-A-B control mode gives the agent tighter control of the car that can be exploited to perform maneuvers very speciﬁc to the training track in order to navigate the twists and turns while maintaining a high average speed (see the accompanying video). This results in the agent overﬁtting to the training track and it fails to make any signiﬁcant progress in some of the test tracks. Implementing a desired track-position and speed may require diﬀerent sequences of low-level actions in diﬀerent tracks. Relegating the low-level control to a PID controller gives the T-S agent better generalization to track-geometries than the S-A-B agent.

Experiment 2: Generalization across vehicular dynamics through random car selection

In our second experiment, we leverage the ability of MADRa S to change the agent s car at the beginning of each episode to train a driving policy that generalizes to multiple cars with signiﬁcantly diﬀerent vehicular dynamics. Table 6 gives some physical parameters of the cars used in this experiment that characterize their handling and dynamics. Heavier cars with a low centre of gravity e.g. car1-stock1, car3-trb1 and car1-stock2 are more stable and handle better with less body-roll around tight corners. The variation of torque with the RPM (Rotations Per Minute) of a car s engine plays a crucial role in deciding its dynamics. The torque produced by an engine decides how fast the car can accelerate. Torque is usually a strong function of engine RPM. While running at a given RPM, a car can accelerate faster if its engine can produce higher torque at that RPM. Figure 3 gives

MADRa S : Multi Agent Driving Simulator

the torque-RPM curves for the cars used in this experiment. The cars fall in two broad categories in terms of the overall shape of this curve. Cars with a -shaped curve e.g. buggy, baja-bug and 155-DTM have high torque at low (< 1000) and high (> 10000) RPM and signiﬁcantly lower values in the middle. The other category of cars e.g. car1-stock1, car3-trb1 and car1-stock2 have a hat ( )-shaped curve with low torque at low and high RPM and high values in the middle. When the agent needs high torque to accelerate from a standstill, speed up or climb uphill, it needs to take the engine RPM to the high-torque zone with a suitable sequence of accelerator inputs. The high-torque zones of the aforementioned categories of cars are roughly opposite to one another. This makes it challenging for a driving agent to generalize to both kinds of cars.

We choose the Alpine-1 track for this experiment. The Alpine-1 track is one of the hardest road tracks of MADRa S with sharp left and right turns and a few stretches of slippery road. We set the maximum duration of an episode to 20000 time steps and the target speed to 100 km/hour. We evaluate the agent in terms of average fraction of the lap covered per episode and average speed.

First, we train two PPO agents to drive car1-stock1 ( -shaped torque-rpm curve) and buggy ( -shaped torque-rpm curve) using the S-A-B control mode. We evaluate them on ﬁve test cars of diﬀerent dynamic properties. Table 8 presents the results. We see that an agent trained on a car of one torque-RPM category has diﬃculty generalizing to the cars of the other category. While the car1-stock1 agent generalizes to car3-trb1, kc-2000gt and car1-stock2 with -shaped torque-rpm curves, it fails to drive 155-DTM and baja-bug that have -shaped torque-rpm curves. The buggy agent on the other hand generalizes fairly to 155-DTM and baja-bug but fails to drive the other three test cars due to mismatch in dynamic properties. With a view to aiding in generalization through domain randomization, we leverage the ability of MADRa S to randomly switch cars between episodes and present car1-stock1 and buggy to the same agent with equal probability. We observe that this training strategy brings remarkable generalization across both categories of test vehicles with signiﬁcant improvement both in terms of average fraction of lap covered in an episode and average speed.

Experiment 3: Curriculum learning for driving in the Spring track

In our third experiment, we present a study to demonstrate how the ability of MADRa S to control the complexity of a driving task in well deﬁned steps can be used to design curricula for an RL agent to accomplish complex tasks in a sample eﬃcient way. We attempt to train a PPO agent to drive car1-stock1 on Spring track using the primitive S-A-B action space. With a length of 22.1 km, Spring is the longest track in TORCS. It has the largest number of turns with diﬀerent grades of sharpness, both in the left and right directions. It also has ramps and declines. The surface texture varies from place to place. These make it the toughest road track to drive in TORCS. We set the target speed to 100 Km/hr and maximum episode length to 40000 steps. Figure 4 and Table 9 show the results of this study. We see that training from scratch on Spring fails to complete one lap of the track

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Table 6: Some physical properties of the cars used in Experiment 2 that play an important role in determining their vehicular dynamics. RWD and 4WD stand for Rear Wheel Drive and Four Wheel Drive , respectively.

Car Name Drive Type Mass (Kg) Height of CG (m) car1-stock1 RWD 1550.0 0.3 car1-stock2 RWD 1550.0 0.3 155-DTM 4WD 1100.0 0.2 car3-trb1 RWD 1150.0 0.2 kc-2000gt RWD 1200.0 0.25 buggy RWD 650.0 0.45 baja-bug RWD 600.0 0.35

Table 7: Generalization of an agent trained on Alpine-1 to other road tracks (Experiment 1). S-A-B (Steering - Acceleration - Brake) and T-S (Track position - Speed) denote the control mode used.

Avg. fraction of lap covered

Avg. Speed Lap completion rate S-A-B T-S S-A-B T-S S-A-B T-S

Training Track alpine-1 0.75 0.73 91.89 83.32 0.68 0.58

Test Tracks

aalborg 0.001 0.11 0.10 59.39 0.0 0.0 alpine-2 0.38 0.31 89.95 72.64 0.04 0.0 brondehach 0.001 0.72 0.1 81.01 0.0 0.3 g-track-1 0.001 0.98 0.06 79.42 0.0 0.91 g-track-2 0.002 0.97 0.11 75.99 0.0 0.95 g-track-3 0.001 0.84 0.09 79.90 0.0 0.44 corkscrew 0.0008 0.64 0.06 81.39 0.0 0.0 e-road 0.001 0.94 0.11 85.63 0.0 0.88 e-track-2 0.07 0.39 8.38 75.21 0.0 0.0 e-track-3 0.31 0.68 25.88 77.96 0.03 0.57 e-track-4 0.0005 0.95 0.08 78.41 0.0 0.85 e-track-6 0.0009 0.83 0.09 80.65 0.0 0.58 forza 0.001 0.79 0.08 71.63 0.0 0.70 ole-road-1 0.29 0.40 101.22 78.06 0.0 0.11 ruudskogen 0.97 0.97 100.87 81.15 0.95 0.93 street-1 0.03 0.87 1.76 74.67 0.0 0.67 wheel-1 0.0009 0.95 0.09 78.08 0.0 0.76 wheel-2 0.36 0.81 81.69 81.51 0.0 0.64 spring 0.14 0.29 104.76 82.55 0.0 0.0 Average Scores (Test) 0.14 0.71 27.12 77.64 0.04 0.49

Table 8: Generalization of PPO policies using the S-A-B control mode across vehicles with diﬀerent dynamics (Experiment 2). random refers to the setting in which the agent is presented with both car1-stock1 and buggy, each with a probability of 0.5 during training.

Avg. Fraction of Lap Covered Avg. Speed (km/h) Training Car car1-stock1 buggy random car1-stock1 buggy random

155-DTM 0.37 0.05 0.37 104.22 22.71 99.78 car3-trb1 0.002 0.017 0.62 0.12 0.97 58.95 kc-2000gt 0.77 0.013 0.30 80.44 0.71 22.02 car1-stock2 0.001 0.016 0.54 0.09 0.91 50.23 baja-bug 0.35 0.92 0.55 59.45 61.45 54.91 Average Scores 0.30 0.20 0.48 48.86 17.35 57.18

MADRa S : Multi Agent Driving Simulator

Figure 3: Variation of torque with engine RPM of cars studied in Experiment 2. (a) Torquevs-RPM of the cars that we present our agent to drive during training with equal probability. (b) Torque-vs-RPM of the cars that we test our agent on.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Figure 4: Variation of episode reward over iterations of PPO for learning from scratch on spring compared with ﬁrst learning on simpler tracks alpine-1 and corkscrew and then ﬁne-tuning on spring (Experiment 3).

Table 9: Curriculum learning results for driving in Spring track (Experiment 3).

Fraction of lap covered

Average Speed (km/hr)

Lap completion rate (%) Training from scratch 0.18 101.9 0.0 Pre-training in Alpine-1 0.57 103.5 27.0 Pre-training in Corkscrew 0.54 100.6 45.8

even after 2500 iterations. When we use a curriculum of ﬁrst training on Alpine-1 or Corkscrew tracks followed by ﬁne-tuning on Spring the agent learns to complete the entire lap with high success rates and average speed. In our curriculum learning experiments, we pick the policy that gives the highest mean trajectory reward in the ﬁrst phase of training (obtained after 701 iterations in Alpine-1 and 561 iterations in Corkscrew) and use it to initialize the policy in the second phase. The total number of training iterations and the total number of training samples for the curriculum learning strategies (considering both the pre-training and ﬁne-tuning stages) are kept equal to that of training from scratch for fairness of comparison. For ﬁne-tuning, we choose a learning rate of 1 10 6 for the Alpine-1 policy and 5 10 7 for the Corkscrew policy. We evaluate the agents in terms of the average fraction of lap covered in an episode, average speed and successful lap completion rate.

MADRa S : Multi Agent Driving Simulator

Table 10: Results of a single PPO agent learning to drive in traﬃc by RL. The agent was trained to drive in the presence of 4 or 5 traﬃc cars with equal probability (Experiment 5).

Number of traﬃc agents

3 4 5 6 7 8 9

Successful task completion rate

99.5% 98.1% 95.5% 96% 95.5% 95.7% 92.8%

Experiment 4: Learning under partial observability and stochastic outcomes of actions

In this experiment we compare the performances of PPO agents trained to drive car1-stock1 around the Corkscrew track with and without observation noise under diﬀerent levels of stochasticity of the outcome of actions. The training was performed with the primitive SA-B action space. Observed variables, episode termination criteria and evaluation metrics are the same as in Experiment 1. The reward function is the same as in the Experiments 1-3 (see Table 5) with the weightage for angular acceleration penalty increased to 8. As described in Section 3, stochastic outcomes of actions is implemented by adding zero mean Gaussian noise to the actions. Figure 5 shows the learning curves. All these agents are tested in the same track Corkscrew in the presence of both observation noise and 0.5 standard deviation action noise. Table 11 compares the performance statistics. We observe that the agents trained in the presence of both observation and action noise perform better than the others. This demonstrates the ability of MADRa S to serve as a platform for evaluating the resilience of learning agents to observation noise and environmental stochasticity.

Figure 5: Learning to drive with under partial observability and stochastic outcomes of actions in Corkscrew track (Experiment 4).

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Table 11: Learning to drive in the corkscrew track with and without observation noise and diﬀerent levels of stochasticity in the outcome of actions and evaluation with observation noise and 0.5 std action noise (Experiment 4).

Avg. Fraction of Lap Covered

Avg. Speed (km/hr) No noise 0.38 52.54 Observation noise 0.19 30.78 Stochastic actions (noise std 0.1) 0.12 29.99 Stochastic actions (noise std 0.5) 0.64 48.67 Observation noise and Stochastic actions (noise std 0.1)

Observation noise and Stochastic actions (noise std 0.5)

Figure 6: Schematic diagram of the environment design for Experiment 5. The task of the learning agent is to overtake all the traﬃc cars without colliding with any of them or going oﬀtrack.

Experiment 5: Learning to drive in traﬃc

In this experiment we use the ability of MADRa S to generate custom traﬃc to train an agent to navigate through a narrow road without colliding with any traﬃc car moving or parked. Figure 6 shows a schematic diagram of the training environment. We choose the Aalborg track for this study since it is one of the narrowest tracks of TORCS and further reduce its width to half resulting in an eﬀective track width of 5m.

The traﬃc agents used in this experiment are Drive And Park Agents (see Table 2). MADRa S positions the traﬃc cars ahead of the learning car at the start of the race. When an episode begins, the Drive And Park Agents start driving at their given target speeds (50 km/hr) towards their given parking locations (speciﬁed in terms of distance from the start of the race and track position) using PID controllers. This way, the learning agent sees moving cars in the beginning and parked cars towards the end of each episode. This forces it to learn to avoid collision with both static and moving obstacles. We set the parking

MADRa S : Multi Agent Driving Simulator

Table 12: RL training criteria for Experiment 5. Please refer to (Loiacono et al., 2013) for details on the observed variables.

Reward function

Reward Function Component Weightage Progress Reward 1.0 Average Speed Reward 1.0 Collision Penalty 10.0 Turn Backward Penalty 10.0 Angular Acceleration Penalty 1.0 Overtake Reward 5.0 Rank 1 Reward 100.0 Observed variables angle, track, track Pos, speed X, speed Y, speed Z, opponents Done criteria Rank 1, Time Out, Collision, Turn Backward, Out of Track

locations of the traﬃc cars on alternate sides of the road so that the the agent must learn to turn both left and right to overtake all the traﬃc cars. We maintain a gap of at least 10m between consecutive parking locations along the length of the road to make sure that the learning car has enough space to maneuver between the traﬃc cars. To create variance in the environment, we randomly vary each parking location within an area of 5m along the track length and 0.25m along the track width. We also switch the number of traﬃc cars between 4 and 5 with equal probability. Changing the number of traﬃc cars also makes sure that the learning agent gets initialized in the left and right halves of the track with equal probability. We use the T-S control mode and set the target speed of the learning agent to 50 km/h. Table 12 gives the training criteria for this experiment.

The agent gets an Overtake Reward every time it overtakes a traﬃc agent and Rank 1 Reward at the end of the episode if it manages to overtake all the traﬃc agents. The agent is evaluated in terms of the fraction of times it overtakes all the traﬃc cars successfully.

Table 10 presents the results of this experiment. We observe that the agent learns to generalize to both fewer and more traﬃc agents than it encountered during training and navigate its way through them collision-free with a high success rate. Figure 12 in the Appendix shows how the instances of the agent colliding and driving oﬀ-track reduces as training progresses while the frequency of it emerging Rank 1 and successfully completing the episode increases.

Experiment 6: Learning to navigate safely through a traﬃc bottleneck by multi-agent cooperation and RL

One of the biggest aspirations of autonomous driving is the avoidance of traﬃc congestion through cooperation. In this experiment we utilize the multi-agent training infrastructure of MADRa S and its framework for inter-vehicular communication to solve a simpliﬁed ver-

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Figure 7: Schematic diagram of the multi-agent task studied in Experiment 6. The task for the two learning agents is to coordinate with each other and pass through the gap between the parked traﬃc cars without making any collision. The top row shows an example of undesirable behavior in which both the agents attempt to pass through the bottleneck at the same time and result in a collision. The bottom row gives a feasible solution to the problem in which one of the agents stops or slows down to wait for the other agent to pass through the gap. Only after the gap is clear does it attempt to pass through thus avoiding collision with any of the other cars.

Table 13: Dimensions of cars used in Experiment 6.

Car Model Length (m) Width (m) Traﬃc Car car1-trb1 4.52 1.94 PPO Agent-1 car3-trb1 4.55 1.95 PPO Agent-2 car5-trb1 4.67 1.94

sion of this task by multi-agent reinforcement learning.

The training environment consists of two PPO agents and two traﬃc agents on the Corkscrew track. The PPO agents communicate their actions to each other at every step. We park the traﬃc agents next to each other with a small gap in between that is suﬃcient only for one car to pass through. The task of the PPO agents is to pass through the gap one by one without colliding with each other or with any traﬃc agent (see Figure 7). Thus the agents must learn a collaborative strategy in which the agent trying to pass through

Table 14: Curriculum for multi-agent RL in Experiment 6.

Iterations of training Parking Distance (m) Gap Width (m) 1 240 30 40 2.76 4.06 240 300 30 35 2.76 3.46

MADRa S : Multi Agent Driving Simulator

(a) Agent-1 Training Curves

(b) Agent-2 Training Curves

(c) Joint learning Curves

Figure 8: Learning curves for multi-agent training in Experiment 6. The cross symbol denotes transition point in the agent s curriculum where the ﬁrst task ends and the second task begins.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Table 15: RL training criteria for Experiment 6. peer Actions refers to the actions of the other learning agent from the previous time step. Please refer to (Loiacono et al., 2013) for details on the other observed variables.

Reward function

Reward Function Component Weightage Progress Reward 1.0 Average Speed Reward 1.0 Collision Penalty 10.0 Turn Backward Penalty 10.0 Angular Acceleration Penalty 5.0 Observed variables angle, track, track Pos, opponents, speed X, speed Y, speed Z, peer Actions Done criteria Race Over, Time Out, Collision, Turn Backward, Out of Track

the gap ﬁrst should be given enough time to pass through completely by the other agent before it makes its own attempt.

Table 13 gives the cars assigned to the learning and traﬃc agents and their dimensions. Both the PPO agents have T-S action space. Table 14 describes the curriculum used for the training. We randomly vary the parking distance of each traﬃc car and the gap between them at the start of each episode for improved generalization. Table 15 gives the details of the observed variables, reward functions and done criteria. The agents must learn the following distinct skills to be able to accomplish this task.

Running forward without going oﬀtrack.

Not colliding with each other.

Not colliding with any of the parked cars.

Learning to collaborate and pass through the bottleneck one by one.

We jointly evaluate the agents in terms of the rate of successful passage of both the agents through the traﬃc bottleneck. Figure 8 shows the individual and joint learning curves respectively during training. The ﬁnal evaluation is done over 100 episodes of stochastically parked agents and the PPO agents demonstrate a joint task completion rate of 83.3%.

5. Conclusion

In this paper we present MADRa S, an open-source Multi-Agent Driving Simulator for autonomous driving. MADRa S builds on TORCS, a popular car racing platform, and adds a suite of features like hierarchical control modes, domain randomization, custom traﬃc, partial observability, stochastic outcomes of actions and support for multi-agent training. We present a suite of experiments that illustrate how MADRa S can be used to simulate rich highway and track driving scenarios of high variance and complexity that are valuable for autonomous driving research and investigating the robustness and generalization abilities

MADRa S : Multi Agent Driving Simulator

of RL algorithms. We compare primitive and abstract (or, high-level) control-modes at the task of generalizing to a multitude of driving tracks and observe that the abstract control-mode achieves superior generalization while the primitive control-mode oﬀers tighter handling. We learn a policy that generalizes to a wide range of vehicular dynamics simply by training on two car models from the extreme ends of the spectrum and leveraging MADRa S s ability to change the agent s car in every episode. We use the ability of MADRa S to inject varying levels of noise into the observation and action spaces to study driving under stochasticity and partial observability. MADRa S oﬀers a powerful set of tools for simulating traﬃc. We present experiments on learning to navigate through static and moving traﬃc without colliding or going oﬀtrack and learning multi-agent cooperation for passing through traﬃc bottlenecks safely. We wish to develop features speciﬁc to fuel management and vehicular safety in the future.

Acknowledgements

The authors would like to thank Professor Pabitra Mitra of the Department of Computer Science and Engineering, IIT Kharagpur for his helpful feedback on the structure of the paper and Manish Prajapat of ETH Zurich for his useful tips on the implementation of inter-vehicular communication in MADRa S. The authors would also like to thank Intel Labs India for incubating the early stage of this project. Anirban Santara s work in this project was supported by Google India, under the Google India Ph D Fellowship grant, and Intel Inc. under the Intel Student Ambassador Program. Anirban Santara and Sohan Rudra contributed equally to this project.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Appendix A. Conﬁguring MADRa S

The structure of MADRa S focuses on the ease of use and encourages custom modiﬁcations. In this section we describe the conﬁguration variables of MADRa S. All these variables are speciﬁed in the envs/data/madras config.yml ﬁle. The yaml (or yml ) format provides a powerful yet convenient way of specifying most data types and basic data structures like lists and dictionaries.

The madras config.yml ﬁle has three sections:

1. Server conﬁguration: In this section contains the global conﬁgurations of the MADRa S environment. Since MADRa S can randomly vary the driving tracks, model of car for the learning agents, and the number of traﬃc cars between episodes, these terms are speciﬁed as lists and ranges. The maximum number of cars in the environment (including learning and traﬃc agents) can be speciﬁed as max cars and the minimum number of traﬃc cars by min traffic cars. The number of learning agents (Nl) is speciﬁed in the agent conﬁguration section.

Nl + min traffic cars max cars

The list of car models to choose for the learning agent can be speciﬁed in learning car. The list of tracks to choose for each episode can be speciﬁed in track names. If randomize env = True the car model, track and the number of traﬃc agents is chosen randomly for each episode.

2. Agent conﬁguration: The agents section, contains the conﬁgurations of the learning agents. The target speed, pid settings for the low level controller if pid assist is True, conﬁguration of the observation space (according to the modes in utils/observation handler.py), reward function (to be parsed by utils/reward handler.py) and done function (to be parsed by utils/done handler.py) can be speciﬁed individually for each agent in this section.

3. Traﬃc conﬁguration: The traffic section can be used to specify the details of the traﬃc agents in the environment. If Nt traﬃc agents need to be chosen in a given episode, their conﬁgurations will be set to the ﬁrst Nt elements from the list of agents in this section. These conﬁgurations are parsed by traffic/traffic.py. The target speed, target lane pos, collision avoidance properties and pid settings of the traﬃc cars can be speciﬁed here. If the traﬃc agents need to be parked in certain locations (speciﬁed in terms of their distance from the start line and track position) of the track before the start of each episode, that can also be speciﬁed in this section.

The full list of the conﬁguration variables is available in Tables A1, A2 and A3.

MADRa S supports inter-vehicular communication (IV-Comm) between the learning agents. The settings for the IV-Comm system can be speciﬁed in envs/data/communications.yml. The user can specify the list of variables (vars) that each learning agent wants to observe from a list of communicating agents (comms) for a given number of previous time steps (buff size).

MADRa S : Multi Agent Driving Simulator

Table A1: Server Conﬁguration Parameters

Parameters Description Possible Values

torcs server port For setting the port of communication with the TORCS Server. Z+

max cars Max number of vehicles to be spawned. Z+

min traffic cars Min number of traﬃc cars to be spawned. Z+

track names List of tracks on which the simulation will run.

List of track names

track limits Restrict the agent to remain within a given range of track pos values. (R, R)

distance to start Starting distance of the cars from the start line. Z+

torcs server config dir The location of the TORCS server racing conﬁg directory. Path string

scr server config dir The location of available cars conﬁg directory Path string traffic car The type of car to be used for traﬃc car name

learning car List of car models for using as the learning agent.

List of car names randomize env Flag for turning randomization on. boolean

add noise to actions Flag for adding a small Gaussian Noise to the actions before sending to the TORCS server. boolean

action noise std Speciﬁes the standard deviation of the Gaussian for the noise addition. [0, 1]

noisy observations Toggles the TORCS ﬂag for enabling noisy observations. boolean

visualise Flag for setting the display on and oﬀ. boolean no of visualisations To visualize multiple training instances Z+

max steps Maximum steps that the environment will take before resetting. Z+

Appendix B. PID Response

In this section we describe our implementation of the PID controller used for low level control in our experiments with the track-position speed control mode of MADRa S. Please note that this implementation can be easily swapped out for a more sophisticated one by creating a derived class of PIDController deﬁned in controllers/pid.py. The error function (e TP ) for track-position PID controller is deﬁned as a function of the track-position (TP) and the angle (θ) that the car s heading makes with the center line. The output of this controller is the steer-angle of the vehicle for the current time-step (t) that would bring the car closer to the desired track-position (TPdesired).

e TP (t) = θ(t 1) (TP(t 1) TPdesired) scale (6)

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Table A2: Agent Conﬁguration Parameters

Parameters Description Possible Values

vision Flag for activating visual input instead of the usual sensor based one. boolean

throttle Flag for activating throttle control on and oﬀ. boolean gear change Flag for activating gear control on and oﬀ. boolean client max steps Maximum steps that the client is available to take. Z+ { 1} target speed Target speed setting of the agent car. Z+

state dim Dimension of the Observation Space. Z+

normalize actions Toggle to turn on action normalization. boolean pid assist Toggle to turn on T-S control mode. boolean pid settings[accel pid] Kp, Ki, Kd for throttle PID. List of ﬂoats pid settings[accel pid] Kp, Ki, Kd for steering PID. List of ﬂoats accel scale Acceleration Scaling. R+

steer scale Steering Scaling. R+

pid latency Number time-steps the control command sticks to the server. Z+

observations[mode] Name of the Observation Class. string observations[multi flag] (multi-agent mode only)

Toggle for turning on communication for the agent i, boolean

observations[buff size] Speciﬁes the buﬀer size of action. Z+

observation[normalize] Toggle to tun on observation normalization. boolean obs min Minimum values for certain observation attributes. dict obs max Maximum values for certain observation attributes. dict

rewards[name, scale] List of the Reward classes and a scaling factor of the rewards.

list of names and dict dones Done conditions currently in use. list of dones

Table A3: Common Traﬃc Conﬁguration Parameters

Parameters Description Possible Values name Traﬃc Agent Type, string target speed Traﬃc Agent Speed. R+

initial distance Traﬃc Agent initial distance from start line (range). 2-Tuple of Floats initial trackpos Traﬃc Agent initial track-position (range). 2-Tuple of Floats track len Length of the Current Track. R+

pid settings[accel pid] Kp, Ki, Kd values for acceleration. List of Floats pid settings[steer pid] Kp, Ki, Kd values for steering. List of Floats accel scale Acceleration scaling. R+

steer scale Steering scaling. R+

collision time window Describes the collision region for the traﬃc agent R+

MADRa S : Multi Agent Driving Simulator

Figure 9: Speed control accuracy and convergence of our PID controller at diﬀerent initial speeds over time-steps and distance travelled. The PID latency is set to 5. The track used in this study is the f-speedway oval track. The lane-position command is ﬁxed at 0.0 which refers to the center of the track.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Figure 10: This plot demonstrates the position control accuracy and convergence over time at diﬀerent speeds of the PID controller used in our experiments. The track used is the f-speedway oval track and PID latency is set to 5. The plot covers the range of speeds used in our experiments.

MADRa S : Multi Agent Driving Simulator

Figure 11: This plot demonstrates the position control accuracy and convergence over distance at diﬀerent speeds of the PID controller used in our experiments. The track used is the f-speedway oval track and PID latency is set to 5. The plot covers the range of speeds used in our experiments.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Figure 12: This plot demonstrates how the causes of episode termination (done reason) varies as the agent makes progress in training. We observe that collisions and out-of-track frequencies drop while Rank 1 frequency increases as training progresses. This experiment was a replica of Experiment 5 with 2 or 3 traﬃc cars at equal probability.

The error function for the Speed PID controller (e V ) is a function of the forward velocity (V ). The output of the controller is the value of acceleration and braking that would bring the speed closer to the target speed of the vehicle (Vtarget).

e V (t) = (V (t 1) Vtarget) scale (7)

Figure 9, 10 and 11 show the responses of the PID controller used in our experiments with the high level track-position speed action space. For testing the controller response over time, we change the input signal (track-position or speed) every 500 steps and monitor the output. For testing the controller response over distance, we change the input signal after the agent has driven every 100 meters for speed control and 400 meters for track-position control. We use the f-speedway oval track for this study. For speed control (Figure 9) we change the target signal in incremental steps of 10 km/hour from 0 km/hour to 100 km/hour and back to 0 km/hour keeping the track-position input ﬁxed at 0.0, the center of the track. For position control we increment the signal in steps of 0.4 starting from 0.4 (default initial track-position) towards the extreme left (up to 0.8) and then towards the extreme right (up to 0.8). We observe that the controller responds faithfully within the range of speeds and track positions used in our experiments.

Appendix C. Initial State Distribution

The initial state of an episode in MADRa S can be conﬁgured in terms of the set of parameters listed below. The madras config.yml ﬁle has the randomize env ﬂag that can be enabled to randomly assign values for these parameters at the start of each episode.

MADRa S : Multi Agent Driving Simulator

Vehicle Model: The model of the car assigned to the learning agent(s) can be speciﬁed using the learning car ﬁeld. This can also be randomly selected from a categorical distribution over a list of car models when randomize env = True.

Number of Traﬃc Cars: The number of traﬃc cars can be speciﬁed using the min traffic cars ﬁeld. When randomize env = True the number of traﬃc cars is assigned randomly between min traffic cars and (max cars - (number of learning agents)).

Track Position of Traﬃc Cars: Some traﬃc cars can be assigned a certain track position to stick to. For Parked Agent, it can be speciﬁed as the parking lane pos while for Const Vel Traffic Agent, Sinusoidal Speed Agent and Random Stopping Agent it can be speciﬁed using the target lane pos ﬁeld. If randomize env = True the track position is sampled randomly from a continuous uniform distribution between speciﬁed high and low limits for these parameters.

Parking Distance of Traﬃc Agents from the Start line: The distance from start of Parked Agent traﬃc agents can be set using the parking dist from start parameter. When randomize env = True it is sampled uniformly from a ﬁxed range speciﬁed by high and low values for the same parameter.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Andrychowicz, O. M., Baker, B., Chociej, M., Jozefowicz, R., Mc Grew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. (2020). Learning dexterous inhand manipulation. The International Journal of Robotics Research, 39(1), 3 20.

Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5), 469 483.

Bansal, M., Krizhevsky, A., & Ogale, A. (2018). Chauﬀeurnet: Learning to drive by imitating the best and synthesizing the worst..

Behere, S., & T orngren, M. (2016). A functional reference architecture for autonomous driving. Information and Software Technology, 73, 136 150.

Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41 48. ACM.

Bojarski, M., Yeres, P., Choromanska, A., Choromanski, K., Firner, B., Jackel, L., & Muller, U. (2017). Explaining how a deep neural network trained with end-to-end learning steers a car..

Bowling, M., & Veloso, M. (2000). An analysis of stochastic game theory for multiagent reinforcement learning. Tech. rep., Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym..

Brown, A., et al. (2018). Udacity self-driving car simulator. In Git Hub Repository https: // github. com/ udacity/ self-driving-car-sim .

Bu, L., Babu, R., De Schutter, B., et al. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2), 156 172.

Chen, C., Seﬀ, A., Kornhauser, A., & Xiao, J. (2015). Deepdriving: Learning aﬀordance for direct perception in autonomous driving. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2722 2730. IEEE.

Da Silva, F. L., & Costa, A. H. R. (2019). A survey on transfer learning for multiagent reinforcement learning systems. Journal of Artiﬁcial Intelligence Research, 64, 645 703.

de Bruin, T., Kober, J., Tuyls, K., & Babuˇska, R. (2018). Integrating state representation learning into deep reinforcement learning. IEEE Robotics and Automation Letters, 3(3), 1394 1401.

Dikmen, M., & Burns, C. M. (2016). Autonomous driving in the real world: Experiences with tesla autopilot and summon. In Proceedings of the 8th international conference on automotive user interfaces and interactive vehicular applications, pp. 225 228. ACM.

Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). Carla: An open urban driving simulator..

MADRa S : Multi Agent Driving Simulator

Dossa, R. F. J., Lian, X., Nomoto, H., Matsubara, T., & Uehara, K. (2019). A human-like agent based on a hybrid of reinforcement and imitation learning. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE.

Dresner, K., & Stone, P. (2008). A multiagent approach to autonomous intersection management. Journal of artiﬁcial intelligence research, 31, 591 656.

Fayjie, A. R., Hossain, S., Oualid, D., & Lee, D. (2018). Driverless car: Autonomous driving using deep reinforcement learning in urban environment. In 2018 15th International Conference on Ubiquitous Robots (UR), pp. 896 901.

Kaushik, M., Prasad, V., Krishna, K. M., & Ravindran, B. (2018). Overtaking maneuvers in simulated highway driving using deep reinforcement learning. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1885 1890. IEEE.

Koutn ık, J., Cuccu, G., Schmidhuber, J., & Gomez, F. (2013). Evolving large-scale neural networks for vision-based reinforcement learning. In Proceedings of the 15th annual conference on Genetic and evolutionary computation, pp. 1061 1068. ACM.

Koutn ık, J., Schmidhuber, J., & Gomez, F. (2014). Evolving deep unsupervised convolutional networks for vision-based reinforcement learning. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, pp. 541 548. ACM.

La Valle, S. M. (2006). Planning algorithms. Cambridge university press.

Li, Y., Song, J., & Ermon, S. (2017). Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3812 3822.

Liang, E., Liaw, R., Moritz, P., Nishihara, R., Fox, R., Goldberg, K., Gonzalez, J. E., Jordan, M. I., & Stoica, I. (2018). Rllib: Abstractions for distributed reinforcement learning..

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2019). Continuous control with deep reinforcement learning..

Lin, X., Adams, S. C., & Beling, P. A. (2018). Multi-agent inverse reinforcement learning for general-sum stochastic games. Ar Xiv, abs/1806.09795.

Lin, X., Beling, P. A., & Cogill, R. (2017). Multiagent inverse reinforcement learning for two-person zero-sum games. IEEE Transactions on Games, 10(1), 56 68.

Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157 163. Elsevier.

Liu, G.-H., Siravuru, A., Prabhakar, S., Veloso, M., & Kantor, G. (2017). Learning end-toend multimodal sensor policies for autonomous navigation. In Levine, S., Vanhoucke, V., & Goldberg, K. (Eds.), Proceedings of the 1st Annual Conference on Robot Learning, Vol. 78 of Proceedings of Machine Learning Research, pp. 249 261. PMLR.

Loiacono, D., Cardamone, L., & Lanzi, P. L. (2013). Simulated car racing championship: Competition software manual..

Loiacono, D., Lanzi, P. L., Togelius, J., Onieva, E., Pelta, D. A., Butz, M. V., Lonneker, T. D., Cardamone, L., Perez, D., S aez, Y., et al. (2010a). The 2009 simulated car racing championship. IEEE Transactions on Computational Intelligence and AI in Games, 2(2), 131 147.

Santara, Rudra, Buridi, Kaushik, Naik, Kaul, Ravindran

Loiacono, D., Prete, A., Lanzi, P. L., & Cardamone, L. (2010b). Learning to overtake in torcs using simple reinforcement learning. In IEEE Congress on Evolutionary Computation, pp. 1 8. IEEE.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, p. 6382 6393, Red Hook, NY, USA. Curran Associates Inc.

Minster, G., Haghighat, S., Chu, K., & Vogt, K. (2018). System and method for autonomous vehicle driving behavior modiﬁcation.. US Patent 10,035,519.

Neider, J., Davis, T., & Woo, M. (1993). Open GL programming guide, Vol. 14. Addison Wesley Reading, MA.

Onieva, E., Cardamone, L., Loiacono, D., & Lanzi, P. L. (2010). Overtaking opponents with blocking strategies using fuzzy logic. In Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games, pp. 123 130. IEEE.

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., & Wermter, S. (2019). Continual lifelong learning with neural networks: A review. Neural Networks, 113, 54 71.

Pomerleau, D. A. (1989). Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305 313.

Richter, S. R., Hayder, Z., & Koltun, V. (2017). Playing for benchmarks. In International conference on computer vision (ICCV), Vol. 2.

Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In European Conference on Computer Vision, pp. 102 118. Springer.

Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. M. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3234 3243.

Santara, A., Naik, A., Ravindran, B., Das, D., Mudigere, D., Avancha, S., & Kaul, B. (2018). Rail: Risk-averse imitation learning. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, AAMAS 18, p. 2062 2063, Richland, SC. International Foundation for Autonomous Agents and Multiagent Systems.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. Co RR, abs/1707.06347.

Seel, N. M. (2011). Encyclopedia of the Sciences of Learning. Springer Science & Business Media.

Shah, S., Dey, D., Lovett, C., & Kapoor, A. (2018). Airsim: High-ﬁdelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pp. 621 635. Springer.

Shalev-Shwartz, S., Shammah, S., & Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving..

MADRa S : Multi Agent Driving Simulator

Shalev-Shwartz, S., & Shashua, A. (2016). On the sample complexity of end-to-end training vs. semantic abstraction training..

Sharifzadeh, S., Chiotellis, I., Triebel, R., & Cremers, D. (2016). Learning to drive using inverse reinforcement learning and deep q-networks. Co RR, abs/1612.03653.

Sulkowski, T., Bugiel, P., & Izydorczyk, J. (2018). In search of the ultimate autonomous driving simulator. In 2018 International Conference on Signals and Electronic Systems (ICSES), pp. 252 256. IEEE.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Talpaert., V., Sobh., I., Kiran., B. R., Mannion., P., Yogamani., S., El-Sallab., A., & Perez., P. (2019). Exploring applications of deep reinforcement learning for real-world autonomous driving systems. In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP,, pp. 564 572. INSTICC, Sci Te Press.

Wymann, B., Espi e, E., Guionneau, C., Dimitrakakis, C., Coulom, R., & Sumner, A. (2000). Torcs, the open racing car simulator. Software available at http://torcs.sourceforge.net, 4(6).

Yoshida, N. (2016). Gym-torcs. https://github.com/ugonama-kun/gymtorcs.

You, C., Lu, J., Filev, D., & Tsiotras, P. (2019). Advanced planning for autonomous vehicles using reinforcement learning and deep inverse reinforcement learning. Robotics and Autonomous Systems, 114, 1 18.

Yu, L., Song, J., & Ermon, S. (2019). Multi-agent adversarial inverse reinforcement learning. In International Conference on Machine Learning, pp. 7194 7201. PMLR.