# exploration_via_state_influence_modeling__f724847f.pdf Exploration via State Influence Modeling Yongxin Kang1,2*, Enmin Zhao2,1*, Kai Li2, Junliang Xing2 1 School of artificial intelligence, University of Chinese Academy of Sciences 2 Institute of Automation, Chinese Academy of Sciences {kangyongxin2018, zhaoenmin2018, kai.li}@ia.ac.cn, jlxing@nlpr.ia.ac.cn This paper studies the challenging problem of reinforcement learning (RL) in hard exploration tasks with sparse rewards. It focuses on the exploration stage before the agent gets the first positive reward, in which case, traditional RL algorithms with simple exploration strategies often work poorly. Unlike previous methods using some attribute of a single state as the intrinsic reward to encourage exploration, this work leverages the social influence between different states to permit more efficient exploration. It introduces a general intrinsic reward construction method to evaluate the social influence of states dynamically. Three kinds of social influence are introduced for a state: conformity, power, and authority. By measuring the state influence, agents quickly find the focus state during the exploration process. The proposed RL framework with state influence evaluation works well in hard exploration task. Extensive experimental analyses and comparisons in Grid Maze and many hard exploration Atari 2600 games demonstrate its high exploration efficiency. Introduction Reinforcement learning (RL) in hard exploration tasks with sparse rewards is an essential problem in artificial intelligence. Unlike typical RL problems, hard exploration tasks with sparse rewards often consist of two stages. First, there is a long period of exploration before the agent obtains a new reward, which we term the no-reward exploration stage. Second, after the agent obtains some local rewards, it follows a process of experience utilization and continuing to explore, which we term the local-reward exploitation stage. In this paper, we focus on the first and more difficult no-reward exploration stage. During the no-reward exploration stage, traditional RL algorithms based on value function (Mnih et al. 2015; Van Hasselt, Guez, and Silver 2016) or policy gradient (Schulman et al. 2015, 2017) often get trapped in some confusing states because the state value they used is only evaluated by the reward. Since only a few states contain rewards, the agent cannot distinguish between those no-reward states, even if it have experienced them many times. *Equal contribution, corresponding author. Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. (a) A person s social influence (b) Traditional MDP model in RL (c) State influence model Focus state Figure 1: (a) The social influence of one person in a social network. Three characteristics are often used to represent a person s social influence: 1) the size of the node indicates his conformity, 2) the blue dotted line indicates his power which is usually related to the number of connections within his group, and 3) the solid orange line indicates the authority, that is, how many followers a person has. The most influential person in a social network is called the focus. Similarly, we regard the states in (b) the traditional MDP model in RL as the nodes in (c) social network and obtain the focus state by measuring the states social influence. The focus state will be related to more states, and the exploration of it can accelerate the agent s cognition of the environment. To deal with the hard exploration tasks with sparse rewards, many previous works try to imitate expert demonstrations (DQFD) (Hester et al. 2018) or their own successful experience (SIL) (Oh et al. 2018). In practice, however, expert demonstrations are often unavailable, and SIL focuses on the second stage, where the agent has already received local rewards. In the no-reward exploration stage, HER (Andrychowicz et al. 2017) randomly sets virtual goals from the experience replay buffer, regardless of which experience might be the most valuable. So it suffers from low sampling efficiency. To explore more meaningful directions, some researchers design intrinsic rewards based on the curiosity of states, where curiosity is measured by prediction errors (Pathak et al. 2017), reachability (Savinov et al. 2019) or pseudo-count (Bellemare et al. 2016). Although have achieved some success, they only consider the attributes of the state itself but ignore the relationships between states. The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) But the human essence is no abstraction inherent in each single individual. In its reality it is the ensemble of the social relation (Marx and Engels 1969). Social influence refers to the change of a person s behavior after an interaction with other people or organizations. It consists of the process by which the individual opinions can be changed by the influence of another individual or other individuals (Friedkin 2006). Based on these considerations, when estimating the value of a state in RL, it s better to consider the relationship with other states besides its own attributes. Inspired by the concept of social influence in the social networks analysis community (Friedkin 2006), we regard each state as an individual, the relationship between states as a link in social networks, and the exploration process as a series of accesses to the opinion leader states (Figure 1). A general intrinsic reward construction method is thus introduced to measure the social influence of states dynamically, which is termed as Social Influence (SI) based intrinsic reward function. It comprises three kinds of social attributes: conformity, power, and authority. In particular, the conformity measures how often a state is visited, the power measures the relations with its former states in the MDP process, and the authority measures the relations it might have with its followers. By evaluating these social attributes of states, the agent can find the focus state and exploit this information to accelerate the exploration process. We form a general RL framework using this SI-based intrinsic reward function. The new RL framework applies to both value based RL algorithms like DQN (Mnih et al. 2015), Dueling DQN (Wang et al. 2016), policy gradient based RL algorithms like PPO (Schulman et al. 2017), TRPO (Schulman et al. 2015), and the hybrid one like A3C (Mnih et al. 2016). For a series of hard exploration tasks like Grid Maze and Montezuma s Revenge, we develop corresponding learning algorithms based on the proposed RL framework. The results demonstrate that, with the introduction of social influence, all the evaluated algorithms significantly improve their learning efficiency and quickly accomplish the goal of each task. In specific, on the Grid Maze game, the proposed method distinctly reduces the total exploration steps compared with the classical Q-learning based method. On the Montezuma s Revenge, compared with existing algorithms, the proposed method converges faster and obtains higher scores. To summarize, the main contributions of this work are listed as follows in threefold: We point out that in the no-reward exploration stage, it is not enough to define intrinsic rewards based on the attributes of the state alone. The relationship between different states is introduced as a new part of intrinsic rewards. According to the measurement of individual social influence in social network analysis, a generalized intrinsic reward function is defined for each state, including the attributes of the state itself and the relationships between the state and others. A new RL framework of SI-based intrinsic reward function is proposed and applied to Q-learning and A2C to improve the performance in some hard exploration games. The source code, trained models, and all the experimental results will be released to facilitate further studies on reinforcement learning in hard exploration tasks. Background To expatiate the proposed intrinsic motivation model, we first introduce some background knowledge on basic reinforcement learning, intrinsic reward, and social influence. Basic RL. The standard RL formulation involves an agent interacting with an environment. An MDP is a tuple M = S, A, R, T, γ , consisting of a set of states S, a set of actions A, a reward function R : S A R, a transition probability model T(st+1, rt+1|st, at), and a discount factor γ [0, 1]. A policy π maps a state to an action, π:S A. An episode starts with an initial state s0, and at each timestamp t, the agent chooses an action at=π(a|st) based on the current state st. The environment produces a reward rt+1 to the agent, which reaches to next state st+1 sampled from the distribution T(st+1, rt+1|st, at). The reward might be discounted by a factor γ at each timestamp, and the goal of the agent is to maximize the accumulated reward, k=0 γk Rt+k+1. (1) Intrinsic reward. Intrinsic rewards become critical when extrinsic rewards are sparse (Pathak et al. 2017). They guide the agent based on the change in prediction error or learning progress (Bellemare et al. 2016; Schmidhuber 1991; Oudeyer, Kaplan, and Hafner 2007). If en(A) is the error made by the agent at time n over some event A, and en+1(A) the same one after observing a new piece of information, then the learning progress is en(A) en+1(A). To further quantify the learning process, researchers provide an information gain related method to explain the intrinsic reward (Bellemare et al. 2016). At each timestamp, the agent is trained with the reward rt = et +βit, where et is the extrinsic reward provided by the environment, it is the intrinsic reward generated by the agent, and β > 0 is a scalar balancing between the intrinsic and extrinsic rewards (Taiga et al. 2020). The overall optimization problem solves the following Bellman equation, V (s) = max a A[et + βit + γEπ[V (s )]]. (2) Social influence. In social network analysis, social influence of a node is often characterized by three main features (Friedkin 2006): 1) conformity, that occurs when an individual expresses a particular opinion in order to meet the expectations of a given other, though he/she does not necessarily hold the belief that the opinion is appropriate; 2) power, that is the ability to force someone to behave in a particular way by controlling his/her outcomes; and 3) authority, that is the power that believed to be legitimated by those who are subjected to it. The Model There are two main obstacles in the process of exploring a sparse reward environment. The first obstacle comes from the vast state space. The other one comes from the uncertainty across states. Too many non-reward states and unknown transitions between states cause great confusion to agents and slow down the exploration process. Figure 1(b) is a simple example that regards state as the node and transition as the edge in a dynamic directed graph. When the number of states increases with the exploration process, the number of nodes in the graph increases and the number of possible edges also increases. This brings great trouble to the exploration task. Social networks can also be modeled by dynamic graphs (Figure 1(a)). However, as the population size increases, it can quickly find an effective way to spread information without confusion from the uncertainty. This benefits from the efficient utilization of the focus person and the modeling of the relationship between persons. Based on these observations, we introduce the concepts from the social network into the process of exploring to narrow the exploration space and reduce the uncertainty. Intuitively, as the exploration task shown in Figure 1(c), if you want to explore state S8, you must have a corresponding state S6 appeared. Whether the arrival and re-exploration of a state are instructive to the policy improvement not only relates to the current state itself but also how to reach the state and how much potential the state can get in the future. The exploration process thus can not just be summarized as explore what surprises the agent as most previous methods did. It should take exploit what influences the environment into account. Just like the focus person in the social network in Figure 1(a), his importance to information broadcasting not only relates to his conformity (occurrence frequency), but also his power and authority (connections with other people). Therefore, we give a more reasonable formulation of intrinsic reward according to the concepts in social influence analysis, which consists of the state s characteristics and the relationship between states. State Influence Definition 1 (Conformity function) We define the conformity function on the state space S, f C : S R mapping the state to a conformity level. si, sj S, if f C(si) < f C(sj), we say state sj is visited more often than state si. In the study of social network, conformity indicates that the opinion of an individual is the same as that of the majority of people, or whether the opinion of the individual is expected by the public. In the exploration task of RL, we regard each state encountered as an individual and its visited characteristics as opinions. The conformity function f C measures how often a state is visited. In social networks, the focus person will not blindly follow others, which means, less conformity. Meanwhile, in our exploration problems, we should avoid accessing the already familiar states. In the episodic RL, we formalize the conformity as f C(si) := p(si). (3) Definition 2 (Power function) We define the power function f P on the state space S, f P : S R, and it maps the state to a power level. si, sj S, if f P (si)