# the_gamblers_problem_and_beyond__3e7ace76.pdf Published as a conference paper at ICLR 2020 THE GAMBLER S PROBLEM AND BEYOND Baoxiang Wang Department of Computer Science and Engineering The Chinese University of Hong Kong bxwang@cse.cuhk.edu.hk Shuai Li John Hopcroft Center for Computer Science Shanghai Jiao Tong University shuaili8@sjtu.edu.cn Jiajin Li Department of SEEM The Chinese University of Hong Kong jjli@se.cuhk.edu.hk Siu On Chan Department of Computer Science and Engineering The Chinese University of Hong Kong siuon@cse.cuhk.edu.hk We analyze the Gambler s problem, a simple reinforcement learning problem where the gambler has the chance to double or lose their bets until the target is reached. This is an early example introduced in the reinforcement learning textbook by Sutton & Barto (2018), where they mention an interesting pattern of the optimal value function with high-frequency components and repeating nonsmooth points. It is however without further investigation. We provide the exact formula for the optimal value function for both the discrete and the continuous cases. Though simple as it might seem, the value function is pathological: fractal, self-similar, derivative taking either zero or infinity, not smooth on any interval, and not written as elementary functions. It is in fact one of the generalized Cantor functions, where it holds a complexity that has been uncharted thus far. Our analyses could lead insights into improving value function approximation, gradientbased algorithms, and Q-learning, in real applications and implementations. 1 INTRODUCTION We analytically investigate a deceptively simple problem, the Gambler s problem, introduced in the reinforcement learning textbook by Sutton & Barto (2018), on Example 4.3, Chapter 4, page 84. The problem setting is natural and simple enough that little discussion was given in the book apart from an algorithmic solution by value iteration. A close inspection will however show that the problem, as a representative of the entire family of Markov decision processes (MDP), involves a level of complexity and curiosity uncharted in years of reinforcement learning research. The problem discusses a gambler s casino game, where they places multiple rounds of betting. The gambler gains the bet amount if they win a round or loses the bet if they lose the round. The probability of losing each round is p 0.5, independently. The game ends when the gambler s capital reaches either their goal of N or 0. In each round, the gambler must decide what portion of the capital to stake. In the discrete setting this bet amount must be an integer, but it can be a real number in the continuous setting. To formulate it as an MDP, we denote state s be the current capital and action a the bet amount. The reward is +1 when the state reaches s = N, and zero otherwise. Our goal is to solve the optimal value function of the problem. We first give the solution to the discrete Gambler s problem. Denote N as the target capital, n as the current capital (which is the state in the discrete setting), p > 0.5 as the probability of losing a bet, and γ as the discount factor. The special case of N = 100, γ = 1 corresponds to the original setting in Sutton and Barto s book. Published as a conference paper at ICLR 2020 0 20 40 60 80 100 n p=0.6, gamma=1.0 p=0.6, gamma=0.95 p=0.7, gamma=1.0 p=0.7, gamma=0.95 Figure 1: The optimal state-value function of the discrete Gambler s problem. Proposition 1. Let 0 γ 1 and p > 0.5. The optimal value function z(n) is v(n/N) in the discrete setting of the Gambler s problem, where v( ) is the optimal value function under the continuous case defined in Theorem 12. The above statement amounts the discrete problem to the continuous problem by a uniform discretization. The rest of the discussion will be on the more general continuous setting. In the setting, the target capital is 1, the state space is [0, 1], and the action space is 0 < a min{s, 1 s} at state s, meaning that the bet can be any fraction of the current capital as long as the capital after winning does not exceed 1. We state the optimal value function below and intuitive description of the value function later in this section. Theorem 12. Let 0 γ 1 and p > 0.5. Under the continuous setting of the Gambler s problem, the optimal state-value function is v(1) = 1, and i=1 (1 p)γibi j=1 ((1 p) + (2p 1)bj) (1) for 0 s < 1, where s = 0.b1b2 . . . bℓ. . .(2) is the binary representation of the state s. Next, we solve the Bellman equation of the continuous Gambler s problem. In the strictly discounted setting 0 γ < 1, the solution of the Bellman equation f(0) = 0, f(1) = 1, f(s) = max 0 0.5, and f( ) be a real function on [0, 1]. f(s) solves the Bellman equation if and only if either f(s) is v(s) defined in Theorem 12, or f(0) = 0, f(1) = 1, and f(s) = C for all 0 < s < 1, for some constant C 1. Under the corner case of γ = 1, p = 0.5 (where the gambler do not lose capital in bets in expectation), the problem involves the midpoint concavity (Sierpi nski, 1920a;b) and Cauchy s functional equation. The measurable function that solves the Bellman equation is uniquely f(s) = C s + B on s (0, 1), for some constants C + B 1. Additionally, under Axiom of Choice, f(s) can also be some non-constructive, non Lebesgue measurable function described by a Hamel basis (Theorem 27 and its lemmas). Published as a conference paper at ICLR 2020 0.0 0.2 0.4 0.6 0.8 1.0 s p=0.6, gamma=1.0 p=0.6, gamma=0.95 p=0.7, gamma=1.0 p=0.7, gamma=0.95 Figure 2: The optimal state-value function of the continuous Gambler s problem. Though the description of the Gambler s problem seems natural and simple, Theorem 12 shows that its simpleness is deceptive. The optimal value function is fractal and self-similar, and non-rectifiable (see Corollary 14 and Lemma 8). It is thus not smooth on any interval, which can be unexpected when a significant line of reinforcement learning studies is based on function approximation like discretization and neural networks. The value function (1) can neither be simplified into a formula of elementary functions, which introduces difficulties in understanding it. The function is monotonically increasing with v(0) = 0 and v(1) = 1, but its derivative is 0 almost everywhere, which is counterintuitive. This is known as singularity, a famous pathological property of functions. v(s) is continuous almost everywhere but not absolutely continuous. Also when γ is strictly smaller than 1, it is discontinuous on a dense and compact set of infinitely many points. These properties indicate that assumptions like smoothness, continuity, and approximability are not satisfied in this problem. In general, it is reasonable to doubt if these assumptions can be imposed in reinforcement learning. To better understand the pathology of v(s), we analogize it to the Cantor function, which is well known in analysis as a counterexample of many seemingly true statements (Dovgoshey et al., 2006). In fact, v(s) is a generalized Cantor function, where the above descriptions are true for both v(s) and the Cantor function. Intuitive description of v(s). All the statements above require the definition of v(s). In fact, in this paper, v(s) is important enough such that its definition will not change with the context. The function cannot be written as a combination of elementary functions. Nevertheless, we give an intuitive way to understand the function for the original, undiscounted problem. The function can be regarded as generated by the following iterative process. First we fix v(0) = 0 and v(1) = 1, and compute 2) = pv(0) + (1 p)v(1) = (1 p). 2) is the weighted average of the two neighbors v(0) and v(1) that have been already evaluated. Further, the same operation applies to v( 1 4) and v( 3 4), where v( 1 4) = pv(0) + (1 p)v( 1 2) = (1 p)2 and v( 3 2) + (1 p)v(1) = (1 p) + p(1 p), and so forth to v( 1 8), etc. This process evaluates v(s) on the dense and compact set S ℓ 1 Gℓof the dyadic rationals, where Gℓ= {k2 ℓ| k {1, . . . , 2ℓ 1}}. With the fact that v(s) is monotonically strictly increasing, this dyadic rationals determines the function v(s) uniquely. This iterative process can also be explained from the analytical formula of v(s). Starting with the first bit, a bit of 0 will not change the value, while a bit of 1 will add (1 p) Qi 1 j=1((1 p)+(2p 1)bj) to the value. This term can also be written as (1 p)((1 p)#0 bits p#1 bits), where the number of bits is counted over all previous bits. The value (1 p)#0 bits p#1 bits decides the gap between two neighbor existing points in the above process, when we insert a new point in the middle. This insertion corresponds to the iteration on Gℓover ℓ. Published as a conference paper at ICLR 2020 We provide high resolution plots of z(n), N = 100 and v(s) in Figure 1 and Figure 2, respectively. The non-smoothness and the self-similar fractal patterns can be clearly observed from the figures. Also, v(s) is continuous when γ = 1 while v(s) is not continuous on infinitely many points when γ < 1. In fact, when γ < 1, the function is discontinuous on the dyadic rationals S ℓ 1 Gℓwhile continuous on its complement, as we will rigorously show later. Self-similarity. The function v(s) on [ s, s + 2 ℓ] for any s = 0.b1b2 . . . bℓ(2), ℓ 1 is self-similar to the function itself on [0, 1]. Let s = 0.b1b2 . . . bℓ. . .(2) [ s, s + 2 ℓ], this can be observed by i=1 (1 p)γibi j=1 ((1 p) + (2p 1)bj) i=1 (1 p)γibi j=1 ((1 p) + (2p 1)bj) + γℓ( j=1 ((1 p) + (2p 1)bj)) i=1 (1 p)γibℓ+i j=1 ((1 p) + (2p 1)bℓ+j) = v( s) + γℓ ℓ Y j=1 ((1 p) + (2p 1)bj) v(2ℓ(s s)). (2) The self-similarity can be compared with the Cantor function (Dovgoshey et al., 2006; Mandelbrot, 1985), which uses the ternary of s instead in the formula. The Cantor function is self-similar to itself on [ s, s + 3 ℓ], when s = 0.b1b2 . . . bℓ(3) and bℓ = 1. Both v(s) and the Cantor function can be uniquely described by their self-similarity, the monotonicity, and the boundary conditions. Optimal policies. It is immediate by Theorem 12 and Lemma 8 that π(s) = min{s, 1 s} is one of the (Blackwell) optimal policies. Here, Blackwell optimality is defined as the uniform optimality under any 0 γ 1. This policy agrees with the intuition that under a game that is in favor of the casino (p > 0.5), the gambler desires to bet the maximum to finish the game in as little cumulative bet as possible. In fact, the probability of reaching the target is the expected amount of capital by the end of the game, which is negative linear to the cumulative bet. The optimality is not unique though, for example, π( 15 32) = 1 32 is also optimal (for any γ). Under γ = 1 the original, undiscounted setting, small amount of bets can also be optimal. Namely, when s can be written in finite many bits s = b1b2 . . . bℓ(2) in binary (assume bℓ= 1), π(s) = 2 l is also an optimal policy. This policy is by repeatedly rounding the capital to carryover the bits, keeping the game to be within at most ℓrounds of bets. 1.1 PRELIMINARIES We use the canonical formulation of the discrete-time Markov decision process (MDP), denoted as the tuple (S, A, T , r, ρ0, γ). That includes S the state space, A the action space, T : S A S R+ the transition probability function, r : S R the reward function, ρ0 : S R+ the initial state distribution, and γ [0, 1] the unnormalized discount factor. A deterministic policy π : S A is a map from the state space to the action space. In this problem, T (s, a, s a) and T (s, a, s + a) are p and 1 p respectively for s S, a A, and T is otherwise 0. Our goal is to solve the optimal value function of the Gambler s problem. In this problem, the statevalue function is the probability of the gambler eventually reaching the target capital from a state. The definition of the state-value function of an MDP with respect to state s and policy π is t γtrt s0 = s, at = π(st), st+1 T (st, at), rt r(st), t = 0, 1, . . . When π is one of the optimal policies, f π (s) is the optimal state-value function. Despite that there may exist more than one optimal policies, this optimal state-value function is unique (Sutton & Barto, 2018; Szepesv ari, 2010). Published as a conference paper at ICLR 2020 1.2 IMPLICATIONS Our results indicate hardness on reinforcement learning (Papadimitriou & Tsitsiklis, 1987; Littman et al., 1995; Thelen & Smith, 1998) and revisions of existing reinforcement learning algorithms. It is worth noting that similar patterns of fractal and self-similarity have been observed empirically, for example in Chockalingam (2019) for the Mountain Car problem. With these characterizations being observed in simple problems like Mountain Car and the Gambler s problem, our results are expected to be generalized to a variety of reinforcement learning settings. The first implication is naturally on function value function approximation, which is a developed topic in reinforcement learning (Lee et al., 2008; Lusena et al., 2001). By the fractal property of the optimal value function, that representation of such function must be inexact (Tikhonov, 2014). When discretization is used for value function representation, the approximation error is at least O(1/N), where N is the number of bins. Proposition 19. When N N+, N 4 is a power of 2, let v1(s) be piecewise constant on any of the intervals s (k/N, (k + 1)/N), k = 0, . . . , N 1, then Z s |v(s) v1(s)| ds 1 N (2 γ)(1 p)γ 1 pγ + o( 1 Alternatively, when a subclass of L-Lipschitz continuous is used to represent v(s), this error is then at least (1/L) (1 p)2γ2(1 γ)2/(4 4pγ) by the discontinuity of v(s) (Proposition 20). It is worth remarking that despite this specific lower bound diminishes when γ is 1, the approximation error is nonzero for an arbitrarily large L under γ = 1, as the derivative of v(s) can be infinite (Fact 16). Notably, neural networks are within this family of functions, where the Lipschitz constant L is determined by the network architecture. By the proposition, it is not possible to obtain the optimal value function when a neural network is used, albeit the universal approximation theorem (Cs aji, 2001; Dovgoshey et al., 2006; Levesley et al., 2007). The second implication is by Theorem 12 and Fact 16 that the derivative of v(s) is lim s 0+ v(s + s) s = 0, lim s 0 v(s + s) ( + , if s = 0 or s S ℓ 1 Gℓ, 0, otherwise. This imposes that the value function s derivative must not be exactly obtained, as it is 0 almost everywhere, except on the dyadic rationals Gℓ, where it has a left derivative of infinity and a right derivative of 0. Algorithms relying on v(s)/ s and Q(s, a)/ a (Lillicrap et al., 2015; Gu et al., 2017; Heess et al., 2015; Fairbank & Alonso, 2012; Fairbank, 2008; Pan et al., 2019; Lim et al., 2018), where Q(s, a) is the action-value function (Sutton & Barto, 2018), can suffer from the estimation error or even have unpredictable behavior. In practice, the boolean implementation of float numbers can further increase this error, as all points s implemented are in Gℓfor some ℓ. A precise evaluation requires all these derivatives to be infinity when a Leabague derivative is used (the average of left and right derivatives), which cannot be obtained a computer system. The third implication is on Q-learning (Mnih et al., 2015; Watkins & Dayan, 1992; Baird, 1995), by Theorem 22 and its supporting lemmas. It is proved that when γ = 1, Q-learning has multiple converging points, as the Bellman equation has multiple solutions, namely v(s) and f(0) = 0, f(1) = 1, and f(s) = C for all 0 < s < 1, for some constant C 1. Therefore, even when the Q-learning algorithm converges, it may not converge to the optimal value function v(s). In fact, as the solution can be either the ground truth of the optimal value function, or a large constant function, it is easier to approximate a constant function than the optimal value function, resulting in a relatively lower Bellman error when converging to the large constant. This challenges Q-learning under γ = 1 when the return (cumulative reward) is unbiased. Though the artificial γ is originally introduced to prevent the return from diverging, it can be also necessary to prevent the algorithm from converging to a large constant in Q-learning, which is not desired. Published as a conference paper at ICLR 2020 2 DISCRETE CASE The analysis of the discrete case of the Gambler s problem will give an exact solution. It will also explain the reason the plot on the book has a strange pattern of repeating spurious points. The discrete case can be described by the following MDP: The state space is {0, . . . , N}; the action space at n is A(n) = {0 < a min{n, N n}}; the transition from state n and action a is n a and n + a with probability p and 1 p, respectively; the reward function is r(N) = 1 and r(n) = 0 for 0 n N 1. The MDP terminates at n {0, N}. We use a time-discount factor of 0 γ 1, where the agent receives γT r(N) rewards if the agents reaches the state n = N at time T. Let z(n), n N, 0 n N, be the value function. The exact solution below of the discrete case is relying on Theorem 12, our main theorem which describes the exact solution of the continuous case. This theorem will be discussed and proved later in Section A.1. Proposition 1. Let 0 γ 1 and p > 0.5. The optimal value function z(n) is v(n/N) in the discrete setting of the Gambler s problem, where v( ) is the optimal value function under the continuous case defined in Theorem 12. Proof. We first verify the Bellman equation. By the definition of v( ) we have z(n) = v(n/N) = max 0 0.5, f(s) 1 for all s, f(s) is continuous on s = 0. (X) Respectively, the unbounded version (Y) of the problem leads to the solutions of the Bellman equation. 0 γ 1, p > 0.5. (Y) The results extend for p = 0.5 in general, except an extreme corner case of γ = 1, p = 0.5, where the monotonicity in Lemma 3 will not apply. This case (Z) involves arguments over measurability and the belief of Axiom of Choice, which we will discuss at the end of Section A. γ = 1, p = 0.5, f(s) is unbounded. (Z) We are mostly interested in two settings: the first setting (ABX) and its solution Theorem 12, discuss a set of necessary conditions of f(s) being the optimal value function of the Gambler s problem. As we show later the solution of (ABX) is unique, this solution must be the optimal value function. The second setting (ABY) and its solutions in Proposition 21 and Theorem 22 discuss all the functions that satisfy the Bellman equation. These functions are the optimal points that value iteration and Q-learning algorithms may converge to. (ABZ) is interestingly connected some foundations of mathematics like the belief of axioms, and is discussed in Theorem 27. 1This continuity assumption is only for a better organization of the settings. The more general problem (AB) is solved in Section A.2 without this assumption. Published as a conference paper at ICLR 2020 The analysis section rigorously supports the statements on the Gambler s problem and its Bellman equation with proofs and discussions. It is deferred to the appendix due to the page limit. 5 CONCLUSION AND FUTURE WORKS We give a complete solution to the Gambler s problem, a simple and classic problem in reinforcement learning, under a variety of settings. We show that its optimal value function is very complicated and even pathological. Despite its seeming simpleness, these results are not clearly pointed out in previous studies. Our contributions are the theoretical finding and the implications. It is worthy to bring the current results to start the discussion among the community. Indicated by the Gambler s problem, the current algorithmic approaches in reinforcement learning might underestimate the complexity. We expect more evidence could be found in the future and new algorithms and implementations could be brought out. It would be interesting to see how these results of the Gambler s problem generalized to other MDPs. Finding these characterizations of MDPs is in general an important step to understand reinforcement learning and sequential decision processes. ACKNOWLEDGEMENT We thank Richard S. Sutton and Andrew Barto for raising the Gambler s problem in their book, and Rich S. Sutton for the discussions on our theorems and implications. We thank Andrej Bogdanov for pointing out the connection to the Axiom of Choice, namely, Theorem 27, and Chengyu Lin for the discussions on the properties of v(s), namely, Lemma 6, 8, and 9. This paper was largely improved by the reviews and comments. We especially would like to thank Csaba Szepesv ari, Kirby Banman, and ICLR 2020 anonymous reviewers for their helpful feedback. Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30 37. Elsevier, 1995. Valliappa Chockalingam. The role of interest in prediction and control, 2019. URL https:// youtu.be/a FXdp CDAG2g?t=395. The plot of the empirical optimal value function of the Mountain Car problem first appears at 6:35. Some follow-up discussions start at 33:20. Bal azs Csan ad Cs aji. Approximation with artificial neural networks. Faculty of Sciences, Etvs Lornd University, Hungary, 24:48, 2001. Oleksiy Dovgoshey, Olli Martio, Vladimir Ryazanov, and Matt Vuorinen. The cantor function. Expositiones Mathematicae, 24(1):1 37, 2006. Michael Fairbank. Reinforcement learning by value gradients. ar Xiv preprint ar Xiv:0803.3539, 2008. Michael Fairbank and Eduardo Alonso. Value-gradient learning. In The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2012. Shixiang Shane Gu, Timothy Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Sch olkopf, and Sergey Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in neural information processing systems, pp. 3846 3855, 2017. Mance E Harmon and Leemon C Baird III. Spurious solutions to the bellman equation. Technical Report, Wright-Patterson Air Force Base Ohio: Wright Laboratory, WL-TR-96To Be Assigned ., 1996. Published as a conference paper at ICLR 2020 Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944 2952, 2015. Thomas J Jech. The axiom of choice. Courier Corporation, 2008. Takashi Kamihigashi and Cuong Le Van. Necessary and sufficient conditions for a solution of the bellman equation to be the value function: A general principle. Documents de travail du Centre d Economie de la Sorbonne 2015.07, 2015. ISSN: 1955-611X. Peter Latham. Bellman s equation has a unique solution, 2008. URL http://www.gatsby. ucl.ac.uk/ mandana/RLRG/bellman_converges.pdf. Wee S Lee, Nan Rong, and David Hsu. What makes some pomdp problems easy to approximate? In Advances in neural information processing systems, pp. 689 696, 2008. Jason Levesley, Cem Salp, and Sanju L Velani. On a problem of k. mahler: Diophantine approximation and cantor sets. Mathematische Annalen, 338(1):97 118, 2007. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Sungsu Lim, Ajin Joseph, Lei Le, Yangchen Pan, and Martha White. Actor-expert: A framework for using action-value methods in continuous action spaces. ar Xiv preprint ar Xiv:1810.09103, 2018. Michael L Littman, Thomas L Dean, and Leslie Pack Kaelbling. On the complexity of solving markov decision problems. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pp. 394 402. Morgan Kaufmann Publishers Inc., 1995. Christopher Lusena, Judy Goldsmith, and Martin Mundhenk. Nonapproximability results for partially observable markov decision processes. Journal of artificial intelligence research, 14:83 103, 2001. Benoit B Mandelbrot. Self-affine fractals and fractal dimension. Physica scripta, 32(4):257, 1985. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Yangchen Pan, Hengshuai Yao, Amir-massoud Farahmand, and Martha White. Hill climbing on value estimates for search-control in dyna. ar Xiv preprint ar Xiv:1906.07791, 2019. Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441 450, 1987. MJ Pelling. Formulae for the arc-length of a curve in rn. The American Mathematical Monthly, 84 (6):465 467, 1977. Wacław Sierpi nski. Sur l equation fonctionnelle f (x+ y)= f (x)+ f (y). Fundamenta Mathematicae, 1(1):116 122, 1920a. Wacław Sierpi nski. Sur les fonctions convexes mesurables. Fundamenta Mathematicae, 1(1):125 128, 1920b. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Csaba Szepesv ari. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1 103, 2010. Esther Thelen and Linda B Smith. Dynamic systems theories. Handbook of child psychology, 1998. Yu V Tikhonov. On the rate of approximation of singular functions by step functions. Mathematical Notes, 95(3-4):530 543, 2014. Published as a conference paper at ICLR 2020 Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279 292, 1992. Shangtong Zhang. Python implementation of Reinforcement learning: An introduction, 2019. URL https://github.com/Shangtong Zhang/ reinforcement-learning-an-introduction. A.1 ANALYSIS OF THE GAMBLER S PROBLEM In this section we show that v(s) defined below is a unique solution of the system (ABX). Since the optimal state-value function must satisfies the system (ABX), v(s) is the optimal state-value function of the Gambler s problem. This statement is rigorously proved in Theorem 12. Let 0 γ 1 and p > 0.5. We define v(1) = 1, and i=1 (1 p)γibi j=1 ((1 p) + (2p 1)bj) (1) for 0 s < 1, where s = 0.b1b2 . . . bℓ. . .(2) is the binary representation of s. It is obvious that the series converges for any 0 s < 1. The notation v(s) will always refer to the definition above in this paper and will not change with the context. We show later that this v(s) is the optimal value function of the problem. We use the notation f(s) to denote a general solution of a system, which varies according to the required properties. Let the set of dyadic rationals Gℓ= k2 ℓ| k 1, . . . , 2ℓ 1 (3) such that Gℓis the set of numbers that can be represented by at most ℓbinary bits. The general idea to verify the Bellman equation (AB) is to prove v(s) = max a Gℓ A(s) (1 p)γ v(s + a) + pγ v(s a) for any s Gℓ by induction of ℓ= 1, 2, . . . , and generalize this optimality to the entire interval s (0, 1). It then suffices to show the uniqueness of v(s) that solves the system (ABX). This is proved by assuming the existence of a solution f(s) and derive the identity f(s) = v(s), condition on the Bellman property that v(s) satisfies (AB). For presentation purposes, the uniqueness is discussed first. As an overview, Lemma 2, 3, and 4 describe the system (ABX). Claim 5, Lemma 6, 8, and 9 describe the properties of v(s). Lemma 2 (Uniqueness under existence). Let f(s) : [0, 1] R be a real function. If v(s) and f(s) both satisfy (ABX), then v(s) = f(s) for all 0 s 1. Proof. We proof the lemma by contradiction. Assume that f(s) is also a solution of the system such that f(s) is not identical with v(s) at some s. Define δ = sup0 δ ϵ. Let a0 = min{s0, 1 s0}, we have v(s0) = (1 p)γ v(s0 a0) + pγ v(s0 + a0) (1 p)γ f(s0 a0) + pγ f(s0 + a0) + pγ δ f(s0) + pγ δ. Published as a conference paper at ICLR 2020 The above inequality is due to at least one of s0 a0 = 0 and s0 + a0 = 1 must hold. Thus at least one of v(s0 a0) f(s0 a0) and v(s0 +a0) f(s0 +a0) must be zero. The inequality contradicts v(s0) f(s0) > δ ϵ. Hence, δ cannot be zero. We discuss under δ > 0 for the rest of the proof. Case (I): γ < 1. In this case, we choose ϵ = (1 γ)δ. By the definition of δ we specify s0 such that f(s0) > v(s0) + δ ϵ. In fact, the existence of s0 is by the condition γ < 1. Let a0 argmaxa A(s0) pγ f(s0 a) + (1 p)γ f(s0 + a), we have f(s0) = pγ f(s0 a0) + (1 p)γ f(s0 + a0) pγ (v(s0 a0) + δ) + (1 p)γ (v(s0 + a0) + δ) = pγ v(s0 a0) + (1 p)γ v(s0 + a0) + γδ v(s0) + δ ϵ. The inequality f(s0) v(s0) + δ ϵ contradicts f(s0) > v(s0) + δ ϵ. Hence, the lemma is proved for the case γ < 1. Case (II): γ = 1. When there exists an s such that f(s ) v(s ) = δ, we show the contradiction. Let S = {s|f(s) v(s) = δ, 0 < s < 1} = . For any s S and a argmaxa A(s ) pγ f(s a) + (1 p)γ f(s + a), we have f(s ) = pγ f(s a ) + (1 p)γ f(s + a ) = p f(s a ) + (1 p) f(s + a ) ( ) p (v(s a ) + δ) + (1 p) (v(s + a ) + δ) = p v(s a ) + (1 p) v(s + a ) + δ ( ) v(s ) + δ. Thus, the equality in ( ) and ( ) must hold. We specify s0 S, and by the equality of ( ) we have f(s0 a0) = v(s0 a0) + δ, thus s0 a0 S. Let s1 = s0 a0, and we recursively specify an arbitrary at argmaxa A(st)(1 p)γ f(st + a) + pγ f(st a) and st+1 = st at, for t = 1, 2, . . . , until s T = 0 for some T, or indefinitely if such an s T does not exist. If s T exists and the sequence {st} terminates at s T = 0, then f(s T ) = v(s T ) + δ = δ by ( ), which contradicts the boundary condition f(s T ) = f(0) = 0. We desire to show the existence of T. When there exists t and ℓsuch that st Gℓ, by Corollary 7 we have st+1 Gℓand inductively st Gℓfor all t t. Consider that {st} is strictly decreasing and there are finite many elements in Gℓ, {st} cannot be infinite. Otherwise st / Gℓfor any t, ℓ 1. Then by Corollary 10 the uniqueness of optimal action we have st+1 = 2st 1 if st 1 2, and st+1 = 0 if st 1 2. After finite many steps of st+1 = 2st 1 we will have st = 0 for some t. It amounts to show the existence of s such that f(s ) v(s ) = δ. By Lemma 9 we have the continuity of v(s). Lemma 3 indicates the monotonicity of f(s) on [0, 1). The upper bound f(s) f(1) in (X) extends this monotonicity to the closed interval [0, 1]. Then by Lemma 4 we have the continuity of f(s) on (0, 1]. By (X) this extends to [0, 1]. Thus we have the continuity of f(s) v(s), and consequently the existence of max0 s 1 f(s ) v(s ). As f(0) v(0) = f(1) v(1) = 0 and δ > 0, this maximum must be attained at some s (0, 1). Therefore we have the existence of max0 0.5. If a real function f(s) satisfies (AB) then f(s) is monotonically increasing on [0, 1). Proof. We prove the claim by contradiction. Assume that there exists s1 < s2 where f(s1) > f(s2). Denote s = s2 s1 > 0 and f = f(s1) f(s2) > 0. By induction we have f(s2 2 ℓ s) f(s2) pℓ f for an arbitrary integer ℓ 1. Then when s2 + 2 ℓ s < 1, by f(s2) pf(s2 2 ℓ s) + (1 p)f(s2 + 2 ℓ s), f(s2 + 2 ℓ s) 1 1 pf(s2) p 1 pf(s2 2 ℓ s) Published as a conference paper at ICLR 2020 = f(s2) + p 1 p(f(s2) f(s2 2 ℓ s)) f(s2) + f(s2) f(s2 2 ℓ s). This concludes f(s2 + 2 ℓ s) f(s2) f(s2) f(s2 2 ℓ s). By induction we have f(s2 + k2 ℓ s) f(s2 + (k 1)2 ℓ s) f(s2 + (k 1)2 ℓ s) f(s2 (k 2)2 ℓ s) for k = 1, 2, . . . , when s2 + k2 ℓ s < 1. We sum this inequality over k and get f(s2 + k2 ℓ s) f(s2) k(f(s2) f(s2 2 ℓ s)) By letting k = 2n, ℓ= n + n0, s2 + 2 n0 s < 1, and n + , we have s2 + k2 ℓ s < 1 and kpℓ f . The arbitrarity of n indicates the non-existence of f(s2 + k2 ℓ s), which contradicts the existence of the solution f( ). Lemma 4 (Continuity). Let γ = 1 and p 0.5. If a real function f(s) is monotonically increasing on (0, 1] and it satisfies (AB), then f(s) is continuous on (0, 1]. Proof. We show the continuity by contradiction. Suppose that there exists a point s (0, 1) such that f(s) is discontinuous at s , then there exists ϵ, δ > 0 where f(s + ϵ1) f(s ϵ2) δ for any ϵ1 + ϵ2 = ϵ. Then, by 4ϵ) p f(s ϵ) + (1 p) f(s + 2 4ϵ) f(s ϵ) (1 p)δ/p. Similarly, for k = 1, 2, . . . , 4k ϵ) f(s 1 4k 1 ϵ) (1 p)δ/p. Let k > ((1 p)δ/p) 1, we have f(s 1 4k ϵ) f(s ϵ) 1. This contradict with the fact that f(s) is bounded between 0 and 1. The continuity follows on (0, 1). If the function is discontinuous on s = 1, then there exists ϵ, δ > 0 where f(1) f(1 ϵ1) δ for any ϵ1 ϵ. The same argument holds by observing f(1 1 2k 1 ϵ) p f(1 1 2k ϵ) f(1). The lemma follows. When f(s) is only required to be monotonically increasing on (0, 1), the continuity still holds but only on (0, 1). For simplicity define Qv(s, a) = pγ v(s a) + (1 p)γ v(s + a). (4) As v(s) is the optimal state-value function (to be proved later in Theorem 12), Qv(s, a) is in fact the optimal action-value function (Sutton & Barto, 2018; Szepesv ari, 2010). Recall that Gℓis the set of dyadic rationals {k2 ℓ| k {1, . . . , 2ℓ 1}}. Claim 5. For any s = 0.b1b2 . . . bℓ(2) Gℓ {0}, v s + 2 (ℓ+1) v(s) = (1 p)γℓ+1 ℓ Y j=1 ((1 p) + (2p 1)bj) (1 p)pℓγℓ+1. (5) Published as a conference paper at ICLR 2020 For any s = 0.b1b2 . . . bk(2) Gℓwith bk = 1 and 1 k ℓ, v(s) v s 2 (ℓ+1) pℓ k+1(1 p)γℓ+1 k 1 Y j=1 ((1 p) + (2p 1)bj) . (6) Also, v(1) v 1 2 (ℓ+1) pℓ+1γℓ+1. (7) The equality of (6) and (7) holds if and only if γ = 1. Proof. Inequality (5) and (7) are obtained by the definition of v(s). To derive inequality (6), denote k = max {1 i ℓ: bi = 0} and then v(s) v s 2 (ℓ+1) = (1 p)γk k 1 Y j=1 ((1 p) + (2p 1)bj) i=k+1 (1 p)γi 1 j=1 ((1 p) + (2p 1)bj) (1 p) = (1 p)γk k 1 Y j=1 ((1 p) + (2p 1)bj) i=k+1 γi kpi k 1 ! = (1 p)γk k 1 Y j=1 ((1 p) + (2p 1)bj) 1 (1 p)γ 1 (γp)ℓ k+1 (1 p)γk k 1 Y j=1 ((1 p) + (2p 1)bj) 1 1 (γp)ℓ k+1 = (1 p)pℓ k+1γℓ+1 k 1 Y j=1 ((1 p) + (2p 1)bj). The arguments are due to the fact that s 2 (ℓ+1) = 0.b1b2 . . . bk 10k1k+1 . . . 1ℓ+1(2) and the inequality is by (1 p)γ 1 γp. Lemma 6. Let ℓ 1, 0 < γ 1, 0.5 p < 1. For any s Gℓ, max a (Gℓ+1\Gℓ) A(s) Qv(s, a) max a Gℓ A(s) Qv(s, a). Proof. Case (I): First we prove that for ℓ> 1, any s Gℓ, a Gℓ A(s), a > 2 ℓ, and s+a < 1, Qv s, a 2 (ℓ+1) max Qv(s, a), Qv s, a 2 ℓ . Note that in this case, a 2 (ℓ+1) Gℓ+1 A(s) and a 2 ℓ Gℓ A(s). Let s a = 0.c1c2 . . . cℓ(2) = 0.c1c2 . . . 0k1k+1 . . . 1ℓ(2), where k = max {1 i ℓ: ci = 0} is the index of the last 0 bit in s a. Such k must exist since 0 s a 1 3 2 ℓ< 1. Similarly, let s + a = 0.d1d2 . . . dℓ(2) = 0.d1d2 . . . 1k (2) where k = max {1 i ℓ: di = 1} is the index of the last 1 bit in s + a. Such k must exist since 3 2 ℓ s + a < 1. Also, s + a 2 (ℓ+1) = 0.d1d2 . . . 0k 1k +1 . . . 1ℓ+1(2). To prove Qv s, a 2 (ℓ+1) Qv(s, a), it is equivalent to prove v(s + a) v s + a 2 (ℓ+1) p 1 p v s a + 2 (ℓ+1) v(s a) . Published as a conference paper at ICLR 2020 Then by applying inequality (6), (7) and inequality (5) in Claim 5 on the LHS and RHS respectively, it suffices to prove j=1 ((1 p) + (2p 1)dj) j=1 ((1 p) + (2p 1)cj) = pℓ k(1 p) j=1 ((1 p) + (2p 1)cj). Let Mc = c1+ +ck 1, Md = d1+ +dk 1 be the number of 1s in {c1, . . . , ck} , {d1, . . . , dk } respectively. Then Qv s, a 2 (ℓ+1) Qv(s, a) holds when p = 0.5 or p > 0.5, Mc + k Mb + k . To prove Qv s, a 2 (ℓ+1) Qv s, a 2 ℓ , it is equivalent to prove v s a + 2 ℓ v s a + 2 (ℓ+1) 1 p v s + a 2 (ℓ+1) v s + a 2 ℓ . Note that s a + 2 ℓ= 0.c1c2 . . . 1k(2) and s + a 2 ℓ= 0.d1d2 . . . 0k 1k +1 . . . 1ℓ(2). Then by inequality (6), (7) and inequality (5) on the LHS and RHS respectively, it suffices to prove pℓ k+2 k 1 Y j=1 ((1 p) + (2p 1)cj) (1 p)2pℓ k k 1 Y j=1 ((1 p) + (2p 1)dj). Then Qv s, a 2 (ℓ+1) Qv s, a 2 ℓ holds when p = 0.5 or p > 0.5, Mc+k +2 Md+k. As at least one of Mc+k +1 Md+k and Md+k Mc+k +1 holds, thus Qv s, a 2 (ℓ+1) < max Qv(s, a), Qv s, a 2 ℓ . We cover two corner cases for the completeness of the proof. Case (II): Next we prove for ℓ 1, any s Gℓ, a Gℓ A(s) and s + a = 1, Qv s, a 2 (ℓ+1) Qv(s, a). Similar to above, it is equivalent prove v(1) v 1 2 (ℓ+1) p 1 p v s a + 2 (ℓ+1) v(s a) . Note that by Claim 5, v(1) v 1 2 (ℓ+1) pℓ+1γℓ+1 = p 1 p (1 p)pℓγℓ+1 v s a + 2 (ℓ+1) v(s a) , which concludes the proof. Case (III): Last we prove for ℓ> 1, any s Gℓand a = 2 ℓ, s < 1 2 ℓ. When s = 0.b1b2 . . . 0m1m+1 . . . 1ℓ(2) with 1 m < ℓ, to prove Qv s, 2 (ℓ+1) Qv s, 2 ℓ , it is equivalent to prove v s + 2 ℓ v s + 2 (ℓ+1) p 1 p v s a + 2 (ℓ+1) v(s a) . In this case, s + 2 ℓ= 0.b1b2 . . . 1m(2), s 2 ℓ= 0.b1b2 . . . 0m1m+1 . . . 1ℓ 1(2) and Mc = Md, k = k = m, thus Md + k Mc + k , which concludes the proof similar to the first part of the (I) case. When s = 0.b1b2 . . . 1m 0m +1 . . . 0ℓ(2) with 1 m < ℓ, let Mb = b1 + . . . bm . Then, Qv s, 2 m Qv s, 2 (ℓ+1) Published as a conference paper at ICLR 2020 = (1 p)γ(v(s + 2 m ) v(s 2 m )) (1 p)γ(v(s) v(s 2 m )) (1 p)γ(v(s + 2 ℓ) v(s)) pγ(v(s 2 ℓ) v(s 2 m )) (1 p)γ(pγ)Mb((1 p)γ)m 2 Mb(1 p)γ (1 p)γ(pγ)Mb((1 p)γ)m 1 Mb(1 p)γ (1 p)γ(pγ)Mb+1((1 p)γ)ℓ 2 Mb(1 p)γ pγ(pγ)Mb((1 p)γ)m Mb(1 + (pγ) + . . . (pγ)ℓ m 1)(1 p)γ = p Mb(1 p)m Mbγm p Mb(1 p)m +1 Mbγm +1 p Mb+1(1 p)ℓ Mbγℓ+1 p Mb+1(1 p)m +1 Mbγm +2(1 (pγ)ℓ m )/(1 pγ) p Mb(1 p)m Mbγm p Mb(1 p)m +1 Mbγm +1 p Mb+1(1 p)ℓ Mbγℓ+1 p Mb+1(1 p)m Mbγm +1(1 (pγ)ℓ m ) p Mb+1(1 p)ℓ Mbγℓ+1 p Mb+1(1 p)m Mbγm +1( (pγ)ℓ m ) 0. The arguments in the proof that either Mc + k Md + k + 1 or Md + k Mc + k must hold is tight for integers Mc and Md. This is the case for a Gℓ+1 \ Gℓ. When a / Gℓ+1, this sufficient condition becomes even looser. The lemma imposes Gℓto be the only set of possible optimal actions, given s Gℓ. Corollary 7. Let ℓ 1. For any s Gℓ, argmax a A(s) Qv(s, a) Gℓ. Now we verify the Bellman property on S ℓ 1 Gℓ. Lemma 8. Let ℓ 1. For any s Gℓ+1, min{s, 1 s} argmax a Gℓ+1 A(s) Qv(s, a). Proof. We prove the lemma by induction over ℓ. When ℓ= 1, it is obvious since G1 has only one element. The base case ℓ= 2 is also immediate by exhausting a {2 1, 2 2} for s = 2 1. Now we assume that for any s Gℓ, min{s, 1 s} argmaxa Gℓ A(s) Qv(s, a). We aim to prove this lemma for ℓ+ 1. For s Gℓ, by Lemma 6, argmaxa Gℓ+1 A(s) Qv(s, a) Gℓ. Then by the induction assumption, min{s, 1 s} argmaxa Gℓ A(s) Qv(s, a) argmaxa Gℓ+1 A(s) Qv(s, a). Hence, the lemma holds for s Gℓ. We discuss under s Gℓ+1 \ Gℓfor the rest of the proof. We start with two inductive properties of v(s) to reduce the problem from s Gℓ+1 to s Gℓ, where s is either 2s or 2s 1. For any s 2 1, that is, s = 0.c1c2 . . . cℓ+1(2) Gℓ+1 with c1 = 1, i=1 (1 p)γici j=1 ((1 p) + (2p 1)cj) i=2 (1 p)γici j=1 ((1 p) + (2p 1)cj) i=1 (1 p)γi+1ci+1((1 p) + (2p 1)c1) j=1 ((1 p) + (2p 1)cj+1) = (1 p)γ + pγ v(0.c2 . . . cℓ+1(2)) = (1 p)γ + pγ v(2s 1). Similarly, for any s < 2 1, that is, s = 0.c1c2 . . . cℓ+1(2) Gℓ+1 with c1 = 0, i=1 (1 p)2γi+1ci+1 j=1 ((1 p) + (2p 1)cj+1) Published as a conference paper at ICLR 2020 = (1 p)γv(2s). Armed with the properties, we split the discussion into four cases 2 1 + 2 2 s < 1, 2 1 s < 2 1 + 2 2, 2 1 2 2 < s < 2 1, and 0 < s 2 1 2 2. When s 2 1 + 2 2, As a 1 s, we have s a 2 1 and s + a 2 1. Hence, the first bit after the decimal of s a and s + a is 1. Hence, Qv(s, a) = pγ v(s a) + (1 p)γ v(s + a) = (1 p)γ2 + pγ (pγ v(2s 2a 1)) + (1 p)γ v(2s + 2a 1) = (1 p)γ2 + pγ (pγ v((2s 1) 2a) + (1 p)γ v((2s 1) + 2a)) = (1 p)γ2 + pγ Qv(2s 1, 2a). As 2s 1 Gℓand 2a Gℓ, by the induction assumption the maximum of Qv(2s 1, 2a) is obtained at a = 1 s. Hence, 1 s argmaxa Gℓ+1 A(s) Qv(s, a) as desired. When 2 1 s < 2 1 + 2 2, if s a 2 1, then first bit after the decimal of s a and s + a is 1 and the lemma follows the same arguments as the above case. Otherwise, if s a < 2 1, we have Qv(s, a) = pγ v(s a) + (1 p)γ v(s + a) = (1 p)2γ2 + p(1 p)γ2 v(2s 2a) + p(1 p)γ2 v(2s + 2a 1) = (1 p)γ (pγ v((2s 2 1) (2a 2 1)) + (1 p)γ v((2s 2 1) + (2a 2 1))) + (1 p)(2p 1)γ2 v(2s + 2a 1) + (1 p)2γ2 = (1 p)γ Qv(2s 2 1, 2a 2 1) + (1 p)(2p 1)γ2 v(2s + 2a 1) + (1 p)2γ2. As 2s 2 1 Gℓand 2a 2 1 Gℓwhenever l 2, by the induction assumption Qv(2s 2 1, 2a 2 1) obtains its maximum at a = 1 s. By Claim 5, v(s) is monotonically increasing on Gℓfor any ℓ 2. Hence, v(2s + 2a 1) obtains the maximum at the maximum feasible a, which is a = 1 s. Since both terms takes their respective maximum at a = 1 s, we conclude that 1 s argmaxa Gℓ+1 A(s) Qv(s, a) as desired. The other two cases, 2 1 2 2 < s < 2 1 and 0 < s 2 1 2 2, follow similar arguments. The lemma follows. Lemma 9. Both v(s) and v (s) = maxa A(s) Qv(s, a) are continuous at s if there does not exist an ℓsuch that s Gℓ. Proof. We first proof the continuity of v(s). For s = b1b2 . . . bℓ. . .(2), s / Gℓindicates that for any integer N there exists n1 N such that bn1 = 1 and n0 N such that bn0 = 0. The monotonicity of v(s) is obvious from Equation (1) that flipping a 0 bit to a 1 bit will always yield a greater value. For any s 2 N s s + 2 N, we specify n1 and n0 such that s 2 n1 s s + 2 n0. By the monotonicity of v(s) we have v(s) v(s ) v(s) v(s 2 n1) = (1 p)γn1 n1 1 Y j=1 ((1 p) + (2p 1)bj) (1 + i=n1+1 γi n1bip j=n1+1 ((1 p) + (2p 1)bj)) (1 p)γn1 n1 1 Y j=1 ((1 p) + (2p 1)bj) i=n1+1 γi n1bi(1 p) j=n1+1 ((1 p) + (2p 1)bj) = (1 p)γn1 n1 1 Y j=1 ((1 p) + (2p 1)bj) (1 i=n1+1 γi n1bi(2p 1) j=n1+1 ((1 p) + (2p 1)bj)) (1 p)γn1pn1 1 (1 + i=n1+1 γi n1(2p 1)pn1 i 1) Published as a conference paper at ICLR 2020 2(1 p)γNp N 1. And similarly, v(s) v(s ) v(s) v(s + 2 n0) (1 p)γn0pn0 1 (1 + i=n0+1 γi n0(2p 1)pn0 i 1) 2(1 p)γNp N 1. Hence, |v(s) v(s )| is bounded by 2(1 p)γNp N 1 for s 2 N s s + 2 N. As 2(1 p)γNp N 1 converges to zero when N approaches infinity, v(s) is continuous as desired. We then show the continuity of v (s) = maxa A(s) Qv(s, a). We first argue that v (s) is monotonically increasing. In fact, for s s and 0 < a min{s, 1 s}, either 0 < a min{s , 1 s } or 0 < a + s s min{s , 1 s } must be satisfied. Therefore a A(s) indicates at least one of a A(s ) and a + s s A(s ). Let a be a or a + s s whoever is in A(s ), we have both s + a s + a and s a s a. Specify a such that v (s) = Qv(s, a), we have v (s ) Qv(s , a ) v (s). The monotonicity follows. Let s = b1b2 . . . bℓ. . .(2). Similarly, for any N, specify n1 N such that bn1 = 1 and n0 N + 2 such that bn0 = 0. Also let s0 = b1b2 . . . b N (2). Then for the neighbourhood set s0 2 (N+1) s s0 + 2 (N+1), v (s) = v(s) for both the ends of the interval s0 2 (N+1), s0 + 2 (N+1) GN+1. |v (s) v (s )| is then bounded by |v(s0 2 (N+1)) v(s0+2 (N+1))|. According to Claim 5, this value converges to zero when N approaches infinity. The continuity of v (s) follows. The continuity of v(s) extends to the dyadic rationals S ℓ 1 Gℓwhen γ = 1, which means that v(s) is continuous everywhere on [0, 1] under γ = 1. It worth note that similar to the Cantor function, v(s) is not absolutely continuous. In fact, v(s) shares more common properties with the Cantor function, as they both have a derivative of zero almost everywhere, both their value go from 0 to 1, and their range is every value in between of 0 and 1. The continuity of v (s) = maxa A(s) Qv(s, a) indicates that the optimal action is uniquely min{s, 1 s} on s / Gℓ. This optimal action agrees with the optimal action we specified on s Gℓ in Lemma 8, which makes π(s) = min{s, 1 s} an optimal policy for every state (condition on that v(s) is the optimal value function, which will be proved later). Corollary 10. If s / Gℓfor any ℓ 1, argmax a A(s) Qv(s, a) = {min {s, 1 s}} . Lemma 11. v(s) is the unique solution of the system (ABX). Proof. Let v (s) = maxa A(s) Qv(s, a). As per Lemma 8 we have v(s) = v (s) on the dyadic rationals S ℓ 1 Gℓ. Since S ℓ 1 Gℓis dense and compact on (0, 1), v(s) = v (s) holds whenever both v(s) and v (s) are continuous at s. By Lemma 9 v(s) and v (s) are continuous for any s if there does not exist an ℓ 1 such that s Gℓ, which then indicates v(s) = v (s) on the complement of S ℓ 1 Gℓ. Therefore v(s) = v (s) is satisfied on (0, 1), which verifies the Bellman property (AB). The boundary conditions (X) holds obviously. Finally as per Lemma 2, v(s) is the unique solution to the system of Bellman equation and the boundary conditions. Theorem 12. Let 0 γ 1 and p > 0.5. Under the continuous setting of the Gambler s problem, the optimal state-value function is v(1) = 1 and v(s) defined in Equation (1) for 0 s < 1. Proof. As the optimal state-value function must satisfy the system (ABX) and v(s) is the unique solution to the system, v(s) is the optimal state-value function. Published as a conference paper at ICLR 2020 Corollary 13. The policy π(s) = min{s, 1 s} is (Blackwell) optimal. It is worth noting that when γ = 1 and s Gℓ\ Gℓ 1 for some ℓ, then π (s) = 2 ℓis also an optimal policy at s. Theorem 12 and the proof of Lemma 8 and also induce the following statement that the optimal value function v(s) is fractal and self-similar. The derivation of the corollary is in the introduction. Corollary 14. The curve of the value function v(s) on the interval [k2 ℓ, (k + 1)2 ℓ] is similar (in geometry) to the curve of v(s) itself on [0, 1], for any integer ℓ 1 and 0 k 2ℓ 1. Some other notable facts about v(s) are as below: Fact 15. The expectation Z 1 0 v(s)ds = (1 p)γ = v(1 Fact 16. The derivative lim s 0+ v(s + s) s = 0, lim s 0 v(s + s) ( + , if s = 0 or s S ℓ 1 Gℓ, 0, otherwise. (8) Fact 17. The length of the arc y = v(s), 0 s 1 equals 2. In fact, any singular function (zero derivative a.e.) has an arc length of 2 (Pelling, 1977) if it go from (0, 0) to (1, 1) monotonically. This can be intuitively understood as that the curve either goes horizontal, when the derivative is zero, or vertical, when the derivative is infinity. Therefore the arc length is the Manhattan distance between (0, 0) and (1, 1), which equals 2. argmin 0 s 1 v(s) s = {2 It is natural that by the fractal characterization of v(s) the approximation must be inexact. The following two propositions give quantitative lower bounds on such approximation errors. Proposition 19. When N N+, N 4 is a power of 2, let v1(s) be piecewise constant on any of the intervals s (k/N, (k + 1)/N), k = 0, . . . , N 1, then Z s |v(s) v1(s)| ds 1 N (2 γ)(1 p)γ 1 pγ + o( 1 Proof. When N is a power of 2, for k {0, . . . , N 1}, the curve of v(s) on each of the intervals (k/N, (k + 1)/N) is self-similar to v(s) itself on (0, 1). We consider this segment of the curve. By Equation (2), v(s) = v( s) + γℓ ℓ Y j=1 ((1 p) + (2p 1)bj) v(2ℓ(s s)), where ℓ= log2 N, s = k/N and s = 0.b1b2 . . . bℓ. . .(2) ( s, s + 1 Let s = 1/N, y = v((k + 1)/N) v(k/N), we have for 0 < s < s v( s + s) = v( s) + v(s/ s) y. As v(s) is monotonically increasing on every interval (k/N, (k + 1)/N), the minimum over y Z s+ s s= s |v(s) y| ds is obtained when y s = v( s + 1 2 s) (intuitively, the median of v(s) on the interval). This results in an approximation error of s= s |v(s) y| ds Published as a conference paper at ICLR 2020 s= s v( s + 1 2 s) v(s) ds + Z s+ s 2 s v(s) v( s + 1 s= s v(s) ds + Z s+ s 2 s v(s) ds s=0 v( s) + (v( s + 1 2 s) v( s))v(s) ds s=0 v( s + 1 2 s) + (v( s + s) v( s + 1 2 s))v(s) ds 2 s((1 (1 p)γ)(v( s + 1 2 s) v( s)) + (1 p)γ(v( s + s) v( s + 1 This error is then summed over s = 0, 1/N, . . . , (N 1)/N such that s= s |v(s) y s| ds 1 2N ((1 (1 p)γ)v(N 1 2 N ) + (1 p)γ(1 v( 1 = 1 2N ((1 (1 p)γ)(1 p)γ 1 pγ (1 (pγ)log2 N+1) + (1 p)γ(1 ((1 p)γ)log2 N+1)) = N 1 (2 γ)(1 p)γ 1 pγ N 1+log2 pγ (1 (1 p)γ)p(1 p)γ2 N 1+log2(1 p)γ (1 p)2γ2 N (2 γ)(1 p)γ 1 pγ + o( 1 An error bound in O(1/N) can be generated to any integer N, as we can relax N to 2 log2 N 1 so that at least one self-similar segment of size 1/N is included in each interval. For Lipschitz continuous functions like neural networks, the following proposition shows an approximation error lower bound in O(1/L), where L is the Lipschitz constant. Proposition 20. Let L (1 p)γ(1 γ)/(1 pγ). If v2(s) is Lipschitz continuous on s (0, 1) where L is the Lipschitz constant, then Z s |v(s) v2(s)| ds 1 L (1 p)2γ2(1 γ)2 Proof. We consider v( 1 2) = (1 p)γ and 2 v(s) = (1 p)2γ2 When 0 < γ < 1, we have 2 v(s) = (1 p)γ(1 γ) Denote h = (1 p)γ(1 γ)/(1 pγ). By the monotonicity of v(s), using v2(s) to approximate v(s) has an error at least R s |ξ(s) v2(s)| ds, where ξ(s) denotes the step function on [0, 1], ξ(s) = 0 0 s < 1 2, h 1 2 s 1. In this case, the optimal v2(s) is 2 h 2L, h 2 + L(s 1 2) 1 2 h 2L s 1 2 + h 2L, h 1 2 + h 2L < s 1. Published as a conference paper at ICLR 2020 Hence, we have Z s |v(s) v2(s)| ds Z s |ξ(s) v2(s)| ds h2 as desired. A.2 ANALYSIS OF THE BELLMAN EQUATION We have proved that v(s) is the optimal value function in Theorem 12, by showing the existence and uniqueness of the solution of the system (ABX). However, the condition (X) is derived from the context of the Gambler s problem. It is rigorous enough to find the optimal value function, but we are also interested in solutions purely derived from the MDP setting. Also, algorithmic approaches such as Q-learning (Watkins & Dayan, 1992; Baird, 1995; Mnih et al., 2015) optimize the MDP by solving the Bellman equation, without eliciting the context of the problem. Studying such systems will help to understand the behavior of these algorithms. In this section, we inspect the system of Bellman equation (AB) of the Gambler s problem. We aim to solve the general case (ABY) where p > 0.5 and the corner case (ABZ) where p = 0.5. When p > 0.5, the value function v(s) is obviously still a solution of the system (ABY) without condition (X). The natural question is if there exist any other solutions. The answer is two-fold: When γ < 1, f(s) = v(s) is unique; when γ = 1, the solution is either v(s) or a constant function at least 1. This indicates that algorithms like Q-learning have constant functions as their set of converging points, apart from v(s). As v(s) is harder to approximate due to the non-smoothness, a constant function in fact induces a smaller approximation error and thus has a better optimality for Q-learning with function approximation. It is immediate to generate this result to general MDPs, as function of a large constant solves MDPs with episodic rewards. This indicates that Q-learning may have more than one converging points and may diverge from the optimal value function under γ = 1. This leads to the need of γ, which is artificially introduced and biases the learning objective. More generally, the Bellman equation may have a continuum of finite solutions in an infinite state space, even with γ < 1. Some studies exist on the necessary and sufficient conditions for a solution of the Bellman equation to be the value function (Kamihigashi & Le Van, 2015; Latham, 2008; Harmon & Baird III, 1996), though the majority of this topic remains open. The discussions above are supported by a series of rigorous statements. We begin with the following proposition that when the discount factor is strictly less than 1, the solution toward the Bellman equation is the optimal value function. Proposition 21. When γ < 1, v(s) is the unique solution of the system (ABY). Proof. The uniqueness has been shown in Lemma 2 for the system (ABX). When γ < 1 it corresponds to case (I), where neither the upper bound f(s) 1 nor the continuity at s = 0 in condition (X) is used. Therefore Lemma 2 holds for (ABY) under γ < 1, so follows Lemma 11 the uniqueness as desired. This uniqueness no longer holds under γ = 1. Theorem 22. Let γ = 1, p > 0.5, and f( ) be a real function on [0, 1]. f(s) satisfies the Bellman equation (ABY) if and only if either f(s) is v(s) defined in Theorem 12, or f(0) = 0, f(1) = 1, and f(s) = C for all 0 < s < 1, for some constant C 1. Proof. It is obvious that both f(s) defined above are the solutions of the system. It amounts to show that they are the only solutions. Without the bound condition (X), the function f(s) is not necessarily continuous on s = 0 and s = 1 and is not necessarily monotonic on s = 1. Therefore the same arguments in the proof of Lemma 2 will not hold. However, the arguments can be extended to (Y) by considering the limit of f(s) when s approaches 0 and 1. Published as a conference paper at ICLR 2020 By Lemma 4 the function is continuous on the open interval (0, 1). Let C0 = lim s 0+ f(s), C1 = lim s 1 f(s). Then by Lemma 3, 0 C0 f(s) C1 for s (0, 1). Here we eliminate the possibility of C0 = + and C1 = + . This is because if there is a sequence of st 0 such that f(st) > t, then we have f( 1 2) p f(st) + (1 p) f(1 st) (1 p)t for any t. Then f( 1 2) does not exist. Similar arguments show that C1 cannot be + . Now specify a sequence at 1 2, such that C0 f( 1 2 at) C0 + 1 2 + at) C1. Then we have 2 at) + (1 p) f(1 p C0 + (1 p) C1 1 As t is arbitrary we have f( 1 2) p C0 + (1 p)C1. By induction on ℓit holds on s S f(s) C0 + (C1 C0)v(s). By Lemma 4 the continuity of f(s) and v(s) under γ = 1, this lower bound extends beyond the dyadic rationals to the entire interval (0, 1). Define f(s) = C0 + (C1 C0)v(s) for s (0, 1), f(0) = C0, f(1) = C1. It is immediate to verify that for any C1 > C0 0, f(s) solves the system (AY) (without (B) the boundary conditions). If C1 C0 = 0, by Lemma 2 Case (II) f(s) on (0, 1) is the unique solution of the system (AY), given monotonicity, continuity, and the lower bound f(s). With the boundary conditions (B), we have 0 = f(0) = C0 and 1 = f(1) = C1, therefore f(s) = v(s). This case C1 C0 = 0 induces the first possible solution. It amounts to determine f(s) when C1 C0 = 0, that is, when 0 C0 = f(s) = C1 for s (0, 1) (by Lemma 3). It suffices to prove that C0 < 1 is not possible. In fact, if C0 < 1 then f( 3 4) < p f( 1 2)+(1 p)f(1), which contradicts (A). Then, f(s) = C0 for some C0 1 is the only set of solutions when C1 C0 = 0, as desired. The fact that a large constant function can also be a solution toward the Bellman equation can be extended to a wide range of MDPs. The below proposition lists one of the sufficient conditions but even without this condition it is likely to hold in practice. Proposition 23. For an arbitrary MDP with episodic rewards where every state has an action to transit to a non-terminal state almost surely, f(s) = C for all non-terminal states s is a solution of the Bellman equation system for any C greater or equal to the maximum one-step reward. Proof. The statement is immediate by verifying the Bellman equation. The rest of the section discusses the Gambler s problem under p = 0.5, where the gambler does not lose capital by betting in expectation. In this case, the optimal value function is still v(s) by similar arguments of Theorem 12. It is worth noting that when γ = 1, Theorem 12 indicates v(s) = s. This agrees with the intuition that the gambler does not lose their capital by placing bets in expectation, therefore the optimal value function should be linear to s. Proposition 21 also holds that v(s) is the unique solution of the Bellman equation, given γ < 1. The remaining problem is to find the solution of the Bellman equation under γ = 1 and p = 0.5. This corresponds to the system (ABZ). When p = 0.5, condition (A) implies midpoint concavity such that for all a A(s), 2f(s a) + 1 2f(s + a), (9) where the equality must hold for some a. As Lemma 3 no longer holds, a solution f(s) may have negative values for some s. Though, if it does not have a negative value, it must be concave, and thus linear by condition (A) (will be proved later in Theorem 27). It suffices to satisfy f(s) s for any Published as a conference paper at ICLR 2020 s. Therefore the non-negative solution is f(0) = 0, f(1) = 1, and f(s) = C s + B on 0 < s < 1 for some constants C + B 1. If f(s) does have a negative value at some s, then the midpoint concavity (9) does not imply concavity. In this case, by recursively applying (9) we can show that the set {(s, f(s)) | s (0, 1)} is dense and compact on (0, 1) R. Then the function becomes pathological, if it exists. Despite this, the following lemma shows that f(s) need to be positive on the rationals Q. Lemma 24. Let f(s) satisfies (ABZ). If there exists 0 s < s+ 1 and a constant C such that f(s ), f(s+) C, then f(s) C for all s {s + w(s+ s ) | w Q, 0 w 1}. Proof. The statement is immediate for w {0, 1}. For 0 < w < 1 we prove the lemma by contradiction. Let f(s + w(s+ s )) < C for some w Q while 0 < w < 1. We define s0 = s + w(s+ s ) and st+1 = 2st s for st < 1 2(s + s+) and st+1 = 2st s+ for st > 1 2(s + s+), respectively. st+1 will be undefined if st = 1 2(s + s+). Since w Q, let w = m/n where m and n are integers and the greatest common divisor gcd(m, n) = 1. Then (st s )/(s+ s ) = mt/n, where mt = 2tm mod n. As Zn is finite, {st}t 0 can only take finite many values. Thus either the sequence {st} is periodic, or it terminates at some st = 1 Then we show that f(st) is strictly decreasing by induction. Assume that f(s0) > > f(st). When st < 1 2(s +s+), by (9) we have f(st) 1 2f(st+1), which indicates that f(st+1) f(st) f(st) f(s ) < f(s0) f(s ) < 0. When st > 1 2(s + s+), by (9) we have f(st) 1 2f(st+1) + 1 2f(s+), which indicates f(st+1) f(st) f(st) f(s+) < f(s0) f(s+) < 0. The base case f(s1) < f(s0) holds as at least one of f(s0) 1 2f(s ) + 1 2f(s1) and f(s0) 1 2f(s1) + 1 2f(s+) must be true. Thus we conclude that f(st) is strictly decreasing. If the sequence terminates at some st = 1 2(s + s+), then f(st) < f(s1) < C, which contradicts f(st) = f( 1 2(s +s+)) 1 2f(s+) C. Otherwise st is periodic and indefinite. Denote the period as T we have f(st+T ) < f(st), which indicates f(st) < f(st) as a contradiction. Lemma 24 agrees with the statement that the midpoint concavity indicates rational concavity. The below statements give some insights into the irrational points. Lemma 25. Let f(s) satisfies (ABZ). If there exists an s R \ Q such that f( s) 0, then f(s) 0 for all s {w s + u | w, u Q, 0 w, u, 1, w + u 1}. Proof. Specify s = s and s+ = 1 in Lemma 24, we have f( s + u w+u(1 s)) 0 whenever 0 u w+u 1 and u w+u Q. This is satisfied when 0 w, u 1, w + u > 0, w, u Q. Specify s = 0 and s+ = s + u w+u(1 s) in Lemma 24, we have f(w s + u) = f((w + u)( s + u w+u(1 s))) 0 whenever w + u 1. Thus f(w s + u) 0 for 0 < w, u < 1, w, u Q, and 0 < w + u 1. Since the case w = u = 0 is immediate, the statement follows with s {w s + u | w, u Q, 0 w, u, 1, w + u 1}. Corollary 26. Let f(s) satisfies (ABZ). If there exists an s R \ Q such that f( s) < 0, then f(w s) is monotonically decreasing with respect to w for w Q, 1 w < 1/ s. Lemma 25 and Corollary 26 indicate that when there exists a negative or positive value, infinitely many other points (that are not necessarily in its neighbor) must have negative or positive values as well. It is intuitive that the solution with a negative value, if it exists, must be complicated and pathological. In fact, Sierpi nski has shown that a midpoint concave but non-concave function is not Lebesgue measurable and is non-constructive (Sierpi nski, 1920a;b), so does f(s) if it solves (ABZ) while having a negative value. Such an f(s) exists if and only if we assume Axiom of Choice (Jech, 2008; Sierpi nski, 1920a;b). We consider the vector space by field extension R/Q. With the axiom specify a basis B = {1} {gi}i I, known as a Hamel basis. With this basis B every real number can be written uniquely as a linear combination of the elements in B with rational coefficients. Therefore, denote a real number s uniquely as a vector (w, wi)i I, w, wi Q, such that s = w + P i I wigi. We correspond each f(s) = f (w, {wi}i I) uniquely to f defined on Q|B| and use the two spaces interchangeably. Published as a conference paper at ICLR 2020 The solution f(s) to (9) is any concave function f on the vector space R/Q. Based on this solution we extend the system (9) to (ABZ). To this end, f(s) need to attain the equality in (9) at every s, which holds if and only if for every s = w+P i I wigi there exists s1 = w1+P i I w1 i gi and s2 = w2 +P i I w2 i gi such that f (λw1 +(1 λ)w2, λw1 i + (1 λ)w2 i i I) is λf(s1)+(1 λ)f(s2) for any 0 < λ < 1 (intuitively, local linearity of f on at least one direction everywhere). By specifying a B, the condition can be met if f(s) = f (w, {wi}i I) = α(wj) + β(wj) for some wj {w} {wi | i I}, where α is a linear function, β is a concave function, and wj denotes {w} {wi | i I} \ {wj}. When wj is w, f(s) is in fact w + β({wi}i I) for some concave function β. This is equivalent to specifying a function ω(s) : R Q such that ω maps reals to rationals additively ω(s1 + s2) = ω(s1) + ω(s2) and ω is not constantly zero, and then write f(s) as ω(s) + β1(s ω(s)) for some concave real function β1. Otherwise if wj is not w, f(s) is in the aforementioned form with the boundary conditions (B). While we have shown that under Axiom of Choice f(s) = α(wj) + β(wj) is a set of solutions that can be described by the infinite-dimensional vector space R/Q and its basis, we do not know if they are the only solutions. Nevertheless, combining our analysis and the literature we conclude the following statement about the system (ABZ). Theorem 27. Let γ = 1 and p = 0.5. A real function f(s) satisfies (ABZ) if and only if either f(s) = C s + B on s (0, 1), for some constants C + B 1, or f(s) is some non-constructive, not Lebesgue measurable function under Axiom of Choice. Proof. The first bullet correspond to the function f(s) that is non-negative on [0, 1]. By the midpoint concavity (9) and the fact that f(s) is non-negative, f(s) is concave on [0, 1] (Sierpi nski, 1920a;b). We specify s0 = 1 2. By (A) we have 2f(s0 a0) + 1 2f(s0 + a0) for some a0. Since f(s) is concave, f(s) must be linear on the interval [s0 a0, s0 + a0]. Consider the nonempty set A1 = a0 A(s0) f(s0) = 1 2f(s0 a0) + 1 2f(s0 + a0) . We show by contradiction that sup A1 is 1 2. If a1 = sup A1 < 1 2, then by the continuity a1 A1, where the continuity is implied by the convexity. This indicates that f(s) = f(s0) + f(s0+a1) f(s0 a1) 2a1 (s s0) when s0 a1 s s0 + a1. But as a1 is the maximum element of A1, on at least one of the intervals [0, s0 a1) and (s0 + a1] we have by the convexity f(s) < f(s0) + f(s0+a1) f(s0 a1) 2a1 (s s0). Therefore, for at least one of s {s0 a1, s0 + a1} we have f(s) > 1 2f(s a) + 1 2f(s + a) for all a, which contradicts condition (A). Hence sup A1 is 1 2, which implies that f(s) is linear on (0, 1). Writing f(s) as C s + B , by the boundary condition (B) we have C + B 1. It is immediate to verify that f(s) = C s + B with C + B 1 is sufficient to satisfy the system (ABZ), and is thus the non-negative solution of the system. If f(s) is negative at some s, then by (9) and (B) f(s) must not be concave. By Sierpi nski (1920a;b) such an f(s) exists only if we assume Axiom of Choice, and is non-constructive and not Lebesgue measurable if it exists (some discussion on this function is given above the statement of this theorem).