# two_steps_to_risk_sensitivity__a03aa41d.pdf Two steps to risk sensitivity Chris Gagne MPI for Biological Cybernetics Tübingen, Germany christopher.gagne@tuebingen.mpg.de Peter Dayan MPI for Biological Cybernetics University of Tübingen Tübingen, Germany dayan@tue.mpg.de Distributional reinforcement learning (RL) in which agents learn about all the possible long-term consequences of their actions, and not just the expected value is of great recent interest. One of the most important affordances of a distributional view is facilitating a modern, measured, approach to risk when outcomes are not completely certain. By contrast, psychological and neuroscientific investigations into decision making under risk have utilized a variety of more venerable theoretical models such as prospect theory that lack axiomatically desirable properties such as coherence. Here, we consider a particularly relevant risk measure for modeling human and animal planning, called conditional value-at-risk (CVa R), which quantifies worst-case outcomes (e.g., vehicle accidents or predation). We first adopt a conventional distributional approach to CVa R in a sequential setting and reanalyze the choices of human decision-makers in the well-known two-step task, revealing substantial risk aversion that had been lurking under stickiness and perseveration. We then consider a further critical property of risk sensitivity, namely time consistency, showing alternatives to this form of CVa R that enjoy this desirable characteristic. We use simulations to examine settings in which the various forms differ in ways that have implications for human and animal planning and behavior. 1 Introduction Risk is integral to decision making, arising whenever there are uncertain outcomes. It is especially critical when those outcomes are potentially calamitous, and plays an important role in psychiatric illness [1]. However, psychological investigations into choice under risk (i) have yet to embrace the strong formal foundations being developed in finance, AI and machine learning; and (ii) have mostly focused on one-shot decision tasks, despite the ubiquity of situations in the real-world that require planning, and growing interest in multi-step tasks for elucidating richer mechanisms of choice [2, 3]. To address these lacunæ, we consider a modern, theoretically well-founded coherent [4] risk measure called conditional value-at-risk (CVa R). This exactly quantifies the lower tail of possible outcomes those which are important for survival and also tend to drive our most persistent worries. CVa R has been applied to sequential decision problems, notably by means of distributional reinforcement learning (RL; [5 7]). In this paradigm, the agent (or decision maker) estimates whole distributions for potential outcomes that can arise from its actions. Risk measures, such as CVa R, can be applied to these distributions to adjust decision making to any desired level of risk sensitivity. In this paper, we combine CVa R with a distributional approach to examine risk sensitivity in the intensively-investigated two-step sequential decision task [2]. We find that the choices of many individuals reflect a large degree of risk aversion. Moreover, we show that more standard analyses ascribe this to enhanced choice perseveration and/or reduced estimates for learning rate, thus misspecifying the effects of relatively high levels of uncertainty. Incorporating risk sensitivity into 35th Conference on Neural Information Processing Systems (Neur IPS 2021). choice, through CVa R or other risk measures, however, raises subtle issues regarding the consistency of choices over time [8 11]. Such issues are well explored in psychology in the context of hyperbolic temporal discounting [12], but their thorough investigation in finance and operations research has yet to permeate judgement and decision making research. We discuss how a direct incorporation of CVa R into distributional RL can lead to time inconsistency and point to two additional time-consistent approaches to CVa R for sequential decision making. Finally, we show how these three approaches can give rise to distinct patterns of behavior in a problem designed to tease them apart and which can serve as the basis for future empirical investigation. Preliminaries: Conditional value-at-risk (CVa R) A risk measure ρ(Z) maps a probability distribution of outcomes associated with a random variable Z to a real number. In distributional RL, Z is typically a discounted sum of rewards minus costs (i.e. the return). ρ(Z) represents the risk inherent in the uncertainty about Z, and is often interpreted as the amount one is willing to pay to avoid adopting this risk. Well known risk measures include the variance and the value-at-risk (Va Rα), which is defined as: Va Rα(Z) = F 1(α) := inf{z : F(z) α} (1) for a cumulative distribution function F(z) and α-quantile; see Figure 1a. While variance and Va R are venerable risk measures, neither satisfy the full set of axiomatically desirable properties associated with coherent risk measures [4]. For instance, the Va R is not sub-additive (so fails to reward diversification), and variance is neither sub-additive nor positively homogeneous (as it changes non-linearly with the units in which costs or rewards are measured). Conditional value-at-risk (CVa R) was introduced as a coherent risk measure [4], and is particularly popular in finance, robotics, operations research, and recently machine learning and RL [6, 13 19]. For the lower tail of a continuous distribution, it is defined as the average of the values lower than the Va R (Figure 1a):1 CVa Rα[Z] := E[Z|Z < Va Rα(Z)] (2) α determines risk aversion by emphasizing the lower tail of the distribution more or less (Figure 1b). That CVa R concentrates on the lower tail makes it particularly attractive for capturing aspects of normal and pathological reasoning in animals, whose lives often hang on rather thin threads, and humans, who can catastrophize about diverse, unlikely, possibilities. But it is then particularly important to consider CVa R in sequential decision making, since this represents the ecological norm and is etched into the neural structure of decision making. We therefore start by examining CVa R in the simplest such problem that is widely studied in humans: the two-step task. 2 Modeling risk sensitivity in human planning: CVa R in the two-step task The two-step task (Figure 2) is a popular option for studying sequential choice in humans (and animals) [2]. It was originally designed to investigate model-based (MB) and model-free (MF) learning and planning, distinguishing the two by examining the consequences of progressive changes in the probabilities of rewards. However, these changes necessarily induce uncertainty which could affect risk-sensitive subjects. We therefore used a CVa R-based form of MB and MF reasoning to fit a very substantial dataset of human behavior in this task (out of more than 2000 participants in [3], the 792 who responded on every trial). Task: On each of 200 trials, participants make decisions at two successive stages or steps. At the first stage, participants choose between two actions (the fractals in Figure 2a). As shown by the arrows in the figure, depending on their choice, they then transition to one of two possible second stage states either 70% or 30% of the time. Each second stage state has its own pair of options (the colored squares) between which the participants must then choose. According to this second choice, the participant receives a binary outcome, which is drawn according to a probability that drifts randomly across the trials (shown by color in Figure 2b). Before the actual experiment, participants were given 40 trials of practice, during which they learned about the general structure of the task 1For discrete random variables, alternative definitions such as CVa Rα(Z) := supν{ν 1 αE[(ν Z)+]} [20] are used. Figure 1: Conditional value-at-risk (CVa R). (a) CVa Rα is the average of the values in the lower α-% of a distribution, i.e. below the Va Rα. (b) Adjusting α emphasizes the lower tail of the distribution more or less and sets the level of risk aversion. At α = 1.0, CVa R is equal to the mean. As α decreases to 0, it approaches the minimum. Figure 2: Two-step task. (a) On each trial, participants choose between two options (depicted as fractals) in stage 1 and then transition stochastically to one of two states (purple/blue or orange/red) in stage 2. They again choose between two options (the colored squares) and receive a binary reward. (b) This reward is sampled according to a probability that drifts randomly across trials (separately for each option). Distributions over reward probabilities are assumed by the model to be estimated for each option (depicted at the bottom in panel a). including the 70%/30% transition probabilities. This structure was not presented to participants explicitly, but participants were tested about their knowledge of the task and excluded if they failed this test. Participants were instructed about the drifting reward probabilities. CVa R-based model: We included CVa R in a model that is conventionally applied to the two-step task. We assume that participants learn a distribution of values [5] associated with each of the four second-stage options using an approximate Kalman filter: µt+1 = µt + λ(rt µt) Ψ2 t+1 = (1 φ2)Ψ2 t + η2 λΨ2 t (3) updating the mean µt of each distribution on each trial using the observed outcome rt via a delta-rule with a participant-specific learning rate λ, and also updating the variance Ψt. When the outcome is not observed (i.e. when the participant chose a different option), the mean is updated towards 0.5 and the variance is updated without the last term in equation (3). Thus, the dispersion parameter η2 controls the increase of the variance whether or not outcomes are observed, and the learning rate λ controls the amount it decreases when they are. The term (1 φ2) controls the asymptotic variance for unobserved outcomes (in relation to η2) and was set based on the other two parameters (see later). The CVa Rα,t(a) at risk preference α (estimated per participant) was calculated for each option a using the mean µt(a) and variance Ψt(a) on each trial t. Note that because this variance represents uncertainty about the reward probabilities themselves, CVa R here captures both ambiguityas well as risk-sensitivity; however, to keep terminology consistent with the finance and machine learning communities [21 24], we simply refer to this as to risk-sensitivity. Equipped with these CVa Rα,t(a) values, participants second stage choices were modeled using a soft-max choice rule: P(2nd stage choice=a) exp(β2nd CVa Rα,t(a)) (4) where the parameter β2nd controls the relative stochasticity/determinism. Decisions between the two first-stage options were modeled as involving a combination of model-free (MF) and model-based (MB) approaches to value estimation both of which were modified to include CVa R. MF estimates were calculated using the same formulæ, learning rate λ and dispersion parameter η2 as at the second stage to learn an additional pair of means and variances based on the actual outcomes received in the second stage. For model-based estimation, the 70%/30% transition probabilities were used to calculate a mixture distribution for each of the top stage actions, from which the CVa R was calculated directly. CVa Rα,t(a) for the model-free and model-based first-stage distributions again determined the choice probability through a soft-max choice rule: P(1ststage choice=a) exp(βMBCVa RMB α,t (a) + βMF CVa RMF α,t (a) + βstickyδa,at 1) (5) Parameter βsticky was included to capture the tendency of participants to repeat (or to switch) the previously chosen action regardless of its value. Parameter estimation: The 7 parameters of the CVa R-model were estimated for each participant: CVa R-based risk-sensitivity α [0.1, 1], learning rate λ [0.01, 0.99], dispersion η2 [0.001, 0.09], perseveration βsticky [0, 20] and three inverse temperature parameters β2nd, βMB, βMF [0, 30]. Parameters were estimated in Python using L-BFGS-B. Parameter recovery analyses were conducted to investigate parameter estimability. Preliminary recovery simulations suggested that estimating both η2 and (1 φ2) was difficult, so we determined the value of (1 φ2) that would pin the asymptotic variance to 0.1. This was done so that a never-chosen option with an estimated mean 0.5 would have a CVa R of 0 at the lower-bound value for α (i.e. 0.1), which is appropriate since the outcomes themselves were between 0 and 1. The learning rate lay between 0.01 to 0.99 and was additionally constrained such that λ < (1 φ2). With these constraints, we ran further parameter recovery analyses, generating new data based on participants parameter estimates and re-estimating the model; a rank-correlation of 0.71 between the generative and recovered CVa R α parameter indicated moderately good estimability/identifiability; Supplemental Figure 1. Baseline and alternate models: Setting α = 1 arranges for CVa R choices that depend only on expected values (in which case the variance estimation can also be removed, because it does not influence choice probability). This mean-model is used as a baseline against the risk-sensitive CVa Rα<1 model; it is very similar to typical models used in the two-step task [2, 3], but is more directly comparable to its risk-sensitive counterpart. We provide a more detailed description of the differences between the current and previously used model in Appendix A.2. We also tried two forgetful beta-binomial models (one with and without risk-sensitivity) since rewards are binary, and a risk-seeking version of the CVa R-model, which emphasized the upper rather than lower tail. However, these models fit participants data less well, so we omit them for the sake of parsimony. We first compared the the mean-model (CVa Rα=1) to the risk-sensitive model (CVa Rα<1). Using BIC to account for the two extra parameters, the risk-sensitive model was slightly preferred overall (avg. BIC=368.8 versus BIC=371.5; diff=-2.67, t(790)=-3.54, p=0.004), albeit in a minority (40.2%) of participants. For many such subjects, however, the improved fit was substantial (red points in Figure 3a); and many of them had α < 0.2 (Figure 3b), indicating substantial aversion to risk. To investigate the improved fit and for insight into the choice characteristics associated with this risk aversion, we analyzed how other parameters altered from the α=1 case for significantly risk-averse Figure 3: Two-step task modeling results. (a) Many participants are better fit by the CVa R-model than the mean-model, and (b) are substantially risk averse, as estimated by the CVa R α parameter. (c-d) Including risk sensitivity increases/decreases estimates of learning rate and perseveration, respectively, in these participants. (e-f) Simulations show that fitting a risk-agnostic model to risksensitive behavior can lead to underestimated learning rates (e) and/or apparent perseveration (f). Shading around the regression line in panels e-f indicate 95%-confidence intervals. Note that 3 outliers located near (x=12, y=10) were removed from panel d. subjects. There were two related changes: participants estimated learning rates were higher in the CVa R-model (Figure 3c), and their estimated perseveration parameters were lower (Figure 3d). One potential reason for these is that participants tended not to prefer less frequently chosen (and thus more uncertain) options, even when the more certain option is apparently worse. To demonstrate this directly, we simulated choices from the CVa R-model at increasing levels of risk sensitivity (α [1.0, 0.6, 0.3, 0.1]) in response to a predetermined set of outcomes, with the other parameters held constant (Figure 4). Option A is chosen by design for the first 6 trials, but the model is then allowed to switch to an option B after observing what it thinks is a sufficient number of negative outcomes. For α=1, this switch occurs when the mean estimate for option A crosses the mean estimate for option B at 0.5 (trial 10 in panel b). For α<1 (panels c-e), this switch occurs when the CVa Rα<1.0 estimate for the options cross. Due to the uncertainty around the mean and the fact that the model is more uncertain about option B than A, the crossing point occurs later for lower α (i.e. on trials 11, 12, and 13). Thus, what looks like a tendency to stick with an apparently worse option, here arises from an aversion to uncertainty. For α = 1, this can only occur if the learning rate is low (so the value of the chosen option does not decrease too much upon experiencing an unfortunate outcome), and/or with a high perseveration parameter. But, as a consequence, genuine risk aversion (or uncertainty aversion) might be misattributed to one of these parameters this is shown directly in Figure 3e;f where generating choices from a non-perseverative CVa R-model with low α leads to inferred learning rates that are too low (panel e) and inferred non-zero perseveration (panel f). Note, though, that in participants data, some perseveration remains even with CVa Rα<1, as removing perseveration from the CVa R-model (i.e. setting βsticky = 0) led to worse BIC values. 3 Three approaches to sequential risk and time (in)consistency One of the most important issues for sequential decision making is time consistency that the choices the decision maker at time t assumes will be executed at time t + 1 are indeed carried out.2 Inconsistency, famously caused by hyperbolic discounting [12], can lead to reneging on past resolutions or require potentially sub-optimal and expensive commitment behavior to prevent that 2An intimately related, yet differently defined, notion of time consistency involves the consistency of successive risk evaluations of a stochastic process rather than choices per se. Of course, inconsistent evaluation can lead to inconsistent choices. See [25 27] for a more in-depth discussion. Figure 4: CVa R-based perseveration. (a) The CVa R-model is simulated at four values of α on a series of choices between options A and B. For low α (high risk sensitivity), the model chooses option A for more trials, despite receiving 0 outcomes after trial 7; binary outcomes are shown as dots in panels b-e. (b-e) The switch from A to B occurs when the model s estimate for the CVa R (or mean for α = 1.0) of A (in color) goes below that of B (in gray). The crossing point occurs later for lower levels of α due to uncertainty around the mean which is greater for B. from occurring. Another, less well-known, route to time inconsistency comes from mishandling risk sensitivity [8 11, 26 34] and, indeed, different variants of CVa R have been developed to deal with this issue. The two-step task lacks sufficient temporal complexity to make this issue clear. Therefore in this section we examine it in greater depth, and, using simulations, suggest a task that highlights the differences between these CVa R variants. Fixed, precommitted, and nested CVa R In the two-step task, we followed recent work on distributional RL [6] in assuming that participants applied CVa R at the same risk preference (α) to the estimated value distributions at both the first and Figure 5: Three approaches to CVa R in a two-stage problem. (a) A choice of left or right can be made in both state A and B (if state B is visited). Stochastic transitions are denoted by dashed lines (with the probabilities shown), and the possible terminal outcomes are [-4, -2, 1, 0, 2]. The fixed approach to CVa R (discussed in the text) would choose policy π at state A, but policy π at state B, illustrating issues with time consistency. The precommitted approach would consistently choose policy π . (b) In an adjusted problem, additional state transitions (colored in gray) are appended to state B. The nested approach to CVa R (with α = 0.1) chooses policy π , taking a sure loss of 4, despite the only extremely remote chance of getting 5. second stage. We call this the fixed approach or f CVa R. To see why this approach is time-inconsistent, consider calculating the optimal f CVa Rα=0.1 policy for the simplified two-stage choice scenario depicted in Figure 5a. From the perspective of the top state A, consider the two policies π={a1 =right, a2 =right}, which leads to a distribution of outcomes p = {0.1 : 0, 0.9 : 2} and π = {a1 = right, a2 = left}, which leads to a distribution of outcomes p = {0.01 : 2, 0.09 : 1, 0.9 : 2}. Just considering these overall distributions, π has a CVa R0.1 of 0 and π has an CVa R0.1 of 0.7, so π would be preferred. However, if the agent takes action a1 =right and arrives at state B, the remainder of π (i.e. action a2 =left) now looks worse than the remainder of π (i.e. a2 =right), with CVa R0.1 s of 2 and 0, respectively; as a result, the f CVa R agent will defect on its original plan π . One way to circumvent this issue with time consistency is to skirt dynamic evaluation altogether and precommit to evaluating risk with respect to only one stage or time-point (the start being most natural) thus enforcing a commitment π at state A. One way to make this contract is to change the value of α after each transition in the light of the probability of its happening. For instance, the decision in π made at state A to go left rather than right at state B was based on considering all the outcomes in state B (i.e. the decision was based considering CVa R at α = 1.0 rather than =0.1). This dynamic adjustment of α (in this case from 0.1 in state A to 1.0 in state B) prevents time inconsistency by allowing later risk evaluations and decisions all to be coordinated with respect to the single risk preference at the start stage [31]. A second way to deal with time consistency is to evaluate risk dynamically using a series of nested one-step (conditional) risk measures [11, 35]: ρk,N(Rk, . . . , RN) = Rk +ρk(Rk+1 +ρk+1(Rk+2 + +ρN 2(RN 1 +ρN 1(RN)) )), (6) where Rk is the reward (or cost) for time step k. If CVa Rα with the same value of α is used for each conditional risk measure ρk, this is known as nested or n CVa Rα [36]. Although like f CVa R, it keeps risk preferences fixed across time, it applies the CVa R at each stage to subsequent CVa R evaluations (which themselves are random scalars) rather than to the full distribution of future random costs or rewards under the policy lower in the tree. One consequence of this nesting or compounding of risk evaluations, is that n CVa R can sometimes become much more conservative than the other two approaches. To see this, consider panel b in Figure 5. Here, the same decision tree is appended with a set of alternative transitions (depicted in gray) from state B. Now, each possible action is additionally associated with a 10% chance of ending up in state D, which either returns 0 with probability 0.9 or makes a transition to yet another state E with probability 0.1. The outcomes in this state are either 5 (the worst possible outcome) or again 0. Importantly, from the perspective of state B, the 5 is so remote (having a probability of merely 0.001) that it barely impacts the distribution there (and therefore p CVa R and f CVa R). However, due to the nested structure of n CVa R (at α = 0.1), the 5 is propagated backwards first to E then to D, and then in fact, all the way back to the top state A. As a consequence, the n CVa R agent would then rather choose π = {a1 = left, N/A}, for a certain loss of -4. Of course at α = 1.0, all three approaches are equivalent to using the expected value, which is time consistent. As α approaches 0, all three approaches again become equivalent, but to the worst-case risk measure [37], which is also time consistent. The three approaches in a gridworld As noted, the two-step task is too simple to surface these issues (evaluating the other forms of CVa R leads to equivalent results on the 792 subjects). Thus, we used the simulations in Figure 6 to highlight their differences as the basis of possible future tests. Consider an agent (or decision maker) who starts in a gridworld in the top left corner and can move either right or left. Exiting the gridworld on the right hand side informally represents a goal and is associated with a +3 reward, while exiting the map on the left hand side represents quitting and a loss of -2. The agent s actions are stochastic with the possibility of moving downwards with some error probability. If the agent falls off the bottom of the gridworld, a substantial loss of -15 is incurred (schematized by the lavapit). The right action has twice the error probability of the left action, meaning that heading towards the goal is riskier than attempting to quit; this leads to interesting differences for the three approaches. Figure 6: Three approaches to CVa R in a gridworld. (a) The agent starts in the upper left corner and can choose to go left or right in each state. Actions result in stochastic transitions (as depicted at the top). Exiting the map on the left, right, or bottom results in an outcome of -2, +3, -15. (b) The optimal policy is shown for each of three approaches, highlighting their distinct behaviors. Optimal polices for the p CVa R, f CVa R and n CVa R were calculated using dynamic programming equations (see Appendix B; based on [11, 16]) for various levels of α (for p CVa R, which requires interpolating between different values of alpha, we used 21 log spaced points, following [16]). The optimal policies for each approach to CVa R at a moderate-to-high level of risk (i.e. α = 0.18) are depicted in Figure 6b. For p CVa R (with a start state risk of α = 0.18), the optimal policy in every state is to head towards the goal. This is the case even in the bottom row, where rightward actions are much more risky due to the proximity of the lava pit. In fact, if the agent started in the bottom row with the same α = 0.18, it would head left in an attempt to quit. However, since the p CVa R approach coordinates risk with respect to the start state and these bottom states are part of the start state s lower tail, the p CVa R agent knows that it is better to make riskier decisions here for the sake of its former self.3 Indeed, doing so yields a higher start state CVa R than abandoning its plans. In contrast, abandoning the pursuit of the goal is exactly what the f CVa R agent does, as it re-evaluates risk using α = 0.18 at every state; i.e. it heads towards the left in rows 2 and 3 after getting knocked off course. Similarly to f CVa R, the n CVa R agent also re-evaluates risk using α = 0.18 at every state. However, it chooses to quit from the start due to the nesting of risk (risk evaluated on top of future risk evaluations) even from the start, the chances of the distal lava pit looms larger than it does for the other two approaches (at the same nominal level of alpha). At lower and higher values of alpha, the behavior of the three approaches more closely aligns, either quitting from the start or pursuing the goal from every state. 4 Discussion and related work We first showed that adopting the modern risk measure CVa R [4] in a form based on ideas from distributional RL [6] suggests that a large minority of healthy volunteer subjects exhibit quite significantly risk-averse behavior. This was true even in a simple two-step task for extremely low stakes and with only rather limited amounts of uncertainty. That this risk aversion masqueraded as enhanced perseveration and slowed learning is a reminder of the complexities of building models of behavior, and an invitation to consider risk in models of other common tasks. These effects of risk in behavior can be complemented by considerations of its effects in the sort of off-line planning that has recently been suggested to model rumination in anxiety disorders, as subjects struggle to find ways of mitigating potential future threats [38]. That distributional RL has recently been suggested as a way of understanding facets of the activity of major neural systems involved in processing affective outcomes [7] offers a highly attractive link to understanding risk aversion in animals (and 3This increase in risk sensitivity can be seen in the average values of the adjusted alphas in each state; shown in the upper right hand corner of each grid cell. also tempts us to consider the role of other neuromodulators that have been implicated in representing and processing uncertainty; [39]). Since pathological effects of uncertainty play a malign role in the effects of many psychiatric conditions [1], it would be of great interest to design tasks, for instance based on the gridworld of section 3, that decompose various of its different facets to help determine which might be responsible. Indeed, this task can be seen as a structurally rich form of a popular method for assessing impulsivity the balloon analogue risk task (BART; [40]). Many potential variants would be of interest for instance studying forms of precommitment by allowing subjects to buy a form of insurance at the outset against one or more unlucky downwards transitions. We could modify the task further with intermediate rewards and punishments to examine the observation that individuals only take on greater risk after a loss, if it is not yet realized (in our terms, before reaching one of the two sides of the grid) [41]. Then, we might expect precommitments to be abandoned and risk to be re-evaluated to the degree to which previous outcomes seem irreversible or the current situation sufficiently distinct from the one in which the committment was made. One could examine consistency between subjects choices in the environment and their willingness to pay to protect themselves ahead of time. Furthermore, richer distributions of possible outcomes and a diverse set of navigation environments, with features that differentiate the three approaches, could be used to reverse engineer uncertainty calculations in the brain given stable risk preferences. One particular facet of uncertainty that we did not explore is that, in some circumstances, it is possible to collect more information that reduces it to acceptable levels. Indeed, modern risk measures have recently been applied applied in settings with model uncertainty (i.e. POMDPs and Bayes adaptive MDPs, [21 24]). Information gathering, as explored in the sequential information sampling task [42] has been of particular interest in obsessive compulsive disorder, where subjects have been algorithmically modelled as exhibiting less urgency to make up their minds [43]; it would be interesting to model them computationally as being more risk averse. Similarly, if trust is seen as willingness to risk vulnerability to others [44, 45], then risk sensitivity could be an important factor in social pathologies such as borderline personality disorder. We showed that f CVa R, although intuitive, is not time consistent. The other forms of precommitted p CVa R and nested n CVa R are, which provides them with more attractive formal properties. In fact, one can see f CVa R as living inbetween p CVa R and n CVa R (rather literally, in the problem shown in Figure 6b). There are other potential interpolants for instance, cases in which the updating of α following lucky or unlucky transitions (which justifies a form of gambler s fallacy; [38]) is incomplete (or perhaps asymmetric). Such incompetent calculations have been suggested as underlying psychiatric conditions themselves [46]. Finally, although we focused on CVa R, there are many other risk measures that satisfy the coherence axioms, and indeed other more stringent conditions. Adding the requirements of comontonicity and law invariance lead to the class of distortion risk measures [47 49] (or equivalently, spectral risk measures [50]), which apply a distortion function to cumulative probabilities. This allows them to be linked directly to the dual theory of choice [51], therefore inheriting an additional set of rational choice axioms (i.e. the axioms of expected utility theory, with an alternative independence axiom)[52]. CVa R itself is part of this class, and can be used as a basis to construct all other members [34, 49]. Interestingly, the probability weighting function from cumulative prospect theory [53, 54], a popular model in psychology, can be considered a distortion risk measure, even though the full prospect theory adds reference dependence and loss aversion. While prospect theory has been well validated for single decisions, there is substantial opportunity for theoretical and empirical investigations of its realization in sequential decision-making. 5 Broader Impact Coherent risk measures, despite their widespread use in formal applications and favorable theory, have yet to permeate fully psychological and neuroscientific models of decision making, especially for sequential problems. We take early steps in this direction using a widely-recognized and commonly used experimental paradigm, and a psychologically relevant coherent risk measure, CVa R. In doing so, we highlight a key issue that can arise, namely time inconsistency, with a naive application of risk measures to sequential choice, and discuss two alternative (time-consistent) approaches. We expect that bringing awareness to this issue will lead to interesting future discussions and empirical tests in psychological and neuroscientific communities. We do not anticipate our research to selectively impact some groups at the expense of others. Our study had the limitation of not examining how subjects might compute these various forms of risk sensitivity in a neurobiologically credible manner, or showing that the risk aversion that we inferred from behaviour in the two-step task would generalize to other choices that the subjects might make. Acknowledgements The authors have no competing interests to disclose. We thank Fabian Renz for his contributions to a preliminary analyses of the two-step task and helpful discussions. CG and PD are funded by the Max Planck Society. PD is also funded by the Alexander von Humboldt Foundation. [1] Matthias Brand, Kirsten Labudda, and Hans J Markowitsch. Neuropsychological correlates of decision-making in ambiguous and risky situations. Neural Networks, 19(8):1266 1276, 2006. [2] Nathaniel D Daw, Samuel J Gershman, Ben Seymour, Peter Dayan, and Raymond J Dolan. Model-based influences on humans choices and striatal prediction errors. Neuron, 69(6):1204 1215, 2011. [3] Claire M Gillan, Michal Kosinski, Robert Whelan, Elizabeth A Phelps, and Nathaniel D Daw. Characterizing a psychiatric symptom dimension related to deficits in goal-directed control. Elife, 5:e11305, 2016. [4] Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical finance, 9(3):203 228, 1999. [5] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449 458. PMLR, 2017. [6] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pages 1096 1105. PMLR, 2018. [7] Will Dabney, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis Hassabis, Rémi Munos, and Matthew Botvinick. A distributional code for value in dopamine-based reinforcement learning. Nature, 577(7792):671 675, 2020. [8] Kang Boda and Jerzy A Filar. Time consistent dynamic risk measures. Mathematical Methods of Operations Research, 63(1):169 186, 2006. [9] Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, David Heath, and Hyejin Ku. Coherent multiperiod risk adjusted values and bellman s principle. Annals of Operations Research, 152(1):5 22, 2007. [10] Alexander Shapiro. On a time consistency concept in risk averse multistage stochastic programming. Operations Research Letters, 37(3):143 147, 2009. [11] Andrzej Ruszczy nski. Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125(2):235 261, 2010. [12] George Ainslie and Nick Haslam. Hyperbolic discounting. 1992. [13] Nicole Bäuerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361 379, 2011. [14] Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Parametric return density estimation for reinforcement learning. ar Xiv preprint ar Xiv:1203.3497, 2012. [15] Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the cvar via sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. [16] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decisionmaking: a cvar optimization approach. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. [17] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070 6120, 2017. [18] Ramtin Keramati, Christoph Dann, Alex Tamkin, and Emma Brunskill. Being optimistic to be conservative: Quickly learning a cvar policy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4436 4443, 2020. [19] Núria Armengol Urpí, Sebastian Curi, and Andreas Krause. Risk-averse offline reinforcement learning. ar Xiv preprint ar Xiv:2102.05371, 2021. [20] R Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions. Journal of banking & finance, 26(7):1443 1471, 2002. [21] Jingnan Fan and Andrzej Ruszczy nski. Dynamic risk measures for finite-state partially observable markov decision problems. In 2015 Proceedings of the Conference on Control and its Applications, pages 153 158. SIAM, 2015. [22] Apoorva Sharma, James Harrison, Matthew Tsao, and Marco Pavone. Robust and adaptive planning under model uncertainty. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 410 418, 2019. [23] Mohamadreza Ahmadi, Masahiro Ono, Michel D Ingham, Richard M Murray, and Aaron D Ames. Risk-averse planning under uncertainty. In 2020 American Control Conference (ACC), pages 3305 3312. IEEE, 2020. [24] Marc Rigter, Bruno Lacerda, and Nick Hawes. Risk-averse bayes-adaptive reinforcement learning. ar Xiv preprint ar Xiv:2102.05762, 2021. [25] Michel De Lara and Vincent Leclère. Building up time-consistency for risk measures and dynamic optimization. European Journal of Operational Research, 249(1):177 187, 2016. [26] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy nski. Lectures on stochastic programming: modeling and theory. SIAM, 2014. [27] Birgit Rudloff, Alexandre Street, and Davi M Valladão. Time consistency and risk averse dynamic decision models: Definition, interpretation and practical consequences. European Journal of Operational Research, 234(3):743 750, 2014. [28] David Kreps and Evan Porteus. Temporal von Neumann Morgenstern and Induced Preferences. 1979. [29] Larry G Epstein and Martin Schneider. Recursive multiple-priors. Journal of Economic Theory, 113(1):1 31, 2003. [30] Berend Roorda and Johannes M Schumacher. Time consistency conditions for acceptability measures, with an application to tail value at risk. Insurance: Mathematics and Economics, 40(2):209 230, 2007. [31] Georg Ch Pflug and Alois Pichler. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41(2):682 699, 2016. [32] Alois Pichler and Alexander Shapiro. Risk averse stochastic programming: time consistency and optimal stopping. ar Xiv preprint ar Xiv:1808.10807, 2018. [33] Rouven Schur, Jochen Gönsch, and Michael Hassler. Time-consistent, risk-averse dynamic pricing. European Journal of Operational Research, 277(2):587 603, 2019. [34] Anirudha Majumdar and Marco Pavone. How should a robot assess risk? towards an axiomatic theory of risk in robotics. In Robotics Research, pages 75 84. Springer, 2020. [35] Andrzej Ruszczy nski and Alexander Shapiro. Conditional risk mappings. Mathematics of operations research, 31(3):544 561, 2006. [36] Georg Ch Pflug and Werner Romisch. Modeling, measuring and managing risk. World Scientific, 2007. [37] Stefano P Coraluppi and Steven I Marcus. Risk-sensitive and minimax control of discrete-time, finite-state markov decision processes. Automatica, 35(2):301 309, 1999. [38] Christopher Gagne and Peter Dayan. Peril, prudence and planning as risk, avoidance and worry. Psy Ar Xiv, 2021. [39] Angela J Yu and Peter Dayan. Uncertainty, neuromodulation, and attention. Neuron, 46(4):681 692, 2005. [40] Carl W Lejuez, Jennifer P Read, Christopher W Kahler, Jerry B Richards, Susan E Ramsey, Gregory L Stuart, David R Strong, and Richard A Brown. Evaluation of a behavioral measure of risk taking: the balloon analogue risk task (bart). Journal of Experimental Psychology: Applied, 8(2):75, 2002. [41] Alex Imas. The realization effect: Risk-taking after realized versus paper losses. American Economic Review, 106(8):2086 2109, 2016. [42] Luke Clark, Trevor W Robbins, Karen D Ersche, and Barbara J Sahakian. Reflection impulsivity in current and former substance users. Biological psychiatry, 60(5):515 522, 2006. [43] Tobias U Hauser, Michael Moutoussis, Peter Dayan, and Raymond J Dolan. Increased decision thresholds trigger extended information gathering across the compulsivity spectrum. Translational psychiatry, 7(12):1 10, 2017. [44] Roger C Mayer, James H Davis, and F David Schoorman. An integrative model of organizational trust. Academy of management review, 20(3):709 734, 1995. [45] Denise M Rousseau, Sim B Sitkin, Ronald S Burt, and Colin Camerer. Not so different after all: A cross-discipline view of trust. Academy of management review, 23(3):393 404, 1998. [46] Quentin JM Huys, Marc Guitart-Masip, Raymond J Dolan, and Peter Dayan. Decision-theoretic psychiatry. Clinical Psychological Science, 3(3):400 421, 2015. [47] Shaun Wang. Premium calculation by transforming the layer premium density. ASTIN Bulletin: The Journal of the IAA, 26(1):71 92, 1996. [48] Shaun S Wang. A class of distortion operators for pricing financial and insurance risks. Journal of risk and insurance, pages 15 36, 2000. [49] Hans Föllmer and Alexander Schied. Stochastic finance. de Gruyter, 2016. [50] Carlo Acerbi. Spectral measures of risk: A coherent representation of subjective risk aversion. Journal of Banking & Finance, 26(7):1505 1518, 2002. [51] Menahem E Yaari. The dual theory of choice under risk. Econometrica: Journal of the Econometric Society, pages 95 115, 1987. [52] Hans Peter Wächter and Thomas Mazzoni. Consistent modeling of risk averse behavior with spectral risk measures. European Journal of Operational Research, 229(2):487 495, 2013. [53] Daniel Kahneman and Amos Tversky. Prospect Theory: An Analysis of Decision under Risk. Econometrica, 47(2):263 291, March 1979. [54] Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5(4):297 323, 1992.