# introspective_forecasting__7243b7ba.pdf

Introspective Forecasting

Loizos Michael Open University of Cyprus loizos@ouc.ac.cy

Science ultimately seeks to reliably predict aspects of the future; but, how is this even possible in light of the logical paradox that making a prediction may cause the world to evolve in a manner that defeats it? We show how learning can naturally resolve this conundrum. The problem is studied within a causal or temporal version of the Probably Approximately Correct semantics, extended so that a learner s predictions are ﬁrst recorded in the states upon which the learned hypothesis is later applied. On the negative side, we make concrete the intuitive impossibility of predicting reliably, even under very weak assumptions. On the positive side, we identify conditions under which a generic learning schema, akin to randomized trials, supports agnostic learnability.

1 Introduction Several scientiﬁc disciplines seek to reliably forecast a future state of affairs. Often, utility is derived not simply by having one s predictions veriﬁed, but by acting upon the predictions prior to their veriﬁcation. Being able to predict the stock market, for example, would be of little value if one were not to invest based on that prediction, or get paid to share it. In certain cases, taking such actions is accompanied by the distinct risk of altering the temporal evolution of the environment in a way that would lead to a different, than anticipated, future. The problem is modeled as a forecaster in a possibly adversarial environment that becomes aware of the forecaster s prediction before the environment commits to the outcome that the forecaster attempts to predict. In this context, a forecaster is required to be introspective, and acknowledge the effect of its predictions on their realizability. The contributions of this work are four-fold: (i) it formalizes introspective forecasting (cf. Deﬁnition 4) as an extension of the PAC semantics [Valiant, 1984; Michael, 2011]; (ii) it identiﬁes a realizability quantity (cf. Deﬁnition 3) that measures the optimal correctness that any introspective forecaster can achieve; (iii) it proposes a general schema for constructing introspective forecasters and identiﬁes conditions (cf. Theorem 1) under which it applies; (iv) it establishes limitations on what can be introspectively forecasted (cf. Theorem 2) and discusses situations under which these limitations can be lifted (cf. Theorem 3).

Of course, introspective forecasting is not aimed to be used to make self-fulﬁlling predictions, which are realized simply because they were acted upon. Instead, having the ability to make such self-fulﬁlling predictions allows one to rationally choose whether it is warranted to act upon them, if this indeed happens to lead to a future that is preferred over one where no action is taken. We defer a discussion of the philosophical issues that inevitably arise for an extended version of this paper, and focus herein on the technical framework.

We start with an overview of certain key aspects of this work.

2.1 Causal Learnability

Causal learnability [Michael, 2011] extends the PAC learning semantics [Valiant, 1984] to a causal or temporal setting. Below we present a simpliﬁed version of that framework. Given as inputs T S n Tn, S S n Sn, ε, δ (0, 1], where Sn is a set of states of size n, and Tn is a set of functions that map Sn to Sn, the learning algorithm proceeds as follows: During a training phase, the learner observes pairs of states s, t(s) , such that s is drawn independently at random from some arbitrary but ﬁxed probability distribution Dn, and such that t is a ﬁxed target function in Tn. After time upper bounded by a ﬁxed polynomial in n, 1/ε, 1/δ, the learner returns, with probability at least 1 δ, a hypothesis function h : Sn Sn such that the probability that h(s) = t(s) on a state s randomly drawn from Dn is at least 1 ε. If the learner can meet the stated requirements for every choice of n, ε, δ (0, 1], Dn, and t Tn, then the class T is causally learnable. Previous work [Michael, 2011] investigates which classes are causally learnable, by establishing, among others, connections to PAC concept learning [Valiant, 1984]. One may allow functions in Tn that map Sn to a set Ln, when modeling scenarios where a learner seeks to predict not the entire successor state t(s) of s Sn, but only some aspect, or label in Ln, of t(s). When |Ln| = 2, this resembles PAC concept learning, but with a key conceptual difference: the causal dependence of the label of t(s) on s. Even when s is observed, the label of t(s) is not determined until the successor state t(s) materializes; if we were to somehow affect s after observing it, this would also affect the label of t(s).

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

2.2 Recording Predictions To make the aforementioned conceptual distinction between PAC concept learning and causal learning explicit, let Tn be as in the original causal learning framework, and introduce a new set Fn of functions that map Sn to Ln. Like concepts in PAC learning, each f Fn maps a state to a label of that same state. Unlike PAC learning, however, class F S n Fn does not aim to act as a hidden structure to be learned; the hidden structure is the temporal dynamics of the environment as captured by class T . Instead, F is available for the learner to choose to apply any f Fn on the successor state t(s) Sn to produce a label (i.e., to extract a particular feature). For any chosen function f Fn, then, the deﬁnition of causal learnability applies mutatis mutandis as before, except that the learner has access to Ln and f, and aims to return a hypothesis function h : Sn Ln such that h(s) = f(t(s)). If the learner can meet the learning requirements for every choice of n, ε, δ (0, 1], Dn, t Tn, and f Fn, then the class T is extrospectively learnable through the class F. Extrospective learnability adopts an assumption that is implicit in several supervised learning frameworks: the learner operates outside the environment, in that the label of a learning instance is determined irrespectively of the learner s prediction. This is so even when the learning instances are chosen adversarially with access to the learner s current hypothesis (e.g., in mistake-bounded learning [Littlestone, 1988]), since even then the label of each (adversarially) chosen learning instance is still independent of the learner. Closer in spirit to the causal nature of extrospective learnability, the stated assumption is also followed by certain statistical and learning approaches that are used to predict the temporal evolution of a sequence of states [Murphy, 2002; Box et al., 2008]. To account for situations where the very act of making a prediction may potentially affect the correct / expected prediction, we introduce the notion of a recorder function that maps Sn Ln to Sn; thus, a recorder function records into a state s Sn a given label l Ln and produces a new state s l Sn. This new state belongs in the same set Sn of states as state s, since the manner in which the environment is represented is independent of whether predictions are made. As in extrospective learnability, the goal is to make a prediction h(s) on the label of the successor state when given only s Sn. The successor state, however, is now t(s h(s)). Having observed s, the learner s prediction h(s) is announced and recorded in s, so that the initial state becomes s h(s). It is this affected initial state that brings about the successor state t(s h(s)), whose label f(t(s h(s))) is to be predicted. For any recorder , then, the deﬁnition of extrospective learnability applies mutatis mutandis as before, except that the learner has access to , and aims to return a hypothesis function h : Sn Ln such that h(s) = f(t(s h(s))). If the learner can meet the learning requirements for every choice of n, ε, δ (0, 1], Dn, t Tn, f Fn, and , then the class T is introspectively learnable through the class F.

2.3 Introspective Setting In the extrospective case (and other typical supervised learning models), the hypothesis function h that is sought is guaranteed to exist; simply set h( ) = f(t( )). In the introspective

case, however, the hypothesis function h appears in both sides of the equality h(s) = f(t(s h(s))) that it seeks to obey; it is not necessarily possible to factor out h from the equation. This leads to the consideration of an agnostic setting, where h is required to obey the equality with probability 1 α ε, where 1 α is the optimal probability that any hypothesis can achieve. As is typical in agnostic or noisy learning models, the learner is allowed time polynomial in the usual learning parameters, but also in 1/(1 α), to compensate for the adversarial environment it faces. These extensions give rise to the ﬁnal deﬁnition of introspective learnability that we adopt. Under this ﬁnal deﬁnition, we establish certain connections between extrospective (and, in particular, causal) learnability and introspective learnability. We ﬁrst show that T is introspectively learnable through F if T is causally learnable (cf. Theorem 1), but causal learnability is not also a necessary condition (cf. Theorem 2). We then introduce a metric of how expressive set Fn is (cf. Deﬁnition 5) in terms of how many functions in the set are needed to invert the mappings they produce, and show that if running time polynomial in this invertibility dimension of Fn is also allowed, then the necessity of causal learnability can be established (cf. Theorem 3).

2.4 Randomized Trials The sufﬁciency of causal learnability for introspective learnability comes through a process analogous to randomized trials, often used in clinical research and certain social sciences. Roughly, a randomized trial examines the effects of competing intervention strategies by applying a randomly chosen one on each instance of interest. In the case of clinical research, for example, one administers a randomly chosen drug among a certain set of such to each patient participating in the study. The need for randomized trials arises from the fact that administering a drug to cure / prevent a predicted illness in the future may have inadvertent and unpredictable effects. What is needed, then, is making predictions that are introspectively accurate. Ideally, a doctor should administer drugs based not on the extrospectively predicted illness of a patient (i.e., take this drug because without it you will remain / become sick ), but on the introspectively predicted wellness of the patient (i.e., take this drug because with it you will remain / become well ). Randomized trials offer an empirical exploration of the space of possible introspectively accurate solutions. Our setting differs non-trivially from that above. A prediction in clinical research is effectively always that the patient will not exhibit the illness speciﬁed by a certain study. Thus, h(s) is thought to be a constant k, and one seeks to solve the equation h(s) = f(t(s h(s))), or k = f(t(s k)), by choosing an appropriate recorder (i.e., how to act upon the prediction) among a small set of such (e.g., administer a new drug, an existing one, or a placebo). By contrast, our results apply on all possible recorders, and our aim is not to identify one of them, but rather for each of them to identify the hypothesis h that solves the equation h(s) = f(t(s h(s))). Further, the hypothesis is a function, and one cannot simply exhaustively consider its application on every state s Sn to empirically determine the value for h(s). This last consideration necessitates the use of learning techniques along with randomized trials to efﬁciently and reliably solve the equation.

2.5 Reinforcement Learning Although introspective learnability descends from a supervised learning model [Valiant, 1984], it blurs the dividing line between supervised (generalization and pattern recognition) and reinforcement (trial-and-error) learning [Sutton and Barto, 1998]. In the latter case, an agent (partially) observes a state of its environment, chooses an action, whose execution (stochastically) transitions the environment to another state, and presents the agent with an immediate reward. The agent s goal in this setting is to minimize its overall long-term regret. The driving force behind the vast literature on reinforcement learning is primarily the tension between exploration to identify an optimal long-term strategy, and exploitation to get rewarded. Locally-optimal actions may transition the environment to parts of the state space that defeat the agent s long-term prospects. By contrast, an introspective forecaster observes only an initial and a successor state, and can opt for the locally-optimal prediction, ignoring long-term effects. Because of the central concern for identifying a long-term policy, reinforcement learning remains interesting even under the assumption that the immediate reward for each available action in each state is known to the agent. In this case, reinforcement learning effectively reduces to a planning problem in a known environment, with the agent s goal being to ﬁnd an optimal sequence of actions from an initial state. In a sense, then, introspective forecasting is the dual problem to planning, since it focuses exactly on the learning aspect of reinforcement learning, ignoring any planning considerations. Among the related literature on multi-armed bandits [Robbins, 1952], most relevant is the study of the contextual bandit problem [Li et al., 2010], where an agent observes a context, takes an action, and is then rewarded accordingly. As in introspective forecasting, the problem becomes rather interesting when running time is allowed to grow only polynomially in the description size of the contexts (or states, in our case). Even considering only the obvious difference that an introspective forecaster does not get rewarded, sufﬁces to see that we assume the availability of a richer feedback mechanism for the agent. Attempting, for instance, to use binary rewards / bandits to simulate prediction fulﬁllment, would result in not seeing the label of the resulting state in case the prediction is not fulﬁlled. More to the point, in our setting we assume, in fact, that the agent gets to see the entire successor state itself. Most importantly, the contextual bandit problem is viewed through the reinforcement learning paradigm, seeking to minimize regret by balancing exploration and exploitation. Our work approaches the problem through the supervised learning paradigm, seeking to provide guarantees on predictive accuracy after an initial training phase. At the end of the day, contextual bandits and our work could be seen as complementary attempts moving towards the same point, but from opposing directions to unify the two learning paradigms, by investigating learning algorithms that identify and exploit both the intra-state and inter-state structure of their environment.

3 Framework

The making of highly-accurate predictions can be viewed as a game between Nature and the Predictor. The game s parame-

ters are determined by an environment E L, S, F, T , H , which deﬁnes for each positive integer n: a set Ln L of label representations; a set Sn S of state representations; a set Fn F of functions from Sn to Ln; a set Tn T of functions from Sn to Sn; a set Hn H of all functions from Sn to Ln. We assume that Ln, Sn, Fn, Tn, Hn are ﬁnite (but unbounded, as n grows), and that functions in Fn, Tn can be evaluated on their inputs in time polynomial on the input size. The game is played in three rounds. First, the Predictor chooses a function f Fn, encoding the feature of states to be predicted. Second, Nature chooses: (i) a function t Tn, encoding the laws governing the temporal evolution of states; (ii) a probability distribution Dn over Sn, encoding the process of choosing states. Third, the Predictor chooses a function h Hn, encoding the process that it uses to make a prediction. Once both players make their choices, the predictive ability of h is evaluated thus: A state s Sn is chosen at random from Dn, and the prediction h(s) is computed. State s is mapped to its temporal successor t(s), and the label f(t(s)) of this latter state is computed. Function h is said to be extrospectively (1 ε)-accurate w.r.t. f, t, Dn if h(s) = f(t(s)) except with probability ε over the choice of state s from Dn. Next, we extend the framework to account for recordings.

Deﬁnition 1 (Recorder). A recorder of labels Ln in states Sn is a function : Sn Ln Sn that can be evaluated on its inputs in time polynomial on the input size. The recorder is degenerate for a pair of labels l1, l2 Ln if for every state s Sn, it holds that s l1 = s l2; i.e., the recording of the two labels is not distinguishable. The recorder is degenerate if it is degenerate for every pair of labels in Ln.

Our notion of a recorder is the broadest possible given the requirement of efﬁciency. It may fully or only partially record a label, so that distinct labels affect a state in distinct or similar ways. A recorder is degenerate if it records only the fact that some prediction is announced, but ignores the prediction itself. Among the degenerate recorders, the trivial recorder maps each state to itself, so predictions are not announced at all. Unless stated otherwise, we deal with arbitrary recorders. The game is played in three rounds, as before, with an extra choice for the Predictor during the ﬁrst round: the Predictor chooses a recorder of Ln in Sn. Once both players make their choices, the predictive ability of h is evaluated thus: A state s Sn is chosen at random from Dn, the prediction h(s) is computed and recorded in state s, giving rise to s h(s). The resulting state s h(s) is mapped to its temporal successor t(s h(s)), and the label f(t(s h(s))) of this latter state is computed. Function h is said to be introspectively (1 ε)-accurate w.r.t. , f, t, Dn if h(s) = f(t(s h(s))) except with probability ε over the choice of state s from Dn. Unlike in the extrospective setting, a highly-accurate hypothesis h may not (and typically does not) necessarily exist.

Proposition 1 (Non-Predictability Scenarios). Assume that the Predictor chooses during the ﬁrst round (i) any nondegenerate recorder of Ln in Sn, and (ii) any non-constant function f Fn, such that the recorder is non-degenerate for a pair of labels l1, l2 Ln in the image of f. Then, Nature can always choose during the second round (i) a function t Tn, and (ii) a probability distribution Dn over Sn, such

that for every function h Hn that is introspectively (1 ε)- accurate w.r.t. , f, t, Dn , it holds that ε = 1.

Proof. Since is a non-degenerate recorder for the pair of (distinct) labels l1, l2, then by Deﬁnition 1 there exists a state s0 Sn such that s0 l1 = s0 l2. Since l1, l2 are in the image of f, there exist states s1, s2 Sn such that f(s1) = l1 and f(s2) = l2. Deﬁne the probability distribution Dn to assign non-zero probability only to state s0. For every state s Sn, deﬁne t(s) to equal s1 if s = s0 l2 and to equal s2 if s = s0 l2. The result follows by case analysis on h(s).

The very weak conditions of Proposition 1 corroborate and explain the strong intuition that an announced prediction can be easily defeated. This ability derives from the possibility that the temporal dynamics t Tn may invariably respond to the recorded prediction itself in an adversarial manner. We seek conditions that quantify such an adversarial behavior which presumably is limited in those real-world scenarios where announced predictions are not usually falsiﬁed.

Deﬁnition 2 (Response Function). Consider a recorder of Ln in Sn, a function f Fn, a function t Tn, and a state s Sn. The response function for s under , f, t is the function rs : Ln Ln such that for every label l Ln, rs(l) = f(t(s l)).

Deﬁnition 3 (Realizability Degree). Consider a recorder of Ln in Sn, a function f Fn, a function t Tn, and a probability distribution Dn over Sn. Nature s choice t, Dn is (1 α)-realizable for , f if the response function rs for a randomly chosen state s under , f, t has a ﬁxed-point except with probability α over the choice of state s from Dn.

Deﬁnition 3 captures precisely the likelihood α for Nature to present the Predictor with a state s that, under Nature s chosen temporal dynamics t, makes every prediction made by the Predictor (using any hypothesis) in state s self-defeating.

Proposition 2 (Condition for Predicting Introspectively). Consider a recorder of Ln in Sn, a function f Fn, a function t Tn, and a probability distribution Dn over Sn. Then, there exists a function h Hn that is introspectively (1 α)-accurate w.r.t. , f, t, Dn if and only if Nature s choice t, Dn is (1 α)-realizable for , f .

Proof. The claim follows directly from Deﬁnitions 2 3.

The Predictor s goal is, then, to ﬁnd a function h Hn that is close to being introspectively (1 α)-accurate (i.e., to the optimal value that any possible hypothesis could achieve).

4 Learnability

The learning model formalized in this section operates under the following premises / principles: (P1) The Predictor passively observes the current state of affairs that is drawn from some distribution. (P2) A label is recorded exactly once and only according to the Predictor s chosen recorder, but recordings can also be simulated without affecting the actual state of affairs. (P3) The Predictor observes the entire state of affairs that results according to Nature s temporal dynamics, irrespectively of the prediction task that it undertakes. (P4) The

Predictor seeks introspectively accurate predictions to the extent that they exist by the choices of Nature, with no a priori assumptions. (P5) The Predictor has direct access to all of its choices (predicted feature and recorder), but not to those of Nature. (P6) Learning is computationally efﬁcient. Deﬁnition 4 (Introspective Learnability). The environment E L, S, F, T , H is information-theoretically introspectively learnable if there exists an algorithm A such that for every positive integer n, every real number ε (0, 1], every real number δ (0, 1], every recorder of Ln in Sn, every function f Fn, every function t Tn, every probability distribution Dn over Sn, if t, Dn is (1 α)-realizable for , f then there exists a function h Hn such that: (i) given E, n, , f, ε, δ as inputs, A enters a training phase at the end of which it returns h, and then terminates; (ii) during the training phase the following are repeated: a state s Sn is drawn from Dn and given to A; a label l Ln is chosen by A; the state t(s l) is given to A; (iii) h is introspectively (1 α ε)-accurate w.r.t. , f, t, Dn , except with probability δ over the randomness employed by A and Dn during the training phase. If, in addition, algorithm A runs in time polynomial in |Ln|, log |Sn|, log |Fn|, log |Tn|, log |Hn|, 1/ε, 1/δ, 1/(1 α), then environment E is (efﬁciently) introspectively learnable. Condition (i) follows the typical Learning Theory requirement of having certain problem parameters available as inputs (cf. Principle (P5)). Condition (ii) follows passive supervised learning (cf. Principles (P1) and (P3)), while embracing aspects of active learning by having the algorithm partly affect, through recording, the input state (cf. Principle (P2)). Condition (iii) follows the typical PAC learning semantics in insisting on predictive guarantees, while also incorporating aspects of agnostic learning by acknowledging that the optimal accuracy might be exogenously restricted (cf. Principle (P4)). Efﬁciency (cf. Principle (P6)) adopts the usual polynomial dependence of the running time on the problem parameters. Time linear in log |Sn|, log |Fn|, log |Tn|, log |Hn| is needed even to read or write an element of Sn, Fn, Tn, Hn. Time linear in 1/ε and logarithmic in 1/δ is needed to compensate for the arduous requirement to achieve arbitrarily high accuracy and conﬁdence in the returned hypothesis [Ehrenfeucht et al., 1989]. Time linear in 1/(1 α) is needed to compensate for the (possibly small) probability 1 α of drawing states for which a correct prediction exists; an analogous treatment applies for noisy instances [Angluin and Laird, 1988]. To support a general, task-independent formulation, the introspective learning algorithm is a priori oblivious to the actual value of α (as determined by Nature s choices), even if its running time may depend on it. Lastly, time linear in |Ln| (as opposed to log |Ln|) is needed even to enumerate the labels in Ln during the training phase, which is shown next to be necessary. In effect, the following result shows the necessity of employing some form of randomized trials for introspective learning, at least when the labels lack any usable internal structure. Proposition 3 (Linear Dependence on Number of Labels). Any algorithm that information-theoretically introspectively learns an environment L, S, F, T , H runs in time Ω(|Ln|).

Proof. Consider a presumed counterexample algorithm to the claim. Let Dn assign probability 1 to a state s, so the response function rs under , f, t has one ﬁxed-point. For ε, δ < 1, the algorithm identiﬁes a ﬁxed-point of rs, except with probability δ over its randomness. For large n, however, the algorithm s running time is less than (1 δ) |Ln|, and probing rs for its ﬁxed-point fails with probability more than δ.

Given the allowed computational resources, then, an introspective learner A seeks to ﬁnd a ﬁxed-point of the response function rs for s under , f, t , for states s of sufﬁcient probability mass under Dn. To appreciate the central aspects of the learning model, we next consider a possible training strategy for algorithm A. Following Condition (ii) of Deﬁnition 4, algorithm A ﬁrst obtains a state s0 Sn drawn from Dn. Having no indication as of yet on what the sought function h might look like, the algorithm may choose an arbitrary label l Ln, and compute the state s1 s0 l0 that results from recording l0 in s0. After the label is recorded, the algorithm receives the state s2 t(s1), and may compute the label l2 f(s2). At this point the algorithm may seek to exploit the information that has been made available so far. We examine the algorithm s behavior by case analysis. Assume ﬁrst that l0 = l2, so that the algorithm predicts correctly. It can, then, safely set h(s0) := l0. Assume now that l0 = l2, so that the algorithm predicts incorrectly. Note that, unlike in typical supervised learning models, the correct prediction is not revealed to the algorithm. Indeed, label l2 is not necessarily the appropriate value to assign to h(s0); for the same state s0, Nature may have responded to a prediction of l2 with l 2, to a prediction of l 2 with l 2, and so on, without any of these labels corresponding to a ﬁxed-point of the response function rs0 for s0 under , f, t . Thus, algorithm A can safely infer in this case only that h(s0) = l0, but not that h(s0) = l2. In either case, the algorithm has access to a pair s1, s2 comprising the initial / successor states of the environment according to the temporal dynamics t. The algorithm may seek, then, to solve the auxiliary learning problem of approximating t by an auxiliary hypothesis g (that agrees with t on sufﬁciently many inputs) [Michael, 2011]. If algorithm A obtains such a hypothesis g, it can compute a highly-accurate h: consider the response function rs(l) = f(g(s l)) f(t(s l)) for s under , f, g ; systematically test it on each label in Ln to identify the least ﬁxed-point l; if l exists, set h(s) := l. Since the algorithm has access to recorder and function f, it can compute f(g(s l)). Note that the computation of h in this manner cannot be done explicitly during the training phase, but it is done lazily when hypothesis h is applied during the testing phase to make a prediction on state s. Why is the problem not solved by learning t and computing h? Although, the algorithm obtains a pair s1, s2 of initial / successor states even when l0 = l2, in that case state s1 is not drawn from the right probability distribution (in which only correct predictions are recorded)! Thus, while s1, s2 can be used to learn t, it might offer little in ensuring that g approximates t on those states that are important (for computing h). The right probability distribution is not Dn from which Nature draws states, but a probability distribution D+ n over states s l induced by drawing a state s Sn from Dn, and

then recording the least (or any other prescribed tie-breaking function) ﬁxed-point l of the response function rs for s under , f, t , if one exists. It is on states of sufﬁciently high probability mass under probability distribution D+ n that g should approximate t. However, sampling from D+ n seems to presuppose a highly-accurate h to identify an appropriate label l for state s. Are we back to square one? Fortunately not quite, since it sufﬁces to sample from D+ n only sufﬁciently often.

Theorem 1 (Sufﬁcient Learnability Condition w.r.t. T ). The environment L, S, F, T , H is introspectively learnable if the class T is causally learnable.

Proof. Let B be a causal learning algorithm for class T . Consider an environment E L, S, F, T , H and algorithm A:

Given inputs E, n, , f, ε, δ, algorithm A starts by simulating the execution of algorithm B on inputs Sn, Tn, ε(1 α)/|Ln|, δ/2; we assume for now that α is known, and deal with the task of identifying it later. Whenever algorithm B requests a training instance, algorithm A samples a state s Sn from the probability distribution Dn, selects a label l Ln uniformly at random, computes s1 s l, receives s2 t(s1), and passes s1, s2 to algorithm B. Once algorithm B returns a hypothesis g and terminates, algorithm A constructs the hypothesis h Hn deﬁned so that on any input s Sn, h(s) equals the lexicographically ﬁrst ﬁxed-point of the response function rs(l) = f(g(s l)) for s under , f, g ; if no such ﬁxed-point exists, then h(s) equals (or any arbitrary label). Algorithm A returns h, and terminates.

Let D+ n be the probability distribution deﬁned as in our discussion above. Since a ﬁxed-point exists with probability 1 α, and since algorithm A records a uniformly at random label in s, it follows that with probability (1 α)/|Ln| the chosen label is the least ﬁxed-point of rs. Overall, then, algorithm A obtains a pair s1, s2 satisfying s2 = t(s1), so that state s1 is sampled from D+ n with probability (1 α)/|Ln|, and from some other probability distribution, say D n , with the remaining probability. It follows that if g(s1) = t(s1) except with probability ε(1 α)/|Ln| on states s1 sampled from the probability distribution induced by D+ n and D n , then g(s1) = t(s1) except with probability ε on states s1 sampled from D+ n . The correctness of algorithm A follows. The running time is as required by Deﬁnition 4 even if α is replaced with any upper-bound α of α such that 1 α is polynomially related to 1 α. Such an α can be efﬁciently computed iteratively as done when learning with an unknown noise rate [Laird, 1988; Angluin and Laird, 1988].

Although algorithm A s inputs resemble a causal learner s, the condition of introspective learnability is not necessary.

Theorem 2 (Counterexample to Necessity w.r.t. T ). There exists an environment L, S, F, T , H that is introspectively learnable, but the class T is not causally learnable. (The result holds even if classes F and F T are causally learnable.)

Proof. Consider an environment E L, S, F, T , H such that F comprises constant functions only, and the class T is not causally learnable (see, e.g., [Michael, 2011] for such a

class). Observe, however, that the environment E is (trivially) introspectively learnable: given f Fn as input, the learning algorithm can return the hypothesis h( ) := f( ), for which it holds that h(s) = f(s) = f(t(s h(s))) for every state s Sn. With regards to the parenthetical constraint, Fn Tn = Fn, and F is causally learnable.

Theorem 2 illustrates a scenario where the temporal dynamics T are unlearnable, but the features F of the successor states are simple enough to nullify this unlearnability. Similar in spirit results can be established for more expressive classes F, all attesting to the following point: If the features F that the Predictor can reliably predict are insufﬁciently expressive , then it does not follow that the Predictor is able to learn the temporal dynamics T of Nature. We formalize this below. Deﬁnition 5 (Invertibility Dimension). Consider a set Cn of functions with domain Xn and codomain Yn. The invertibility dimension of Cn equals the minimum non-negative integer d for which there exist functions c1, c2, . . . , cd Cn and a function u : Yd n Xn such that (i) all of the d + 1 functions can be evaluated on their inputs in time polynomial on the input size, and (ii) for every element x Xn, it holds that u(c1(x), c2(x), . . . , cd(x)) = x. If no such d exists, the invertibility dimension of Cn is . Recall that each f Fn captures a feature of some state of affairs. d is the minimum number of such features that would be required to get full information for a given state of affairs. Theorem 3 (Necessary Learnability Condition w.r.t. T ). The environment L, S, F, T , H is introspectively learnable only if the class T is causally learnable in non-uniform time polynomial in the usual learning parameters and also in the invertibility dimension d of Fn. For uniform time, the necessary condition needs to account also for the time required to identify d, u, c1, c2, . . . , cd as given in Deﬁnition 5.

Proof. Let A be an introspective learning algorithm for environment E L, S, F, T , H . Consider algorithm B:

Given inputs Ln, Sn, Fn, Tn, n, ε, δ, algorithm B starts by simulating the execution of d independent copies of algorithm A, so that the i-th copy is simulated on inputs E, n, , ci, ε/d, δ/2d, where is the trivial recorder, d is the invertibility dimension of Fn, and u, c1, c2, . . . , cd are as in Deﬁnition 5 for Cn = Fn, Xn = Sn, Yn = Ln. Whenever algorithm A requests a training instance, algorithm B samples a state s Sn and t(s) from the probability distribution Dn, passes s to algorithm A and receives a label l Ln, sets s1 s l and s2 t(s1), and passes s2 to algorithm A. Once each copy of algorithm A returns a hypothesis hi and terminates, algorithm B constructs the hypothesis h Hn deﬁned so that on any input s Sn, h(s) equals u(h1(s), h2(s), . . . , hd(s)). Algorithm B returns h, and terminates. Correctness follows by observing that algorithm A is simulated given the trivial recorder , which, in particular, implies α = 0. The computational complexity is non-uniform, since given Fn we assume access to d and u, c1, c2, . . . , cd.

Analogous results to Theorems 1-2 can be obtained w.r.t. the class F T (the composition of functions from F and T ).

5 Conclusions

Understanding how the realizability of predictions is affected by the actions effected by the predictions themselves, is key in designing intelligent and autonomous machines. We have put forward a learning-theoretic model within which the problem can be formally investigated. Further connections to the general issue of usefully accommodating predictions in the process of learning [Michael, 2008; 2014] remain to be studied. Future work can extend the model to settings where: state sensing is noisy [Angluin and Laird, 1988; Shackelford and Volper, 1988] or partial [Michael, 2010]; only statistical information is available [Kearns, 1998]; the successor state is not always available [Balcan and Blum, 2010]; temporal dynamics are stochastic [Kearns and Schapire, 1994]; the initial state is partially chosen / ﬁxed by the learner [Angluin, 1988; Michael, 2011] or a teacher [Goldman and Mathias, 1996]; the labels exhibit learnable internal structure that allows introspective learnability with sub-linear (and even logarithmic) time-dependence on the number of labels (cf. Proposition 3). Further work can be done on developing unbiased introspective learners that return a uniformly at random chosen hypothesis among all highly-accurate ones, addressing, thus, concerns of manipulation by a forecaster [Simon, 1954, p. 251]. On a second front, future work can seek to approach as special instances of introspective forecasting, tasks whose understanding would prima facie beneﬁt from such a treatment: (i) accommodating for the side-effects of announcing poll results [Simon, 1954]; (ii) personalizing the education of students without risking to overwhelm and discourage them [Michael, 2012]; (iii) anticipating how spam senders adapt to the spam ﬁltering tools in use [Fawcett, 2003]1; and (iv) playing in general game competitions [Genesereth et al., 2005], where a technique known as Monte Carlo Tree Search [Browne et al., 2012], which effectively reacts to a randomly predicted strategy for an agent s opponent, was found to be very effective. Approaching randomized trials in clinical research as special instances of introspective forecasting could conceivably yield optimal interventions more efﬁciently and less intrusively. A third front relates to investigating the deeper connection of this work to ﬁxed-points. We have sought to identify ﬁxedpoints to the extent they happen to exist (as given by α), sidestepping concerns on the conditions under which they exist [Brouwer, 1911], and without a need to make multiple queries to anyone of Nature s response functions [Sperner, 1928]. We have considered the identiﬁcation of exact ﬁxed-points for a set of nominal-valued response functions, and sought to efﬁciently ε-approximate the subset of functions for which this was possible. One could examine the case of identifying approximately the ﬁxed-points themselves [Etessami and Yannakakis, 2010], either by introducing a distance metric between labels, or by clustering them in equivalence classes. Investigating learnability in such coarser settings could be informed by work on learning to approximate real-valued functions [Balcan and Harvey, 2012], while also accommodating

1Offered as a challenge to Knowledge Discovery and Data Mining: [this perspective is] foreign to most data mining researchers: the data are mined and the results are deployed, but the data environment is not considered to be an active entity that will react in turn .

environments with sets Ln, Sn, Fn, Tn, Hn of inﬁnite size. Much work remains to be done in exploring the intricacies of introspective forecasting and its utilization in various settings. Yet, its demonstrable amenability to a task-independent formalization already offers a glimpse into its possible role as a convergence point across disciplines, contra to arguments offered against the unity of science [Fodor, 1974]. The gist of an oft-cited argument is that prediction tend[s] to change the initial course of developments in the social sciences, whereas [t]his is not true of prediction in ﬁelds which do not pertain to human conduct [Merton, 1936, p. 903]. Our work responds by demonstrating that to the extent that prediction is possible in any given setting and discipline, the same methodology can yield optimal, in a deﬁned sense, predictions.

[Angluin and Laird, 1988] Dana Angluin and Philip Laird. Learning From Noisy Examples. Machine Learning, 2(4):343 370, 1988.

[Angluin, 1988] Dana Angluin. Queries and Concept Learning. Machine Learning, 2(4):319 342, 1988.

[Balcan and Blum, 2010] Maria-Florina Balcan and Avrim Blum. A Discriminative Model for Semi-Supervised Learning. Journal of the ACM, 57(3):19:1 19:46, 2010.

[Balcan and Harvey, 2012] Maria-Florina Balcan and Nicholas Harvey. Learning Submodular Functions. In Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2012), pages 846 849, 2012.

[Box et al., 2008] George Box, Gwilym Jenkins, and Gregory Reinsel. Time Series Analysis: Forecasting and Control. John Wiley & Sons, Inc., Hoboken, New Jersey, U.S.A., 4th edition, 2008.

[Brouwer, 1911] Luitzen Brouwer. Uber Abbildungen von Mannigfaltigkeiten. Mathematische Annalen, 71(1):97 115, 1911.

[Browne et al., 2012] Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1 43, 2012.

[Ehrenfeucht et al., 1989] Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A General Lower Bound on the Number of Examples Needed for Learning. Information and Computation, 82(3):247 261, 1989.

[Etessami and Yannakakis, 2010] Kousha Etessami and Mihalis Yannakakis. On the Complexity of Nash Equilibria and Other Fixed Points. SIAM Journal on Computing, 39(6):2531 2597, 2010.

[Fawcett, 2003] Tom Fawcett. In Vivo Spam Filtering: A Challenge Problem for KDD. SIGKDD Explorations Newsletter, 5(2):140 148, 2003.

[Fodor, 1974] Jerry Fodor. Special Sciences (or: The Disunity of Science as a Working Hypothesis). Synthese, 28(2):97 115, 1974.

[Genesereth et al., 2005] Michael Genesereth, Nathaniel Love, and Barney Pell. General Game Playing: Overview of the AAAI Competition. AI Magazine, 26(2):62 72, 2005.

[Goldman and Mathias, 1996] Sally Goldman and David Mathias. Teaching a Smarter Learner. Journal of Computer and System Sciences, 52(2):255 267, 1996. [Kearns and Schapire, 1994] Michael Kearns and Robert Schapire. Efﬁcient Distribution-Free Learning of Probabilistic Concepts. Journal of Computer and System Sciences, 48(3):464 497, 1994. [Kearns, 1998] Michael Kearns. Efﬁcient Noise-Tolerant Learning from Statistical Queries. Journal of the ACM, 45(6):983 1006, 1998. [Laird, 1988] Philip Laird. Learning from Good and Bad Data. Kluwer Academic Publishers, Norwell, Massachusetts, U.S.A., 1988. [Li et al., 2010] Lihong Li, Wei Chu, John Langford, and Robert Schapire. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proc. 19th International Conference on World Wide Web (WWW 2010), pages 661 670, 2010. [Littlestone, 1988] Nick Littlestone. Learning Quickly when Irrelevant Attributes Abound: A New Linear-Threshold Algorithm. Machine Learning, 2(4):285 318, 1988. [Merton, 1936] Robert Merton. The Unanticipated Consequences of Purposive Social Action. American Sociological Review, 1(6):894 904, 1936. [Michael, 2008] Loizos Michael. Autodidactic Learning and Reasoning. Doctoral Dissertation, Harvard University, U.S.A., 2008. [Michael, 2010] Loizos Michael. Partial Observability and Learnability. Artiﬁcial Intelligence, 174(11):639 669, 2010. [Michael, 2011] Loizos Michael. Causal Learnability. In Proc. 22nd International Joint Conference on Artiﬁcial Intelligence (IJCAI 2011), pages 1014 1020, 2011. [Michael, 2012] Loizos Michael. Primum Non Nocere for Personalized Education. In Proc. Workshop on Personalizing Education With Machine Learning (PEw ML 2012), 2012. [Michael, 2014] Loizos Michael. Simultaneous Learning and Prediction. In Proc. 14th International Conference on Principles of Knowledge Representation and Reasoning (KR 2014), pages 348 357, 2014. [Murphy, 2002] Kevin Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. Doctoral Dissertation, University of California, Berkeley, U.S.A., 2002. [Robbins, 1952] Herbert Robbins. Some Aspects of the Sequential Design of Experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952. [Shackelford and Volper, 1988] George Shackelford and Dennis Volper. Learning k-DNF with Noise in the Attributes. In Proc. 1st Annual Workshop on Computational Learning Theory (COLT 1988), pages 97 103, 1988. [Simon, 1954] Herbert Simon. Bandwagon and Underdog Effects and the Possibility of Election Predictions. The Public Opinion Quarterly, 18(3):245 253, 1954. [Sperner, 1928] Emanuel Sperner. Neuer Beweis f ur die Invarianz der Dimensionszahl und des Gebietes. Abhandlungen aus dem Mathematischen Seminar der Universitt Hamburg, 6(1):265 272, 1928. [Sutton and Barto, 1998] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, Massachusetts, U.S.A., 1998. [Valiant, 1984] Leslie Valiant. A Theory of the Learnable. Communications of the ACM, 27(11):1134 1142, 1984.