# provable_interactive_learning_with_hindsight_instruction_feedback__0e6800a5.pdf

Provable Interactive Learning with Hindsight Instruction Feedback

Dipendra Misra 1 Aldo Pacchiano 2 Robert Schapire 1

We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight labeling where a teacher provides an instruction that is most suitable for the agent s generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent s response space. Next we study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting, and show that it is a no-regret algorithm with the regret scaling with

T and depends on the intrinsic rank but does not depend of the agent s response space. We provide experiments showing the performance of LORIL in practice for 2 domains.

1. Introduction

Success of a machine learning approach is intimately tied to the ease of getting training data. For example, language models (Brown et al., 2020; Achiam et al., 2023), which are one of the most successful applications of machine learning, are trained on an abundance of language data which is both easy to elicit from non-expert users and is available

*Equal contribution 1Microsoft Research 2Broad Institute of MIT and Harvard, Boston University.. Correspondence to: Aldo Pacchiano <pacchian@bu.edu>, Dipendra Misra <dipendra.Misra@microsoft.com>, Robert Schapire <schapire@microsoft.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

offline. In contrast, consider the task of a robot following instructions specified by a human user (Misra et al., 2016; Blukis et al., 2019; Myers et al., 2023). It is expensive to collect ground truth robot trajectories making standard imitation learning (IL) approaches (Pomerleau, 1991) expensive to apply, whereas reinforcement learning (RL) (Sutton & Barto, 2018) approaches suffer from high sample complexity. This makes IL and RL the two most common ways of training agents, expensive in practice. Motivated by the limitations of IL and RL, a line of work has proposed using hindsight labeling, where the agent (robot in our example) generates a response (trajectory) given an instruction, and a teacher instead of providing expensive ground truth response, provides the instruction that is suitable for the agent s response (Fried et al., 2018; Nguyen et al., 2021). This reverses the labeling problem to an easier labeling problem, since instructions are typically in a format such as natural language, which can be inexpensively elicited from non-expert users in contrast to robot trajectories. While this approach has been applied empirically, a theoretical understanding remains absent. In this work, we initiate the theoretical understanding of interactive learning from hindsight instruction.

We consider the learning setup illustrated in Figure 1. In this setup, a teacher is teaching an agent to navigate in a virtual home environment. In each round, the world gives an instruction and a context to the agent. The instruction in this case is expressed in natural language. The context is an image that provides information about the environment such as the position, color, and sizes of different objects. The goal of the agent is to generate a trajectory that follows the given instruction. In the beginning, the agent lacks any language understanding and, therefore, cannot generate correct trajectories. We assume access to a teacher that can provide an instruction that best describes the agent s trajectory. This type of feedback can be viewed as a hindsight instruction, as it was the correct instruction in hindsight for the trajectory generated by the agent. In each round, the agent generates a trajectory and receives a hindsight instruction from the teacher. This allows the agent to learn a mapping from the instruction space to the trajectory space, which helps improve the agent s policy. We call this learning approach as Learning from Hindsight Instruction (LHI).

There are several different approaches for training a

Provable Interactive Learning with Hindsight Instruction Feedback

Step 1: Agent is given an instruction and a

Step 2: Agent generates a trajectory to

follow the instruction

Go to the red couch in the

corner of the room

Go round the coffee table

and stop in front of the

television.

Step 3: Teacher generates an instruction

for the agent s trajectory

Go to the red couch in the

corner of the room

t P(X | Y = yt)

Teacher Model

Agent trajectory

1 2 3 Interactive Learning Loop

Desired trajectory

Figure 1. Shows sketch of our interactive Learning from Hindsight Instruction (LHI) setting. The agent interacts with the world iteratively. In each round (or time step), the agent is given an instruction xt and a context st. In our case, the context st is the house layout. In response, the agent generates a trajectory (response) yt which is then labeled by a teacher model with an instruction x t (hindsight instruction). The agent never receives any expert response or rewards.

decision-making agent. One of the most commonly used approaches is imitation learning (IL) where a teacher provides access to expert demonstrations allowing the agent to learn the right behavior (Ross et al., 2011). For the example in Figure 1, this will require the teacher to be able to understand the agent s action space and dynamics. This often requires domain expertise and can only be provided by expert teachers and may require specialized tools.1 In contrast, a non-expert user can easily provide an instruction for the red trajectory in Figure 1.

Reinforcement learning (RL) is another widely used approach for training agents that overcomes the expense of collecting expert demonstrations by directly optimizing a reward function that is more user-friendly to provide (Sutton & Barto, 2018). However, RL approaches are less sample efficient than IL approaches making them less suited for real-world settings.In contrast, learning from hindsight instruction uses instruction feedback which is user-friendly and more natural for humans to provide. Further, instruction feedback contains significantly richer information than scalar rewards which can help in reducing the sample complexity compared to RL (Nguyen et al., 2021).

Because of its promise, learning from hindsight labeling has been explored in various applications (Andrychowicz et al., 2017; Fried et al., 2018; Nguyen et al., 2021). However, a principled understanding of this setting remains absent despite these empirical results. In particular, we focus on an interactive learning setting where a teacher trains the agent using hindsight instructions. For these settings, the natural evaluation metric is regret which penalizes the agent

1One such commonly used approach is motion capture where a human can record behavior that can be transferred to a humanoid agent but this requires specialized tools.

for failing to follow a given instruction. A key challenge in designing algorithms for this setting is that the agent has to both exploit to follow the given instruction, but also explore to improve its understanding capabilities more generally. In this work, we initiate the theoretical study of this setting. We first present a formal interactive learning setting and define a notion of regret. Motivated by natural settings where the teacher is a human user, we assume access to a black box teacher which can generate a sampled instruction given the agent s response but where the agent does not have access to the teacher s probability values. The agent is evaluated using a hidden reward given by the probability of the teacher labeling the agent s response by the original given instruction.

We first prove a lower bound for this setting showing that in the worst case, the regret bounds for any algorithm will scale polynomially with the size of the agent s response space. In many applications such as our robot navigation example, the agent s response is a trajectory, leading to an exponentially large response space. However, in practice using function approximations and featurization enables generalization in infinitely large spaces. Motivated by this we introduce a lowrank setting where where the agent has access to a feature representation of the response and context and the teacher s distribution admits a low-rank decomposition in this feature space. We introduce an algorithm LORIL for this setting and derive regret bounds that scale as

T with the horizon T. Importantly, the regret does not depend upon the size of the agent s response or the size of instruction or context space, and instead depends on the rank of the teacher s distribution, which can be significantly smaller in practice.

We evaluate LORIL on two tasks. In the first setting, we use a synthetic task where low-rank assumption holds and show

Provable Interactive Learning with Hindsight Instruction Feedback

Protocol 1 Shows the protocol for our setting: Learning from Hindsight Instruction (LHI). The line in blue needs to be implemented by an algorithm implementing the protocol.

1: for t = 1, 2, , T do 2: World presents st, xt possibly adversarially 3: Agent generates a response yt Y 4: Teacher describes the response x t P(X | yt, st) 5: Evaluate using a hidden reward rt = P(xt | yt, st)

6: Return PT t=1 rt

that LORIL achieves lower regret compared to baselines. In the second setting, we apply LORIL to a setting with natural language instruction and images and show that insights from LORIL help even when the low-rank assumption does not hold.

We include a discussion on the related literature after presenting our results in Section 7. The code for all experiments in the paper can be found at https://github. com/microsoft/Intrepid.

2. Preliminary and Overview

We first introduce the mathematical notations before providing an overview of our setup.

Notation. For a given N N, we define [N] = {1, 2, , N}. For a given countable set U, we denote the set of all distributions over U by (U). For a given positive-definite matrix A Rd d, we define the induced norm of a vector v Rd as v A =

Interactive Learning from Hindsight Instruction. We define the space of contexts as S, the space of instructions as X and the space of all possible agent response as Y. We assume S, X, and Y to be finite for our analysis but allow them to be arbitrarily large. The finiteness is only a mild assumption in practice, as it still allows us to handle the most common data types. For example, if X denotes the space of all m n RGB images with each pixel taking values in {0, 1, , 255}, then |X| = 2563mn which is an exponentially large but finite value.

The agent interacts with the world repeatedly over T rounds. Protocol 1 shows our learning framework. In the tth round, the world presents a context st and an instruction xt sampled according to a fixed distribution Dt( , | x1, y1, s1, , xt 1, st 1, yt 1) that can depend on the past history. Given the instruction and the context, the agent generates a response yt Y. Ideally, we want the agent to generate a response that fulfills the intent of the instruction. After generating the response, the agent receives a hindsight instruction x t sampled from a fixed con-

ditional distribution model P(X | Y = yt, S = st). This conditional distribution models a teacher that provides an instruction that is most appropriate for the agent response. In a typical setting, this teacher will be modeled using a human-in-the-loop.

The agent does not have access to P(X | Y = yt, S = st) but can observe a sample from this distribution by generating a response and asking the teacher to label it with an instruction. This is because in a human-in-the-loop setting, we don t have access to the human teacher s distribution.

The teacher model P(X | Y, S = st) is in practice highly stochastic since there can be many possible ways to describe instructions for a given response. Further, the space of all possible responses and instructions can be impractically large, necessitating the use of function approximation.

Computing Regret. Given a state st and context xt, the ideal response should maximize the probability of the teacher labeling the response with the right instruction xt, i.e., the agent should play y = arg maxy Y P(xt | y, st). We can, therefore, view P(xt | y, st) as a latent reward for generating response y. This leads to a natural notion of regret given by:

max y Y P(xt | st, y) P(xt | st, yt) ,

(1) where st, xt are the context and instruction in round t and yt is the response generated by the agent.

There can be alternative ways to define regret in Equation 1. Log-probabilities log P(X | S, Y ) may appear more natural to use instead of probabilities, however, the former is unbounded which makes it ill-suited for defining reward. For example, if in the first round, the agent generates a response y1 for which P(x1 | y1, s1) = 0, then the agent s regret is unbounded irrespective of the agent s performance in later rounds. Another choice for reward is the likelihood of the response P(y | xt, st). However, this requires assuming a prior distribution over y which can be hard to realize.

[DM: Discuss relation to LLMs]

3. Lower Bound in the General Case

We first prove that it is impossible to design an algorithm for Protocol 1 with a regret bound that doesn t scale polynomially in the size of plausible responses |Y|.

We introduce the concept of stochastic worlds to prove our lower bound. A stochastic world W consists of a set of instructions, contexts marginal distribution PW (X, S) and a conditional distribution of instructions given responses and contexts PW (X|Y, S).

Provable Interactive Learning with Hindsight Instruction Feedback

When at time t an agent A interacts via Protocol 1 with a stochastic world W, the world produces an instruction xt and context st sampled from a time-independent distribution PW (X, S). We use the notation PW,A and EW,A to denote the measure and expectations over trajectories (s1, x1, y1, x 1, , s T , x T , y T , x T ) resulting from the interaction between A and world W. We show that for any K N, and any algorithm A, there is a stochastic world where the regret of algorithm A satisfies Reg(T) Ω(

KT) when A interacts with W through Protocol 1.

To prove our main result we exhibit a family of stochastic worlds {Wi}, such that world Wi is defined by context space S = {so} instruction space X = {A, B}, response set Y = [K] marginal PWi(X, S) = Uniform((A, so), (B, so)) for all i [K], and conditional

PWi(X|yj) = 1/2+ p

K/T 1(j = i) (1 2 1(X = B)).

The context distribution is a delta mass around context so. In world Wi the optimal response for instruction X = A equals yi, and the optimal response for instruction X = B is any yj for j = i. Any suboptimal decision, regardless of the instruction incurs in regret of order p

K/T. Our main result is the following.

Theorem 1. Let T 256 log(2e) and K 8e. For any algorithm, there is at least one stochastic world Wˆi such that Reg(T)

8 such that with probability at least 1/4e.

The proof can be found in Appendix A. Lemma 1 implies the expected regret lower bound,

Corollary 2. If the conditions of Lemma 1 hold then for any algorithm there exists at least one stochastic world Wˆi such that Reg(T) Ω(

KT). Where,

Reg(T) = EWˆi,A

t=1 max y Y P(xt|y, so) P(xt|yt, so)

The proof of this result can also be found in Appendix A. Theorem 1 shows that for tractable hindsight learning it is necessary to impose structural assumptions on the conditional probabilities P(X|Y, S). We explore one such assumption in the next sections.

4. Provable Learning in Low-Rank Setting

The analysis in Section 3 shows that the regret scales as Ω( p

|Y|) which makes this an intractable setting when Y is extremely large. For settings with typically large input or output spaces, it is natural in practice to use function approximation. For example, a trajectory can be encoded using a neural network to a representation that contains

the relevant information. In statistical learning theory, significant progress has been made in the study of learning with function approximation (Misra et al., 2020; Sekhari et al., 2021; Foster et al., 2021). In particular, problems with low-rank structures (Agarwal et al., 2020; Jin et al., 2020) have received significant attention due to their abilities to model commonly occurring settings and the success of corresponding algorithms in real-world problems even where the low-rank assumption is violated (Henaff et al., 2022). Motivated by this, we introduce and study a setting where the teacher model P(X | Y, S) admits a low-rank decomposition.

Low-Rank Teacher Model. We consider a specialization of our general setup where the teacher model follows a lowrank decomposition. Formally, we assume that there exists f : X Rd and g : Y Rd such that

s S, x X, y Y, P(x | y, s) = f (x) g (y, s),

where d is the intrinsic dimension of the problem which is much smaller than the size of S, X, and Y which can all be infinitely large. We assume that the agent has knowledge of g but does not know f .

We assume access to a model class F to learn f . Our goal is to get regret guarantees that do not scale with |X|, |Y|, |S| and instead only depend on the intrinsic dimension d of the problem and the statistical complexity of F.

LORIL Algorithm. We present the Learning in LOw Rank models from Instruction Labels algorithm (LORIL): for low-rank teacher models in Algorithm 1. The algorithm assumes access to the embedding function g for encoding the agent s response. In practice, such a function can be available either using a pre-trained representation model or by using a self-supervised learning objective such as autoencoding. We discuss some implementation choices later in the experiment section.

LORIL implements Protocol 1. In the tth round, the algorithm first computes a maximum likelihood estimation ˆft of f using the historical data (line 3). We use this to define a policy πt to generate a response yt. LORIL is based on the principle of optimism under uncertainty. As per this principle, we first compute an appropriate uncertainty measure bt(y) for a response y Y such that we know with high probability that the true value of a response, i.e., f (xt) g (y, st) lies in h ˆft(xt) g (y, st) bt(y, st), ˆft(xt) g (y, st) + bt(y, st) i

with high probability. As ˆft(xt) g (y, st) is the current estimate of the value of a response y in the tth round, we can view b(y, st) as defining a confidence interval for a given response y and context st. Second, we take the action that has the maximum possible value in the confidence

Provable Interactive Learning with Hindsight Instruction Feedback

interval, namely:

yt = arg max y Y

ˆft(xt) g (y, st) | {z } estimated model value

+ bt(y, st) | {z } bonus

For low-rank models, we will show that bt(y, st) can be expressed as O( g (y, st) bΣ 1 t ) where bΣt = λt I + Pt 1 l=1 g (yl, sl)g (yl, , sl) is a positive definite matrix capturing information about historical data. This can be viewed as a positive definite matrix λt I were λt > 0 and a sum of rank one positive semi-definite matrices g (yl, sl)g (yl, sl) which form a covariance matrix Pt 1 l=1 g (yl, sl)g (yl, sl) . The quantity bt(y, st) can be viewed as a bonus in Equation 2 and is known as elliptic bonus in the literature and is a frequently appearing quantity in the study of linear models (Abbasi-Yadkori et al., 2011) and low-rank models (Agarwal et al., 2020).2

The agent computes the optimistic response yt = πt(xt) (line 6) and plays it. In response, the teacher provides a description x t (line 7) which is added along with the agent response to the historical data.

Note that the agent never has direct access to f or the true model P(X | Y, S), but only has access through feedback generated by the teacher model and through its knowledge of g and F. Further, the agent is also unaware of the true horizon length T which is often unknown in practice.

Computational Efficiency. The computation of the covariance matrix can be performed easily as can the computation of bonus bt for a given response. The inverse of the covariance matrix can be computed efficiently in a numerically stable way using the Sherman Morrison formula (Sherman, 1949). The two main computationally expensive steps in LORIL are maximum-likelihood estimation and solving the optimization in line 5-6. The maximum likelihood estimation is routinely computed for complex function classes such as deep neural networks in practice. However, in this case, the main challenge is in defining a function class F such that f( ) g (y, s) is a well-defined distribution. This question has been addressed for low-rank models (Zhang et al., 2022) and we expect the same tools to also help here. The optimization in line 5-6 can be trivially solved when Y is small enough to be enumerated. When Y is exponentially large, this step can be challenging. One strategy can be to use a proposal distribution q(y | xt, st) to generate a set of K responses, and then find the response with maximum objective value in Equation 4. The proposal distribution can be trained by performing MLE on historic data and modeling y autoregressively. However, we leave a computational study

2The word elliptic comes from the fact that for a positive definite matrix Σ, the set {y | y Σ 1y 1} denotes an ellipsoid centered at 0.

Algorithm 1 LORIL(g , F): Learning in LOw-Rank models from Instruction Labels Require: Response embedding function g : S Y Rd

Require: Model class F = {f : X Rd}

1: Define λt = 1

t . 2: for t = 1, 2, , T do 3: Compute MLE estimator bft using {x ℓ, yℓ, sℓ}t 1 ℓ=1.

ˆft = arg max f F

ℓ=1 ln ˆft(x ℓ) g (yℓ, sℓ)

4: Define empirical covariance matrix

bΣt = λt I +

ℓ=1 g (yℓ, sℓ)g (yℓ, sℓ) (3)

5: Define policy for this round

πt : x, s 7 argmax y Y

ˆft(x) g (y, s) + bt(y, s)

bt(y, s) = C p

log(t|F|/δ) + p

λt B g (y, s) bΣ 1 t

and C is determined by Equation 7 in Lemma 1 .

6: Agent generates the response yt = πt(xt, st)

7: Teacher describes the response x t P(X | yt, st)

8: Evaluate using a hidden reward rt = P(xt | yt, st)

return PT t=1 rt

with exponentially large Y for future work. [DM: Discuss a specific algorithm specially in RL framework]

5. Theoretical Analysis

Our main result is to show Algorithm 1 satisfies a sublinear regret bound in the realizable setting,

Assumption 1. [Realizability] The teacher model P(x|y, s) = f (x) g (y, s) satisfies f F for a known model class F. Moreover, all teacher models parametrized by feature maps f F are valid distributions, i.e., f (X) g (y, s) (X), for any y Y and s S.

We also assume the feature maps in F are bounded.

Assumption 2. There exists a constant B > 0 such that for any f F we have supx X f(x) B and supy Y,s S g (y, s) B.

Our main theoretical result is a high probability upper bound for the regret of Algorithm 1.

Provable Interactive Learning with Hindsight Instruction Feedback

Theorem 3. [Regret bound of LORIL] When Assumption 1 and Assumption 2 hold and λt = 1/t the regret of Algorithm 1 satisfies

Reg(T) =O B p

Td log(1 + TB)+ p

Td log(T|F|/δ) log(1 + TB)

with probability at least 1 3δ for all T N.

The analysis of Algorithm 1 in Theorem 3 is based on the principle of optimism. For any (s, x, y) S X Y thequantity ˆft(x) g (y, s) + bt(y, s) can be understood as an estimator for the value of P(x|y, s) where the corrective bonus bt(y, s) takes into account the accuracy of the empirical estimator b Pt(x|y, s) = ˆft(x) g (y, s). By definining the bonus function bt : Y R as an appropriately scaled multiple of g (y, s) bΣ 1 t it overestimates P(x|y, s). That is, for all s S, x X, y Y and t N,

P(x|y, s) b Pt(x|y, s) + bt(y, s) (5)

with probability at least 1 2δ. When Equation 5 is satisfied the policy definition of line 5 immediately implies that

max y Y P(x|y, st) max y Y b Pt(x|y, st) + bt(y, st) = b Pt(x|yt, st) + bt(yt, st)

To prove Equation 5 we develop the following supporting result,

Lemma 1. When Assumption 1 and Assumption 2 hold, then with probability at least 1 2δ we have:

f (x) bft(x) bΣt C p

log(t|F|/δ) + p

(7) for all t N simultaneously.

This result provides a bound for the maximum error in the estimation of f (x) as measured by the data norm bΣt. It will prove crucial in bounding the error of the empirical models b Pt(x|y, s). The detailed version of this result can its proof can be found in Lemma 2 in Appendix B.

We denote as E the event that Equation 7 holds. In this case, we can upper bound the prediction error of the empirical model b Pt(x|y, s) = ˆft(x) g (y, s) for all (s, x, y) S X Y. b Pt(x|y, s) P(x|y, s) = f (x) ˆft(x) g (y, s)

(i) f (x) ˆft(x) bΣt g (y, s) bΣ 1 t

(ii) bt(y, s) (8)

where (i) holds because for all v, w Rd and invertible Σ Rd d, the Cauchy-Schwartz inequality implies

v, w = Σ1/2v, Σ 1/2w v Σ w Σ 1 and (ii) by

upper bounding f (x) ˆft(x) bΣt using the RHS of Equa-

These are the necessary ingredients to finalize our sketch of Theorem 3. When E holds the following inequalities are satisfied,

t=1 (P(xt | π (xt), st) P(xt | πt(xt), st))

t=1 ˆft(x) g (yt, st) f (xt) g (yt, st) + bt(yt, st)

t=1 2bt(yt, st)

where (a) is a consequence of Optimism (Equation 6). And inequality (b) of the prediction error bound from Equation 8. What we have managed to achieve at this point is to upper bound the regret by a sum of estimation errors along the features of the responses played by the algorithm at each time-step. Finally, substituting the definition of the bonus terms and invoking a standard sum of inverse norms bound from the linear bandits literature (see for example Proposition 3 in (Pacchiano et al., 2021) and Proposition 7 in Appendix C)

t=1 bt(yt) O

log(T|F|/δ)

t=1 g (yt, st) bΣ 1 t

Td log(T|F|/δ) log(1 + TB)

This finalizes the proof sketch of Theorem 3. These results rely on the assumption that g is known. Removing that assumption yields a substantially harder problem as it makes it more difficult to leverage the linearity structure. Although this scenario can be dealt with by deriving algorithms and bounds depending on statistical capacity measures such as the eluder dimension (Russo & Van Roy, 2013) for the combined f(x) g(y, s) model class we leave the derivation of a sharper analysis of this setting for future work.

6. Empirical Study

We evaluate LORIL in two settings. The first is a synthetic task that satisfies the low-rank teacher setting and all our assumptions and is designed to provide a proof of concept of LORIL. The second setting is a grounded setting with real images, natural language instructions, and where the teacher model is not low-rank. Our goal with these experiments is not to present challenging settings for exploration, but to show how various components of LORIL can be implemented empirically. Our second experiment also evaluates whether insights from LORIL carry over to more realistic settings even where our assumptions are violated.

Provable Interactive Learning with Hindsight Instruction Feedback

6.1. Evaluating on a Synthetic Task

Environment. For a given intrinsic dimension d, instruction size |X| and response size |Y|, we randomly initialize two matrices F R|X| d and G Rd |Y|. We ignore the context in this setting by defining S as a singleton {s0}. We define X = [|X|] and Y = [|Y|] and so an instruction x and a response y are positive integers. We define these matrices by first initializing them with values sampled iid with standard Gaussian distribution. We then take their exponent and divide by a temperature coefficient τ. We then normalize F and G row-wise such that FG R|X| |Y| is a stochastic matrix whose columns sum to 1. For a given (x, y) X Y, we view the matrix entry (FG)xy as denoting the value of the teacher distribution P(X = x | Y = y, S = s0). We can view the xth row of F and the yth column of G as denoting f (x) and g (y) respectively.

Baselines. We evaluate the following baselines. Random: the agent takes uniformly random actions. ϵ-Greedy: the agent performs maximum-likelihood estimation on the historic data to learn an estimate ˆft similar to LORIL; however, unlike LORIL, the exploration is not performed using elliptic bonus but using ϵ-greedy, where with ϵ probability a random action is taken and with the remaining probability, we take the greedy action arg maxy Y ˆft(xt) g (y). Greedy: This is same as ϵ-Greedy with ϵ = 0 and only exploits based on historic data. We tune the hyperparameters λ and C for LORIL and ϵ for ϵ-greedy using grid search.

0 250 500 750 1000 1250 1500 1750 Time step

Cumulative Regret

Random Greedy

Figure 2. Comparison of LORIL against baselines on the controlled task. We run each baseline 3 times and report the average. The shaded areas show the standard deviation.

Model and Optimization. We model f F as f(x) = exp(θxi) P

x X exp(θx i) d

i=1, where (θxi)x X,i [d] are the parameters that we train. For any f F, we can verify that

f(x) g (y) is a valid conditional distribution over x given y. We perform maximum likelihood estimation using Adam optimization.

Results. Figure 2 shows cumulative regret over time steps for LORIL and baselines. We ran each experiment 3 times with different seeds. We select hyperparameters for each algorithm based on the mean final regret. We can see that LORIL performs better than all baselines achieving the best regret which is 12.3% smaller than than the next best baseline. Improvements over the greedy baseline show that exploration helps, whereas improvements over ϵ-greedy show that using elliptic bonus for exploration provides better regret bounds.

6.2. Evaluation on an Image Selection Task

We evaluate LORIL on an image classification task where the true model does not admit a low-rank decomposition. In reinforcement learning, it has been found that using the elliptic bonus for exploration is helpful in real-world settings where low-rank assumption doesn t hold (Henaff et al., 2022). Our goal in this subsection is to test if a similar result holds for our setting.

0 100 200 300 400 500 600 Time step

Cumulative Regret

Random Greedy

Figure 3. Results on the image classification task. We run each baseline 5 times and report the average performance. The shaded areas show the standard deviation. [DM: add a figure of the model]

Environment. The instruction space X is in natural language where for a given x X, we denote the ith token by xi. The agent has an action space with |Y| = K actions. In each round, the world assigns an image of an object to each action, and the agent is given a natural language instruction describing the image that the agent should select. The agent has a non-trivial context s S that contains the identity of

Provable Interactive Learning with Hindsight Instruction Feedback

the object assigned to each action in the given round. No two actions have the same image but the image associated with each action can change across rounds. We sample an object by picking an object of a given type and a given color from a set of types and colors. The instruction space X is in natural language as in our motivating example in Figure 1. We use a set of templates to generate instructions for describing a given image using the object type and color.

Model. We model a function class H = {h : Y S (X)} using a deep neural network. Given an action y and context s, we encode the image associated with the action with an encoding g (y, s) Rd. We model g as a 3-layer convolutional neural network with Leaky Re Lu activation. In this setting, we don t assume that the environment provides the g . Instead, we train g using an autoencoder objective and a set of offline images sampled from the environment. Alternatively, we could have used a pre-trained image representation model such as Res Net (He et al., 2016) or CLIP (Radford et al., 2021).

We use the encoding g (y, s) to generate a distribution over texts x = (x1, , xn) using a two-layer Gated Recurrent Unit (GRU). Specifically, we apply a fully connected layer to g (y, s) to reshape it to an appropriate size and use it to initialize the hidden state of the GRU for all layers.

Results. Figure 3 shows the results. Similar to our previous experiment, LORIL performs better than baselines, achieving 4.8% less regret than the next best baseline, even though the setting does not admit a low-rank structure. We also note that these tasks were not designed to present a challenging scenario for exploration, and consequently the gains relative to baselines are smaller.

7. Related Work

Provably-efficient Interactive Learning. The ubiquitous nature of interactive learning has resulted in significant attention devoted to its theoretical understanding. Protocol 1 superficially resembles a contextual bandit problem but stands in contrast with the scenario where the learner receives a (possibly noisy) reward signal after taking an action in a given context. The key difference is that while in the contextual bandit setting the feedback equals an unbiased sample of the reward corresponding to the arm and context, in our setting the feedback is produced from a conditional distribution of instructions that does not immediately relate to the reward. Thus, it is not possible to immediately adapt a contextual bandit algorithm to provide regret bounds for Protocol 1. There is a vast literature dedicated to developing sublinear regret algorithms for contextual bandit problems. Early efforts to incorporate contextual information into bandit problems led to the development of algorithms such as

Lin UCB (Chu et al., 2011), OFUL (Abbasi-Yadkori et al., 2011), and Linear Thompson Sampling (Agrawal & Goyal, 2013), for the setting when there is a linear relationship between the context and the reward. A long line of work has also focused on studying guarantees for imitation learning (Ross et al., 2011; Rashidinejad et al., 2021), and policy optimization (Kearns & Singh, 2002; Auer & Ortner, 2006; Azar et al., 2017; Foster et al., 2021). More recently, there has been a focus on developing statistically efficient RL algorithms with function approximation (Misra et al., 2020; Jin et al., 2021; Foster et al., 2021). Our work focuses on provable learning similar to these methods and uses similar tools for analysis, but focuses on a novel learning setting with a different type of feedback than IL and RL.

Low-rank Interactive Learning. Low-rank models have been studied in bandit settings (Abbasi-Yadkori et al., 2011), contextual bandit settings (Chu et al., 2011), and in more general multi-step reinforcement learning (Jin et al., 2020; Agarwal et al., 2020). One of the appeals of low-rank models is that they can generalize tabular MDPs and provide a way to study function approximation settings which is standard in empirical studies. This is also our motivation for studying low-rank models. Further, low-rank models are one of the most expressive settings for which both statistically and computationally efficient algorithms exist.

Learning using Hindsight Feedback. Several different works have found it advantageous to use hindsight feedback to convert a failed example into a positive example by relabeling it with a different goal (or in our case instruction) (Andrychowicz et al., 2017; Li et al., 2020; Nair et al., 2018). These approaches typically solve goal-conditioned RL where a failed trajectory is labeled with its final state as the goal. However, these approaches focus on empirical performance and do not provide regret bounds.[DM: cite Aviv s paper]

Instruction Following. The task of developing agents that can follow natural language instructions has received significant attention since the early days of AI (Winograd, 1972). Several approaches have developed methods that train these systems using imitation learning (Mei et al., 2016) and reinforcement learning (Misra et al., 2018; Hill et al., 2021). Training agents with hindsight instruction labeling has been previously explored for instruction following in (Nguyen et al., 2021; Fried et al., 2018). The main focus of these results is on empirical performance and they either provide no theoretical analysis, or in the case of (Nguyen et al., 2021) only provide asymptotic analysis. In contrast, we provide the first finite-sample regret bounds for learning from instruction labeling.

Provable Interactive Learning with Hindsight Instruction Feedback

8. Conclusion

In this work we define a formal interactive learning setup for hindsight instruction and initiate its theoretical understanding. Among other things we present a lower bound indicating that hindisght instruction learning in the general case can be statistically intractable, thus implying the necessity of imposing structural conditions for statistically efficient learning with hindsight feedback. We present an algorithm LORIL that has no-regret when the underlying teacher distribution has low-rank. The regret of LORIL scales e O

with the horizon and only depends on the rank of the distribution and does not depend on the size of the agent s response space or instruction space. We finalize our work with an experimental demonstration of LORIL in a variety of synthetic and grounded scenarios. This work represents a first exploration of the hindsight instruction setup and therefore many exciting research directions remain open. Chief among them is to design provably efficient algorithms for hindsight instruction under less restrictive function approximation assumptions and that are also computationally efficient. We foresee that the algorithmic framework introduced by the Decision Estimation Coefficient literature (Foster et al., 2023; 2021) can serve as the basis of the development of algorithms for hindsight instruction that are both computationally tractable and statistically efficient and that can lead to practical impact in scenarios such as training language models and robotics.

Impact Statement

This paper presents an interactive learning algorithm that learns from hindsight instructions. The main contributions of this paper are theoretical, however, algorithmic principles from our work can be useful in various empirical studies. A key application can be in training embodied systems using feedback provided by a human or a language model. An important thing to keep in mind is ensuring safety and privacy of any human in the loop during the training process. Further, if a language model is used to generate hindsight instructions, then care must be taken to ensure that hallucinations and implicit bias in the model does not lead to undesired behavior in the embodied agent or robot. While these questions are important, they are orthogonal to the focus of our study.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646. PMLR, 2014.

Agarwal, A., Kakade, S., Krishnamurthy, A., and Sun, W. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095 20107, 2020.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pp. 127 135. PMLR, 2013.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mc Grew, B., Tobin, J., Pieter Abbeel, O., and Zaremba, W. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.

Auer, P. and Ortner, R. Logarithmic online regret bounds for undiscounted reinforcement learning. Advances in neural information processing systems, 19, 2006.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272. PMLR, 2017.

Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19 26. JMLR Workshop and Conference Proceedings, 2011.

Blukis, V., Terme, Y., Niklasson, E., Knepper, R. A., and Artzi, Y. Learning to map natural language instructions to physical quadcopter control using simulated flight. In 3rd Annual Conference on Robot Learning, Co RL 2019, Osaka, Japan, October 30 - November 1, 2019, Proceedings, 2019.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020.

Provable Interactive Learning with Hindsight Instruction Feedback

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011.

Foster, D. J., Kakade, S. M., Qian, J., and Rakhlin, A. The statistical complexity of interactive decision making. ar Xiv preprint ar Xiv:2112.13487, 2021.

Foster, D. J., Golowich, N., and Han, Y. Tight guarantees for interactive decision making with the decision-estimation coefficient. ar Xiv preprint ar Xiv:2301.08215, 2023.

Freedman, D. A. On tail probabilities for martingales. the Annals of Probability, pp. 100 118, 1975.

Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., and Darrell, T. Speaker-follower models for visionand-language navigation. Advances in Neural Information Processing Systems, 31, 2018.

Geer, S. A. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Henaff, M., Raileanu, R., Jiang, M., and Rockt aschel, T. Exploration via elliptical episodic bonuses. Advances in Neural Information Processing Systems, 35:37631 37646, 2022.

Hill, F., Tieleman, O., von Glehn, T., Wong, N., Merzic, H., and Clark, S. Grounded language learning fast and slow. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=wp SWuz_hyq A.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020.

Jin, C., Liu, Q., and Miryoosefi, S. Bellman eluder dimension: New rich classes of rl problems, and sampleefficient algorithms. Advances in neural information processing systems, 34:13406 13418, 2021.

Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine learning, 49:209 232, 2002.

Li, A., Pinto, L., and Abbeel, P. Generalized hindsight for reinforcement learning. Advances in neural information processing systems, 33:7754 7767, 2020.

Mei, H., Bansal, M., and Walter, M. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016.

Misra, D., Bennett, A., Blukis, V., Niklasson, E., Shatkhin, M., and Artzi, Y. Mapping instructions to actions in 3D environments with visual goal prediction. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018.

Misra, D., Henaff, M., Krishnamurthy, A., and Langford, J. Kinematic state abstraction and provably efficient richobservation reinforcement learning. In International conference on machine learning, pp. 6961 6971. PMLR, 2020.

Misra, D. K., Sung, J., Lee, K., and Saxena, A. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research, 35(1-3):281 300, 2016.

Myers, V., He, A. W., Fang, K., Walke, H. R., Hansen Estruch, P., Cheng, C.-A., Jalobeanu, M., Kolobov, A., Dragan, A., and Levine, S. Goal representations for instruction following: A semi-supervised language interface to control. In Conference on Robot Learning, pp. 3894 3908. PMLR, 2023.

Nair, A., Mc Grew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 6292 6299. IEEE, 2018.

Nguyen, K. X., Misra, D., Schapire, R., Dud ık, M., and Shafto, P. Interactive learning from activity description. In International Conference on Machine Learning, pp. 8096 8108. PMLR, 2021.

Pacchiano, A., Ghavamzadeh, M., Bartlett, P., and Jiang, H. Stochastic bandits with linear constraints. In Banerjee, A. and Fukumizu, K. (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 2827 2835. PMLR, 13 15 Apr 2021. URL https://proceedings.mlr.press/ v130/pacchiano21a.html.

Pomerleau, D. A. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88 97, 1991.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,

Provable Interactive Learning with Hindsight Instruction Feedback

et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702 11716, 2021.

Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627 635. JMLR Workshop and Conference Proceedings, 2011.

Russo, D. and Van Roy, B. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.

Sekhari, A., Dann, C., Mohri, M., Mansour, Y., and Sridharan, K. Agnostic reinforcement learning with low-rank mdps and rich observations. Advances in Neural Information Processing Systems, 34:19033 19045, 2021.

Sherman, J. Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix. Annals of mathematical statistics, 20(4):621, 1949.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge university press, 2019.

Winograd, T. Understanding natural language. Cognitive psychology, 3(1):1 191, 1972.

Zhang, T., Ren, T., Yang, M., Gonzalez, J., Schuurmans, D., and Dai, B. Making linear mdps practical via contrastive representation learning. In International Conference on Machine Learning, pp. 26447 26466. PMLR, 2022.

Provable Interactive Learning with Hindsight Instruction Feedback

A. Lower Bound Proofs

Theorem 1. Let T 256 log(2e) and K 8e. For any algorithm, there is at least one stochastic world Wˆi such that Reg(T)

8 such that with probability at least 1/4e.

Proof. Throughout the proof we ll use the notation ϵ = p

K/T so that problem Wi has distribution P(X, S) = Uniform((A, so), (B, so)) and

PWi(X|yj) = 1/2 + ϵ 1(j = i) (1 2 1(X = B)).

Let s start by defining the empty problem W as

P(X, S) = Uniform((A, so), (B, so))

P(A|yi) = P(B|yi) = 1/2, i

Let s consider an arbitrary algorithm for A for learning with hindsight labeling and consider its interaction with problem W. We ll use the notation PW,A and EW,A to denote the measure and expectations induced by problem W and algorithm A.

First, we consider algorithm A s interaction with problem W .

Let s consider an arbitrary algorithm for A for learning with hindsight labeling and its interactions with W .

Let ni(T) = PT t=1 1(yt = i) denote the (random) number of times the learner selected yt = i from time 1 to T.

We ll use the notation ℓ1, , ℓK to denote the ordering of indices in [K] such that,

EPW ,A [nℓ1(T)] EPW ,A [nℓK(T)] .

Notice that for all j K/2 ,

EPW ,A [nℓ1(T)] 2T

K , ℓ K/2 (9)

Let s start by noting that for any Wi, the KL distance between PW ,A and PWi,A for i {ℓj}j [ K/2 ] satisfies the bound

KL PW ,A PWi,A = EW ,A

t=1 KL(PW (X |Yt) PWi(X |Yt))

t=1 KL(PW (X |Yt = i) PWi(X |Yt = i))1(Yt = i)

= EW ,A [ni(T)] KL (Ber(1/2) Ber(1/2 ϵ))

Where (i) is implied by inequality 9 for all ℓj with j [ K/2 ].

Define Ei = n PT t=1 1(Yt = i, Xt = A) + 1(Yt = i, Xt = B) 7T

8 o . When interacting with world Wi the event Ei corresponds to the event where A makes the right decisions in at least 3T/4 time-steps.

The complement event Ec i = {PT t=1 1(Yt = i, Xt = A) + 1(Yt = i, Xt = B) < 7T

8 } corresponds to the event where A makes the correct decisions in at most 7T/8 time-steps.

By the Huber-Bretagnolle inequality, all i {ℓj}j [ K/2 ] satisfy

PW ,A(Ei) + PWi,A(Ec i ) exp KL PW ,A PWi,A

Provable Interactive Learning with Hindsight Instruction Feedback

since ϵ = p

K/T, we have PW ,A(Ei) + PWi,A(Ec i ) 1/e for all i {ℓj}j [ K/2 ].

Let n X(T) = PT t=1 1(Xt = X) denote the (random) number of times the learner selected Xt = X for X {A, B} from time 1 to T.

We now define Egood = {n B(T) 5T

8 }. If T 256 log(2e) Proposition 4 applied with α = 1/8 and δ = 1/2e implies that, PW (Egood) 1 1/2e

Define e Ei = Ei Egood. For all i {ℓj}j [ K/2 ],

1/e PW ,A(Ei) + PWi,A(Ec i ) = PW ,A(Ei Ec good) + PW ,A(e Ei) + PWi,A(Ec i )

PW ,A(Ec good) + PW ,A(e Ei) + PWi,A(Ec i )

1/2e + PW ,A(e Ei) + PWi,A(Ec i )

And therefore PW ,A(e Ei) + PWi,A(Ec i ) 1/2e. Thus, X

i {ℓj}j [ K/2 ] PW ,A(e Ei) + PWi,A(Ec i ) K/2 /2e (10)

Notice that when e Ei it follows that PT t=1 1(Yt = i, Xt = A)+ 5T

8 PT t=1 1(Yt = i, Xt = A)+1(Yt = i, Xt = B) 7T

8 and therefore PT t=1 1(Yt = i, Xt = A) T

4 . This in turn implies that when e Ei holds then PT t=1 1(Yt = i, Xt = A) + 1(Xt = B) 3T

Notice that e Ei e Ej = for all i = j. This is because for all j = i, when e Ei holds,

t=1 1(Yt = j, Xt = A) + 1(Yt = j, Xt = B)

t=1 1(Yt = i, Xt = A) + 1(Xt = B) 3T

Since e Ei e Ej = the sum P

i {ℓj}j [ K/2 ] PW ,A(e Ei) 1. Thus Equation 10 implies,

i {ℓj}j [ K/2 ] PWi,A(Ec i ) K/2 /2e 1

And therefore there is an index ˆi {ℓj}j [ K/2 ] such that,

PWˆi,A(Ec ˆi ) 1

where inequality (i) holds because K 8e.

when Ec ˆi = {PT t=1 1(Yt = i, Xt = A) + 1(Yt = i, Xt = B) < 7T

t=1 1(Yt = i, Xt = A) + 1(Yt = i, Xt = B) T

also holds. When Ec ˆi is satisfied and therefore Equation 11 is satisfied, the regret can be lower bounded by ϵT/8 =

KT) since ϵ = p

K/T. We conclude that,

with probability PWˆi,A(Ec ˆi ) 1 4e.

Provable Interactive Learning with Hindsight Instruction Feedback

Corollary 2. If the conditions of Lemma 1 hold then for any algorithm there exists at least one stochastic world Wˆi such that Reg(T) Ω(

KT). Where,

Reg(T) = EWˆi,A

t=1 max y Y P(xt|y, so) P(xt|yt, so)

Proof. Lemma 1 implies there exists at least one problem Wˆi such that Reg(T)

8 with probability at least 1/4e. Let s call this event E. Therefore since PT t=1 maxy Y P(Xt|y, so) P(Xt|Yt, so) 0 with probability one,

Reg(T) = EWˆi,A

t=1 max y Y P(Xt|y, so) P(Xt|Yt, so)

Proposition 4. Let δ (0, 1), α (0, 1/2) and {Xi}T i=1 be T i.i.d. random variables sampled from Ber(1/2). It T 4 α2 log(1/δ) then PT i=1 Xi ( 1

2 + α)T with probability at least 1 δ. Similarly if T 4 α2 log(1/δ) then PT i=1 Xi ( 1

2 α)T with probability at least 1 δ.

Proof. Let b S = PT i=1 Xi be the sum of the outcomes {Xi}i [T ]. Hoeffding inequality implies

b S T/2 2 p

with probability at least 1 δ. Thus, as long as 2 p

T log(1/δ) αT, (i.e. 4 α2 log(1/δ) T) we have b S ( 1

The reverse inequality can be derived using the same argument applied to the inequality T/2 b S 2 p

T log(1/δ) with probability at least 1 δ.

B. Regret Bounds for LORIL in low-rank distribution setting

In this section, we provide a regret bound for LORIL. We first enumerate the assumptions.

Assumption 1. [Realizability] The teacher model P(x|y, s) = f (x) g (y, s) satisfies f F for a known model class F. Moreover, all teacher models parametrized by feature maps f F are valid distributions, i.e., f (X) g (y, s) (X), for any y Y and s S.

Assumption 2. There exists a constant B > 0 such that for any f F we have supx X f(x) B and supy Y,s S g (y, s) B.

Let us consider the tth round. The empirical covariance matrix is given by bΣt where

l=1 g (yl, sl)g (yl, sl) + λt I.

for a regularizer value λt > 0. It is easy to verify that bΣt is a positive definite matrix since λt I is a positive definite matrix and Ct = Pt 1 l=1 g (yl, sl)g (yl, sl) is a positive semidefinite matrix as it is a symmetric matrix and for any v Rd we have v Ctv = Pt 1 l=1 v g (yl, sl)g (yl, sl) v = Pt 1 l=1 g (yl, sl) v 2 2 0. As bΣt is positive definite its inverse bΣ 1 1 and exists and is also a positive definite matrix.3 Finally, it can be shown that one can also define the square root of the matrix Σ1/2 t and that of its inverse Σ 1/2 t and these are symmetric and positive definite as well.

3This is trivial to show as v Σ 1v = (v Σ 1)Σ(Σ 1v) > 0 if Σ 1v = 0 as Σ is positive definite, further Σ 1v = 0 v = 0. Therefore, if v = 0, then v Σ 1v > 0 and vice versa.

Provable Interactive Learning with Hindsight Instruction Feedback

Let b Pt(x | y, s) = ˆft(x) g (y, s) be the model estimated in round t by maximum likelihood estimation. Given any s, x, y S X Y, the following important inequality holds, P(x|y, s) b Pt(x|y, s) = f (x) ˆft(x) g (y, s) ,

= Σ1/2 t f (x) ˆft(x) Σ 1/2 t g (y, s) ,

r f (x) ˆft(x) Σ1/2 t Σ1/2 t f (x) ˆft(x) q

g (y, s) Σ 1/2 t Σ 1/2 t g (y, s),

= f (x) ˆft(x) Σt g (y, s) Σ 1 t , (12)

where the second last step uses Cauchy-Schwarz inequality. This inequality allows us to bound the error in the estimated model for a given y in terms of the error based on historical data given by f (x) ˆft(x) Σt and the novelty of the given

input g (y, s) Σ 1 t . We want to bound the two terms in RHS of Equation 12.

First we ll prove the following result,

Lemma 2. When Assumption 1 and Assumption 2, then with probability at least 1 2δ we have:

f (x) bft(x) bΣt (32C)1/4p

log(t|F|/δ) + 2 p

for all t N simultaneously.

sup x X f (x) bft(x) 2 bΣt = sup x X

ℓ=1 (f (x) ˆft(x)) g (yℓ, sℓ) g (yℓ, sℓ) (f (x) ˆft(x)) + λt f (x) ˆft(x) 2

f (x) g (yℓ, sℓ) ˆft(x) g (yℓ, sℓ) 2 + λt f (x) ˆft(x) 2

ℓ=1 sup x X

f (x) g (yℓ, sℓ) ˆft(x) g (yℓ, sℓ) 2

| {z } := Ut first term

+λt sup x X

f (x) ˆft(x) 2

2 | {z } := Vt second term

We first bound the second term Vt as:

f (x) ˆft(x) 2

2 = sup x X

f (x) 2 2 + ˆft(x) 2

2 2f (x) ˆft(x) ,

sup x X f (x) 2 2 + sup x X

2 + 2 sup x X

f (x) ˆft(x) ,

sup x X f (x) 2 2 + sup x X

2 + 2 sup x X f (x) 2 ˆft(x) 2 ,

where we use the assumption that supx X f(x) 2 B for any f F and that f , ˆft F and the second last step uses Cauchy-Schwarz inequality.

We now bound the first term Ut. For a given f F we define f(x, yℓ, sℓ) = f (x) g (yℓ, sℓ) f(x) g (yℓ, sℓ) 2

and Xf ℓ= supx X f(x, yℓ, , sℓ), which allows us to write Ut = Pt 1 ℓ=1 X ˆ ft ℓ.

We fix f F. We have Xf ℓ 0 by definition and as f (x) g (yℓ, sℓ), f(x) g (yℓ, sℓ) [0, 1], we also have

Xf ℓ= sup x X f(x, yℓ, sℓ) = sup x X

f (x) g (yℓ, sℓ) f(x) g (yℓ, sℓ) 2 1, (14)

Provable Interactive Learning with Hindsight Instruction Feedback

Let Pℓ (Y) be the marginal distribution over yℓconditioned on {xℓ , sℓ , yℓ , x ℓ }ℓ 1 ℓ =1 {xℓ, sℓ} then

Xf ℓ 2 = Ey Pℓ

f (x) g (y, sℓ) f(x) g (y, sℓ) 4

16Ey Pℓ h f ( ) g (y, sℓ) f( ) g (y, sℓ) 4

16Ey Pℓ h f ( ) g (y, sℓ) f( ) g (y, sℓ) 2

where we use the fact that 1, that TV distance between two distributions is equal to half of 1-norm distance between them, and that TV 1. Thus, using the anytime Freedman s inequality (see Lemma 6) and union bound over all f F, we get:

Xf ℓ 2 + C log(t|F|/δ)

ℓ=1 Ey Pℓ h f ( ) g (y, sℓ) f( ) g (y, sℓ) 2

i + C log(t|F|/δ)

simultaneously for all t N and all f F with probability at least 1 δ.

In particular by setting f = bft this implies,

ℓ=1 X b ft ℓ 16η

f ( ) g (y, sℓ) bft( ) g (y, sℓ) 2

+ C log(t|F|/δ)

with probability at least 1 δ. Finally, Proposition 6 implies that with probability at least 1 δ,

f ( ) g (y, sℓ) ˆft( ) g (y, sℓ) 2

2 log(t|F|/δ) (16)

for all t N. Therefore a union bound implies that with probability at least 1 2δ, combining the bounds of Equation 15 and Equation 16,

ℓ=1 X b ft ℓ 32η log(t|F|/δ) + C log(t|F|/δ)

simultaneously for all t N. Optimizing for η gives us η = q

C 32, which gives us Ut

32C log(t|F|/δ).

Plugging together the upper bounds for Ut and Vt we get:

f (x) ˆft(x) 2

bΣt Ut + λt Vt

32C log(t|F|/δ) + 4λt B2,

with probability 1 2δ for all t N. This gives us the desired bound as supx X f (x) ˆft(x) bΣt q

32C log(t|F|/δ) + 4λt B2 (32C)1/4p

log(t|F|/δ) + 2 λt B where we use

b for any a, b 0.

From now on we ll define as E the event that Equation 13 holds for all t N. As established in Lemma 2, we have P(E) 1 2δ. From Lemma 2 we can also establish the following corollary.

Corollary 5. If event E holds then we have: f (x) ˆft(x) g (y, s) (32C)1/4p

log(t|F|/δ) + 2 p

λt B g (y, s) bΣ 1 t

for all s S, x X and y Y and all t N.

Provable Interactive Learning with Hindsight Instruction Feedback

Proof. Let the event E hold. Then the following sequence of inequalities holds,

f (x) ˆft(x) g (y, s)

(i) f (x) ˆft(x) bΣt g (y, s) bΣ 1 t ,

(ii) (32C)1/4p

log(t|F|/δ) + 2 p

λt B g (y, s) bΣ 1 t ,

where (i) holds by Inequality 12 and (ii) holds because of Lemma 2 and because we are assuming E holds.

B.1. Proving Optimism for LORIL

We next show that LORIL derives a useful optimism inequality that allows us to upper bound the true model probabilities P(x | y, s) = f (x) g (y, s) using the estimated model probabilities ˆft(x) g (y, s).

Lemma 3. If event E holds, then we have:

f (x) g (π (x, s), s) ˆft(x) g (πt(x, s), s) + (32C)1/4p

log(t|F|/δ) + 2 p

λt B g (πt(x, s), s) bΣ 1 t , (17)

for all x X and t N, where πt is the policy in tth round of LORIL and π : x, s 7 arg maxy Y P(x | y, s) is the optimal policy.

Proof. If event E holds then for any x S, x X, y Y, and t N we have directly from Corollary 5 the following:

f (x) g (y, s) ˆft(x) g (y, s) + (32C)1/4p

log(t|F|/δ) + 2 p

λt B g (y, s) bΣ 1 t .

For any s S, x X and t N, this implies:

f (x) g (π (x, s), s) ˆft(x) g (π (x, s), s) + (32C)1/4p

log(t|F|/δ) + 2 p

λt B g (π (x, s), s) bΣ 1 t

ˆft(x) g (πt(x), s) + (32C)1/4p

log(t|F|/δ) + 2 p

λt B g (πt(x), s) bΣ 1 t ,

where the second inequality holds because of the definition of πt(x, s).

B.2. Regret Upper Bound

We are ready to upper bound the regret of LORIL(Algorithm 1). Recall that we define regret for a given run of LORIL as,

t=1 (P(xt | π (xt, st), st) P(xt | πt(xt, st), st)) .

The main result is the following theorem,

Theorem 3. [Regret bound of LORIL] When Assumption 1 and Assumption 2 hold and λt = 1/t the regret of Algorithm 1 satisfies

Reg(T) =O B p

Td log(1 + TB)+ p

Td log(T|F|/δ) log(1 + TB)

with probability at least 1 3δ for all T N.

Provable Interactive Learning with Hindsight Instruction Feedback

Proof. Fix T N. We assume event E holds, then we can bound the regret as:

t=1 (P(xt | π (xt, st), st) P(xt | yt, st)) ,

f (xt) g (π (xt), st) f (xt) g (πt(xt), st) ,

ˆft(xt) g (πt(xt), st)

+ (32C)1/4p

log(t|F|/δ) + 2 p

λt B g (πt(xt), st) bΣ 1 t f (xt) g (πt(xt), st) ,

where we use Lemma 3 in the last step. We also have:

ˆft(x) g (πt(xt), st) f (xt) g (πt(xt), st) (32C)1/4p

log(t|F|/δ) + 2 p

λt B g (πt(xt), st) bΣ 1 t

due to Corollary 5. Combining these two results we get:

t=1 2 (32C)1/4p

log(t|F|/δ) + 2 p

λt B g (πt(xt), st) bΣ 1 t (18)

2 (32C)1/4p

log(t |F|/δ) + 2 p

t=1 g (πt(xt), st) bΣ 1 t (19)

Let eΣt = Pt 1 ℓ=1 g (yℓ, sℓ) (g (yℓ, sℓ)) + λT I for all t [T]. Further, let Dt = Pt 1 ℓ=1 g (yℓ, sℓ) (g (yℓ, sℓ)) , which gives us eΣt = Dt + λT I and bΣt = Dt + λt I. This implies that for any v Rd we have

v (Dt + λT I)v = q

v Dtv + λT v 2 2 q

v Dtv + λt v 2 2 = q

v (Dt + λt I)v = v bΣt .

As bΣt eΣt 0 and bΣt = eΣt + (λt λT )I with λt λT > 0 it follows that by Proposition 8, v bΣ 1 t v eΣ 1 t . This implies T X

t=1 g (πt(xt), st) bΣ 1 t

t=1 g (πt(xt), st) eΣ 1 t

Finally, Proposition 7 implies T X

t=1 g (yt, st) eΣ 1 t

2Td log 1 + TB2

where we use supy Y,s S g (y, s) 2 B. Combining these we get:

Reg(T) sup t [T ]

2 (32C)1/4p

log(t |F|/δ) + 2 p

2Td log 1 + TB2

Using λt = 1/t 1 we get:

Reg(T) 2 (32C)1/4p

log(T|F|/δ) + 2B p

2Td log (1 + T 2B2)

log(T|F|/δ) + B p

Td log (1 + TB),

where we use log(1 + T 2B2) 2 log(1 + TB). This gives us

Reg(T) = O B p

Td log(1 + TB) + p

Td log(T|F|/δ) log(1 + TB) .

Provable Interactive Learning with Hindsight Instruction Feedback

C. Useful Lemmas

Lemma 4 (Hoeffding Inequality). Let {Yℓ} ℓ=1 be a martingale difference sequence such that Yℓis Yℓ [aℓ, bℓ] almost surely for some constants aℓ, bℓalmost surely for all ℓ= 1, , t. then

ℓ=1 (bℓ aℓ)2 ln 1

With probability at least 1 eδ.

See for example Corollary 2.20 from (Wainwright, 2019).

Our results relies on the following variant of Bernstein inequality for martingales, or Freedman s inequality (Freedman, 1975), as stated in e.g., (Agarwal et al., 2014; Beygelzimer et al., 2011).

Lemma 5 (Simplified Freedman s inequality). Let X1, ..., XT be a bounded martingale difference sequence with |Xℓ| R. For any δ (0, 1), and η (0, 1/R), with probability at least 1 δ ,

ℓ=1 Eℓ[X2 ℓ] + log(1/δ )

where Eℓ[ ] is the conditional expectation4 induced by conditioning on X1, , Xℓ 1.

Lemma 6 (Anytime Freedman). Let {Xt} t=1 be a bounded martingale difference sequence with |Xt| R for all t N. For any δ (0, 1), and η (0, 1/R), there exists a universal constant C > 0 such that for all t N simultaneously with probability at least 1 δ ,

ℓ=1 Eℓ[X2 ℓ] + C log(t/δ )

where Eℓ[ ] is the conditional expectation induced by conditioning on X1, , Xℓ 1.

Proof. This result follows from Lemma 5. Fix a time-index t and define δt = δ 12t2 . Lemma 5 implies that with probability at least 1 δt,

ℓ=1 Eℓ X2 ℓ + log(1/δt)

A union bound implies that with probability at least 1 Pt ℓ=1 δt 1 δ ,

ℓ=1 Eℓ X2 ℓ + log(12t2/δ )

ℓ=1 Eℓ X2 ℓ + C log(t/δ )

holds for all t N. Inequality (i) holds because log(12t2/δ ) = O (log(tδ )).

Adapted from Theorem 21 from (Agarwal et al., 2020). See also (Geer, 2000).

4We will use this notation to denote conditional expectations throughout this work.

Provable Interactive Learning with Hindsight Instruction Feedback

Proposition 6 (MLE Bound). For any fixed δ (0, 1),

ˆft( ) g (y, sℓ) f ( ) g (y, sℓ) 2

2 ln(t|F|/δ)

for all t N simultaneously with probability at least 1 δ where Pℓ (Y) is the distribution over yℓconditioned on {xℓ , sℓ , yℓ , x ℓ }ℓ 1 ℓ =1 {xℓ, sℓ}.

Proof. This result is an immediate consequence of Theorem 21 from (Agarwal et al., 2020). We convert this result into an anytime statement by invoking this result repeatedly with probability values δt = δ 3 t2 and then applying a union bound.

Proposition 7 (Proposition 3 from (Pacchiano et al., 2021)). For any sequence of vectors v1, , vt Rd satisfying vℓ L for all ℓ N, let Σt be its corresponding Gram matrix Σt = λI + Pt 1 ℓ=1 vℓv ℓ. Then for all t N, we have

ℓ=1 vℓ Σ 1 ℓ

2Td log 1 + TL2

Proposition 8. Let A 0 be a d d positive definite matrix. And let λ 0. If v Rd,

v A+λ I v A

and v (A+λ I) 1 v A 1

Proof. Let v1, , vd be an orthonormal basis of eigenvectors of A. The eigenvalues of A are positive because the matrix is assumed to be positive definite. Call µi > 0 to the eigenvalue associated with eigenvector vi. Elementary linear algebra shows that, (A + λ I)vi = Avi + λ vi = (µi + λ )vi

thus showing that vi are also an orthonormal basis of eigenvectors for A + λ I and have eigenvalues µi + λ . Thus,

v 2 A = v (A) v =

i=1 µi ( v, vi )2

i=1 (µi + λ ) ( v, vi )2 = v (A + λ I)) = v 2 A+λ I

Notice that A 1 = Pd i=1 1 µi viv i and (A + λ I) 1 = Pd i=1 1 µi+λ viv i . And therefore,

v 2 A 1 = v A 1v =

1 µi ( v, vi )2

1 µi + λ ( v, vi )2 = v (A + λ I) 1 v = v 2 (A+λ I) 1

The result follows.

D. Experimental Details

We provide additional details of our experiments in this section.

D.1. Additional Details for the First Experiment on Synthetic Task

We use the almost same implementation of LORIL as listed in Algorithm 1. The only change we make is that we use a simplified policy given by:

πt(xt) = arg max y Y ˆft(xt) g (y, st) + k g (y, st) bΣ 1 t , (22)

where k is a single hyperparameter. We also use λt = λ which is a hyperparameter to be tuned. We use the hyperparameter values in Table 1.

Provable Interactive Learning with Hindsight Instruction Feedback

Hyperparameter Values ϵ grid search in [0.05, 0.1, 0.2, 0.3] λ grid search in [0.05, 0.1, 1.0] k grid search in [0.1, 1.0, 10.0] optimization Adam learning rate 0.001 temperature used in defining F and G 0.75 |X| 2000 d 10 |Y| 10

Table 1. Grid search for hyperparameter for the first experiment.

Hyperparameter Values ϵ grid search in [0.05, 0.1, 0.2, 0.3] λ grid search in [0.05, 0.1, 1.0] k grid search in [0.1, 1.0, 10.0] optimization Adam learning rate 0.001 vocabulary size 34 word embedding dimension 10 GRU hidden dimension 10 dimension of g encoding 16 number of layers in GRU 2 possible templates 10 object types [ square , rectangle , triangle , circle ] object color [ red , blue , green , yellow , black , grey , black , cyan , orange ] |Y| 5

Table 2. Grid search for hyperparameter for the second experiment.

D.2. Additional Details for the Second Experiment on the Image Selection Task

We use the hyperparameter values shown in Table 2. The list of templates is given below where {object1} and {color1} are variables that are replaced by the object type and its color in a given image.

1. You are seeing a {object1} of color {color1}

2. The image contains a {object1} of color {color1}

3. There is a {color1} colored object of type {object1}

4. A {color1} {object1}

5. The object is a {object1} and its color is {color1}.

6. The image has a single {color1} colored {object1}.

7. You are seeing a {color1} colored {object1}.

8. There is a {color1} colored object.

9. You are seeing a {object1} in the image.

10. There is a {color1} colored object1.

Provable Interactive Learning with Hindsight Instruction Feedback

Autoencoder. We use a 3-layer autoencoder with leaky relu activations. The first two layers apply a convolution with 16 8 8 kernels with stride 4. The last layer applies 15 4 4 kernels with stride 2. The input image is an RGB image of size 200 200 3. After applying the CNN encoder, we get a feature of size 16 4 4 = 64. We flatten this feature and apply a fully connected layer to map it to another vector of size 64. We reshape it and pass it through a 3-layer deconvolutional network with leaky relu activation, to predict an image of the same size as the input. The first layer of the decoder applies 16 4 4 convtranspose2d of stride 2 and output padding 1. The second layer applies 16 8 8 convtranspose2d of stride 4 and output padding 1. Finally, the last layer applies 16 8 8 convtranspose2d of stride 4 with no output padding. We train the autoencoder with squared loss using Adam optimization. We apply gradient clip to clip gradient above a clipping value of 2.5. Finally, we model g (y, s) by first applying the encoder to generate a feature map of 16 4 4, and then summing over the first dimension and flattening the remaining tensor into a 16-dimensional vector.

Compute. We use A2600 for all experiments. The entire set of experiments took 3 hours to finish. We used Py Torch to implement the code.