# coactive_learning__1cbc5692.pdf

Journal of Artiﬁcial Intelligence Research 53 (2015) 1-40 Submitted 08/14; published 05/15

Coactive Learning

Pannaga Shivaswamy pshivaswamy@linkedin.com Linked In Corporation 2029 Stierlin Ct Mountain View, CA 94043, USA

Thorsten Joachims tj@cs.cornell.edu Department of Computer Science Cornell University Ithaca, NY 14853, USA

We propose Coactive Learning as a model of interaction between a learning system and a human user, where both have the common goal of providing results of maximum utility to the user. Interactions in the Coactive Learning model take the following form: at each step, the system (e.g. search engine) receives a context (e.g. query) and predicts an object (e.g. ranking); the user responds by correcting the system if necessary, providing a slightly improved but not necessarily optimal object as feedback. We argue that such preference feedback can be inferred in large quantity from observable user behavior (e.g., clicks in web search), unlike the optimal feedback required in the expert model or the cardinal valuations required for bandit learning. Despite the relaxed requirements for the feedback, we show that it is possible to adapt many existing online learning algorithms to the coactive framework. In particular, we provide algorithms that achieve O(1/

T) average regret in terms of cardinal utility, even though the learning algorithm never observes cardinal utility values directly. We also provide an algorithm with O(log(T)/T) average regret in the case of λ-strongly convex loss functions. An extensive empirical study demonstrates the applicability of our model and algorithms on a movie recommendation task, as well as ranking for web search.

1. Introduction

In a wide range of systems in use today, the interaction between human and system takes the following form. The user issues a command (e.g. query) and receives a possibly structured result in response (e.g. ranking). The user then interacts with the results (e.g. clicks), thereby providing implicit feedback about the user s utility function. Here are three examples of such systems and their typical interaction patterns:

Web Search: In response to a query, a search engine presents the ranking [A, B, C, D, ...] and observes that the user clicks on documents B and D.

Movie Recommendation: An online service recommends movie A to a user. However, the user rents movie B after browsing the collection.

Machine Translation: An online machine translator is used to translate a wiki page from language A to B. The system observes some corrections the user makes to the translated text.

c 2015 AI Access Foundation. All rights reserved.

Shivaswamy & Joachims

In all the above examples, the user provides some feedback about the results of the system. However, the feedback is only an incremental improvement, not necessarily the optimal result. For example, from the clicks on the web search results we can infer that the user would have preferred the ranking [B, D, A, C, ...] over the one we presented. However, this is unlikely to be the best possible ranking. Similarly in the recommendation example, movie B was preferred over movie A, but there may have been even better movies that the user did not ﬁnd while browsing. And in the machine translation example, the corrected text need not be the best possible translation from language A to language B. In all three examples, the algorithm typically receives a slightly improved result from the user as feedback, but not necessarily the optimal prediction or any cardinal utilities. We conjecture that many other applications fall into this schema, ranging from news ﬁltering to personal robotics. In this paper, we propose Coactive Learning as a model of such system-user interactions. We formalize Coactive Learning as a general model of interaction between a learning system and its user, deﬁne a suitable notion of regret, and validate the key modeling assumption namely whether observable user behavior can provide valid feedback in our model in a user study for web search. The new model can be viewed as a cooperative learning process between system and user, where both parties aim to optimize utility but lack the means to achieve this goal on their own. Speciﬁcally, the (boundedly rational) user is computationally limited in maximizing utility over the space of alternatives, while the system is limited in how well it knows the user s utility function. The proposed online learning framework diﬀers signiﬁcantly from existing online learning models in terms of the observed feedback (see the related works section for a comparison). A strength of the proposed framework is that it is possible to derive a wide range of coactive learning algorithms by adapting existing online algorithms for convex optimization. We provide a template for Coactive Learning algorithms and then show several instances of this template in this paper, and in each case, we prove that the worst case analysis of the algorithm carries over from the conventional online learning framework to coactive learning despite the diﬀerences between the two models. In particular, in the cases of linear utility models and convex cost functions we show O(1/

T) regret bounds with a matching lower bound. We also show that the regret bound can be improved with a second order algorithm for strongly convex functions. The learning algorithms perform structured output prediction (see Bakir, Hofmann, Sch olkopf, Smola, Taskar, & Vishwanathan, 2007) and thus can be applied in a wide variety of problems. We study several interesting extensions of the framework using batch updates, expected feedback, and an exponentiated learning algorithm. Finally, we provide extensive empirical evaluations of our algorithms on a movie recommendation and a web search task, showing that the algorithms are highly eﬃcient and eﬀective in practical settings. The rest of this paper is organized as follows. We discuss related work in Section 2. In Section 3 we formally introduce the coactive learning model and also motivate the model with a real-world user study. We present the linear version of our algorithm along with several extensions in Section 4. In Section 5, we then detail a general schema for deriving coactive learning algorithms and their regret bounds. In particular, we derive an exponentiated gradient algorithm in Section 5.1, and we propose coactive learning algorithms for minimizing general convex losses and λ-strongly convex losses in Sections 5.2 and 5.3. An

Coactive Learning

empirical evaluation of the proposed framework and algorithms is done in Section 6 and we conclude in Section 7. We include most of our proofs in the Appendix.

2. Related Works

The Coactive Learning Model bridges the gap between two forms of feedback that have been well studied in online learning. On one side there is the multi-armed bandit model (e.g., Auer, Cesa-Bianchi, Freund, & Schapire, 2002b; Auer, Cesa-Bianchi, & Fischer, 2002a), where an algorithm chooses an action and observes the utility of (only) that action. On the other side, utilities of all possible actions are revealed in the case of learning with expert advice (e.g., Cesa-Bianchi & Lugosi, 2006a). Online convex optimization (Zinkevich, 2003; Hazan, Agarwal, & Kale, 2007) and online convex optimization in the bandit setting (Flaxman, Kalai, & Mc Mahan, 2005) are continuous relaxations of the expert and the bandit problems respectively. Our model, where information about two arms is revealed at each iteration (the one we presented and the one we receive as feedback from the user), sits between the expert and the bandit setting. Most closely related to Coactive Learning is the dueling bandits setting (Yue, Broder, Kleinberg, & Joachims, 2009; Yue & Joachims, 2009). The key diﬀerence is that both arms are chosen by the algorithm in the dueling bandits setting, whereas one of the arms is chosen by the user in the Coactive Learning setting. Our model allows contextual information like in contextual bandits (Langford & Zhang, 2007), however, the arms in our problem are structured objects such as rankings. A summary of how our framework compares with other existing frameworks is shown in Table 1. Other types of feedback have also been explored in the literature. For example, in the multi-class classiﬁcation problems, after the algorithm makes a prediction based on the context, the feedback received is only whether the prediction is correct or wrong as opposed to the actual label (Crammer & Gentile, 2011; Kakade, Shalev-Shwartz, & Tewari, 2008). This can be seen as observing partial feedback (as opposed to the actual cardinal feedback) in a bandit problem. As pointed out above, Coactive Learning algorithms and conventional online learning algorithms operate in diﬀerent types of environments. Coactive Learning algorithms present an object and observe another object as a feedback, while online convex learning algorithms pick a vector in each step and observe the gradient at that vector as feedback. Despite the contrast between online learning and Coactive Learning, two of the algorithms presented in this paper are closely related to those in the work of Zinkevich (2003) and Hazan et al. (2007). We show that it is possible to adapt the regret bounds of these algorithms to corresponding regrets bounds for Coactive Learning. At the heart of all our algorithms and analysis is the well-known idea (Polyak & Tsypkin, 1973) that the descent algorithms do not necessarily need to know the gradients, but that a vector with positive inner product with the gradient in expectation suﬃces. While feedback in Coactive Learning takes the form of a preference, it is diﬀerent from ordinal regression and ranking. Ordinal regression (e.g., Crammer & Singer, 2001) assumes training examples (x, y), where y is a rank. In the Coactive Learning model, absolute ranks are never revealed. More closely related is learning with pairs of examples (Herbrich, Graepel, & Obermayer, 2000; Freund, Iyer, Schapire, & Singer, 2003; Chu & Ghahramani, 2005), since it circumvents the need for absolute ranks; only relative orderings are required. Vari-

Shivaswamy & Joachims

Framework Algorithm Feedback Bandits pull an arm observe cardinal reward for the arm pulled Experts pull an arm observe cardinal rewards for all the arms Dueling Bandits pull two arms observe feedback on which one is better Coactive Learning pull an arm observe another arm which is better

Table 1: A comparison of diﬀerent online learning frameworks.

ants of such pairwise ranking algorithms have been applied to Natural Language Processing (Haddow, Arun, & Koehn, 2011; Zhang, Lei, Barzilay, Jaakkola, & Globerson, 2014) and image annotation (Weston, Bengio, & Usunier, 2011). However, existing approaches require an iid assumption and typically perform batch learning. Finally, there is a large body of work on ranking (see Liu, 2009). These approaches are diﬀerent from Coactive Learning as they require training data (x, y) where y is the optimal ranking for query x. However, we will draw upon structured prediction approaches for ranking problems in the design of our models. Coactive learning was ﬁrst proposed by Shivaswamy and Joachims (2012); this paper serves as a journal extension of that paper, adding a complete discussion of batch updates and expected feedback, the exponentiated gradient algorithm, the O(log(T)/T) algorithm for λ-strongly convex loss functions, and a substantially extended empirical evaluation. Since then, coactive learning has been applied to intrinsically diverse retrieval (Raman, Shivaswamy, & Joachims, 2012), learning ranking function from click feedback (Raman, Joachims, Shivaswamy, & Schnabel, 2013), optimizing social welfare (Raman & Joachims, 2013), personal robotics (Jain, Wojcik, Joachims, & Saxena, 2013), pattern discovery (Boley, Mampaey, Kang, Tokmakov, & Wrobel, 2013), robotic monitoring (Somers & Hollinger, 2014), and extended to allow approximate inference (Goetschalckx, Fern, & Tadepalli, 2014).

3. Coactive Learning Model

We now introduce coactive learning as a model of interaction (in rounds) between a learning system (e.g. search engine) and a human (search user) where both the human and learning algorithm have the same goal (of obtaining good results). At each round t, the learning algorithm observes a context xt X (e.g. a search query) and presents a structured object yt Y (e.g. a ranked list of URLs). The utility of yt Y to the user for context xt X is described by a utility function U(xt, yt), which is unknown to the learning algorithm. As feedback the human user returns an improved object yt Y (e.g. reordered list of URLs), i.e.,1

U(xt, yt) > U(xt, yt), (1)

when such an object yt exists. In fact, we will also allow violations of (1) when we formally model user feedback in Section 3.1. The process by which the user generates the feedback yt can be understood as an approximately utility-maximizing action, where the user is modeled as a boundedly rational

1. The improvements should be not just strict, but by a margin , this will be clear in Section 3.1.

Coactive Learning

agent. In particular, the user selects the feedback object yt by approximately maximizing utility over a user-deﬁned subset Yt of all possible Y.

yt = argmaxy YU(xt, y) (2)

This approximately and boundedly rational user may employ various tools (e.g., query reformulations, browsing) to construct the subset Y and to perform this search. Importantly, however, the feedback yt is typically not the optimal label which is deﬁned as,

y t := argmaxy YU(xt, y). (3)

In this way, Coactive Learning covers settings where the user cannot manually optimize the argmax over the full Y (e.g. produce the best possible ranking in web search), or has diﬃculty expressing a bandit-style cardinal rating U(xt, yt) for yt in a consistent manner. This puts our preference feedback yt in stark contrast to supervised learning approaches which require (xt, y t ). But even more importantly, our model implies that reliable preference feedback (1) can be derived from observable user behavior (i.e., clicks), as we will demonstrate in Section 3.2 for web search. We conjecture that similar feedback strategies exist in many application settings (e.g., Jain et al., 2013; Boley et al., 2013; Somers & Hollinger, 2014; Goetschalckx et al., 2014), especially when users can be assumed to act approximately and boundedly rational according to U. Despite the weak preference feedback, the aim of the algorithm is nevertheless to present objects with utility close to that of the optimal y t . Whenever, the algorithm presents an object yt under context xt, we say that it suﬀers a regret U(xt, y t ) U(xt, yt) at time step t. Formally, we consider the average regret suﬀered by the algorithm over T steps as follows:

t=1 (U(xt, y t ) U(xt, yt)) . (4)

The goal of the learning algorithm is to minimize REGT , thereby providing the human with predictions yt of high utility. Note, however, that a cardinal value of U is never observed by the learning algorithm, but U is only revealed ordinally through preferences (1).

3.1 Quantifying Preference Feedback Quality

To provide any theoretical guarantees about the regret of a learning algorithm in the coactive setting, we need to quantify the quality of the user feedback. Note that this quantiﬁcation is a tool for theoretical analysis, not a prerequisite or parameter to the algorithm. We quantify feedback quality by how much improvement y provides in utility space. In the simplest case, we say that user feedback is strictly α-informative when the following inequality is satisﬁed:

U(xt, yt) U(xt, yt) α(U(xt, y t ) U(xt, yt)). (5)

In the above inequality, α (0, 1] is an unknown parameter. Feedback is such that utility of yt is higher than that of yt by a fraction α of the maximum possible utility range U(xt, y t ) U(xt, yt). The term on the right hand side in the above inequality ensures that user feedback

Shivaswamy & Joachims

yt is not only better than yt, but also better by a margin α(U(xt, y t ) U(xt, yt)). Violations of the above feedback model are allowed by introducing slack variables ξt:2

U(xt, yt) U(xt, yt) = α(U(xt, y t ) U(xt, yt)) ξt. (6)

Note that the ξt are not restricted to be positive, but can be negative as well. We refer to the above feedback model as α-informative feedback. Note also that it is possible to express feedback of any quality using (6) with an appropriate value of ξt. Our regret bounds will contain ξt, quantifying to what extent the α-informative modeling assumption is violated. Finally, we will also consider an even weaker feedback model where a positive utility gain is only achieved in expectation over user actions:

Et[U(xt, yt) U(xt, yt)] = α(U(xt, y t ) U(xt, yt)) ξt. (7)

We refer to the above feedback as expected α-informative feedback. In the above equation, the expectation is over the user s choice of yt given yt under context xt (i.e., under a distribution Pxt[ yt|yt] which is dependent on xt). In the rest of this paper, we use a linear model for the utility function,

U(x, y) = w φ(x, y), (8)

where w RN is a parameter vector unknown both to the learning system and users and φ : X Y RN is a known joint feature map known to the system such that3

φ(x, y) ℓ2 R, (9)

for any x X and y Y. Note that both x and y can be structured objects.

3.2 User Study: Preferences from Clicks

We now validate that reliable preferences as speciﬁed in Equation (1) can indeed be inferred from implicit user behavior. In particular, we focus on preference feedback from clicks in web-search and draw upon data from a user study (Joachims, Granka, Pan, Hembrooke, Radlinski, & Gay, 2007). In this study, subjects (undergraduate students, n = 16) were asked to answer 10 identical questions 5 informational, 5 navigational using the Google search engine. All queries, result lists, and clicks were recorded. For each subject, queries were grouped into query chains by question4. On average, each query chain contained 2.2 queries and 1.8 clicks in the result lists. We use the following strategy to infer a ranking y from the user s clicks: prepend to the ranking y from the ﬁrst query of the chain all results that the user clicked throughout the whole query chain. To assess whether U(x, y) is indeed larger than U(x, y) as assumed in our learning model, we measure utility in terms of a standard measure of retrieval quality from Information Retrieval. We use DCG@10(x, y) = P10 i=1 r(x,y[i])

log i+1 , where r(x, y[i]) is the relevance score of the i-th document in ranking y (see Manning, Raghavan, & Sch utze,

2. Strictly speaking, the value of the slack variable depends on the choice of α and the deﬁnition of utility. However, for brevity, we do not explicitly show this dependence in the notation. 3. We make a slightly diﬀerent assumption in Section 5.1. 4. This was done manually, but can be automated with high accuracy (Jones & Klinkner, 2008).

Coactive Learning

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

-5 -4 -3 -2 -1 0 1 2 3 4 5

Cumulative Distribution Function

DCG(x,ybar)-DCG(x,y)

Normal Condition Swapped Condition Reversed Condition

All Conditions

Figure 1: Cumulative distribution of utility diﬀerences between presented ranking y and click-feedback ranking y in terms of DCG@10 for three experimental conditions and overall.

2008). To get ground-truth relevance assessments r(x, d), ﬁve independent human assessors (students) were asked to manually rank the set of results encountered during each query chain. The assessors were also given the true answers for navigational queries. We then linearly normalize the resulting ranks to a relative relevance score r(x, d) [0..5] for each document.

We can now evaluate whether the feedback ranking y is indeed better than the ranking y that was originally presented, i.e., DCG@10(x, y) > DCG@10(x, y). Figure 1 plots the Cumulative Distribution functions (CDFs) of DCG@10(x, y) DCG@10(x, y) for three experimental conditions, as well as the average over all conditions. All CDFs are shifted far to the right of 0, showing that preference feedback from our strategy is highly accurate and informative. Focusing ﬁrst on the average over all conditions, the utility diﬀerence is strictly positive on 60% of all queries, and strictly negative on only 10%. This imbalance is signiﬁcant (binomial sign test, p < 0.0001). Among the remaining 30% of cases where the DCG@10 diﬀerence is zero, 88% are due to y = y (i.e. click only on top 1 or no click). Note that a learning algorithm can easily detect those cases and may explicitly eliminate them as feedback. Overall, this shows that implicit feedback can indeed produce accurate preferences.

What remains to be shown is whether the reliability of the feedback is aﬀected by the quality of the current prediction, i.e., U(xt, yt). In the user study, some users actually received results for which retrieval quality was degraded on purpose. In particular, about one third of the subjects received Google s top 10 results in reverse order (condition reversed ) and another third received rankings with the top two positions swapped (condition swapped ). As Figure 1 shows, we ﬁnd that users provide accurate preferences across this substantial range of retrieval quality. Intuitively, a worse retrieval system may make it harder to ﬁnd good results, but it also makes an easier baseline yt to improve upon. This intuition is formally captured in our deﬁnition of α-informative feedback. The optimal value of the α vs. ξ trade-oﬀ, however, will likely depend on many application-speciﬁc factors, like user motivation, corpus properties, and query diﬃculty. In the following, we therefore present algorithms that do not require knowledge of α, theoretical bounds that hold for any value of α, and experiments that explore a large range of α.

Shivaswamy & Joachims

Algorithm 1 Preference Perceptron.

Initialize w1 0 for t = 1 to T do Observe xt Present yt argmaxy Yw t φ(xt, y) Obtain feedback yt Update: wt+1 wt + φ(xt, yt) φ(xt, yt) end for

4. The Preference Perceptron for Coactive Learning

In this section, we start by presenting and analyzing the most basic algorithm for the coactive learning model, which we call the Preference Perceptron (Algorithm 1). The Preference Perceptron maintains a weight vector wt which is initialized to 0. At each time step t, the algorithm observes the context xt and presents an object y that maximizes w t φ(xt, y). The algorithm then observes user feedback yt and the weight vector wt is updated in the direction φ(xt, yt) φ(xt, yt). Although the update of the preference perceptron appears similar to the standard perceptron for multi-class classiﬁcation problems, there are key diﬀerences. First, the standard perceptron algorithm requires the true label y as feedback, whereas much weaker feedback y suﬃces for our algorithm. Second, the standard analysis of the perceptron bounds the number of mistakes made by the algorithm based on margin and the radius of the examples. In contrast, our analysis bounds a diﬀerent regret that captures a graded notion of utility. Further, the standard perceptron mistake bound (Novikoﬀ, 1962) contains R2 w 2 while our bound in the following Theorem contains R w where R is as deﬁned in (9).

Theorem 1 The average regret of the preference perceptron algorithm can be upper bounded, for any α (0, 1] and for any w as follows:

t=1 ξt + 2R w

Proof First, consider w T+1 2, we have,

w T+1w T+1 = w T w T + 2w T (φ(x T , y T ) φ(x T , y T ))

+ (φ(x T , y T ) φ(x T , y T )) (φ(x T , y T ) φ(x T , y T )

w T w T + 4R2 4R2T.

On line one, we simply used our update rule from algorithm 1. On line two, we used the fact that w T (φ(x T , y T ) φ(x T , y T )) 0 from the choice of y T in Algorithm 1 and that φ(x, y) R. Further, from the update rule in algorithm 1, we have,

w T+1w = w T w + (φ(x T , y T ) φ(x T , y T )) w

t=1 (U(xt, yt) U(xt, yt)) . (11)

Coactive Learning

We now use the fact that w T+1w w w T+1 (Cauchy-Schwarz inequality), which implies T X

t=1 (U(xt, yt) U(xt, yt)) 2R

From the α-informative modeling of the user feedback in (6), we have

t=1 (U(xt, y t ) U(xt, yt))

from which the claimed result follows.

The ﬁrst term in the regret bound denotes the quality of feedback in terms of violation of the α-informative feedback. In particular, if the user feedback is strictly α-informative for some α, then all slack variables in (10) vanish and REGT = O(1/

T). It is trivial to design algorithms (with even better regret) under strict α-informative assumption when the cardinality of the context set X is ﬁnite. One of the interesting aspects of the above bound (Theorem 1) and the subsequent results is that we can minimize the regret even when the context xt is diﬀerent in every step. Thus, |X| could be inﬁnite and the regret bound still holds. We note that the bound in Theorem 1 holds for any w and α (0, 1]. The slacks have to be based on the corresponding α and w . Though user feedback is modeled via α-informative feedback, the algorithm itself does not require knowledge of α; α plays a role only in the analysis. So far, we have characterized user behavior in terms of deterministic feedback actions. However, if a bound on the expected regret suﬃces, the weaker model of Expected αInformative Feedback from Equation (7) is applicable.

Corollary 2 Under the expected α-informative feedback model, the expected regret (over user behavior distribution) of the preference perceptron algorithm can be upper bounded as follows:

E[REGT ] 1 αT

t=1 ξt + 2R w

The above corollary can be proved by following the argument of Theorem 1, but taking expectations over user feedback:

E[w T+1w T+1] = E[w T w T ] + E[2w T (φ(x T , y T ) φ(x T , y T ))]

+ ET [(φ(x T , y T ) φ(x T , y T )) (φ(x T , y T ) φ(x T , y T )]

E[w T w T ] + 4R2.

In the above, E denotes expectation over all user feedback yt given yt under the context xt. It follows that E[w T+1w T+1] 4TR2.

Shivaswamy & Joachims

Algorithm 2 Batch Preference Perceptron.

Initialize w1 0 l 1 s 0 for t = 1 to T do Observe xt Present yt argmaxy Yw l φ(xt, y) Obtain feedback yt if t == s + k then Update: wl+1 wl + Pt j=s φ(xj, yj) φ(xj, yj) l l + 1 s t end if end for

Applying Jensen s inequality on the concave function , we get:

E[w T w ] w E[ w T ] w q

E[w T w T ].

The corollary follows from the deﬁnition of expected α-informative feedback.

4.1 Lower Bound

We now show that the upper bound in Theorem 1 cannot be improved in general with respect to its scaling with T. In the following lemma, given any Coactive Learning algorithm, we construct a sequence of examples where, even with α = 1 feedback, the algorithm suﬀers an average regret of Ω(1/

Lemma 3 For any coactive learning algorithm A with linear utility, there exist xt, objects Y and w such that REGT of A in T steps is Ω(1/

Proof Consider a problem where Y = { 1, +1}, X = {x RT : x = 1}. Deﬁne the joint feature map as φ(x, y) = yx. Consider T contexts e1, . . . , e T such that ej has only the jth component equal to one and all the others equal to zero. Let y1, . . . y T be the sequence of outputs of A on contexts e1, . . . , e T . Construct w = [ y1/

T, . . . , y T /

T] , we have for this construction w = 1. Let the user feedback on the tth step be yt. With these choices, the user feedback is always α-informative with α = 1 since y t = yt. Yet, the regret of the algorithm is 1

T PT t=1(w φ(et, y t ) w φ(et, yt)) = Ω( 1

4.2 Batch Update

The Preference Perceptron as stated in Algorithm 1 makes an update after every iteration. In some applications, due to high volumes of feedback, it might not be possible to do an update that frequently. For such scenarios, it is natural to consider a variant of Algorithm 1 that makes an update every k iterations; the algorithm simply uses wt obtained from the

Coactive Learning

Algorithm 3 Generic Template for Coactive Learning Algorithms

Initialize w1 for t = 1 to T do Observe xt Present yt argmaxy Yw t φ(xt, y) Obtain feedback yt Perform an update on wt using the gradient of w (φ(xt, yt) φ(xt, yt)) to obtain wt+1. end for

previous update until the next update. The type of updates shown in Algorithm 2 are called mini-batch updates and have been used in distributed online optimization (Dekel, Gilad Bachrach, Shamir, & Xiao, 2012). The steps of the batch update algorithm are shown in Algorithm 2. It is easy to show the following regret bound for batch updates:

Lemma 4 The average regret of the batch preference perceptron algorithm can be upper bounded, for any α (0, 1] and for any w as follows:

t=1 ξt + 2R w

While this lemma implies that mini-batches slow down learning by a factor equal to the batch size, we will see in Section 6.2.3 that empirically convergence is substantially faster.

5. Deriving Algorithms for Coactive Learning

The Preference Perceptron and the regret it minimizes, as deﬁned in Eqn. (4), is only one point in the design space of diﬀerent regret deﬁnitions and learning algorithms for coactive learning. In this section, we will outline a general strategy for deriving coactive learning algorithms from existing algorithms for online optimization. Furthermore, we will demonstrate that more general notions of regret are meaningful and feasible in coactive learning, and derive coactive learning algorithms for general convex and λ-strongly convex losses. All coactive learning algorithms that we derive in this section follow the general template outlined in Algorithm 3. After initializing w1, in each iteration the context xt is observed and the algorithm presents yt by maximizing its current utility estimate represented by wt. Once the feedback yt is observed, the algorithm simply takes the gradient of w (φ(xt, yt) φ(xt, yt)) and uses an update from a standard online convex optimization algorithm to obtain wt+1 from wt. In each case, an upper bound on the regret of the proposed algorithm is derived by using the following strategy. First, we start with a notion of regret that is suited for coactive learning. We then upper bound this regret by ﬁrst reducing it to a form such that results from a standard online convex opimization regret bound can be applied. This gives a regret bound for the original coactive algorithm in turn. In each case, we use this template algorithm to derive a speciﬁc algorithm. However, we still provide a self-contained proof (in the appendix) clearly pointing out where we have used the regret bound from a corresponding online convex optimization algorithm.

Shivaswamy & Joachims

Algorithm 4 Exponentiated Preference Perceptron

Intialize wi 1 1

T for t = 1 to T do Observe xt Present yt argmaxy Yw t φ(xt, y) Obtain feedback yt wi t+1 wi t exp( η(φi(xt, yt) φi(xt, yt)))/Zt, where Zt is such that the weights add to one. end for

5.1 Exponentiated Preference Perceptron

To illustrate the generic strategy for deriving coactive learning algorithms, we ﬁrst derive an exponentiated gradient algorithm for coactive learning that can be used as an alternative to the Preference Perceptron. The exponentiated gradient algorithm inherits the ability to learn quickly for sparse weight vectors. Unlike the additive updates of the Preference Perceptron, the exponentiated gradient algorithm summarized in Algorithm 4 performs multiplicative updates. This exponentiated algorithm is closely related to the exponentiated algorithms for classiﬁcation (Kivinen & Warmuth, 1997). At the start, it initializes all weights uniformly. Each subsequent update step has a rate η associated with it. The rate depends on an upper bound on the ℓ norm of the features (i.e., φ( , ) ℓ S) and the time horizon T. After each multiplicative update, the weights are normalized to sum to one, and the steps of the algorithm repeat. Since the updates are multiplicative and the weights are initially positive, wt is guaranteed to remain in the positive orthant for this algorithm. We note that Algoithm 4 is assumed to know both T and S. There are standard techniques (see Cesa-Biachi & Lugosi, 2006b) to convert such an algorithm to not have dependence on T, however, such extensions are not the focus of this paper. We now provide a regret bound for Algorithm 4. While the regret bounds for Algorithm 1 and Algorithm 2 depended on the ℓ2 norm of the features, the bound for the exponentiated algorithm depends on the ℓ norm of the features.

Theorem 5 For any w RN such that w ℓ1 = 1, w 0, under (expected) αinformative feedback the average (expected) regret of the Exponentiated Preference Perceptron can be upper bounded as:

t=1 ξt + 2 log(m)S

E[REGT ] 1 αT

t=1 ξt + 2 log(m)S

where φ(x, y) ℓ S.

Coactive Learning

Proof We start with the regret of the coactive learning algorithm as deﬁned in (4):

t=1 (U(xt, y t ) U(xt, yt))

t=1 (U(xt, yt) U(xt, yt)) + 1

w φ(xt, yt) w φ(xt, yt) + 1

t=1 ξt (13)

In the above equation, we have used the deﬁnition of α-informative feedback as deﬁned in Eqn. (6). By viewing Algorithm 4 as an exponentiated online gradient descent algorithm, it is easy to derive the following regret bound using techniques initially introduced by Kivinen and Warmuth (1997),

t=1 (w t (φ(xt, yt) φ(xt, yt)))

t=1 (U(xt, yt) U(xt, yt)) + 2 log(N)S

Since we could not ﬁnd this speciﬁc bound in the literature, a self-contained proof is provided in Appendix A. In the proof, REGT is ﬁrst upper bounded in terms of the diﬀerence between KL(w||wt+1) and KL(w||wt). A telescoping argument is then used to get the above result. Observing that w t (φ(xt, yt) φ(xt, yt)) 0, we get,

t=1 (U(xt, yt) U(xt, yt)) 2 log(N)S

Combining (13) and (14), we obtain the average regret bound. The proof of the expected regret bound is analogous to that of the Preference Perceptron.

Like the result in Theorem 1, the above result (Theorem 5) also bounds the regret in terms of the noise in the feedback (ﬁrst term) and additional terms which converge to zero at the rate O(1/

T). The key diﬀerence to Theorem 1 is that the regret bound of the exponentiated algorithm scales logarithmically with the number of features, but with the ℓ1-norm of w, which can be advantageous if the optimal w is sparse.

5.2 Convex Preference Perceptron

Generalizing the deﬁnition of regret from Eqn. (4), we now allow that at every time step t, there is an (unknown) convex loss function ct : R R which determines the loss ct(U(xt, yt) U(xt, y t )) at time t based on the diﬀerence in utility between yt and the optimal y t . The functions ct are assumed to be non-increasing. Non-increasing assumption on ct is based on the intuition that the loss should be higher as U(xt, yt) is farther from U(xt, y t ). Further, sub-derivatives of the ct s are assumed to be bounded. Formally, c t(θ) [ G, 0] for all t and for all θ R where c t(θ) denotes the sub-derivative of ct( ) at θ. The vector w which determines the utility of yt under context xt is assumed from a closed and

Shivaswamy & Joachims

Algorithm 5 Convex Preference Perceptron.

Initialize w1 0 for t = 1 to T do Set ηt 1

t Observe xt Present yt argmaxy Yw t φ(xt, y) Obtain feedback yt Update: wt+1 wt + ηt(φ(xt, yt) φ(xt, yt)) Project: wt+1 arg minu B u wt+1 2

bounded convex set B whose diameter is denoted as |B|. In the case of convex losses, we consider the following notion of regret:

t=1 ct(U(xt, yt) U(xt, y t )) 1

t=1 ct (0) (15)

In the bound (16), ct(0) is the minimum possible convex loss since U(xt, yt) U(xt, y t ) can never be greater than zero by deﬁnition of y t . Hence the above regret compares the loss of our algorithm with the best loss that could be achieved with a convex loss. Note that, for the case ct(θ) = θ, the above deﬁnition of regret reduces to our earlier deﬁnition of regret in the linear case (Eqn. (4)). Algorithm 5 minimizes the average convex loss. There are two diﬀerences between this algorithm and Algorithm 1. First, there is a rate ηt associated with the update at time t. Second, after every update, the resulting vector wt+1 is projected back to the set B. Algorithm 5 is also closely related to the online convex optimization algorithm propsed by Zinkevich (2003). However, the online convex optimization algorithm assumes that the gradient of the loss (ct( )) is observed after each iteration. Our algorithm doesn t observe the gradient directly, but only observes an improved object yt after presenting an object yt. Our earlier regret bounds were expressed in terms of slack variables ξt. However, here and in the following section, our bounds will be expressed in terms of the clipped version of the slack variables deﬁned as ξ+ t := max(0, ξt).

Theorem 6 For the Convex Preference Perceptron under α-informative feedback, for nonincreasing convex losses ct( ) with bounded sub-derivative, we have, for any α (0, 1] and any w B,

t=1 ξ+ t + G

Similarly, under expected α-informative feedback, we have,

E[CREGT ] 2G

t=1 ξ+ t + G

Coactive Learning

The proof for the above Theorem is provided in the Appendix B. The idea of the proof is to ﬁrst divide the time steps into two types depending on the nature of the feedback. This allows us to upper bound CREGT in terms of PT t=1(wt w ) (φ(xt, yt) φ(xt, yt)). This term can further be upper bounded by following the argument from Zinkevich (2003) even in the Coactive Learning framework. From the deﬁnition of CREGT in Eqn. (15), the above theorem upper bounds the average convex loss via the minimum achievable loss and the quality of the feedback. Like the previous result (Theorem 1), under strict α-informative feedback, the average loss approaches the best achievable loss at O(1/

T), albeit with larger constant factors. In the case of the linear utility bounds in Theorem 1 and Theorem 5, it was suﬃcient to have the average of the slack variables be zero to achieve zero regret. However, in the case of convex losses, our upper bound on regret approaches zero only when the average of the clipped slack variables is zero.

5.3 Second-Order Preference Perceptron

For a particular class of convex functions, it turns out that we can give much stronger regret bounds than for general convex losses. The improvement for this special class of losses parallels improvements in online convex optimization from general convex losses (Zinkevich, 2003) to λ-strongly convex losses (Hazan et al., 2007).

Deﬁnition 7 A convex function f : D R is λ-strongly convex if for all points x and y in D, the following condition is satisﬁed for a ﬁxed λ > 0:

f(x) f(y) + f(x) (x y) λ

2 ||y x||2, (18)

where f(x) denotes a sub-derivative at x.

Algorithm 6 shows the Second-order Preference Perceptron for λ-strongly convex losses. Like our previous algorithms, the Second-order Preference Perceptron also maintains a weight vector wt, and the step of presenting yt based on a context xt is still the same as in our previous algorithms. However, in addition to the weight vector, it also maintains an additional matrix At which is constructed from the outer product of the vector φ(xt, yt) φ(xt, yt). The update step and the projection steps now involve both At as shown in Algorithm 6. Algorithm 6 is closely related to the online convex optimization algorithm proposed by Hazan et al. (2007). However, as we pointed out in the case of Algorithm 5, our algorithm only observes a user preference feedback after each step unlike online convex optimization algorithms which observe gradients. It is still possible to prove a regret bound for the λ-strongly convex case, and we have the following result.

Theorem 8 For the second order preference learning algorithm, for (expected) λ-strongly convex, non-increasing functions ct, with bounded sub-derivatives, we have,

CREGT γ 2Tα2

t=1 ξ+ t 2 + 2G

t=1 ξ+ t + Gϵ|B|

2Tγα log 4R2Tγ

ϵ + 1 , (19)

E[CREGT ] γ 2Tα2

t=1 ξ+2 t + 2G

t=1 ξ+ t + Gϵ|B|

2Tγα log 4R2Tγ

ϵ + 1 , (20)

Shivaswamy & Joachims

Algorithm 6 Second-order Preference Perceptron.

Intialize w1 0 A0 ϵI γ λ/G. for t = 1 to T do Observe xt Present yt argmaxy Yw t φ(xt, y) Obtain feedback yt At At 1 + γ[φ(xt, yt) φ(xt, yt)][φ(xt, yt) φ(xt, yt)]

Update: wt+1 wt + A 1 t [φ(xt, yt) φ(xt, yt)] Project: wt+1 = arg minw B( wt+1 w) At( wt+1 w) end for

where, ϵ > 0 is an initialization parameter, as shown in Algorithm 6.

We prove the above Theorem in the Appendix C. Like in the proof of Theorem 6, we divide time steps into two types. Starting with this, it is possible to upper bound CREGT to such a form that the resulting terms can be upper bounded using similar arguments as that for online strongly convex optimization (Hazan et al., 2007). When user feedback is strictly α-informative for some α and some w B, the ﬁrst two terms of the regret bound (19) result in an O(log T

T ) scaling with T. However, there is a linear dependence on the dimensionality of the joint feature map in the regret bound for the Second-order Preference Perceptron algorithm. Even though it appears like we need to invert the matrix At in the Second-order Preference Perceptron, this can be avoided since the updates on At are of rank one. By the Woodbury matrix inversion lemma, we have:

A 1 t = (At 1 + γ[φ(xt, yt) φ(xt, yt)][φ(xt, yt) φ(xt, yt)] ) ) 1

= A 1 t 1 A 1 t 1[φ(xt, yt) φ(xt, yt)][φ(xt, yt) φ(xt, yt)] A 1 t 1 1/(γ) + [φ(xt, yt) φ(xt, yt)] A 1 t 1[φ(xt, yt) φ(xt, yt)].

Thus, in practice, the Second-order Preference Perceptron can update both At and Bt in each iteration. Nevertheless, the projection step to obtain wt+1 involves solving a quadratically-constrained quadratic program where B is a ball of ﬁxed radius, which still takes O(N3) time. Hence, the Second-order Preference Perceptron is computationally more demanding than the Convex Preference Perceptron. As we show in the experiments, the Second-order Preference Perceptron might be still quite useful for low-noise data.

6. Experiments

We empirically evaluate our Coactive Learning algorithms on two real-world datasets. The two datasets diﬀer in the nature of prediction and feedback. On the ﬁrst dataset, the algorithms operate on structured objects (rankings) whereas on the second dataset, atomic items (movies) were presented and received as feedback.

Coactive Learning

6.1 Datasets And User Feedback Models

First, we provide a detailed description of the two datasets that were used in our experiments. Along with this, we provide the details of the strategies that we used on each dataset for generating user feedback.

6.1.1 Structured Feedback: Web-Search

Our ﬁrst dataset is a publicly available dataset from Yahoo! (Chapelle & Chang, 2011) for learning to rank in web-search. This dataset consists of query-url feature vectors (denoted as xq i for query q and URL i), each with a relevance rating rq i that ranges from zero (irrelevant) to four (perfectly relevant). To pose ranking as a structured prediction problem, we deﬁned our joint feature map as follows:

w φ(q, y) =

w xq yi log(i + 1). (21)

In the above equation, y denotes a ranking such that yi is the index of the URL which is placed at position i in the ranking. Thus, the above measure considers the top ﬁve URLs for a query q and computes a score based on graded relevance. Note that the above utility function deﬁned via the feature-map is analogous to DCG@5 (see, Manning et al., 2008) after replacing the relevance label with a linear prediction based on the features. For query qt at time step t, the Coactive Learning algorithms present the ranking yq t that maximizes w t φ(qt, y). Note that this merely amounts to sorting documents by the scores w t xqt i , which can be done very eﬃciently. The utility regret in Eqn. (4), based on the deﬁnition of utility w φ(q, y) is given by 1

T PT t=1 w (φ(qt, yqt ) φ(qt, yqt)). Here yqt

denotes the optimal ranking with respect to w , which we consider to be the best least squares ﬁt to the relevance labels from the features using the entire dataset. We obtain yqt from Eqn. 3, that is, yqt = argmaxy Yw φ(qt, y). In all our experiments, query ordering was randomly permuted twenty times and we report average and standard error of the results. We used the following two user models for generating simulated user feedback in our experiments. The ﬁrst feedback model is an idealized version of feedback whereas the second feedback is based directly on relevance labels that are available in the dataset:

Strict α-informative feedback: In this model, the user is assumed to provide strictly α-informative feedback at a given α value (i.e., slacks zero). Given the predicted ranking yt, the user would go down the list until she found ﬁve URLs such that, when placed at the top of the list, the resulting yt satisﬁed the strictly α-informative feedback condition w.r.t. the optimal w . This model assumes that the user has access to w hence it is an idealized feedback.

Noisy feedback at depth k: In this feedback model, given a ranking for a query, the user would go down the list inspecting the top k URLs (or all the URLs if the list is shorter) for a speciﬁed k value. Five URLs with the highest relevance labels (rq i ) are placed at the top ﬁve locations in the user feedback. Note that this produces noisy feedback since no linear model can perfectly ﬁt the relevance labels on this dataset.

Shivaswamy & Joachims

6.1.2 Item Feedback: Movie Recommendation

In contrast to the structured prediction problem in the previous dataset, we considered a second dataset with atomic predictions, namely movie recommendation. In each iteration, a movie is presented to the user, and the feedback consists of a single movie as well. We used the Movie Lens dataset from grouplense.org which consists of a million ratings over 3900 movies as rated by 6040 users. The movie ratings range from one to ﬁve. We randomly divided users into two equally sized sets. The ﬁrst set was used to obtain a feature vector xj for each movie j using the SVD embedding method for collaborative ﬁltering (see Bell & Koren, 2007, Eqn. (15)). The dimensionality of the feature vectors and the regularization parameters were chosen to optimize cross-validation accuracy on the ﬁrst dataset in terms of squared error. For the second set of users, we then considered the problem of recommending movies based on the movie features xj. This experiment setup simulates the task of recommending movies to a new user based on movie features from old users. For each user i in the second set, we found the best least squares approximation w T i xj to the user s utility functions on the available ratings. This enabled us to impute utility values for movies that were not explicitly rated by this user. Furthermore, it allowed us to measure regret for each user as 1

T PT t=1 w i (xt xt), which is the average diﬀerence in utility between the recommended movie xt and the best available movie xt . We denote the best available movie at time t by xt which is obtained from Eqn. 3. In this experiment, once a user gave a particular movie as feedback, both the recommended movie and the feedback movie were removed from the set of candidates for subsequent recommendations. In all the experiments we report (average) regret values averaged over all 3020 users in the test set. To simulate user behavior, we considered the following two feedback models on this dataset:

Strict α-informative feedback: As in the previous dataset, in this model, the user is assumed to provide strictly α-informative feedback at a given α value (i.e., slacks zero). Given the predicted movie yt, the user is assumed to watch the movie if it already has the highest rating in the remaining corpus of movies. If not, the user picks another movie from the corpus with lowest utility that still satisﬁes strict αinformative assumption. This model again assumes that the user has access to w , hence it is an idealized feedback.

Noisy feedback: In this feedback model, given a movie y, the user is assumed to have access to either the actual rating of the movies (when available) or is assumed to round the imputed rating to the nearest legal rating value. We used two sub-strategies by which the user provides feedback. In better feedback, the user provides y such that it has the smallest rating (actual rating or rounded rating) but strictly better rating than that of y. In best feedback, the user provides y such that it has the highest rating (actual rating or rounded rating) in the remaining corpus. There could be multiple movies satisfying the above criteria, and ties were broken uniformly at random among such movies. Note that this feedback model results in a noisy feedback due to rounding of movie ratings to discrete values.

Coactive Learning

6.2 Preference Perceptron

In the ﬁrst set of experiments, we analyze the empirical performance and scaling behavior of the basic Preference Perceptron Algorithm and its variants.

6.2.1 Strong Versus Weak Feedback

The goal of the ﬁrst experiment to explore how the regret of the algorithm changed with feedback quality. To get feedback at diﬀerent quality levels α, we used strict α-informative feedback for various α values.

10 0 10 1 10 2 10 3 10 4 0

avg. dcg regret

α = 0.1 α = 0.5 α = 1.0

10 0 10 1 10 2 10 3 0

avg. util regret

α = 0.1 α = 0.5 α = 1.0

Figure 2: Regret based on strict α-informative feedback for various α values for websearch (left) and movie recommendation (right).

Figure 2 shows the results of this experiment for three diﬀerent α values. Overall, regret is typically substantially reduced after only tens or hundreds of iterations. As expected, the regret for α = 1.0 is lower compared to the regret for lower α values. Note, however, that the diﬀerence between the two curves is much smaller than a factor of ten. Also note that the diﬀerences are less prominent in the case of web-search. This is because strictly α-informative feedback is also strictly β-informative feedback for any β α. So, in our user feedback model, we could be providing much stronger feedback than that required by a particular α value. As expected from the theoretical bounds, since the user feedback is based on a linear model with no noise, utility regret approaches zero in all the cases. Note that we show standard error in the plots, giving an indication of statistical signiﬁcance. In the left plots in Figure 2, the standard errors are high at lower iterations but become lower with more iterations. In some plots in the rest of the paper, the error bars are small and may be diﬃcult to visually identify. In the rest of this paper, for strict α-informative feedback, we consider α = 0.5 unless we explicitly mention otherwise.

6.2.2 Noisy Feedback

In the previous experiment, user feedback was based on actual utility values computed from the optimal w . We next study how regret changes with noisy feedback where user behavior

Shivaswamy & Joachims

does not follow a linear utility model. For the web-search dataset, we use noisy feedback at depths 10 and 25, and for the movie dataset we use noisy feedback with both the better and the best variant of it.

10 0 10 1 10 2 10 3 10 4 0

avg. util regret

depth=10 depth=25

10 0 10 1 10 2 10 3 0

avg. util regret

better best

Figure 3: Regret based on noisy feedback for web-search (left) and movie recommendation (right).

The results for this experiment are shown in Figure 3. The ﬁrst observation to make is that in the case of web-search, the regret values now do not converge to zero. Similarly, in the case of movie recommendation the regret values are higher than those in the previous experiment. These results are in line with our theory which shows regret converging to average slack variables when the user does not provide strict α informative feedback for any α. Interestingly, in the case of web-search the average regret is slightly higher when the user goes to greater depth in providing feedback. This is due to the fact that the relevance labels in this dataset are noisy when the user maximizes (noisy) utility over a larger set of URLs, the selection of the (true) utility maximizers becomes less reliable, which degrades user feedback quality. In the rest of this paper, for web-search we consider noisy feedback with depth=10. In the case of movie recommendation, we consider the better version of the noisy feedback unless we explicitly mention otherwise.

6.2.3 Batch Updates

In this section, we consider the Batch Preference Perceptron algorithm (Algorithm 2). Its regret bound from Section 4.2 scales by a factor

k under strict α-informative feedback, if the update is made only every k iterations of the algorithm. We now verify whether empirical performance scales as suggested by the bound. For both web-search and movies, we considered both strict α-informative feedback and noisy feedback. For both types of feedback, we use the Batch Perceptron Algorithm with various values of k and report the resulting average regret. The results of these experiments are shown in Figure 4 and Figure 5. As expected, as the value of k becomes smaller, regret converges faster. However, we observe that the

Coactive Learning

10 0 10 1 10 2 10 3 10 4 0

avg. util regret

k = 1 k = 10 k=20 k=50

10 0 10 1 10 2 10 3 0

avg. util regret

k=1 k=10 k=20 k=50

Figure 4: Regret versus time based on batch updates with strict α-informative feedback for web-search (left) and movie recommendation (right).

10 0 10 1 10 2 10 3 10 4 0.4

avg. util regret

k = 1 k = 10 k=20 k=50

10 0 10 1 10 2 10 3 0

avg. util regret

k=1 k=10 k=20 k=50

Figure 5: Regret versus time based on batch updates with noisy feedback for web-search (left) and movie recommendation (right).

empirical scaling with k is substantially better than the

k factor suggested by Lemma 4. These results show the feasibility of using Coactive Learning algorithms in systems where it might be impractical to do an update after every iteration.

6.2.4 Expected User Feedback

The user feedback was deterministic in our experiments so far. In this sub-section, we consider probabilistic feedback and study the behavior of the Preference Perceptron algorithm. Recall that we provided an upper bound on the expected regret for expected user feedback in Corollary 2. To provide α-informative feedback under expectation, we consider the following strategy. Given an object yt on context xt, the user would ﬁrst generate deterministic feedback yt

Shivaswamy & Joachims

following a strict α-informative feedback model (α = 0.5 for web-search and α = 1.0 for movie recommendation).5 In addition, we consider ﬁve randomly generated objects as feedback. We then put uniform probability mass over the randomly generated objects and remaining mass over the deterministic feedback such that the user feedback is still α-informative at α = 0.5 in expectation.

10 0 10 1 10 2 10 3 10 4 0

avg. util regret

Expct. feedback Det. feedback

10 0 10 1 10 2 10 3 0

avg. util regret

Expct. Feedback Det. Feedback

Figure 6: Expected feedback versus deterministic feedback on web-search (left) and movie recommendation (right).

The results for this experiment are shown in Figure 6. As a reference, we also plot the regret curve with deterministic α-informative feedback with α = 0.5. It can be seen that there is not much diﬀerence between deterministic and expected feedback at higher numbers of iterations. It can also be seen that the regret converges to zero even with α-informative feedback in expectation as suggested by Corollary 2.

6.2.5 Comparison with Ranking SVM

We now compare our algorithms against several baselines, starting with a conventional Ranking SVM (Joachims, 2002) that is repeatedly trained. At each iteration, the previous SVM model is used to present a ranking to the user (yqt svm). The user returns a ranking ( yqt svm) based on strict α-informative feedback in one experiment and based on noisy feedback in the other. The pairs of examples (qt, yqt svm) and (qt, yqt svm) are used as training pairs for the ranking SVM. Note that training a ranking SVM after each iteration would be prohibitive expensive, since it involves solving a quadratic program and cross-validating the regularization parameter C. Thus, we retrained the SVM whenever 10% more examples were added to the training set. The ﬁrst training was after the ﬁrst iteration with just one pair of examples (starting with a random yq1), and the C value was ﬁxed at 100 until there were 50 pairs of examples, when reliable cross-validation became possible. After there were more than 50 pairs in the training set, the C value was obtained via ﬁve-fold cross-

5. Note that, in the case of web-search, our user model can provide strictly β-informative where β larger than 0.5.

Coactive Learning

validation. Once the C value was determined, the SVM was trained on all the training examples available at that time. The same SVM model was then used to present rankings until the next retraining.

10 0 10 1 10 2 10 3 10 4 0

avg. util regret

SVM Pref. Perceptron

10 0 10 1 10 2 10 3 0

avg. util regret

SVM Pref. Perceptron

Figure 7: Preference Perceptron versus Ranking SVM with strict α-informative feedback on web-search (left) and movie recommendation (right).

10 0 10 1 10 2 10 3 10 4 0.4

avg. util regret

SVM Pref. Perceptron

10 0 10 1 10 2 10 3 0

avg. util regret

SVM Pref. Perceptron

Figure 8: Preference Perceptron versus Ranking SVM with noisy feedback on web-search (left) and movie recommendation (right).

The results of this experiment are shown in Figure 7 and Figure 8. In the case of strict α-informative feedback, the Preference Perceptron performed much better than the SVM for the movie recommendation, and comparably for web search. In the case of noisy feedback, the Preference Perceptron performs signiﬁcantly better than the SVM over most of the range on both the datasets. While it took around 20 minutes to run the Preference Perceptron experiment, it took about 20 hours to run the SVM experiment on

Shivaswamy & Joachims

web-dataset for each permutation of the dataset. Similary, on the movie recommendation task it took around 125 seconds to run the preference perceptron for each user while it took around 260 seconds to run the SVM for each user. These results show that the preference perceptron can perform on par or better than SVMs on these tasks at a fraction of the computational cost.

6.2.6 Comparison with Dueling Bandit

As a second baseline, we compare the Preference Perceptron algorithm with the dueling bandit approach of Yue and Joachims (2009). In each step, the dueling bandit algorithm makes a comparison between a vector w and a perturbed version of it w1 (in a random direction u such that w1 = w + γu). The results produced by these two weight vectors are assessed by the user through techniques such as interleaving (Radlinski, Kurup, & Joachims, 2008), providing a preference between w and w1. The preference feedback determines the update that the dueling bandits algorithm makes to w. If w is preferred, it is retained for the next round. If w1 is preferred, a small step of length δ is taken in the direction of perturbation u.

10 0 10 1 10 2 10 3 10 4 0

avg. util regret

Dueling Bandit Pref. Perceptron

10 0 10 1 10 2 10 3 10 4 0

avg. util regret

Dueling Bandit Pref. Perceptron

Figure 9: Preference Perceptron versus Dueling Bandit on web-search. The left plot is based on strict α-informative feedback, the right plot shows noisy feedback.

In our ﬁrst experiment on web-search, in each step, we ﬁrst obtained two ranked lists based on w and w1. The features used to obtain these ranked lists were identical to those used for Preference Perceptron. The two rankings were then interleaved. The interleaved ranking was presented to a user. In the ﬁrst experiment, the user provided strict αinformative feedback on the interleaved ranking. In the second experiment, the user provided noisy feedback. Depending on the feedback, we inferred which of the two rankings was preferred using the Team-Game method proposed by Radlinski et al. (2008). When w was preferred or when there was a tie, no step was taken. When w1 was preferred, a step of length δ was taken in the direction u. The regret of the dueling bandit algorithm was measured by considering the utility of the interleaved ranking. Unlike the Preference Perceptron algorithm, the dueling bandit algorithm has two parameters (γ and δ) that need

Coactive Learning

to be tuned. We considered 25 values for these parameters (5x5 grid) and simply chose the best parameter values of the dueling bandits algorithm in hindsight.

The results for this experiment are shown in Figure 9. Despite the advantage of setting the parameters to best possible values, it can be seen that dueling bandit algorithm performs signiﬁcantly worse compared to the preference perceptron algorithm by orders of magnitude. For example, the performance of the dueling bandit at around 28,000 iterations is matched by preference perceptron at less than 100 iterations with both types of feedback. This is not surprising, since the dueling bandit algorithm basically relies on random vectors to determine the direction in which a step needs to be taken. In the Coactive Learning model, the user feedback provides a (better than random) direction to guide the algorithm.

10 0 10 1 10 2 10 3 0

avg. util regret

Dueling Bandit Pref. Perceptron

10 0 10 1 10 2 10 3 0

avg. util regret

Dueling Bandit Pref. Perceptron

Figure 10: Preference Perceptron versus Dueling Bandit on movie recommendation. The left plot is based on utility values whereas the right plot shows results with rounded values.

Similarly, we also conducted a comparison with the dueling bandit algorithm on the movie recommendation dataset. However, unlike the web-search experiment, the dueling bandit model is somewhat unnatural on this dataset in our experimental setup, since interleaving two rankings is natural whereas interleaving two items is not. We therefore consider a diﬀerent setup. Two movies were obtained based on w and w1 for the dueling bandit algorithm. User feedback was to merely indicate which of these two movies has a higher rating. In the noisy case, user feedback was based on the actual rating or the rounded rating. In the noise-free case, user feedback was based on the utility values. In either case, the utility of dueling bandit was considered to be the average utility of the two movies selected for comparison.

The performance of the dueling bandit algorithm in this experiment is shown in Figure 10. For the Preference Perceptron algorithm, regret curves for strict α-informative feedback (α = 0.5) and better noisy feedback are also shows as reference. It can be seen that the dueling bandit algorithm again performs substantially worse compared to the Preference Perceptron algorithm.

Shivaswamy & Joachims

6.3 Exponentiated versus Additive Updates

In this experiment, we compare the exponentiated algorithm (Algorithm 4) with the additive Preference Perceptron algorithm. For the exponentiated algorithm, all the components of we must be non-negative.6 We obtained a non-negative we as follows:

[we ]i = min(0, [w ]i) 1 i m, max(0, [w ]i m) m + 1 i 2m. (22)

In the above equation, [we ]i denotes the ith component of we . Moreover, we also modiﬁed the joint feature map for the exponentiated algorithm as follows:

[φe(x, y)]i = +[φ(x, y)]i 1 i m [φ(x, y)]i m m + 1 i 2m (23)

With the above modiﬁcations, we will have only non-negative components and moreover, it is easy to verify that we φe(x, y) = w φ(x, y). This makes the regret of the exponentiated algorithm directly comparable with the regret of the additive algorithm. The exponentiated algorithm has a ﬁxed rate parameter η that inversely depends on the time horizon T. When T is large, η is small. In this situation, consider the update in Algorithm 4: wi t+1 wi t exp(η(φi(xt, yt) φi(xt, yt)))/Zt.

Since, η is small, we can approximate the exponential term in the above equation with a ﬁrst order approximation:

exp(η(φi(xt, yt) φi(xt, yt))) 1 + η(φi(xt, yt) φi(xt, yt)).

Thus the exponentiated updates resemble the updates of the additive algorithm up to a normalization factor. Despite the normalization factor, we empirically observed the behavior of the two algorithms to be nearly identical (though not exact). We thus empirically evaluated the exponentiated algorithm with a variable rate parameter ηt = 1 2S

t at time t. Note that this is an empirical result without formal theoretical guarantees for this variable rate. Results of this experiment are shown in Figure 11 and Figure 12 for strict α-informative feedback and noisy feedback respectively. It can be seen that the exponentiated algorithm tends to performs slightly better than the additive algorithm for small number of iterations. As the time horizon becomes large, the two algorithms seem to have comparable performance in most cases.

6.4 Minimizing Convex Losses

In this section, we empirically evaluate the Convex Preference Perceptron (Algorithm 5) and the Second-Order Preference Perceptron (Algorithm 6).

6. We put a superscript e to distinguish the joint feature map and w that we used in our experiments for the exponentiated algorithm.

Coactive Learning

10 0 10 1 10 2 10 3 10 4 0

avg. util regret

Exponentiated Pref. Perceptron

10 0 10 1 10 2 10 3 0

avg. util regret

Exponentiated Pref. Perceptron

Figure 11: Exponentiated versus Additive with strict α-informative feedback on websearch (left) and movie recommendation (right).

10 0 10 1 10 2 10 3 10 4 0

avg. util regret

Exponentiated Pref. Perceptron

10 0 10 1 10 2 10 3 0

avg. util regret

Exponentiated Pref. Perceptron

Figure 12: Exponentiated versus Additive with noisy feedback on web-search (left) and movie recommendation (right).

6.4.1 Convex Perceptron Versus Second-Order Algorithm

The regret bounds from Section 5 show that one can get lower regret for λ-strongly convex functions using a second-order algorithm, while the ﬁrst-order Convex Perceptron applies to general convex functions. In this section, we evaluate the relative performance of the ﬁrst-order and the second-order algorithms empirically. For this purpose, we considered the quadratic loss c(θ) = (θ M)2 where M is the largest utility value on any possible φ(x, y) with any w in a convex ball of radius w . It can be veriﬁed this loss function is λ-strongly convex. B was set to 100 for both the algorithms for both the datasets.

Shivaswamy & Joachims

10 0 10 1 10 2 10 3 10 4 0

Quad regret

Second Order Convex Perceptron

10 0 10 1 10 2 10 3 10 4 0

Util regret

Second Order Convex Perceptron

Figure 13: Cumulative regret of the convex perceptron and the second order convex perceptron for web-search.

10 0 10 1 10 2 10 3 0

Quad. regret

Second Order Convex Perceptron

10 0 10 1 10 2 10 3 0

Util regret

Second Order Convex Perceptron

Figure 14: Cumulative regret of the convex perceptron and the second order convex perceptron for movie recommendation.

In the ﬁrst set of experiments, we considered strict α-informative feedback. We ran both the second-order algorithm as well as the Convex Preference Perceptron algorithm 5. The γ value in the second order perceptron was simply set to one. We recorded the REGT and CREGT values for both the methods. Note that REGT corresponds to the utility regret as deﬁned in 4. Results of this experiment are shown in Figure 13 and Figure 14. To demonstrate the qualitative diﬀerence between the two algorithms, we plot cumulative regret (i.e. T REGT and T CREGT ) in these ﬁgures. The cumulative regret of the second-order algorithm is linear on a log-scale. This shows that the convergence of the regret is indeed

Coactive Learning

logarithmic, compared to the much slower convergence of the Convex Preference Perceptron. Interestingly, even the cumulative regret based on raw utility values empirically shows a similar behavior. This is purely an empirical result, since theoretically, O(log(T)/T) average regret holds only for strongly convex losses and the linear loss is not strongly convex.

10 6 10 4 10 2 10 0 10 2 10 4 10 3

Quad regret

weak@5000 strong@5000 weak@15000 strong@15000 weak@25000 strong@25000

10 6 10 4 10 2 10 0 10 2 10 4 10 2

Util. regret

weak@5000 strong@5000 weak@15000 strong@15000 weak@25000 strong@25000

Figure 15: Sensitivity of the second order preference perceptron algorithm to the parameter value γ.

10 4 10 2 10 0 10 2 10

Quad. regret

weak@5000 strong@5000 weak@15000 strong@15000 weak@25000 strong@25000

10 4 10 2 10 0 10 2 10 4 10 2.1

Util. regret

weak@5000 strong@5000 weak@15000 strong@15000 weak@25000 strong@25000

Figure 16: Sensitivity of the second order preference perceptron algorithm to the parameter value γ on movie recommendation.

In the previous experiment, we ﬁxed the γ value in the second-order algorithm to one. We now study the sensitivity of the second-order algorithm to the value of this parameter. Figures 15 and 16 show regret values after a given number of iterations when γ is swept over a range of values. The dotted lines show the performance of the Convex Preference Perceptron for comparison. In the case of web-search, there is a wide range of parameter

Shivaswamy & Joachims

values where the performance of the algorithm is good. As the parameter γ takes an extreme value on either side, the performance of the algorithm deteriorates. The range of suitable γ values is much broader for the web-search dataset than for movie recommendation. It is interesting to note that both the algorithms performed empirically best at γ = 1 among the values that were tried.

10 0 10 1 10 2 10 3 10 4 0

Quad regret

Second Order Convex Perceptron

10 0 10 1 10 2 10 3 10 4 0

Util regret

Second Order Convex Perceptron

Figure 17: Strong convex versus weak convex with noisy feedback on web-seach.

10 0 10 1 10 2 10 3 0

Quad. regret

Second Order Convex Perceptron

10 0 10 1 10 2 10 3 0

Util regret

Second Order Convex Perceptron

Figure 18: Strong convex versus weak convex with noisy feedback on movie recommendation.

We also tested the convex algorithms under noisy feedback. Both regret bounds contain the slack terms on the right hand side. Thus, when user feedback is not α-informative for any α, the regret bounds for the second-order algorithm and the ﬁrst-order algorithm are both dominated by the slack variables. The empirical performance of the two algorithms under noisy feedback are shown in Figures 17 and 18. In the case of web-search, the results for the second-order algorithm and the ﬁrst-order algorithm are nearly identi-

Coactive Learning

cal. However, in the case of movie recommendation, there is still some advantage to the second-order algorithm.

In summary, the second-order algorithm performs substantially superior under no-noise circumstances. In the presence of noise in the feedback, the two algorithms do not show drastically diﬀerent behaviors.

7. Conclusions

We proposed Coactive Learning as a new model of online learning with preferences that is especially suited for implicit user feedback. Unlike most supervised learning approaches, Coactive Learning algorithms do not require optimal labels, but merely (noisy) feedback that improves over the prediction. Our model, where no cardinal utilities are observed, sits between the experts and the bandits settings, and we argue that Coactive Learning is applicable to a wide range of systems that aim to optimize themselves based on observable user actions.

We provide several algorithms that provably optimize regret in the Coactive Learning framework, and we empirically validate the eﬀectiveness of the proposed framework on web-search ranking and movie recommendation datasets with simulations of both noisy and noise-free feedback. A recurring theme in this paper is that a wide variety of conventional online learning algorithms can be converted into Coactive Learning algorithms, despite the diﬀerences in the learning model itself, in the nature of feedback and in the notion of regret. We conjecture that many other online learning algorithms could similarly be converted to practically useful Coactive Learning algorithms.

The Coactive Learning model, the algorithms we proposed, and the ability to use weak feedback from observable user behavior oﬀer a wide range of opportunities for new learning approaches to application problems ranging from natural language processing and information retrieval to robotics. There are also several opportunities for further developing algorithms for the Coactive Learning model. For example, our algorithm for convex loss minimization assume only that the gradient of the convex losses are bounded. However, in most practical situations, the convex loss to be minimized is known apriori. It is an interesting research direction to study whether there are algorithms that can utilize the gradient of the loss to perform better either theoretically or empirically. Another question is whether better algorithms exist for special cases of the linear utility model. Our lower bound is based on an argument where the dimensionally of the joint feature maps grow with the given horizon T. When the dimensionality of the joint feature map is ﬁxed, an interesting research question is: are there algorithms with better regret than our proposed algorithms?

Acknowledgments

This work was funded in part under NSF awards IIS-0905467, IIS-1247637, and IIS-1142251. This was work was done when Pannaga Shivaswamy was a postdoctoral associate at Cornell University. We thank Peter Frazier, Bobby Kleinberg, Karthik Raman, Tobias Schnabel and

Shivaswamy & Joachims

Yisong Yue for helpful discussions. We also thank anonymous reviewers for their thoughtful comments on an earlier version of this paper.

Appendix A. Proof of Theorem 5

Proof We look at how the KL divergence between w and wt evolves,

KL(w||wt) KL(w||wt+1) =

i=1 wi log(wi t+1/wi t)

i=1 wi(η(φi(xt, yt) φi(xt, yt))) log(Zt)

= ηw (φ(xt, yt) φ(xt, yt)) log(Zt). (24)

On the second line, we pulled out log(Zt) from the sum since PN i=1 wi = 1. Now, consider the last term in the above equation. Denoting φi(xt, yt) φi(xt, yt) by iφt for brevity, we have, by deﬁnition,

log(Zt) = log

i=1 wi t exp(η iφt)

i=1 wi t(1 + η iφt + η2 iφt 2)

log 1 + ηw t φt + η2S2

ηw t φt + η2S2. (25)

On the second line we used the fact that exp(x) 1 + x + x2 for x 1. The rate η ensures that η( iφ) 1. On the last line, we used the fact that log(1 + x) x. Combing (24) and (25), we get,

(w wt) φt KL(w||wt) KL(w||wt+1)

Adding the above inequalities, we get:

t=1 (w wt) (φ(xt, yt) φ(xt, yt))

KL(w||wt) KL(w||wt+1)

Rearranging the above inequality, and substituting the value of η from Algorithm 4, we get:

t=1 (U(xt, yt) U(xt, yt))

t=1 w t (φ(xt, yt) φ(xt, yt)) + 2 log(N)S

Coactive Learning

In the above, we also used the fact that KL(w||w1) log(N) since w1 is initialized uniformly. Moreover, from H older s inequality, we obtained

w t φ(xt, yt) wt ℓ1 φ(xt, yt) ℓ S.

The above inequality along with α-informative feedback gives the claimed result.

Appendix B. Proof of Theorem 6

Proof First, we divide the set of time steps into two diﬀerent sets based on the nature of the feedback:

I := {t : U(xt, yt) U(xt, yt)) 0; 1 t T},

J := {t : U(xt, yt) U(xt, yt)) < 0; 1 t T}.

For brevity we denote φ(xt, a) φ(xt, b) by7 (a, b) in the rest of this proof. We start by considering the following term for a single time step t:

ct(U(xt, yt) U(xt, y t )) ct(0)

ct(U(xt, yt) U(xt, y t )) ct

w t (yt, yt)

w t (yt, yt)

(w wt) (yt, yt)

(w t (yt, yt) + ξ+ t w (yt, yt))G/α t I

(w t (yt, yt) + ξ+ t )G/α t J.

In the above inequalities, the second line follows from the fact that w t (yt, yt)

α 0 and ct( ) is non-increasing. The third line follows from α-informative feedback (Eqn. (6)). The fourth line follows since the function ct is convex.8 We obtain the ﬁrst term in the next inequality (in either case) since c t( ) [ G, 0] and w t (yt, yt) 0 from the choice of yt in the algorithm. The second terms (in either case) is obtained by the fact that ξtc t(w (yt, yt)) is upper bounded by ξ+ t G. This is the step in which the clipped version of the slack variables are needed in the proof. Finally, w (yt, yt) is either positive or negative depending on the feedback which leads to two diﬀerent cases depending on whether t I or t J.

7. Since the context xt will always be clear, we suppress this in our notation for brevity. 8. For any convex function f, f(y) f(x) (y x)f (y) where f (y) denotes a sub-derivative of f at y.

Shivaswamy & Joachims

Summing the above inequalities from 1 to T , we get:

t=1 ct(w (yt, y t ))

t=1 w t (yt, yt) + G

t I w (yt, yt)

t=1 (wt w ) (yt, yt) + G

t=1 ξ+ t + G

t J w (yt, yt). (26)

We obtained the last line above simply by adding and and subtracting G P t J w (yt, yt)/α on the right side of the previous inequality. From this point, we mostly follow the proof techniques from online convex optimization (Zinkevich, 2003). We now bound the ﬁrst term on the right hand side of (26). For this purpose, consider the following:

wt+1 w 2 = wt + ηt ( yt, yt) w 2

= wt w 2 + η2 t ( yt, yt) 2 + 2ηt(wt w ) ( yt, yt). (27)

Rearranging terms in the above equation, we get:

(wt w ) (yt, yt) = 1

2ηt wt w 2 1

2ηt wt+1 w 2 + ηt

2 ( yt, yt) 2

2ηt wt w 2 1

2ηt wt+1 w 2 + 2ηt R2

where, on the last line, we used the fact that wt+1 w 2 wt+1 w 2 since the wt+1 is just the projection of wt+1 to the convex set B (which contains the vector w ). We can bound the ﬁrst term in (26) using the following telescoping argument.

2ηt wt w 2 1

2ηt wt+1 w 2 + 2ηt R2

2η1 w1 w 2 +

2ηt 1 2ηt 1

wt w 2 + 2R2 T X

2ηt 1 2ηt 1

|B| + 2R2(2

2 |B| + 4R2

In the above, we obtained the second line by simply rearranging the terms in the expression above. On the third line, we used the boundedness property of the set B, as well as the fact PT t=1 ηt 2

T 1. The ﬁnal line follows from cancelling out terms and the fact that ηT = 1/

Coactive Learning

Now, consider the third term on the right hand side of (26):

α w (yt, y t ) + ξ+ t α ξ+ t α .

The ﬁrst inequality above follows from α-informative feedback. Whereas the second inequality follows from the fact w (yt, y t ) 0 from the deﬁnition of y t . Finally, the bound (16) follows from the trivial fact 0 G

α P t I ξ+ t .

To obtain the bound on the expected regret, consider the convex loss at step t conditioned on user behavior so far:

ct(w (yt, y t )) Etct

w t (yt, yt)

Et[w (yt, yt) ξt]

w t (yt, yt)

w (yt, yt) ξt

w t (yt, yt)

GEt[w t (yt, yt) + ξ+ t w (yt, yt)]/α t I

GEt[w t (yt, yt) + ξ+ t ]/α t J

where the second line follows from the deﬁnition of expected α-informative feedback and the third line follows from Jensen s inequality. We obtain the last line following an argument similar to that in the proof of Theorem 6. The bound follows from an expected version of (27).

Appendix C. Proof of Theorem 8

Proof First, we divide time steps into two diﬀerent sets based on the nature of feedback:

I := {t : U(xt, yt) U(xt, yt)) 0; 1 t T},

J := {t : U(xt, yt) U(xt, yt)) < 0; 1 t T}.

Shivaswamy & Joachims

We start by considering a single time step t, we have:

ct(w (yt, y t )) ct(0)

ct(w (yt, y t )) ct

w t (yt, yt)

w t (yt, yt)

(w wt) (yt, yt)

(w wt) (yt, yt)

G w t (yt, yt)

α + ξ+ t α w (yt, yt)

2 (w wt) (yt, yt)

α ξ+ t α 2 t I

G w t (yt, yt)

α + ξ+ t α γ

2 (w wt) (yt, yt)

α ξ+ t α 2 t J.

In the above inequalities, the second line follows from the fact that w t (yt, yt)

α 0 and ct( ) is non-increasing. The third line follows from the fact that the function ct( ) is non-increasing and the following inequality which follows from the deﬁnition of ξ+ t :

U(xt, yt) U(xt, yt) + α(U(xt, y t ) U(xt, yt)) ξ+ t .

The fourth line follows by strong convexity. The last line follows from a same line of reasoning as in the proof of Theorem 6. Now consider the last term in both the cases:

(w wt) (yt, yt)

(w wt) (yt, yt)

2α2 + γξ+ t (w wt) (yt, yt)

(w wt) (yt, yt)

2 + γξ+ t w (yt, yt)

α2 γξ+ t w t (yt, yt)

(w wt) (yt, yt)

2 + γξ+ t α

w (yt, y t ) + ξ+ t α

(w wt) (yt, yt)

2 + γ ξ+ t 2

In the above equations, the second and the third lines follow from simple algebraic expansion of the expression on the ﬁrst line. The fourth line follows from the deﬁnition of α-informative feedback and the fact that w t ( yt, yt) 0. The last line follows from the fact that w (yt, y t ) 0 from the deﬁnition of y t .

Coactive Learning

Now, summing the terms in (28) and then substituting the above bound, we get,

t=1 ct(w (yt, y t ))

w t (yt, yt)

α + ξ+ t α γ

(w wt) (yt, yt)

α + γ PT t=1 ξ+ t 2

(wt w ) (yt, yt)

(w wt) (yt, yt)

t J w (yt, yt) + γ PT t=1 ξ+ t 2

2α2 + G PT t=1 ξ+ t α

(wt w ) (yt, yt) γ

(wt w ) (yt, yt) 2 + γ PT t=1 ξ+ t 2

2α2 + 2G PT t=1 ξ+ t α .

In the above, we obtained the third inequality by adding and subtracting G P t J w (yt, yt)

α . To obtain the last line, we used the fact that 1/α2 1/α since α (0, 1]). Finally, we used an argument similar to that in the proof of theorem 6 to bound G

α P t J w (yt, yt) and obtained a factor of two with the sum of slacks term. From this point, we use arguments similar to those from online convex optimization with strongly convex losses (Hazan et al., 2007).

Next, we consider ( wt+1 w ) At( wt+1 w ) and express it interms of wt and At 1:

( wt+1 w ) At( wt+1 w )

=(wt A 1 t (yt, yt) w ) At(wt A 1 t (yt, yt) w )

=(wt w ) At(wt w ) + (yt, yt) A 1 t (yt, yt) 2(wt w ) (yt, yt)

=γ(wt w ) (yt, yt) (yt, yt) (wt w ) + (wt w ) At 1(wt w )

+ (yt, yt) A 1 t (yt, yt) + 2(w wt) (yt, yt)

Rearranging terms in the above equation, we get:

2(wt w ) (yt, yt) γ

2(wt w ) (yt, yt) (yt, yt) (wt w )

(wt w ) At 1(wt w ) (wt+1 w ) At(wt+1 w ) + (yt, yt) A 1 t (yt, yt).

Shivaswamy & Joachims

We now identify that the term on the left hand side in the inequality occurs in the expression that we would like to bound in (29). We therefore have,

(wt w ) (yt, yt) γ((wt w ) (yt, yt))2

(wt w ) At 1(wt w ) (wt+1 w ) At(wt+1 w ) + (yt, yt) A 1 t ( byt, yt)

(w1 w ) A0(w1 w ) +

t=1 (yt, yt) A 1 t (yt, yt)

γ log 4R2Tγ

In the above, we have used the fact that PT t=1 (yt, yt) A 1 t (yt, yt) N log(4R2T/ϵ+1), where N is the dimens ionality of φ(x, y) and R is an upper bound on the norm of the joint feature maps (i.e. φ(x, y) ℓ2 R. A proof of this fact can be found in Hazan et al. (2007).

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3), 235 256.

Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. (2002b). The non-stochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 48 77.

Bakir, G. H., Hofmann, T., Sch olkopf, B., Smola, A., Taskar, B., & Vishwanathan, S. (Eds.). (2007). Predicting Structured Data. The MIT Press.

Bell, R. M., & Koren, Y. (2007). Scalable collaborative ﬁltering with jointly derived neighborhood interpolation weights. In ICDM.

Boley, M., Mampaey, M., Kang, B., Tokmakov, P., & Wrobel, S. (2013). One click mining: Interactive local pattern discovery through implicit preference and performance learning. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics, pp. 27 35.

Cesa-Bianchi, N., & Lugosi, G. (2006a). Prediction, learning, and games. Cambridge University Press.

Cesa-Bianchi, N., & Lugosi, G. (2006b). Prediction, Learning, and Games. Cambridge University Press, Cambridge, UK.

Chapelle, O., & Chang, Y. (2011). Yahoo! learning to rank challenge overview. JMLR - Proceedings Track, 14, 1 24.

Chu, W., & Ghahramani, Z. (2005). Preference learning with gaussian processes. In ICML.

Crammer, K., & Singer, Y. (2001). Pranking with ranking. In NIPS.

Coactive Learning

Crammer, K., & Gentile, C. (2011). Multiclass classiﬁcation with bandit feedback using adaptive regularization. In Proceedings of the 28th International Conference on Machine Learning (ICML).

Dekel, O., Gilad-Bachrach, R., Shamir, O., & Xiao, L. (2012). Optimal distributed online prediction using mini-batches. JMLR, 13, 165202.

Flaxman, A., Kalai, A. T., & Mc Mahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA.

Freund, Y., Iyer, R. D., Schapire, R. E., & Singer, Y. (2003). An eﬃcient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933 969.

Goetschalckx, R., Fern, A., & Tadepalli, P. (2014). Coactive learning for locally optimal problem solving.. In Conference of the American Association for Artiﬁcial Intelligence (AAAI), pp. 1824 1830.

Haddow, B., Arun, A., & Koehn, P. (2011). Samplerank training for phrase-based machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 261 271, Edinburgh, Scotland. Association for Computational Linguistics.

Hazan, E., Agarwal, A., & Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3), 169 192.

Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classiﬁers. MIT Press.

Jain, A., Wojcik, B., Joachims, T., & Saxena, A. (2013). Learning trajectory preferences for manipulators via iterative improvement. In Neural Information Processing Systems (NIPS), pp. 575 583.

Joachims, T. (2002). Optimizing search engines using clickthrough data. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 133 142.

Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., & Gay, G. (2007). Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems (TOIS), 25(2).

Jones, R., & Klinkner, K. (2008). Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In CIKM.

Kakade, S. M., Shalev-Shwartz, S., & Tewari, A. (2008). Eﬃcient bandit algorithms for online multiclass prediction. In Proceedings of the 25th International Conference on Machine Learning (ICML).

Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient versus gradient gradient descent for linear predictors. Journal of Information and Computation, 132(1), 1 64.

Langford, J., & Zhang, T. (2007). The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS.

Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3.

Manning, C., Raghavan, P., & Sch utze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

Shivaswamy & Joachims

Novikoﬀ, A. (1962). On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, Vol. XII, pp. 615 622.

Polyak, B., & Tsypkin, Y. (1973). Pseudogradient adaptation and training algorithms. Automatic Remote Control, 12, 83 94.

Radlinski, F., Kurup, M., & Joachims, T. (2008). How does clickthrough data reﬂect retrieval quality?. In Conference on Information and Knowledge Management (CIKM).

Raman, K., & Joachims, T. (2013). Learning socially optimal information systems from egoistic users. In European Conference on Machine Learning (ECML), pp. 128 144.

Raman, K., Joachims, T., Shivaswamy, P., & Schnabel, T. (2013). Stable coactive learning via perturbation. In International Conference on Machine Learning (ICML), pp. 837 845.

Raman, K., Shivaswamy, P., & Joachims, T. (2012). Online learning to diversify from implicit feedback. In KDD.

Shivaswamy, P., & Joachims, T. (2012). Online structured prediction via coactive learning. In ICML.

Somers, T., & Hollinger, G. (2014). Coactive learning with a human expert for robotic monitoring. In RSS Workshop on Robotic Monitoring.

Weston, J., Bengio, S., & Usunier, N. (2011). Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI).

Yue, Y., Broder, J., Kleinberg, R., & Joachims, T. (2009). The k-armed dueling bandits problem. In COLT.

Yue, Y., & Joachims, T. (2009). Interactively optimizing information retrieval systems as a dueling bandits problem. In ICML.

Zhang, Y., Lei, T., Barzilay, R., Jaakkola, T., & Globerson, A. (2014). Steps to excellence: Simple inference with reﬁned scoring of dependency trees. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 197 207, Baltimore, Maryland. Association for Computational Linguistics.

Zinkevich, M. (2003). Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML.