# online_stochastic_linear_optimization_under_onebit_feedback__38898b0f.pdf

Online Stochastic Linear Optimization under One-bit Feedback

Lijun Zhang ZHANGLJ@LAMDA.NJU.EDU.CN National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China

Tianbao Yang TIANBAO-YANG@UIOWA.EDU Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA

Rong Jin JINRONG.JR@ALIBABA-INC.COM Alibaba Group, Seattle, USA

Yichi Xiao XIAOYC@LAMDA.NJU.EDU.CN Zhi-Hua Zhou ZHOUZH@LAMDA.NJU.EDU.CN National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China

In this paper, we study a special bandit setting of online stochastic linear optimization, where only one-bit of information is revealed to the learner at each round. This problem has found many applications including online advertisement and online recommendation. We assume the binary feedback is a random variable generated from the logit model, and aim to minimize the regret deﬁned by the unknown linear function. Although the existing method for generalized linear bandit can be applied to our problem, the high computational cost makes it impractical for real-world applications. To address this challenge, we develop an efﬁcient online learning algorithm by exploiting particular structures of the observation model. Speciﬁcally, we adopt online Newton step to estimate the unknown parameter and derive a tight conﬁdence region based on the exponential concavity of the logistic loss. Our analysis shows that the proposed algorithm achieves a regret bound of e O(d

T), which matches the optimal result of stochastic linear bandits.

1. Introduction

Online learning with bandit feedback plays an important role in several industrial domains, such as ad placement, website optimization, and packet routing (Bubeck & Cesa Bianchi, 2012). A canonical framework for studying this

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

problem is the multi-armed bandits (MAB), which models the situation that a gambler must choose which of K slot machines to play (Robbins, 1952). In the basic stochastic MAB, each arm is assumed to deliver rewards that are drawn from a ﬁxed but unknown distribution. The goal of the gambler is to minimize the regret, namely the difference between his expected cumulative reward and that of the best single arm in hindsight (Auer et al., 2002).

Although MAB is a powerful framework for modeling online decision problems, it becomes intractable when the number of arms is very large or even inﬁnite. To address this challenge, various algorithms have been designed to exploit different structure properties of the reward function, such as Lipschitz (Kleinberg et al., 2008) and convex (Flaxman et al., 2005; Agarwal et al., 2013). Among them, stochastic linear bandits (SLB) has received considerable attentions during the past decade (Auer, 2002; Dani et al., 2008; Abbasi-yadkori et al., 2011). In each round of SLB, the learner is asked to choose an action xt from a decision set D Rd, then he observes yt such that

E[yt|xt] = x t w , (1)

where w Rd is a vector of unknown parameters. The goal of the learner is to minimize the (pseudo) regret

T max x D x w

t=1 x t w . (2)

In this paper, we consider a special bandit setting of online linear optimization where the feedback yt only contains one-bit of information. In particular, yt { 1}. Our setting is motivated from the fact that in many real-world applications, such as online advertising and recommender

Online Stochastic Linear Optimization under One-bit Feedback

systems, user feedback (e.g., click or not, like or dislike) is usually binary. Since the feedback is binary-valued, we assume it is generated according to the logit model (Hastie et al., 2009), i.e.,

Pr[yt = 1|xt] = 1 1 + exp( ytx t w ). (3)

Without loss of generality, suppose 1 is the preferred outcome. Then, it is natural to deﬁne the regret in terms of the expected times that 1 is observed, i.e.,

T max x D exp(x w ) 1 + exp(x w )

exp(x t w ) 1 + exp(x t w ). (4)

The observation model in (3) and the nonlinear regret in (4) can be treated as a special case of the Generalized Linear Bandit (GLB) (Filippi et al., 2010). However, the existing algorithm for GLB is inefﬁcient in the sense that: i) it is not a truly online algorithm since the whole learning history is stored in memory and used to estimate w ; and ii) it is limited to the case that the number of arms is ﬁnite because an upper bound for each arm needs to be calculated explicitly in each round.

The main contribution of this paper is an efﬁcient online learning algorithm that effectively exploits particular structures of the logit model. Based on the analytical properties of the logistic function, we ﬁrst show that the linear regret deﬁned in (2) and the nonlinear regret in (4) only differs by a constant factor, and then focus on minimizing the former one due to its simplicity. Similar to previous studies (Bubeck & Cesa-Bianchi, 2012), we follow the principle of optimism in face of uncertainty to deal with the exploration-exploitation dilemma. The basic idea is to maintain a conﬁdence region for w , and choose an estimate from the conﬁdence region and an action so that the linear reward is maximized. Thus, the problem reduces to the construction of the conﬁdence region from one-bit feedback that satisﬁes (3). Based on the exponential concavity of the logistic loss, we propose to use a variant of the online Newton step (Hazan et al., 2007) to ﬁnd the center of the conﬁdence region and derive its width by a rather technical analysis of the updating rule. Theoretical analysis shows that our algorithm achieves a regret bound of e O(d

T),1 which matches the result for SLB (Dani et al., 2008). Furthermore, we provide several strategies to reduce the computational cost of the proposed algorithm.

2. Related Work

The stochastic multi-armed bandits (MAB) (Robbins, 1952), has become the canonical formalism for studying

1We use the e O notation to hide constant factors as well as polylogarithmic factors in d and T.

the problem of decision-making under uncertainty. A long line of successive problems have been extensively studied in statistics (Berry & Fristedt, 1985) and computer science (Bubeck & Cesa-Bianchi, 2012).

2.1. Stochastic Multi-armed Bandits (MAB)

In their seminal paper, Lai & Robbins (1985) establish an asymptotic lower bound of O(K log T) for the expected cumulative regret over T periods, under the assumption that the expected rewards of the best and second best arms are well-separated. By making use of upper conﬁdence bounds (UCB), they further construct policies which achieve the lower bound asymptotically. However, this initial algorithm is quite involved, because the computation of UCB relies on the entire sequence of rewards obtained so far. To address this limitation, Agrawal (1995) introduces a family of simpler policies that only needs to calculate the sample mean of rewards, and the regret retains the optimal logarithmic behavior. A ﬁnite time analysis of stochastic MAB is conducted by Auer et al. (2002). In particular, they propose an UCB-type algorithm based on the Chernoff-Hoeffding bound, and demonstrate it achieves the optimal logarithmic regret uniformly over time.

2.2. Stochastic Linear Bandits (SLB)

SLB is ﬁrst studied by Auer (2002), who considers the case D is ﬁnite. Although an elegant UCB-type algorithm named Lin Rel is developed, he fails to bound its regret due to independence issues. Instead, he designs a complicated master algorithm which uses Lin Rel as a subroutine, and achieves a regret bound of e O((log |D|)3/2

Td), where |D| is the number of feasible decisions. In a subsequent work, Dani et al. (2008) generalize Lin Rel slightly so that it can be applied in settings where D may be inﬁnite. They refer to the new algorithm as Conﬁdence Ball2, and show it enjoys a bound of e O(d

T), which does not depend on the cardinality of D. Later, Abbasi-yadkori et al. (2011) improve the theoretical analysis of Conﬁdence Ball2 by employing tools from the self-normalized processes. Specifically, the worst case bound is improved by a logarithmic factor and the constant is also improved.

2.3. Generalized Linear Bandit (GLB)

Filippi et al. (2010) extend SLB to the nonlinear case based on the Generalized Linear Model framework of statistics. In the so-called GLB model, yt is assumed to satisfy E[yt|xt] = µ(x t w ) where µ : R 7 R is certain link function. The regret is also deﬁned in terms of µ( ) and given by

T max x D µ(x w )

t=1 µ(x t w ). (5)

Online Stochastic Linear Optimization under One-bit Feedback

Note that by setting µ(x) = exp(x)/[1 + exp(x)], the problem considered in this paper becomes a special case of GLB. An UCB-type algorithm has been proposed for GLB and achieves a regret bound of e O(d

T). Different from Conﬁdence Ball2 which constructs a conﬁdence region in the parameter space, the algorithm of Filippi et al. (2010) operates only in the reward space. However, the space and time complexities of that algorithm in the t-th iteration are O(t) and O(t + |D|), respectively. The O(t) factor comes from the fact it needs to store the past action-feedback pairs (x1, y1), . . . (xt 1, yt 1) and use all of them to estimate w . The O(|D|) factor is due to the fact it needs to calculate an upper bound for each arm in order to decide the next action xt.

Although we can reduce the computational cost of GLB by replacing the batch optimization with online algorithms, such as online gradient descent (Zinkevich, 2003), the theoretical guarantee of the new algorithm needs to be developed. Furthermore, it is also unclear how to extend GLB to the case of inﬁnite number of arms.

2.4. Bandit Learning with One-bit Feedback

There are several new variants of bandit learning that also rely on one one-bit feedback, such as multi-class bandits (Kakade et al., 2008; Chen et al., 2014) and K-armed dueling bandits (Yue et al., 2009; Ailon et al., 2014). For example, in multi-class bandits, the feedback is whether the predicted label is correct or not, and in K-armed dueling bandits, the feedback is the comparison between the rewards from two arms. However, none of them are designed for online linear optimization.

2.5. One-bit Compressive Sensing (CS)

Finally, we would like to discuss one closely related work in signal processing one-bit Compressive Sensing (CS) (Boufounos & Baraniuk, 2008; Plan & Vershynin, 2013; Zhang et al., 2014). One-bit CS aims to recover a sparse vectors w from a set of one-bit measurements {yi} where yi is generated from x i w according to certain observation model such as (3). The main difference is that one-bit CS is studied in batch setting with the goal to minimize the recovery error, while our problem is studied in online setting with the goal to minimize the regret. Another difference is that w needs to be sparse in one-bit CS, but could be dense in our case.

3. Online Learning for Logit Model (OL2M)

We ﬁrst describe the proposed algorithm for online stochastic linear optimization given one-bit feedback, next compare it with existing methods, then state its theoretical guarantees, and ﬁnally discuss implementation issues.

3.1. The Algorithm

For a positive deﬁnite matrix A Rd d, the weighted ℓ2norm is deﬁned by x 2 A = x Ax. Without loss of generality, we assume the decision space D is contained in the unit ball, that is,

x 2 1, x D. (6)

We further assume the ℓ2-norm of w is upper bounded by some constant R, which is known to the learner. Our ﬁrst observation is that the linear regret in (2) and the nonlinear regret in (4) only differs by a constant factor as indicated below.

Lemma 1. Let RL and RN be the linear and nonlinear regrets in (2) and (4), respectively. We have

1 2(1 + exp(R))RL RN 1

In the following, we will develop an efﬁcient algorithm that minimizes the linear regret, which in turn minimizes the nonlinear regret as well.

The algorithm is motivated as follows. Suppose actions x1, . . . , xt have been submitted to the oracle, and let y1, . . . , yt be the one-bit feedback from the oracle. To approximate w , the most straightforward way is to ﬁnd the maximum likelihood estimator by solving the following logistic regression problem

min w 2 R 1

i=1 log 1 + exp( yix i w) .

However, this approach does not scale well since it requires the leaner to store the entire learning history. Instead, we propose an online algorithm to ﬁnd an approximate solution. The key observation is that the logistic loss

ft(w) = log 1 + exp( ytx t w)

is exponentially concave over bounded domain (Hazan et al., 2014), which motivates us to apply a variant of the online Newton step (Hazan et al., 2007). Speciﬁcally, we propose to ﬁnd an approximate solution wt+1 by solving the following problem

w wt 2 Zt+1 2 + (w wt) ft(wt) (8)

where Zt+1 = Zt + β

2 xtx t , (9)

and β is deﬁned in (14). Although our updating rule is similar to the method in (Hazan et al., 2007), there also exist some differences. As indicated by (9), in our case

Online Stochastic Linear Optimization under One-bit Feedback

Algorithm 1 Online Learning for Logit Model (OL2M)

1: Input: Regularization Parameter λ 2: Z1 = λI, w1 = 0 3: for t = 1, 2, . . . do 4:

(xt, bwt) = argmax x D,w Ct x w

5: Submit xt and observe yt { 1} 6: Solve the optimization problem in (8) to ﬁnd wt+1 7: end for

xtx t is used to approximate the Hessian matrix, while in Hazan et al. (2007) ft(wt)[ ft(wt) ] is used.

After a theoretical analysis, we are able to show that with a high probability

w Ct+1 = w : w wt+1 Zt+1 γt+1 (10)

where the value of γt+1 is given in (12). Given the conﬁdence region, we adopt the principle of optimism in face of uncertainty , and the next action xt+1 is given by

(xt+1, bwt+1) = argmax x D,w Ct+1 x w. (11)

At the beginning, we set

Z1 = λI, and w1 = 0.

The above procedure is summarized in Algorithm 1, and is refer to as Online Learning for Logit Model (OL2M).

Since both Conﬁdence Ball2 (Dani et al., 2008) and our OL2M are UCB-type algorithms, their overall frameworks are similar. The main difference lies in the construction of the conﬁdence region and the related analysis. While Conﬁdence Ball2 uses online least square to update the center of the conﬁdence region, OL2M resorts to online Newton step. Due to the difference in the updating rule and the observation model, the self-normalized bound for vectorvalued martingales (Abbasi-yadkori et al., 2011) can not be applied here.

Although our observation model in (3) can be handled by the Generalized Linear Bandit (GLB) (Filippi et al., 2010), this paper differs from GLB in the following aspects.

To estimate w , GLB needs to store the learning history and perform batch updating in each round. In contrast, the proposed OL2M performs online updating. While GLB only considers a ﬁnite number of arms, we allow the number of arms to be inﬁnite. Our algorithm follows the learning framework of SLB. Thus, existing techniques for speeding up SLB can also be used to accelerate our algorithm, which is discussed in Section 3.3.

3.2. Theoretical Guarantees

The main theoretical contribution of this paper is the following theorem regarding the conﬁdence region of w at each round.

Theorem 1. With a probability at least 1 δ, we have

wt+1 w Zt+1 γt+1, t > 0

γt+1 = 8R + 8

β log det(Zt+1)

+ λR2, (12)

τt = log 2 2 log2 t t2

β = 1 2(1 + exp(R)). (14)

The main idea is to analyze the growth of wt+1 w 2 Zt+1 by exploring the properties of the logistic loss (Lemmas 2 and 4) and concentration inequalities for martingales (Lemma 5). By a simple upper bound of log det(Zt+1)/ det(Z1), we can show that the width of the conﬁdence region is O( d log t).

Corollary 2. We have

log det(Zt+1)

det(Z1) d log 1 + βt

and thus γt+1 O(d log t), t > 0.

Based on Theorem 1, we have the following regret bound for OL2M.

Theorem 3. With a probability at least 1 δ, we have

T max x D x w

β log det(ZT +1)

holds for all T > 0.

Combining with the upper bound in Corollary 2, the above theorem implies our algorithm achieves a regret bound of e O(d

T) which matches the bound for Stochastic Linear Bandits (Dani et al., 2008). One limitation of Theorem 3 is that the upper bound has an exponential dependence on R, which is an upper bound of w 2. That is because our algorithm is built upon online Newton step (Hazan et al.,

Online Stochastic Linear Optimization under One-bit Feedback

2007), the regret of which has such a undesirable dependence on R. From the recent studies on logistic regression (Bach & Moulines, 2013; Bach, 2014; Hazan et al., 2014), we conjecture that it is possible to obtain a polynomial dependence on R, but with a higher dependence on T. We will investigate this issue in the future.

3.3. Implementation Issues

The main computational cost of OL2M comes from (11) which is NP-hard in general (Dani et al., 2008). In the following, we discuss two strategies for reducing the computational cost. More results can be found in the supplementary material.

Finite Decision Set If the decision set D is ﬁnite, (11) can be solved by computing an upper bound for each decision in D. Speciﬁcally, we have

xt+1 = argmax x D max w wt+1 Zt+1 γt+1 x w

= argmax x D max z Zt+1 γt+1 x (wt+1 + z)

= argmax x D

x wt+1 + γt+1 x Z 1 t+1

Optimization Over Ball As mentioned by Dani et al. (2008), in the special case that D is the unit ball, (11) could be solved in time O(poly(d)). Here, we provide an explanation using techniques from convex optimization. To this end, we rewrite the optimization problem in (11) as follows

max x 2 1, w wt+1 Zt+1 γt+1 x w

= max w wt+1 Zt+1 γt+1 w 2

which is equivalent to

min w wt+1 2 Zt+1 γt+1 w 2 2.

The above problem is an optimization problem with a quadratic objective and one quadratic inequality constraint, it is well-known that strong duality holds provided there exists a strictly feasible point (Boyd & Vandenberghe, 2004). Thus, we can solve its dual problem which is convex and given by

max γ s. t. λ 0 I + λZt+1 λZt+1wt+1 λw t+1Zt+1 λ( wt+1 2 Zt+1 γt+1) γ

After obtaining the dual solution, we can get the primal solution based on KKT conditions.

4. Analysis

Due to the limitation of space, we only prove Theorem 1. The omitted proofs are provided in the supplementary material.

4.1. Proof of Theorem 1

We begin with several lemmas that are central to our analysis.

Although the application of online Newton step (Hazan et al., 2007) in Algorithm 1 is motivated from the fact that ft(w) is exponentially concave over bounded domain, our analysis is built upon a related but different property that the logistic loss log(1 + exp(x)) is strongly convex over bounded domain, from which we obtain the following lemma.

Lemma 2. Denote the ball of radius R by BR, i.e., BR = {w : w 2 R}. The following holds for β 1 2(1+exp(R)):

ft(w2) ft(w1) + [ ft(w1)] (w2 w1)

2 (w2 w1) xt 2 , w1, w2 BR.

Comparing Lemma 2 with Lemma 3 in (Hazan et al., 2007), we can see that the quadratic term in our inequality does not depends on yt. This independence allows us to simplify the subsequent analysis involving martingales.

Our second lemma is devoted to analyzing the property of the updating rule in (8).

wt w , ft(wt) 1

2 ft(wt) 2 Z 1 t+1

wt w 2 Zt+1 2 wt+1 w 2 Zt+1 2 . (15)

For each function ft( ), we denote its conditional expectation over yt by ft(w), i.e.,

ft(w) = Eyt log 1 + exp ytx t w . (16)

According to the Leibniz integral rule, we have

ft(w) = Eyt [ ft(w)] . (17)

Based the property of Kullback Leibler divergence (Cover & Thomas, 2006), we obtain the following lemma.

Lemma 4. We have

ft(w) ft(w ), w Rd.

Online Stochastic Linear Optimization under One-bit Feedback

Next, we introduce one inequality for bounding the weighted ℓ2-norm of the gradient

ft(w) 2 A = exp( ytx t w) 1 + exp( ytx t w)

xt 2 A, A 0, w Rd.

We continue the proof of Theorem 1 in the following. Our updating rule in (8) ensures wt 2 R, t > 0. Combining with the assumption w 2 R, Lemma 2 implies

ft(wt) ft(w ) + [ ft(wt)] (wt w )

2 (w wt) xt 2 . (19)

By taking expectation over yt, (19) becomes

ft(wt) (16),(17) ft(w ) + [ ft(wt)] (wt w )

h (w wt) xt 2i .

Combining with Lemma 4, we have

0 [ ft(wt)] (wt w ) β

2 (w wt) xt 2 | {z } :=at

=[ ft(wt)] (wt w ) β

+ [ ft(wt) ft(wt)] (wt w ) | {z } :=bt

=[ ft(wt)] (wt w ) wt w 2 Zt+1 2

+ wt w 2 Zt+1 2 β

(15) wt+1 w 2 Zt+1 2 + 1

2 ft(wt) 2 Z 1 t+1

+ wt w 2 Zt+1 2 β

(18) wt+1 w 2 Zt+1 2 + 1

2 xt 2 Z 1 t+1 | {z } :=ct

+ wt w 2 Zt+1 2 β

(9)= wt+1 w 2 Zt+1 2 β

2 at + bt + 1

+ wt w 2 Zt 2 + β

4 x t (wt w ) 2

= wt+1 w 2 Zt+1 2 β

4 at + bt + 1

+ wt w 2 Zt 2 .

We thus have

wt+1 w 2 Zt+1 wt w 2 Zt β

2 at + 2bt + ct

Summing the above inequality over iterations 1 to t, we obtain

wt+1 w 2 Zt+1 + β

Next, we discuss how to bound the summation of martingale difference sequence Pt i=1 bi. To this end, we prove the following lemma, which is built up the Bernstein s inequality for martingales (Cesa-Bianchi & Lugosi, 2006) and the peeling technique (Bartlett et al., 2005).

Lemma 5. With a probability at least 1 δ, we have

i=1 bi 4R + 2

3Rτt, t > 0

where τt is deﬁned in (13).

From Lemma 5 and the basic inequality

with a probability at least 1 δ, we have

i=1 bi 4R + β

holds for all t > 0. Substituting (21) into (20), we obtain

wt+1 w 2 Zt+1

λR2 + 2 4R + 4

i=1 ci. (22)

Finally, we show an upper bound for Pt i=1 ci, which is a direct consequence of Lemma 12 in Hazan et al. (2007).

Lemma 6. We have

i=1 xi 2 Z 1 i+1 2

β log det(Zt+1)

We complete the proof by combining (22) with the above lemma.

Online Stochastic Linear Optimization under One-bit Feedback

0 2 4 6 8 10 x 10

Instantaneous Regret

(a) c = 0.001

0 2 4 6 8 10 x 10

Instantaneous Regret

(b) c = 0.02

0 2 4 6 8 10 x 10

Instantaneous Regret

(c) c = 0.2

0 2 4 6 8 10 x 10

Instantaneous Regret

Figure 1. Instantaneous regret of OL2M when D is the unit ball in R10.

5. Experiments

In this section, we present experimental results to demonstrate the effectiveness of the proposed algorithm.

5.1. Experimental Setting

We sample a point uniformly at random from the (d 1)- sphere as w , and each time the learner submits an action xt, a one-bit feedback yt { 1} is generated according to the logit model in (3). To apply our algorithm, we need to determine the values of two parameters: λ and γt. λ is introduced to make Zt invertible, and the performance of our algorithm is insensitive to its value. Thus, we simply choose λ = 1 in the following. γt is an essential parameter which is the width of the conﬁdence region, and its value is tuned as c log det(Zt)

det(Z1) according to (12), where c is searched in the range of [1e 3, 1].

5.2. Experimental Results

In the ﬁrst experiment, we choose the unit ball as the decision set, i.e., D = {x : x 2 1} Rd, which contains inﬁnite number of actions. As discussed in Section 3.3, in this case, (11) can be cast as a convex optimization problem, which is then solved by the CVX package (Grant & Boyd, 2008; 2014). We ﬁrst investigate how the instantaneous regret x w x t w varies with t during the learn-

ing process. The results for d = 10 with different settings of c are shown in Fig. 1. As can be seen, the instantaneous regret decreases overall, although exhibits some local ﬂuctuations. These ﬂuctuations actually reﬂect the switches between exploitation and exploration. Generally speaking, valley and peak of the curve correspond to exploitation and exploration, respectively.

The value of c determines the width of the conﬁdence region, which in turn controls the exploitation-exploration trade-off. A small value of c prefers exploitation, which may select an action which is not optimal because of too little exploration. For example, in Fig. 1(a) where c = 0.001, after 2 104 rounds, the learner always submits a suboptimal action and suffers a constant instantaneous regret. On the other hand, a larger value of c favors exploration, which might results in a large regret because too much exploration prevents the algorithm from playing the optimal action. This phenomenon can also be observed in Fig. 1(d) where c = 1. From Fig. 1(b) and Fig. 1(c), we see that a good trade-off between exploitation-exploration is achieved when c = 0.02 or 0.2, for which the instantaneous regret approaches 0 gradually. The behavior of the instantaneous regret for d = 100 is similar and can be found in the supplementary.

Next, we examine the e O(d

T) regret bound indicated by Theorem 3. Let Regret(t) be the regret till round t, i.e.,

Online Stochastic Linear Optimization under One-bit Feedback

0 2 4 6 8 10 x 10

Regret(t)/(d

c = 0.001 c = 0.02 c = 0.2 c = 1

0 2 4 6 8 10 x 10

Regret(t)/(d

c = 0.001 c = 0.02 c = 0.2 c = 1

(b) d = 100

Figure 2. Regret(t)/(d

t) of OL2M when D is the unit ball in Rd

0 500 1000 1500 2000 2500 3000 0

(a) d = 10 and |D| = 100

0 2 4 6 8 x 10

(b) d = 100 and |D| = 1000

Figure 3. Regret of OL2M and GLB when D contains ﬁnite number of actions.

Regret(t) = Pt i=1 x w x i w . If the learner achieves an e O(d

T) regret bound, the curve of Regret(t)/(d

t) should increase at most polylogarithmically. Fig. 2 plots the curve of Regret(t)/(d

t) with respect to t for d = 10 and 100. As can be seen, with a suitable choice of c, the curve indeed increases very slowly (e.g., d = 100 and c = 0.02 ), or even decreases slightly after certain rounds (e.g., d = 10 and c = 0.02 ).

In the last experiment, we study the case that D is ﬁnite, so that the GLB algorithm (Filippi et al., 2010) can also be applied. In the experiments, the parameter of GLB is also manually tuned. The decision set D Rd is constructed by sampling 10d points uniformly at random from the (d 1)- sphere. In Fig. 3, we plot the regret of OL2M and GLB with respect to t. Note that in each round, GLB solves a logistic regression problem that utilizes the whole learning history to estimate w . Thus, it is not surprising that the regret of GLB is smaller than OL2M by a constant factor. On the other hand, OL2M performs online updating, which is more efﬁcient when t is large.

6. Conclusions

In this paper, we consider the problem of online linear optimization under one-bit feedback. Under the assumption that the binary feedback is generated from the logit model, we develop a variant of the online Newton step to approximate the unknown vector, and discuss how to construct the conﬁdence region theoretically. Given the conﬁdence region, we choose the action that produces maximal reward in each round. Theoretical analysis reveals that our algorithm achieves a regret bound of e O(d

The current algorithm assumes that the one-bit feedback is generated from a logit model. In contrast, a much broader class of observation models are allowed in one-bit compressive sensing (Plan & Vershynin, 2013). In the future, we will investigate how to extend our algorithm to other observation models. Recently studies in online learning have shown that Thompson sampling is both competitive and efﬁcient for addressing the exploration-exploitation dilemma (Chapelle & Li, 2011; Li, 2013). We leave the application of Thompson sampling to our problem for future work.

Online Stochastic Linear Optimization under One-bit Feedback

Acknowledgements

This work was partially supported by the NSFC (61333014), NSF (1463988, 1545995), and the Collaborative Innovation Center of Novel Software Technology and Industrialization of Nanjing University.

Abbasi-yadkori, Yasin, P al, D avid, and Szepesv ari, Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pp. 2312 2320, 2011.

Agarwal, Alekh, Foster, Dean P., Hsu, Daniel, Kakade, Sham M., and Rakhlin, Alexander. Stochastic convex optimization with bandit feedback. SIAM Journal on Optimization, 23(1):213 240, 2013.

Agrawal, Rajeev. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054 1078, 1995.

Ailon, Nir, Karnin, Zohar, and Joachims, Thorsten. Reducing dueling bandits to cardinal bandits. In Proceedings of The 31st International Conference on Machine Learning, 2014.

Auer, Peter. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2002.

Auer, Peter, Cesa-Bianchi, Nicol o, and Fischer, Paul. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002.

Bach, Francis. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Journal of Machine Learning Research, 15:595 627, 2014.

Bach, Francis and Moulines, Eric. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Advances in Neural Information Processing Systems 26, pp. 773 781, 2013.

Bartlett, Peter L., Bousquet, Olivier, and Mendelson, Shahar. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

Berry, Donald A. and Fristedt, Bert. Bandit problems: Sequential Allocation of Experiments. Monographs on Statistics and Applied Probability. Springer Netherlands, 1985.

Boufounos, Petros T. and Baraniuk, Richard G. 1-bit compressive sensing. In Proceedings of the 42nd Annual Conference on Information Sciences and Systems, pp. 16 21, 2008.

Boyd, Stephen and Vandenberghe, Lieven. Convex Optimization. Cambridge University Press, 2004.

Bubeck, S ebastien and Cesa-Bianchi, Nicol o. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

Cesa-Bianchi, Nicol o and Lugosi, G abor. Prediction, Learning, and Games. Cambridge University Press, 2006.

Chapelle, Olivier and Li, Lihong. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems 24, pp. 2249 2257, 2011.

Chen, Shang-Tse, Lin, Hsuan-Tien, and Lu, Chi-Jen. Boosting with online binary learners for the multiclass bandit problem. In Proceedings of The 31st International Conference on Machine Learning, 2014.

Cover, Thomas M. and Thomas, Joy A. Elements of Information Theory. Wiley-Interscience, second edition edition, 2006.

Dani, Varsha, Hayes, Thomas P., and Kakade, Sham M. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning, pp. 355 366, 2008.

Filippi, Sarah, Cappe, Olivier, Garivier, Aur elien, and Szepesv ari, Csaba. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23, pp. 586 594, 2010.

Flaxman, Abraham D., Kalai, Adam Tauman, and Mc Mahan, H. Brendan. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 385 394, 2005.

Grant, Michael and Boyd, Stephen. Graph implementations for nonsmooth convex programs. In Blondel, V., Boyd, S., and Kimura, H. (eds.), Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pp. 95 110. Springer-Verlag Limited, 2008.

Grant, Michael and Boyd, Stephen. CVX: Matlab software for disciplined convex programming, version 2.1, March 2014.

Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York, 2009.

Hazan, Elad, Agarwal, Amit, and Kale, Satyen. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Online Stochastic Linear Optimization under One-bit Feedback

Hazan, Elad, Koren, Tomer, and Levy, Kﬁr Y. Logistic regression: Tight bounds for stochastic and online optimization. In Proceedings of The 27th Conference on Learning Theory, pp. 197 209, 2014.

Kakade, Sham M., Shalev-Shwartz, Shai, and Tewari, Ambuj. Efﬁcient bandit algorithms for online multiclass prediction. In Proceedings of the 25th International Conference on Machine Learning, pp. 440 447, 2008.

Kleinberg, Robert, Slivkins, Aleksandrs, and Upfal, Eli. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pp. 681 690, 2008.

Lai, T. L. and Robbins, Herbert. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

Li, Lihong. Generalized thompson sampling for contextual bandits. Ar Xiv e-prints, ar Xiv:1310.7163, 2013.

Plan, Yaniv and Vershynin, Roman. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482 494, 2013.

Robbins, Herbert. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952.

Yue, Yisong, Broder, Josef, Kleinberg, Robert, and Joachims, Thorsten. The K-armed dueling bandits problem. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.

Zhang, Lijun, Yi, Jinfeng, and Jin, Rong. Efﬁcient algorithms for robust one-bit compressive sensing. In Proceedings of the 31st International Conference on Machine Learning, 2014.

Zinkevich, Martin. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pp. 928 936, 2003.