# bandit_phase_retrieval__bf6ca90d.pdf

Bandit Phase Retrieval

Tor Lattimore Deep Mind, London lattimore@deepmind.com

Botao Hao Deep Mind, London haobotao000@gmail.com

We study a bandit version of phase retrieval where the learner chooses actions (At)n

t=1 in the d-dimensional unit ball and the expected reward is h At, ?i2 with ? 2 Rd an unknown parameter vector. We prove an upper bound on the minimax cumulative regret in this problem of (dpn), which matches known lower bounds up to logarithmic factors and improves on the best known upper bound by a factor of

d. We also show that the minimax simple regret is (d/pn) and that this is only achievable by an adaptive algorithm. Our analysis shows that an apparently convincing heuristic for guessing lower bounds can be misleading and that uniform bounds on the information ratio for information-directed sampling [Russo and Van Roy, 2014] are not sufﬁcient for optimal regret.

1 Introduction

We study an instantiation of the low-rank bandit problem [Jun et al., 2019] that in the statistical setting is called phase retrieval. Although this model is interesting in its own right, our main focus is on the curious information structure of this problem and how it impacts algorithm design choices. Notably, we were not able to prove optimal regret for standard approaches based on optimism, Thompson sampling or even information-directed sampling. Instead, our algorithm is a variant of explore-then-commit with an adaptive exploration phase that learns to gain information at a faster rate than what is achievable with non-adaptive exploration.

Problem setting Let k k be the standard euclidean norm and Bd

r = {x 2 Rd : kxk r} and Sd 1

r = {x 2 Rd : kxk = r}. At the start of the game the environment secretly chooses a vector ? 2 Sd 1

r with r 2 [0, 1] a constant that is known to the learner. The assumption that r is known can be relaxed at essentially no cost (Section 8). The game then proceeds over n rounds. In round t the learner chooses an action At 2 Bd

1 and observes a reward

Xt = h At, ?i2 + t ,

where ( t)n

t=1 is a sequence of independent standard Gaussian random variables. As is standard in bandit problems, the conditional law of At should be chosen as a (measurable) function of the previous actions (As)t 1

s=1 and rewards (Xs)t 1

s=1 and possibly an exogenous source of randomness. The performance of a policy is measured in terms of the expected regret,

Rn( , ?) = max

ha, ?i2 h At, ?i2$

The minimax regret is R?

n = supr2[0,1] inf sup ?2Sd 1

r Rn( , ?), where the inﬁmum is over all policies.

We also study the pure exploration setting, where at the end of the game the learner uses the observed data (At)n

t=1 and (Xt)n

t=1 to make a prediction b A? 2 Bd

1 of the optimal action. The simple regret of

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

rn( , ?) = max

ha, ?i2 h b A?, ?i2i

h b A?, ?i2i

As expected, the minimax simple regret is r?

n = supr2[0,1] inf sup ?2Sd 1

r rn( , ?).

Contributions Our main contribution is an upper bound on R?

n that matches existing lower bounds up to logarithmic factors. For the simple regret we provide a near-optimal upper bound and a lower bound showing that non-adaptive policies must be at least a factor of (

d) suboptimal. In all of the following, const is a universal non-negative constant that may vary from one expression to the next.

Theorem 1. R?

n log(n) log(d).

A corollary of the proof of the lower bound by Kotłowski and Neu [2019] shows that the minimax regret is at least (dpn), so the minimax regret for bandit phase retrieval is now known up to logarithmic factors. For the simple regret we provide the following upper and lower bounds:

Theorem 2. r?

log(n) log(d)/n. Theorem 3. Assume that n d 8. Then there exists an r 2 [0, 1] such that for all policies with (At)n

t=1 independent of (Xt)n

t=1, sup ?2Sd 1

r rn( , ?) const

We also show that worst-case bounds on the information ratio for information-directed sampling are not sufﬁcient to achieve optimal regret. Our results suggest that the conjectured lower bounds for low-rank bandits [Jun et al., 2019, Lu et al., 2021] are not true and that existing upper bounds may be loose. The same phenomenon may explain the gap between upper and lower bounds for bandit principle component analysis [Kotłowski and Neu, 2019], as we discuss in Section 8.

Notation The ﬁrst n integers are [n] = {1, 2, . . . , n} and the standard basis vectors in Rd are e1, . . . , ed. The span of a collection of vectors is denoted by span(v1, . . . , vm) and the orthogonal complement of a linear subspace V Rd is V ? = {x 2 Rd : hx, yi = 0 for all y 2 V }. The mutual information between random elements X and Y on the same probability space is I(X; Y ) and the relative entropy between probability measures P and Q on the same measurable space is KL(P, Q). The dimension of a set Rd is deﬁned as the dimension of the afﬁne hull of .

2 Related work

The most related work is by Huang et al. [2021]. As happens surprisingly often, this is a case where multiple groups tackled the same problem and developed similar methods to arrive at approximately the same solution. Their work provides the same upper bound in the cumulative regret setting as we do using an explore-then-commit principle and the noisy power method. They also generalise the setting to consider higher powers. For example, when the reward is h At, ?i3. What is different is that we use a more ad-hoc (but still interesting) construction during the exploration period and our work is more focussed on the information-theoretic implications.

Phase retrieval Phase retrieval is a classical problem in signal processing and statistics [Candès et al., 2015b,a, Cai et al., 2016, Chen and Candès, 2017, Chen et al., 2019, Sun et al., 2018]. These works are focused on learning ? where the covariates (At)n

t=1 are uncontrolled, either random or ﬁxed design.

Linear bandits Our problem can be written as a stochastic linear bandit by noticing that h At, ?i2 = h At A>

? i, where the inner product between matrices on the right-hand side should be interpreted coordinate-wise and the action set is {aa> : a 2 Bd

1}. There is an enormous literature on stochastic linear bandits [Auer, 2002, Dani et al., 2008, Rusmevichientong and Tsitsiklis, 2010b, Chu et al., 2011, Abbasi-Yadkori et al., 2011]. This reduction immediately yields an upper bound on the minimax regret of O(d2pn log(n)).

Low-rank bandits Low-rank bandits are a kind of linear bandit where the environment is determined by an unknown matrix and the actions of the learner are also matrices. Let E Rd1 d2 and A Rd1 d2. A low-rank bandit problem over E and with actions A is characterised by a matrix ? 2 E. The learner plays actions At 2 A and the reward is Xt = h At, ?i + t, where t is noise and the inner product between matrices is interpreted coordinate-wise. So far this is nothing more than

Table 1: For the low-rank bandits column, p is the rank. We ignore logarithmic factors and universal constant. Note, the dppn lower bound derived by Lu et al. [2021] does not apply to bandit phase retrieval because it makes use of the richer structure of the more general model.

Upper bounds Bandit phase retrieval Low-rank bandits Pure exploration Abbasi-Yadkori et al. [2011] O(d2pn) O(d2pn) N/A Jun et al. [2019], Lu et al. [2021] O(d3/2pn) O(d3/2ppn) N/A This work O(dpn) N/A O(d/pn) Lower bounds Lu et al. [2021] N/A (dppn) N/A Kotłowski and Neu [2019] (dpn) N/A N/A This work N/A (dpn) (d/pn) This work (non-adaptive learning) N/A N/A (d3/2/pn)

a complicated way of deﬁning a linear bandit. The name comes from the fact that in general elements of E are assumed to be low-rank. The precise nature of the problem is determined by assumptions on E and the action set A. Our setup is recovered by assuming that E = {xx> : x 2 Sd 1

r } and A = {xx> : x 2 Bd

Jun et al. [2019] assume that E consists of rank p matrices and A = {xy> : x 2 X, y 2 Y} for some reasonably bounded sets X Rd1 and Y Rd2. They prove that the regret is bounded by

O((d1 + d2)3/2ppn). These results cannot be applied directly to the phase retrieval bandit because of the product assumption on the action set. Lu et al. [2021] retain the assumption that E consists of rank-p matrices, but relax the product form of the action set (while also allowing for generalised linear models). Relying only on mild boundedness assumptions, they show that the regret can be bounded by O((d1 + d2)3/2ppn). For the bandit phase retrieval problem, d1 = d2 = d and p = 1, so this algorithm yields an upper bound on the regret for bandit phase retrieval of O(d3/2pn). Both Jun et al. [2019] and Lu et al. [2021] conjecture that their upper bounds are optimal. Our results show that this is not true for this sub-problem, despite the fact that the heuristic argument used by these authors holds in this case, as we explain in Section 3. We summarize these comparisons in Table 1.

Some authors use a model where the noise is in the parameter rather than additive, which means the reward is h At, ti with ( t)n

t=1 an independent and identically distributed sequence of low-rank matrices (with unknown distribution). For example, Katariya et al. [2017b,a] and Trinh et al. [2020] assume that t is rank-1 almost surely and A = {eie>

j : 1 i, j d}, which means the learner is trying to identify the largest entry in a matrix.

Adversarial setting A similar problem has been studied in the adversarial framework by Kotłowski and Neu [2019]. They assume that ( t)n

t=1 is a sequence of vectors chosen in secret by an adversary at the start of the game and the learner observes h At, ti2. They design an algorithm for which the regret is at most O(dpn), while the best lower bound is (

3 Information-theoretic heuristics and information-directed sampling

Jun et al. [2019, 5] argue by comparing the signal to noise ratios between linear and low-rank bandits that the minimax regret for problems like bandit phase retrieval should be lower bounded by (d3/2pn). We make this argument a little more formal and explain why it does not yield the right answer in this instance. Let λ be the uniform (rotational invariant) measure on Sd 1

r and suppose that ? is sampled from λ. The learner takes an action A 2 Bd

1 and observes X = h A, ?i2 + with sampled from a standard Gaussian. What is the information gained by the learner? By symmetry, all actions on the unit sphere have the same information gain, so let s just ﬁx an arbitrary A 2 Sd 1

P be a Gaussian with mean h A, i2 and variance 1. Then,

I( ?; X) = E

KL (P , P ) dλ( ) dλ( ) = 1

h A, i2 h A, i2$2 dλ( ) dλ( )

3 d2 + 2d 1

where the ﬁrst inequality follows from convexity of the relative entropy. Note that d

2 log(n) bits are needed to code ? to reasonable accuracy. So if we presume that the rate of information learned stays the same throughout the learning process, then over n rounds the learner can only obtain O(nr4/d2) bits by Eq. (1). By setting r2 = d3/2p

log(n)/n one could be led to believe that the learner cannot identify the optimal direction and the regret would be (nr2) = (d3/2p

n log(n)). The main point is that the rate of information accumulated by a careful learner increases over time.

Information-directed sampling The combination of our upper bound and the observation above has an important implication. Suppose as above that ? is sampled uniformly from Sd 1

r and let A be a possibly randomised action. Since the learner cannot know the realisation of ? initially, her expected regret for any action a 2 Bd

1 is (a) = r2 E[ha, ?i2] = r2(1 kak2/d) r2(1 1/d). On the other hand, as outlined above, the information gain about ? is about r4/d2. Together these results show that the information ratio is bounded by

I( ?; X, A) = (d2) .

Since the entropy of a suitable approximation of the optimal action is about d

2 log(n), an application of the information theoretic analysis by Russo and Van Roy [2014] suggests that the Bayesian regret can be bounded by O(d3/2p

n log(n)), which is suboptimal. This time the problem is that we have used the worst-case bound on the information ratio, without taking into account the possibility that the information ratio might decrease over time. We should mention here that a decreasing information ratio was exploited by Devraj et al. [2021] in a recent analysis of Thompson sampling for ﬁnite-armed bandits, but there the gain was less dramatic (a logarithm of the number of arms) and no changes to the algorithm were required.

4 Algorithm for bandit phase retrieval

We start by showing that Theorem 1 holds if the learner is given an action that is constant-factor optimal. In the next section we explain how such an action can be identiﬁed with low regret. Our algorithm uses the explore-then-commit design principle, which is usually only sufﬁcient for O(n2/3) regret. The reason we are able to obtain O(n1/2) regret is because of the curvature of the action set, a property that has been exploited in a similar way in a variety of settings [Rusmevichientong and Tsitsiklis, 2010a, Huang et al., 2017, Kirschner et al., 2020].

Theorem 4. Suppose the learner is given an action b Aw 2 Bd

1 such that h b Aw, ?i2 r2 for some universal constant 2 (0, 1]. Then there exists a policy for which the regret is at most Rn( , ?) const d

Proof. By choosing the sign of ?, assume without loss of generality that h b Aw, ?i rp . Let

n log(n)/r2m

and λ = min

If m n, then the regret of any policy is upper bounded nr2 const d

n log(n), so for the rest of the proof we assume that m < n. For the ﬁrst m rounds the policy cycles over the 2d actions {(1 λ) b Aw λek : k 2 [d]}. The constrained least squares estimator of ? based on the data collected over m rounds is

ˆ = arg min{L( ) : 2 Bd

r and h b Aw, i rp } , (2)

where L( ) = 1 2

t=1(Xt h At, i2)2. For the remaining n m rounds the algorithm plays At = b A = ˆ /kˆ k. Then,

β , 9 + d log(98m) (3)

h At, ˆ ?i2h At, ˆ + ?i2

(by Corollary 10)

h At, ˆ ?i2

where in Eq. (4) we used that for some k 2 [d],

h At, ˆ + ?i = h(1 λ) b Aw λek, ˆ + ?i (1 λ)h b Aw, ˆ + ?i λkˆ + ?k

2(1 λ)p r 2λr

2 k ?k . (by deﬁnition of λ)

Eq. (5) follows because for any k 2 [d],

h(1 λ) b Aw + σλek, ˆ ?i2 = 2(1 λ)2h b Aw, ˆ ?i2 + 2λ2hek, ˆ ?i2

2λ2hek, ˆ ?i2 ,

which implies that

h At, ˆ ?i2

2λ2hek, ˆ ?i2

Rearranging Eq. (5) and using the deﬁnition of β shows that

1 m 2d 1 const d2 log(m)

Letting a? = ?/k ?k be the optimal action, the regret is bounded by

Rn( , ?) mk ?k2 + n E

ha?, ?i2 h b A, ?i2i

= mk ?k2 + n E

ha? b A, ?iha? + b A, ?i

mk ?k2 + 2n E

h666ha? b A, ?i

mk ?k2 + 4n E

(by Lemma 6)

mk ?k2 + const nd2 log(m)

mk ?k2 const d

5 Finding a constant-factor optimal action

To establish Theorem 1 we show there exists an algorithm that interacts with the bandit for a random number of rounds and outputs an action b Aw that with high probability satisﬁes h b Aw, ?i2 r2/64. Furthermore, the procedure suffers small regret in expectation. Theorem 5. Let T be the random number of rounds that Algorithm 1 interacts with the bandit, which cannot be more than n, and let b Aw 2 Sd 1

1 be its output. Then,

1. E[T] const d2

r4 log(n) log(d).

2. With probability at least 1 1/n, either T = n or h b Aw, ?i2 r2

2 do 3 sample v uniformly from Sd 1

1 4 play At = v for m rounds and compute average reward X 5 loop u n t i l X 2

p X} 7 for k = 2 to d

8 β = 9(log(98) + 4 log(n)) and m =

w2E wk 9 do 10 sample v uniformly from span(E)? \ Sd 1

1 11 play At 2 {(u + v)/

2, u} for 3m rounds 12 f i n d l e a s t squ ares e s t i m a t o r ˆ c o n s t r a i n e d to Bd

1 13 loop u n t i l hv, ˆ i2 2

14 E = E [ {hv, ˆ iv} 15 i f P

w2E kwk2 r2/16 break 16 end for

17 return b Aw =

Algorithm 1: The procedure operates in d iterations. The ﬁrst iteration is implemented in Lines 1 5 and the remaining d 1 iterations in Lines 7 15.

What is interesting about Algorithm 1 is that it uses what it has learned in early iterations to increase the statistical efﬁciency of its estimation.

Proof. Note that the vectors u and v computed in each iteration are orthogonal, which means that k(u + v)/

2k = k(u v)/

2k = kvk = 1. Hence the actions of the algorithm are always in Bd

1. The main argument of the proof is based on an induction to show that with high probability when the execution of the algorithm ends, there exists an s 2 { 1} such that for all w 2 E,

(a) w 2 E, kwk2 2; and

(b) shw, ?i 2 [ 1

2kwk2, 2kwk2].

We proceed in ﬁve steps. First, we prove that if the above holds and the algorithm halts before n rounds are over, then the vector returned is a suitable approximation of ?/k ?k. Second, we upper bound the probability of certain bad events. In the third and fourth steps we prove the base case and induction step for (a) and (b). In the last step we bound the expected running time.

Step 1: Correctness Suppose that (a) and (b) above hold and the algorithm halts at the end of iteration k. Then,

h b Aw, ?i2 =

where the ﬁrst inequality follows from orthogonality of w 2 E and (b) above. The second inequality follows from the stopping condition in Line 15 of Algorithm 1, (a) above and the deﬁnition of . Part (2) of the theorem follows by showing that (a) and (b) above hold with probability at least 1 1/n.

Step 2: Failure events The algorithm computes some kind of estimator at the end of each do/loop. Since the algorithm cannot play more than n actions, the number of estimators computed is naively upper bounded by n. A union bound over all estimates and the concentration bounds in Lemma 7 and Corollary 10 show that with probability at least 1 1/n the following both hold:

For all v sampled in the ﬁrst iteration and corresponding average rewards X,

66 X hv, ?i266

where m is deﬁned in Line 1 of Algorithm 1.

Let D = (As)s be the actions played in the inner loop of some iteration k 2 and ˆ be the

corresponding least-squares estimator. Then,

ha, ˆ ?i2ha, ˆ + ?i2 9(log(98) + 4 log(n)) , β . (7)

We assume both of the above hold in all relevant iterations for the remainder.

Step 3: Base case The next step is to show that (a) and (b) hold with high probability after the ﬁrst iteration. Consider the operation of the algorithm in the inner loop. After sampling v 2 Sd 1

1 , the algorithm plays v for m rounds and computes the average reward. Let v be the last sampled action before the iteration halts and w = v

p X. By the stopping condition in Line 5, kwk2 = X 2. Without loss of generality, we choose the sign of ? so that hv, ?i 0. Then by Eq. (6),

2kwk2, 2kwk2

This establishes the base case.

Step 4: Inductive step Assume that (a) and (b) above hold for E at the end of iteration k. Let u be the value computed in Line 8 of Algorithm 1. Then,

w2Ehw, ?i p P

Let A = {(u + v)/

2, v}, which is the set of actions played in the inner loop of iteration k + 1 after sampling v for the last time. Let ˆ be the corresponding least-squares estimate. We consider two cases. First, if hu, ˆ + ?i 2|hv, ˆ + ?i|, then by Eq. (7),

ha, ? ˆ i2ha, ? + ˆ i2

16hu + v, ? ˆ i2hu, ?i2 + 1

16hu v, ? ˆ i2hu, ?i2

8hv, ? ˆ i2hu, ?i2 kr2

16dhv, ? ˆ i2 .

Rearranging shows that

hv, ? ˆ i2 16dβ

For the second case, hu, ˆ + ?i 2|hv, ˆ + ?i|. Then,

ha, ? ˆ i2ha, ? + ˆ i2 1

4hv, ˆ ˆ i2hu, ˆ + ?i kr2

8d hv, ? ˆ i2 .

And again, Eq. (8) holds. Summarising, hv, ˆ i is an estimator of hv, ?i up to accuracy /2. By the deﬁnition of the algorithm, the iteration only ends if |hv, ˆ i| . Therefore, with w = hv, ˆ iv, we have kwk2 = hv, ˆ i2 2. Furthermore, hw, ?i = hv, ˆ ihv, ?i 2 [kwk2/2, 2kwk2] . Therefore if (a) and (b) hold for E computed after iteration k, they also hold for E computed after iteration k + 1.

Step 5: Running time The length of an iteration is determined by the corresponding value of m and the number of samples of v. The former is an iteration-dependent constant, while the latter depends principally on how many samples are needed before |hv, ?i| is suitably large. The law of is the

uniform distribution on Sd 1

1 \ span(E)?, which is the uniform distribution on a sphere of dimension d 1 |E| embedded in Rd. The squared norm of the projection of ? onto span(E)? is

where we used (a) of the induction and the stopping condition in Line 15. Therefore, when v is sampled uniformly from Sd 1

1 \ span(E)?, by Lemma 8,

hv, ?i2 3 2/2

hv, ?i2 3r2

const > 0 ,

Furthermore, by the concentration analysis in the previous step, an iteration will end once a v has been sampled for which hv, ?i2 3 2/2. Hence, the expected number of times the algorithm samples v per iteration is constant and a simple calculation using the deﬁnition of m in Lines 1 and 8 shows that the expected number of rounds used by the algorithm is at most

E[T] const d2

r4 log (n) log(d) .

6 Proof of Theorem 1 and Theorem 2

Proof of Theorem 1. Run Algorithm 1 and if it halts, feed the returned action to the input of the explore-then-commit algorithm analysed in Theorem 4. Algorithm 1 fails to return a suitable action with probability at most 1/n, so the contribution of this event to the regret is negligible. By Theorem 5, the regret incurred by Algorithm 1 is bounded by

r2 h At, ?i2$

nr2, const d2

r4 log(d) log(n)

n log(d) log(n) .

Combining this with the regret bound established in Theorem 4 yields the result.

Proof of Theorem 2. We use a standard reduction [Lattimore and Szepesvári, 2020, Chapter 33]. Let be the policy used in the proof of Theorem 1 with b A? sampled uniformly from (At)n

t=1. By Theorem 1,

rn( , ?) = 1

log(n) log(d)

7 Proof of Theorem 3

Let be a ﬁxed policy and for 2 Rd let P be the measure on the sequence of outcomes Hn = (A1, X1, . . . , An, Xn) induced by the interaction between and the phase retrieval model determined

by . Let E denote the expectation with respect to P . Let r be a positive constant to be tuned subsequently and σ be the uniform (Haar) measure on Sd 1

r . Let Q =

P dσ( ) be the Bayesian mixture measure. For 2 Rd, let E be the event given by

h b A?, i2 3

By Fano s inequality [Gerchinovitz et al., 2020, Lemma 5],

P (E ) dσ( )

r KL(P , Q) dσ( )

r Q(E ) dσ( )

We now bound the numerator and denominator in Eq. (9) to show that the right-hand side is at most 1/2 and then complete the proof using the deﬁnition of the regret and E .

Step 1: Bounding the denominator in Eq. (9) By exchanging the order of integrals in the denominator of Eq. (9), it follows that

Q(E ) dσ( )

1E dσ( ) d Q

If U is sampled uniformly from Sd 1

1 , then by a concentration bound for spherical measures [Dasgupta and Gupta, 2003, Lemma 2.2],

exp( δ/4) for all δ > 6 .

By scaling and rotating and choosing δ = 3

4d, it follows that for any b A? 2 Bd

h b A?, i2 3r2

dσ( ) exp ( 3d/16) .

Therefore, by Eq. (10),

Q(E ) dσ( )

Step 2: Bounding the numerator in Eq. (9) By the convexity of KL-divergence,

KL(P , Q) dσ( ) =

KL (P , P ) dσ( ) dσ( ) .

By the chain rule of KL-divergence,

KL (P , P ) = E

KL (P (Yt = |At), P (Yt = |At))

A straightforward computation leads to

KL (P , P ) = E

Since (At)n

t=1 are independent of (Xt)n

t=1, we can interchange the expectation and integral such that Z

KL(P , Q) dσ( )

dσ( ) d ( )

where the expectation is with respect to (At)n

t=1, which does not depend on by assumption. When is uniformly on Sd 1

r and A 2 Bd

1 is arbitrary, Z

h At, i4 dσ( ) = 3r4

d2 + 2d and

h At, i2 dσ( ) = 1

where the expectation is taken with respect to . Therefore,

KL(P , Q) dσ( ) 3nr4

Step 3: Lower bounding the regret Let r2 =

d3/(32n) Combining the previous two steps shows that

P (E ) dσ( ) 16nr2

Therefore there exists a 2 Sd 1

r with P (E ) 1/2, which implies that

rn( , ) = r2 E

h b A?, i2i

8 const d3/2

8 Discussion

Unknown radius The assumption that r = k ?k is known to the learner is easily relaxed by estimating k ?k. Note ﬁrst that all our analysis holds with only trivial modiﬁcations if r 2 [ 1

2k ?k, k ?k]. Next, if A is sampled uniformly from Sd 1

1 and X = h A, ?i2 + and is a standard Gaussian, then E[X] = 1

dk ?k2 and V[X] = 1 + 2(d 1)/(d3 + 2d2) = (1). Therefore k ?k can be estimated to within an arbitrary multiplicative factor and at conﬁdence level 1 1/n using const d2/k ?k4 log(n) interactions with the bandit.

Computation complexity The only computational challenge is ﬁnding the least squares estimates, which is a non-convex optimisation problem. Candès et al. [2015b] proposed a Wirtinger ﬂow algorithm that starts with a spectral initialization, and then reﬁnes this initial estimate using a local update like gradient descent. The computational complexity of the Wirtinger ﬂow algorithm with "-accuracy is O(nd2 log(1/")) where n is the number of samples.

Adversarial setting Kotłowski and Neu [2019] study the adversarial version of this problem, where the learner observes h At, ti2 and ( t)n

t=1 is an adversarially sequence with t 2 Bd

1 for all t. They prove an upper bound of Rn = O(d

n log(n)) and a lower bound of (

dn). Natural attempts at improving the lower bound all fail. We believe that the upper bound is loose, but proving this remains delicate. No warm starting procedure will work anymore because the information gained may be useless in the presence of a change point. New ideas are needed.

Rank-p Perhaps the most natural open question is whether or not our analysis can be extended to the low rank bandit problem without our particular assumptions on the action set and environments matrices.

Principled algorithms Can optimism or information-directed sampling be made to work? The main challenge is to understand the sample paths of these algoritms before learning takes place.

Y. Abbasi-Yadkori, D. Pál, and Cs. Szepesvári. Improved algorithms for linear stochastic bandits. In

Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

P. Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine

Learning Research, 3(Nov):397 422, 2002.

S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of

independence. OUP Oxford, 2013.

T. Cai, X. Li, and Z. Ma. Optimal rates of convergence for noisy sparse phase retrieval via thresholded

wirtinger ﬂow. The Annals of Statistics, 44(5):2221 2251, 2016.

E. J. Candès, Y. C. Eldar, T. Strohmer, and V. Voroninski. Phase retrieval via matrix completion.

SIAM review, 57(2):225 251, 2015a.

E. J. Candès, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger ﬂow: Theory and algorithms.

IEEE Transactions on Information Theory, 61(4):1985 2007, 2015b.

Y. Chen and E. J. Candès. Solving random quadratic systems of equations is nearly as easy as solving

linear systems. Communications on pure and applied mathematics, 70(5):822 883, 2017.

Y. Chen, Y. Chi, J. Fan, and C. Ma. Gradient descent with random initialization: Fast global

convergence for nonconvex phase retrieval. Mathematical Programming, 176(1):5 37, 2019.

W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In

Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 208 214, 2011.

V. Dani, T. Hayes, and S. Kakade. Stochastic linear optimization under bandit feedback. 2008.

S. Dasgupta and A. Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random

Structures & Algorithms, 22(1):60 65, 2003.

A. M. Devraj, B. Van Roy, and K. Xu. A bit better? quantifying information for bandit learning.

ar Xiv preprint ar Xiv:2102.09488, 2021.

S. Gerchinovitz, P. Ménard, and G. Stoltz. Fano s inequality for random variables. Statistical Science,

35(2):178 201, 2020.

B. Huang, K. Huang, S. Kakade, J. Lee, Q. Lei, R. Wang, and J. Yang. Optimal gradient-based

algorithms for non-concave bandit optimization. ar Xiv preprint ar Xiv:2107.04518, 2021.

R. Huang, T. Lattimore, A. György, and Cs. Szepesvári. Following the leader and fast rates in online

linear prediction: Curved constraint sets and other regularities. Journal of Machine Learning Research, 18:1 31, 2017.

K-S Jun, R. Willett, S. Wright, and R. Nowak. Bilinear bandits with low-rank structure. In International Conference on Machine Learning, pages 3163 3172. PMLR, 2019.

S. Katariya, B. Kveton, Cs. Szepesvári, C. Vernade, and Z. Wen. Bernoulli rank-1 bandits for click

feedback. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, pages 2001 2007, 2017a.

S. Katariya, B. Kveton, Cs. Szepesvári, C. Vernade, and Z. Wen. Stochastic rank-1 bandits. In

Artiﬁcial Intelligence and Statistics, pages 392 401. PMLR, 2017b.

J. Kirschner, T. Lattimore, and A. Krause. Information directed sampling for linear partial monitoring.

In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2328 2369. PMLR, 2020.

W. Kotłowski and G. Neu. Bandit principal component analysis. In Alina Beygelzimer and Daniel

Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1994 2024, Phoenix, USA, 25 28 Jun 2019. PMLR.

T. Lattimore and Cs. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals

of Statistics, pages 1302 1338, 2000.

Y. Lu, A. Meisami, and A. Tewari. Low-rank generalized linear bandit problems. In International

Conference on Artiﬁcial Intelligence and Statistics, pages 460 468. PMLR, 2021.

P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations

Research, 35(2):395 411, 2010a.

P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations

Research, 35(2):395 411, 2010b.

D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. In Advances in

Neural Information Processing Systems, pages 1583 1591. Curran Associates, Inc., 2014.

Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Foundations of

Computational Mathematics, 18(5):1131 1198, 2018.

C. Trinh, E. Kaufmann, C. Vernade, and R. Combes. Solving bernoulli rank-one bandits with

unimodal thompson sampling. 31st International Conference on Algorithmic Learning Theory, pages 1 28, 2020.