# quantum_boosting__42a63a2a.pdf Quantum Boosting Srinivasan Arunachalam 1 Reevu Maity 2 Boosting is a technique that boosts a weak and inaccurate machine learning algorithm into a strong accurate learning algorithm. The Ada Boost algorithm by Freund and Schapire (for which they were awarded the G odel prize in 2003) is one of the widely used boosting algorithms, with many applications in theory and practice. Suppose we have a γ-weak learner for a Boolean concept class C that takes time R(C), then the time complexity of Ada Boost scales as VC(C) poly(R(C), 1/γ), where VC(C) is the VCdimension of C. In this paper, we show how quantum techniques can improve the time complexity of classical Ada Boost. To this end, suppose we have a γ-weak quantum learning algorithm for a Boolean concept class C that takes time Q(C), we introduce a quantum boosting algorithm whose complexity scales as p VC(C) poly(Q(C), 1/γ); thereby achieving quadratic quantum improvement over classical Ada Boost in terms of VC(C). 1. Introduction In the last decade, machine learning (ML) has received tremendous attention due to its success in practice. Given the broad applications of ML, there has been a lot of interest in understanding what are the learning tasks for which quantum computers could provide a speedup. In this direction, there has been a flurry of quantum algorithms for practically relevant machine learning tasks that theoretically promise either exponential or polynomial quantum speed-ups over classical computers. In the past, theoretical works on quantum machine learning (QML) have focused on developing efficient quantum algorithms with favourable quantum complexities to solve interesting learning problems. Recently 1IBM Research, Yorktown Heights, USA. 2Clarendon Laboratory, University of Oxford.. Correspondence to: , . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). there have been efforts in understanding the interplay between learning algorithms and noisy quantum devices. The field of QML has given us algorithms for various quantum and classical learning tasks such as (i) quantum improvements to classical algorithms for practically-motivated machine learning tasks such as perceptron learning (Kapoor et al., 2016), support vector machines (Rebentrost et al., 2013), kernel-based classifiers (Havl ı cek et al., 2019; Li et al., 2019), algorithms to compute gradients (Rebentrost et al., 2019; Gily en et al., 2019), clustering (Kerenidis et al., 2019; A ımeur et al., 2007), linear algebra (Prakash, 2014); (ii) learnability of quantum objects (Rocchetto, 2018; Yoganathan, 2019; Aaronson, 2007), shadow tomography of quantum states (Aaronson, 2018; Apeldoorn & Gily en, 2019); (iii) a quantum framework to learn Boolean-valued concept classes (Bernstein & Vazirani, 1993; Bshouty & Jackson, 1999; Atıcı & Servedio, 2005; Arunachalam et al., 2019); (iv) quantum algorithms for optimization (Harrow et al., 2009; Apeldoorn et al., 2020; Chakrabarti et al., 2018); (v) quantum algorithms for machine learning based on generative models (Lloyd & Weedbrook, 2018; Gao et al., 2017). While these results are promising and establish that quantum computers can indeed provide an improvement for interesting machine learning tasks, there are still several challenges that remain. One important question is whether the assumptions made in some quantum machine learning algorithms are practically feasible? Recently, a couple of works (Chia et al., 2019; Jethwani et al., 2019) demonstrated that under certain assumptions QML algorithms can be dequantized, i.e., they showed the existence of efficient classical algorithms for machine learning tasks which were previously believed to provide exponential quantum speedups. In this paper we address another important question: Suppose we implement a QML algorithm A on a quantum device and due to noise in the device, the performance of A is weak, i.e., the output of A is correct on a slightly better-than-half fraction of the inputs. Can we boost the performance of A so that As output is correct on 99% of the inputs? Inspired by the classical Adaptive Boosting algorithm (also referred to as Ada Boost) due to (Freund & Schapire, 1999), the classical Ada Boost can be used immediately to convert Quantum Boosting a weak quantum learning algorithm to a strong algorithm. In this paper, we provide a quantum boosting algorithm that quadratically improves upon the classical Ada Boost algorithm. Using our quantum boosting algorithm, not only can we convert a weak and inaccurate QML algorithm into a strong accurate algorithm, but we can do it in time that is quadratically faster than classical boosting techniques. 2. Classical Boosting We now briefly describe (Valiant, 1984)s Probably Approximately Correct (PAC) model of learning. Let n 1 and C {c : {0, 1}n { 1, 1}} be a concept class. For γ > 0, we say an algorithm A γ-learns C in the PAC model if: for every c C and distribution D : {0, 1}n [0, 1], given examples (x, c(x)) where x D, A outputs h : {0, 1}n { 1, 1} such that Prx D[h(x) = c(x)] 1/2 + γ. In the quantum PAC model, we allow a quantum learner to possess a quantum computer and quantum examples P x Dx|x, c(x) . We call γ the bias of an algorithm, i.e., γ measures the advantage over random guessing. We say A is a weak learner (resp. strong learner) if the bias γ scales inverse polynomially with n, i.e., γ = 1/ poly(n) (resp. γ is a universal constant independent of n, for simplicity we let γ = 1/6). In the early 1990s, (Schapire, 1990; Freund, 1995; Freund & Schapire, 1999) came up with the beautiful boosting algorithm called Ada Boost that efficiently solves the following problem: suppose we are given a weak learner as a blackbox, can we use this black-box to obtain a strong learner? The Ada Boost algorithm by Freund and Schapire was one of the few theoretical boosting algorithms that were simple enough to be extremely useful in practice, with applications ranging from game theory, statistics, optimization, vision and speech recognition and information geometry. Given the success of Ada Boost in theory and practice, Freund and Schapire won the G odel prize in 2003. Ada Boost algorithm. We now give a sketch of the classical Ada Boost algorithm. Let A be a weak PAC learning algorithm for C that runs in time R(C) and has bias γ > 0, i.e., A does slightly better than random guessing (think of γ as inverse-polynomial in n). The goal of boosting is the following: for every unknown distribution D : {0, 1}n [0, 1] and unknown concept c C, construct a hypothesis H that satisfies Prx D[H(x) = c(x)] 2 3. The Ada Boost algorithm by Freund and Schapire produces such an H by invoking A polynomially many times. The algorithm works as follows: it first obtains M different labelled examples S = {(xi, c(xi)) : i [M]} where xi D and then Ada Boost is an iterative algorithm that runs for T steps (for some M, T which we specify later). Let D1 be the uniform distribution on S. At the tth step, Ada Boost defines a distribution Dt depending on Dt 1 and invokes A on the sample S and distribution Dt. Using the output hypothesis ht of the weak learner A, the Ada Boost algorithm computes the weighted error εt = Prx Dt[ht(x) = c(x)] which is the probability of ht misclassifying a randomly selected training example drawn from the distribution Dt. The algorithm then uses εt to compute a weight αt = 1 and updates the distribution Dt to Dt+1 as follows Dt+1 x = Dt x Zt ( e αt if ht(x) = c(x) eαt otherwise , (1) where Zt = P x S Dt x exp( c(x)αtht(x)).1 After T iterations, the algorithm outputs the hypothesis H(x) = sign PT t=1 αtht(x) , where αt is the weight and ht is the weak hypothesis computed in the tth iteration. It remains to answer three important questions: (1) What is T, (2) What is M, (3) Why is H a strong hypothesis? The punchline of Ada Boost is the following: by selecting the number of iterations T = O(log M), the hypothesis H satisfies H(x) = c(x) for every x S. However, note that this does not imply H is a strong hypothesis. Using a standard Hoeffding bound, Freund and Schapire showed that with high probability (taken over the samples in S), suppose the number of labelled examples M is at least O(VC(C)) (where VC(C) is a combinatorial dimension that can be associated with C), then H not only perfectly classifies every x S, but it also satisfies Prx D[H(x) = c(x)] 2/3. Hence H is a strong hypothesis for the target concept c under the unknown distribution D. Theorem 2.1 (Schapire & Freund, 2012) Fix η, γ > 0. Let n 1 and C {c : {0, 1}n { 1, 1}} be a concept class. Let D : {0, 1}n [0, 1] be an unknown distribution. Let A be a weak PAC algorithm that takes time R(C) to learn C with bias γ. Let M be the smallest integer exceeding M VC(C) γ2 log(VC(C)/γ2) η2 . Suppose we run Ada Boost for T ((log M) log(1/δ))/(2γ2) rounds, then with probability 1 δ (over the randomness of the algorithm), we obtain a hypothesis H that has zero training error and small generalization error Prx D[H(x) = c(x)] η. Moreover the time complexity of the classical Ada Boost algorithm is γ4 log(1/δ) 3. Our results The main contribution of this paper is a quantum algorithm that runs in time quadratically faster in O(VC(C)) to obtain 1This distribution update is referred to as the Multiplicative Weights Update Method (MMUW). See (Arora et al., 2012) on how one can cast Ada Boost into the standard MMUW framework. 2Here, e O( ) hides poly-logarithmic factors in the parenthesis. Quantum Boosting a strong learner for the concept class C. Theorem 3.1 (Informal) Let n 1 and C {c : {0, 1}n { 1, 1}}. Let A be a weak quantum PAC learning algorithm that takes time Q(C) and has bias γ. Then the quantum complexity of converting A to a strong PAC learning algorithm is e O VC(C) Q(C)3/2 n2 We now make a few remarks regarding our main theorem: 1. Comparing our bound with classical Ada Boost complexity, we get a quadratic improvement in terms of VC(C). Also observe that the time complexity of quantum PAC learning Q(C) could be polynomially or even exponentially smaller than classical PAC learning time complexity R(C).3 2. Although our dependence on 1/γ is worse than the classical complexity, we believe our complexity should be improvable using quantum techniques (and we leave it as an open question to improve the exponent of the factor 1/γ). We remark that although the complexity of our quantum boosting algorithm is weaker than the classical complexity in terms of 1/γ = poly(n), observe that many concept classes have VC dimension that scales exponentially with n, in which case our quadratic improvement in terms of VC(C) beats the polynomial loss (in terms of 1/γ) in the complexity of our quantum boosting algorithm. 3. There have been a few prior works (Neven et al., 2012; Schuld & Petruccione, 2018; Wang et al., 2019) which touch upon Ada Boost but none of them rigorously prove that quantum techniques can improve boosting. As far as we are aware, ours is the first work that proves quantum algorithms can quadratically improve the complexity of classical Ada Boost. Given the importance of Ada Boost in classical machine learning, our quadratic quantum improvement could potentially have various applications in QML.4 3.1. Why quantum does not trivially give a quantum speedup to Ada Boost? We now give a sketch of our quantum boosting algorithm. The quantum algorithm follows the structure of the classical Ada Boost algorithm. On a very high level, our quantum speedup is obtained by using quantum techniques to esti- 3In (Arunachalam & Wolf, 2018), the authors prove that the sample complexity of classical and quantum PAC learning is the same up to constant factors, but there does exist concept classes demonstrated by (Servedio & Gortler, 2004) for which there could be exponential separations in time complexity between quantum and classical learning. 4Although our quantum boosting algorithm uses quantum phase estimation as a subroutine, which isn t implementable in near-term quantum computers, we leave it as an open question if one could use variational techniques as proposed by Peruzzo et al. (Peruzzo et al., 2014) in place of phase estimation. mate the quantity εt = P x S Dt x [ht(x) = c(x)] quadratically faster than classical methods. In order to do so, one could use quantum algorithms for mean estimation, which given a set of numbers α1, . . . , αM [0, 1], produces an approximation of 1 M P i [M] αi in time Θ( M) (Nayak & Wu, 1999; Brassard et al., 2011),5 whereas classical methods would use time Θ(M). However, using the mean estimation subroutine to improve the time complexity of classical Ada Boost comes with various issues which we discuss now: 1. Errors while computing εts: Quantumly, the mean estimation subroutine approximates εt up to an additive error δ in time O( M/δ). Suppose we obtain ε t satisfying |ε t εt| δ. Recall that the distribution update in the tth step of Ada Boost is given by Dt+1 x = Dt x Zt ( e αt if ht(x) = c(x) eαt otherwise , (2) where Zt = P x S Dt x exp( c(x)αtht(x)) and αt = 1 2 ln((1 εt)/εt). Given an additive approximation ε t of εt, first note that the approximate weights α t = 1 2 ln((1 ε t)/ε t) could be very far from αt. Moreover, it is not clear why the updated distribution e Dt+1 defined as e Dt+1 x = 1 Zt Dt x exp(α tc(x) ht(x)) is even close to a distribution. Another possible way to update our distribution would be e Dt+1 x = 1 Z t Dt x exp( α tc(x) ht(x)), where Z t = P x S Dt x exp( c(x)α tht(x)), so by definition e Dt+1 is a distribution. However, in this case note that a quantum learner cannot compute Z t exactly in time o(M) but instead can only approximate Z t and we face the same issue as mentioned above.6 2. Strong approximation of εt: One possible way to get around this would be to estimate εt very well so that one could potentially show that e Dt+1 is close to a distribution. However, observe that if e Dt+1 should be close to a distribution, then we require a δ = 1/ Mapproximation of εt and such a strong approximation increases the complexity from O( M) to O(M) which removes the entire quantum speedup. 3. Noisy inputs to a quantum learner: Let us further assume that we could spend time O(M) as mentioned above to estimate εt very well (instead of using classical techniques to compute εt). Suppose we obtain e Dt+1 which is close to a distribution. Recall that the input to a quantum learner should be copies of 5We omit poly-logarithmic factors in the complexity. 6Note that computing Z t would take time O(M) classically even though we have knowledge of α t, since Z t involves a summation of M terms. Quantum Boosting a quantum state |ψt = P Dtx|x, c(x) . However, we only have access to a quantum state |φt = P e Dtx|x, c(x) + |χt , where |χt is orthogonal to the first part of |φt (note that |φt is not a quantum state without the additional register |χt since P x S e Dt x 1). Now it is unclear what will be the output of a quantum learner on input |φt instead of |ψt . 4. Why is the final hypothesis good: Assume for now that we are able to show that the learner, on input copies of |φt , produces a weak hypothesis ht for the target concept c. Then, after T steps of the quantum boosting algorithm, the final hypothesis is H(x) = sign PT t=1 α tht(x) . It is not at all clear why H should satisfy H(x) = c(x) for even a constant fraction of the xs in S. Observe that the analysis of classical Ada Boost crucially used that H(x) = c(x) for most x S in order to conclude that the generalization error is small, i.e., Prx D[H(x) = c(x)] 1/3 where D is an unknown distribution over {0, 1}n. 3.2. Quantum algorithm for boosting In this paper, our main contribution is a quantum boosting algorithm that overcomes all the issues mentioned above. We now give more details of our quantum boosting algorithm. In order to avoid the issues mentioned in the previous section, our main technical contribution is the following: we provide a quantum algorithm that modifies the standard distribution update rule of classical Ada Boost in order to take care of the approximations of εts. Moreover, we show that the output of our modified quantum boosting algorithm has the same guarantees as classical Ada Boost. We now discuss the important modification: the distribution update step. As mentioned before, classically one can compute the quantity ε = Prx D[h(x) = c(x)] in time O(M). Quantumly, we describe a subroutine that for a fixed δ, performs the following: outputs yes if ε Ω((1 δ)/(QT 2)) and no otherwise. In the yes instance, the algorithm also outputs an approximation ε that satisfies |ε ε| δε and in the no instance, the algorithm outputs an ε that satisfies |ε ε| 1/(QT 2). The essential point here is the subroutine takes time O( M/δ).7 The subroutine crucially uses the fact that in the yes instance, the complexity of the standard quantum mean estimation algorithm (that given α1, . . . , αM, outputs a δ-approximation of 1 M P i [M] αi) scales as O( M/δ). However, in the no instance, obtaining a good multiplicative approximation of ε using the quantum mean estimation algorithm could potentially take time O(M). In this case, we do not need a good approxima- 7The yes and no events of this subroutine happen with high probability, we omit this for simplicity in exposition. tion of ε and instead we simply set ε = τ = 1/QT 2. We will justify this shortly. Depending on whether we are in the yes instance or no instance of the subroutine, we update the distribution differently. In the yes instance, we make a distribution update that resembles the standard Ada Boost update using the approximation ε t instead of εt. We let Zt = 2 p ε t(1 ε t), (1 ε t)/ε t and update e Dt x as follows e Dt+1 x = e Dt x (1 + 2δ)Zt ( e α t if ht(x) = c(x) eα t otherwise . (3) However, in the no instance, we cannot hope to get a good multiplicative approximation in time O( M). In this case, we crucially observe that weaker approximations of the weights α t = ln p (1 ε t)/ε t is sufficient in the hy- pothesis H(x) = sign PT t=1 α tht(x) , i.e., obtaining a worse approximation of εt still allows us to show that H has small training error. Hence, in the no instance, we simply let ε t = τ, Zt = 2 p τ(1 τ) and α t = ln p and update e Dt x as follows: e Dt+1 x = e Dt x (1 + 2 QT 2 )Zt ( (2 1 QT 2 )e α t if ht(x) = c(x) 2 QT 2 eα t otherwise . Note that the distribution update above is not the standard boosting distribution update and differs from it by assigning higher weights to the correctly classified training examples and lower weights to the misclassified ones. In both cases of the distribution update, observe that e D need not be a distribution. However we are able to show that e D is very close to a distribution, i.e., with some technical work we can argue that P x S e Dx [1 30δ, 1]. This aspect is very crucial because, in every iteration of the quantum boosting algorithm, we will pass copies of |Φ = P e Dx|x, c(x) +|χ , 8 to the quantum learner instead of the ideal quantum state |Φ = P x S Dx|x, c(x) (since our algorithm starts with many copies of P x p Dx|x, c(x) our algorithm also has access to these copies of |Φ ). A priori it is not clear, what will be the output of the weak quantum learner on the input |Φ . However, we show that the state |Φ is close to |Φ , in particular we show that | Φ |Φ | 1 δ. Suppose a weak quantum learner outputs a weak hypothesis h when given copies of the state |Φ (with probability at least 1 1/T), we show that the same quantum learner will output h when given copies of the state |Φ , with probability at least 1 9/T. By applying a union bound over T rounds of quantum boosting, we can bound the probability of obtaining a good hypothesis. Finally, after T 8Again note that we need |χ because e D is not a distribution, and P x p e Dx|x, c(x) is not a valid quantum state. Quantum Boosting rounds, our quantum boosting algorithm outputs the hypothesis H(x) = sign PT t=1 α tht(x) for all x {0, 1}n. It remains to show that the final hypothesis H of the quantum boosting algorithm, with the modified distribution updates has zero training error. We remark that the calculations to prove this part is fairly involved, and the analysis is inspired by the analysis of standard Ada Boost. Crucially, we use the structure of the modified distribution updates to show that H has zero training error. In order to go from zero training error to small generalization error, we use the same ideas as in classical Ada Boost to show that, if the number of classical labelled examples M is at least O(VC(C)), then H has generalization error at most 1/3. The overall time complexity of our quantum boosting algorithm is dominated by the subroutine in estimating ε for every iteration, which scales as O( M) and the remaining part of the quantum boosting algorithm involves invoking the weak quantum learner which takes time Q(C) and basic arithmetic operations. So the overall complexity of our quantum boosting algorithm scales as e O( p VC(C) n2 Q(C)3/2), which is quadratically better than classical Ada Boost in terms of VC(C). Application to classical Ada Boost. We remark that our main technical contribution, i.e., the modified distribution update rule is also applicable to classical Adaboost. In particular, suppose in classical Ada Boost we obtain approximations ε t instead of the exact weighted errors εt in time P. Then our robust classical Adaboost algorithm (i.e., Ada Boost with modified distribution update) can still produce a hypothesis H that has zero training error and the complexity of such a robust classical Ada Boost algorithm will be proportional to O(P). Clearly, it is possible that P could be much smaller than M (which is the time taken by classical Ada Boost to compute εt exactly) in which case the robust classical Ada Boost algorithm can be faster than standard classical Ada Boost. As far as we are aware, ours is the first work that considers approximating the weighted errors ε (which could potentially be much faster than exactly computing εs) and shows that changing the distribution update in Ada Boost still allows to produce a strong hypothesis. 4. Preliminaries Quantum information. In this paper, we assume familiarity with the following quantum information notation. Let |0 = 1 0 and |1 = 0 1 be the basis for C2, the space in which single qubits live. An arbitrary single qubit state is a superposition of |0 , |1 and has the form α|0 + β|1 = α β where α, β C and |α|2 + |β|2 = 1. Multi-qubit quantum states can be simply obtained by taking tensor products of single-qubit quantum states. Overall an arbitrary n-qubit quantum state |ψ C2n can be written as |ψ = P x {0,1}n αx|x where αx C and P x |αx|2 = 1. A valid quantum operation on quantum states can be expressed as a unitary matrix U (which satisfies UU = U U = I). An application of a unitary U to the state |ψ results in the quantum state U|ψ . Quantum oracle access. We say A is given query access to c : {0, 1}n { 1, 1} if, A can query c, i.e., A can obtain c(x) for x of it s choice. Similarly, we say A has quantum query access to c, if A can query c in a superposition, i.e., A can perform the map Oc : |x, b |x, c(x) b for every x {0, 1}n and b { 1, 1}. The query complexity of a quantum algorithm will be in terms of how many quantum queries are made throughout the quantum algorithm and the time complexity of a quantum algorithm will refer to the total number of gates involved in the quantum algorithm (i.e., the number of gates it takes to implement various unitaries during the quantum algorithm) as well as the number of gates it takes to prepare quantum states. Quantum subroutines. In this paper we will use two quantum subroutines. The first quantum algorithm by (Brassard et al., 2011) estimates the mean of numbers quadratically faster on a quantum computer than classical algorithms for mean estimation. Theorem 4.1 (Mean Estimation) Given a black-box for the function F : {1, . . . , N} [0, 1], there exists a quantum algorithm that with probability at least 2/3 computes an additive ε-approximation of 1 N PN i=1 F(i) using O(1/ε) evaluations of F. Observe that classically estimating the mean 1 N PN i=1 F(i) up to additive precision ε would take Θ(1/ε2) many evaluations of F and Theorem 4.1 gives a quadratic speedup compared to classical algorithm in estimating the mean. The second subroutine which we will use is amplitude amplification by (Brassard et al., 2002), a well-known quantum subroutine which performs the following task: suppose we have a (classical) algorithm A that outputs 1 with probability p and 0 otherwise, then classically we need to repeat A Θ(1/p) many times before one of the repetitions of A outputs a 1. Quantumly, amplitude amplification is a procedure that invokes A and the inverse of A (denoted A 1) O(1/ p) many times before outputting 1 with high probability, hence providing a quadratic quantum speedup over classical randomized algorithms. Theorem 4.2 (Amplitude amplification) Suppose there exists a unitary U on n qubits that satisfies the following U|0n = a|ψ0 + 1 a|ψ1 for an unknown a > 0 and arbitrary orthogonal quantum states |ψ0 , |ψ1 . Then there exists a quantum algorithm that outputs |ψ0 with probability exactly a > 0 using an expected number Θ( p a /a) of applications of U, U 1. Quantum Boosting PAC learning. The Probably Approximately Correct (PAC) model of learning was introduced by (Valiant, 1984). A concept class C is a collection of Boolean functions c : {0, 1}n { 1, 1}, which are often referred to as concepts. In the PAC model, there is an unknown distribution D : {0, 1}n [0, 1] under which a learner needs to learn C, i.e., a learner A is given labelled examples (x, c(x)) where x D and c C is an unknown target concept (which the learner is trying to learn). The goal of A is to output a hypothesis h : {0, 1}n { 1, 1} and we say that A is an (η, δ)-PAC learner for a concept class C if it satisfies: for all c C and distributions D, given access to labelled examples (x, c(x)), with probability 1 δ, A outputs a hypothesis h such that Prx D[h(x) = c(x)] η. The sample complexity and time complexity of a learner is the number of labelled examples and number of bit-wise operations (i.e., time taken) that suffices to learn C (under the hardest concept c C and distribution D). In the quantum PAC model, a learner is a quantum algorithm given access to the quantum examples P x Dx|x, c(x) and a quantum computer. The remaining aspects of the quantum PAC learning algorithm is defined analogous to the classical PAC model. We now define what it means for an algorithm A to be a strong and weak learner for a concept class C. Definition 4.3 (Weak and strong learner) Let n 1 and C {c : {0, 1}n { 1, 1}}. We say A is a weak (resp. strong) learner for C if it satisfies: for every c C and distribution D : {0, 1}n [0, 1], given query access to c, A can output a hypothesis h such that Prx D[h(x) = c(x)] 1 2 + 1 poly(n). (resp. Prx D[h(x) = c(x)] 2 Throughout, we will assume that we have classical or quantum query access to the hypothesis h, and will not assume explicit truth table description of h. Similarly, we say h is a weak-hypothesis (resp. strong-hypothesis) if Prx[h(x) = c(x)] 1/2 1/ poly(n) (resp. 1/6). We now define two misclassification errors. Let S = {(x1, y1), . . . , (x M, y M)} where (xi, yi) {0, 1}n { 1, 1} is drawn from a joint distribution D : {0, 1}n { 1, 1} [0, 1]. The training error of h is defined as the error of h on the training set S and given by 1 M PM i=1[h(xi) = yi]. In order to quantify the goodness of the hypothesis h, the true error or the generalization error of h is defined as Pr(x,y) D[h(x) = y]. 5. Quantum Boosting In this section, we use quantum techniques to improve the complexity of Ada Boost. Like in Ada Boost, we break our quantum boosting algorithm into two stages. Stage (1), reduce training error: produce a hypothesis that does well on the training set and Stage (2), reduce generalization error: we show that for a sufficiently large training set, not only does the hypothesis output in Stage (1) has a small training error, but also has a small generalization error. 5.1. Quantum boosting: reducing training error The bulk of the technical work in our quantum boosting algorithm lies in reducing the training error. We now state the main theorem for Stage (1) of our quantum algorithm. Theorem 5.1 Let γ > 0, n 1 and C {c : {0, 1}n { 1, 1}} be a concept class and D : {0, 1}n [0, 1] be an unknown distribution. Let A be a quantum algorithm that takes time Q(C) to PAC learn C with bias γ. Let M be sufficiently large9 and T = O((log M)/γ2). Given a training set S = {(xi, c(xi))}i [M] where xi D and c C, the quantum boosting algorithm takes time e O( M n2 Q(C)3/2T 5), and with probability 2/3, outputs a hypothesis H that satisfies H(xi) = c(xi) for all i S. We describe the quantum boosting algorithm in this theorem statement now. Since the sample complexity of A is at most the time complexity, we will assume that it suffices to provide A with Q quantum examples. Our quantum algorithm is a T-round iterative algorithm similar to classical Ada Boost and in each round, our quantum algorithm produces a distribution e D. In the tth round our quantum algorithm follows a three step process: 1. Invoke the weak quantum learner A to produce a weak hypothesis ht under an approximate distribution e Dt over the training set S. 2. By making quantum queries to ht, our algorithm computes ε t, an approximation to eεt = Prx e Dt[ht(x) = c(x)]. We then use ε t to update the distribution e Dt to e Dt+1. In this step, we depart from standard Ada Boost. 3. Using ε t compute a weight α t. After T steps, output a hypothesis H(x) = sign PT t=1 α tht(x) . Before describing our quantum algorithm, we state a lemma which we use in performing step (2) in the procedure above. Lemma 5.2 Let δ = 1/(10QT 2). there exists a procedure that outputs ε and satisfies the following: with probability 1 10δ/T, if the output is {ε , yes}, then |eε ε | δε ; and if the output is {ε = 1/(QT 2), no}, then |eε ε | 1/(QT 2). The time complexity of the procedure is e O( MQ3/2T 3n2). We now describe our quantum boosting algorithm. 9We quantify what we mean by sufficiently large in the next section, in particular in Theorem 5.5. Quantum Boosting Algorithm 1 Quantum boosting algorithm Input: Quantum weak learner A with time complexity Q, a training sample S = {(xi, c(xi))}i [M], where xi D and D : {0, 1}n [0, 1] is an unknown distribution. Initialize: Let e D1 = D1 be the uniform distribution on S. Let h0 be the constant function,10 T = O((log M)/γ2) and δ = 1/(10QT 2). ε 0 = 1/2. 1: for t = 1 to T (assume quantum query access to h1, . . . , ht 1 and knowledge of ε 1, . . . , ε t 1.) do 2: Prepare Q + 1 many copies of |ψ1 = 1 M P x S |x, c(x), e D1 x . Let |Φ1 = |ψ1 . Phase (1): Obtaining hypothesis ht 3: Using quantum queries to {h1, . . . , ht 1} and knowledge of {ε 1, . . . , ε t 1}, prepare the state x S |x, c(x), e Dt x . 4: Apply amplitude amplification (in Theorem 4.2) to prepare |Φ6 = P e Dtx|x, c(x) + |χt . 5: Pass |Φ6 Q to the quantum learner A to obtain ht. Phase (2): Estimating weighted errors eεt 6: Using quantum queries to ht, prepare |ψ5 = 1 x S |x, c(x), e Dt x [ht(x) = c(x)] . 7: Let eεt = Prx e Dt[ht(x) = c(x)]. Prepare |ψ6 = p 1 eεt/M|φ0 |0 + p eεt/M|φ1 |1 . 8: Invoke Lemma 5.2 to estimate eεt with ε t. Phase (3): Updating distributions 9: If lemma 5.2 outputs yes : let Zt = 2 p ε t(1 ε t), 2 ln (1 ε t)/ε t . Update e Dt+1 x = e Dt x (1 + 2δ)Zt ( e α t if ht(x) = c(x) eα t otherwise . 10: If lemma 5.2 outputs no : let Zt = 2 p QT 2 1 /(QT 2), α t = ln p e Dt+1 x = e Dt x (1 + 2/(QT 2))Zt ( (2 1/(QT 2))e α t if ht(x) = c(x) (1/(QT 2))eα t otherwise . 11: end for Output: Hypothesis H defined as H(x) = sign PT t=1 α tht(x) for all x {0, 1}n. We do not discuss the steps of our quantum boosting algorithm due to space constraints and give more details in Section B of the supplementary material. Now we make a couple of remarks. First, note that we use the notation e Dt x in the quantum boosting algorithm because { e Dt x}x is not a distribution (which is also why we need to use the state |χt in step (4) because P e Dtx|x, c(x) is not a valid quantum state), instead we show that e Dt is close to a distribution in the following sense Claim 5.3 Let t 1, e Dt : {0, 1}n [0, 1] be defined as in steps (9), (10). Then P x S e Dt x [1 30δ, 1]. We also show that, not only are these distributions are close but the weighted training error of the hypotheses under these distributions are close. Claim 5.4 Let t 1, eεt = Prx e Dt[ht(x) = c(x)] be the weighted error corresponding to the approximate distribution e Dt and εt = Prx Dt[ht(x) = c(x)] correspond to the true distribution Dt. Then |eεt εt| 50δ. We prove these claims along with a few more facts (which are used in showing that the quantum algorithm produces a strong H) in Section C of the supplementary material. Working with approximate distributions is an important difference between standard Ada Boost and our quantum boosting algorithm. In Ada Boost, one assumes that the eεs can be computed exactly by spending time O(M), however quantumly, we can only approximate eε with ε using the quantum mean estimation algorithm (in Theorem 4.1) in time O( M). Hence, using ε in Phase (3) results in a sub-normalized distribution in the quantum algorithm. Second, as we mentioned in the introduction we differ from Ada Boost crucially in phase (3). The quantum mean estimation algorithm gives a good approximation in time O( M) only when eεt is large , in which case we use the standard Ada Boost distribution step in Step (9). In case eεt is small , since we cannot hope to get a good approximation of eεt in time O( M), we fix ε t = 1/QT 2 and use a different distribution update compared to Ada Boost in Step (10). The intuition as to why fixing ε t = 1/QT 2 is sufficient is that, when eεt is small , we observe that obtaining worse approximations of α t = 1 eεt are sufficient in the final output hypothesis H = sign PT t=1 α tht . In particular, with weaker approximations of α t, we show that using the α t obtained by fixing ε t = 1/QT 2 (whenever eεt is small) is sufficient to show that the final hypothesis H is strong hypothesis. We make this rigorous in the supplementary material in Section D. We remark that Sections C, D of the 10Precisely, we let the query operation Oh0 corresponding to h0 be the identity map. Quantum Boosting supplementary material (which prove the correctness of the quantum boosting algorithm) are mathematically technical and are the non-trivial aspects of our quantum algorithm. 5.1.1. PROOF OF CORRECTNESS We first bound the probability of failure of the quantum boosting algorithm in obtaining the strong hypothesis H. The first source of error is due to amplitude amplification in step (4) of the boosting algorithm, which fails with probability 1 3T . The second error is due to the quantum weak learner failing to output a weak hypothesis in step (5), whose probability is 1 3T . The third source of error is in estimating eεt in step (8), the probability of failure in estimating eεt is O(1/(QT 3)). Applying a union bound over the T rounds and all the failure events, we ensure that the overall probability of not outputting H can be made an arbitrary constant (with a constant overhead in the complexity). It remains to argue that the training error of H is 0, i.e., H(x) = c(x) for every (x, c(x)) S. To analyze this we crucially use the structure of the modified distribution update step in Phase (3). Proving that the final H has zero training error departs from standard Ada Boost convergence analysis and due to space constraints we defer it to Section D of the supplementary material.11 5.1.2. COMPLEXITY OF THE ALGORITHM Our quantum algorithm begins with the state 1 x S |x, c(x) given access to S = {xi, c(xi)}i. Assuming that a quantum RAM can prepare a uniform superposition 1 x S |x, c(x) using O(n log M) gates, the time complexity of preparing the initial state |ψ1 |Φ1 Q is O(n Q log M). We could also assume that a quantum learning algorithm is given uniform quantum examples 1 x S |x, c(x) , in which case we do not need to assume a quantum RAM.12 11For a more coherent exposition of the theorems alongside proofs, we refer the reader to (Arunachalam & Maity, 2020). 12Given the QRAM assumption has been controversial and seems strong in quantum machine learning, we make a couple of remarks: (i) our quantum boosting algorithm only requires a QRAM to prepare the uniform superposition over classical data S at the start of each iteration. Also, our quantum algorithm does not use QRAM as an oracle for Grover-like algorithms, so the negative results of (Arunachalam et al., 2015) do not apply to our algorithm; (ii) we use the QRAM at the start of T = O(log M) iteration of our algorithm to prepare |ψ0 , so even if the quantum time complexity of preparing |ψ0 is O( M), then our complexity increases by an additive O( M log M) term and we still do not lose our quantum speedup; (iii) of course if QRAM is infeasible then we can also assume that a quantum learner has access to uniform quantum examples or has quantum query access to the training examples in S (i.e., can perform the map |x, b |x, b c(x) for x S). and in both cases we do not need a QRAM. In phase (1) of the quantum algorithm, we first update the distribution registers from e D1 to e Dt. This step involves using O(Qt) quantum queries to {h1, . . . , ht 1} which can be performed in time O(Qt), and other arithmetic operations that can be performed in time O(n2Qt). We then perform amplitude amplification (in Theorem 4.2) to prepare |Φ6 which takes time O(n2 MQt) (in Section E in the supplementary material we make explicit what are the unitaries for which we are applying amplification.) Finally, we pass Q copies of |Φ6 to the weak learner A which outputs a hypothesis ht in time Q. Note that we require the quantum learning algorithm to output an oracle for ht instead of explicitly outputting a circuit for ht. In phase (2), the algorithm in steps (6), (7) performs a query as well as a quantum gate for phase rotation in order to prepare |ψ6 using O(n) gates. The next step is the mean estimation step (in Theorem 4.1) to compute ε t, which takes time O( MQ3/2T 3 tn2) (again in Section (E) we explicitly mention the unitaries to which we apply Theorem 4.1). Finally, Phase (3) of our quantum boosting algorithm involves basic arithmatic operations which takes time O(n2). The total complexity of the algorithm scales as e O( M poly(n, Q, T)). 5.2. Reducing generalization error In the previous section we showed that our quantum boosting algorithm produces a hypothesis H that perfectly classifies the training set S = {(xi, c(xi))}i [M], where (xi, c(xi)) was sampled according to the unknown D. Recall that the goal of our quantum boosting algorithm is to output a hypothesis H : {0, 1}n { 1, 1} that satisfies Prx D[H(x) = c(x)] 1 η. We saw in Theorem 2.1 that as long as M, i.e., the number of training examples is large enough, then not only does H has zero training error, but it also ensures small generalization error. In particular, in Stage (2) of classical Ada Boost we simply use Theorem 2.1 to argue that: suppose the training error of H is 0, then the generalization error of H is at most η as long as M O(VC(C)/η2). Using Theorem 2.1, we now prove our main theorem: Theorem 5.5 (Complexity of Quantum Boosting) Fix η > 0, γ > 0. Let n 1 and C {c : {0, 1}n { 1, 1}} be a concept class and D : {0, 1}n [0, 1] be an unknown distribution. Let A be a weak PAC quantum algorithm that has bias γ and takes time Q(C). Suppose M satisfies γ2 log(VC(C)/γ2) Suppose we run Algorithm 1, then with probability 1 δ (over the randomness of the algorithm), we obtain a hypothesis H that has training error at most 1/10 and generalization error Prx D[H(x) = c(x)] η + 1/10. Moreover, Quantum Boosting the time complexity of the quantum boosting algorithm is η Q(C)3/2 n2 γ11 polylog(1/δ) Picking η = 1/10 we get that H has generalization error at most 1/5. Recall that the complexity of classical Ada Boost is TC = e O γ4 log(1/δ) . In comparison, TQ is quadratically better than TC in terms of the VC dimension of the concept class C and 1/η. Additionally, we could potentially have Q(C) R(C) since the the quantum time complexity of a weak learner can be much lesser than the classical time complexity of learning as shown by (Servedio & Gortler, 2004) (under complexity-theoretic assumptions). Open questions. We conclude with a few interesting questions: (i) can we improve the polynomial dependence on 1/γ in the quantum complexity of boosting? (ii) can we use the quantum boosting algorithm to improve the complexities various quantum algorithms that use classical Ada Boost on top of a weak quantum algorithms? (iii) are there practically relevant concept classes which have large VC dimension for which our quantum boosting algorithm gives a large quantum speedup, (iv) could one replace the quantum phase estimation step in our quantum boosting algorithm by variational techniques developed by (Peruzzo et al., 2014)? Subsequent work. After our work was submitted to ICML 2020, (Hamoudi et al., 2020) posted a paper on ar Xiv proposing a quantum speedup for the Hedge algorithm by Freund and Schapire, which can be viewed as a boosting algorithm using multiplicative weights method. Acknowledgements. RM would like to thank Seth Lloyd for his support and guidance during the course of the project. RM s research was supported by Seth Lloyd in the Research Laboratory of Electronics at MIT. Most of this work was done when SA was a Postdoc at Center for Theoretical Physics, MIT (funded by the MIT-IBM Watson AI Lab under the project Machine Learning in Hilbert space) and visiting Henry Yuen in University of Toronto. SA was also supported by the Army Research Laboratory and the Army Research Office under grant number W1975117. We thank Ashley Montanaro and Ronald de Wolf for clarifications regarding multiplicative phase estimation and Ronald de Wolf for several comments that significantly improved the presentation of this paper. We thank Steve Hanneke and Nishant Mehta for various clarifications regarding classical boosting algorithms. We thank the anonymous reviewers of ICML 2020 for helpful suggestions on a prelimnary version of this paper. Aaronson, S. The learnability of quantum states. Proceedings of the Royal Society of London, 463(2088), 2007. quant-ph/0608142. Aaronson, S. Shadow tomography of quantum states. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC, pp. 325 338, 2018. A ımeur, E., Brassard, G., and Gambs, S. Quantum clustering algorithms. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML), pp. 1 8, 2007. Apeldoorn, J. v. and Gily en, A. Improvements in quantum SDP-solving with applications. In 46th International Colloquium on Automata, Languages, and Programming (ICALP), pp. 99:1 99:15, 2019. ar Xiv:1804.05058. Apeldoorn, J. v., Gily en, A., Gribling, S., and Wolf, R. d. Convex optimization using quantum oracles. Quantum, 4:220, 2020. ar Xiv:1809.00643. Arora, S., Hazan, E., and Kale, S. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012. Arunachalam, S. and Maity, R. Quantum boosting. ar Xiv:2002.05056, 2020. Arunachalam, S. and Wolf, R. d. Optimal quantum sample complexity of learning algorithms. Journal of Machine Learning Research, 19(71):1 36, 2018. Earlier version in CCC 17. ar Xiv:1607.00932. Arunachalam, S., Gheorghiu, V., Jochym-O Connor, T., Mosca, M., and Srinivasan, P. V. On the robustness of bucket brigade quantum RAM. New Journal of Physics, 17(12):123010, 2015. ar Xiv:1502.03450. Arunachalam, S., Chakraborty, S., Lee, T., Paraashar, M., and de Wolf, R. Two new results about quantum exact learning. In 46th International Colloquium on Automata, Languages, and Programming, ICALP, pp. 16:1 16:15, 2019. ar Xiv:1810.00481. Atıcı, A. and Servedio, R. Improved bounds on quantum learning algorithms. Quantum Information Processing, 4 (5):355 386, 2005. quant-ph/0411140. Bernstein, E. and Vazirani, U. V. Quantum complexity theory. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, pp. 11 20, 1993. Brassard, G., Hoyer, P., Mosca, M., and Tapp, A. Quantum amplitude amplification and estimation. Contemporary Mathematics, 305:53 74, 2002. Quantum Boosting Brassard, G., Dupuis, F., Gambs, S., and Tapp, A. An optimal quantum algorithm to approximate the mean and its application for approximating the median of a set of points over an arbitrary distance, 2011. ar Xiv:1106.4267. Bshouty, N. H. and Jackson, J. C. Learning DNF over the uniform distribution using a quantum example oracle. SIAM Journal on Computing, 28(3):1136 1153, 1999. Chakrabarti, S., Childs, A. M., Li, T., and Wu, X. Quantum algorithms and lower bounds for convex optimization, 2018. ar Xiv:1809.01731. Chia, N., Gily en, A., Li, T., Lin, H., Tang, E., and Wang, C. Sampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum machine learning, 2019. ar Xiv:1910.06151. Freund, Y. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256 285, 1995. Earlier in COLT 90. Freund, Y. and Schapire, R. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14: 771 780, 1999. Gao, X., Zhang, Z., and Duan, L. An efficient quantum algorithm for generative machine learning, 2017. ar Xiv:1711.02038. Gily en, A., Arunachalam, S., and Wiebe, N. Optimizing quantum optimization algorithms via faster quantum gradient computation. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2019. Hamoudi, Y., M.Ray, Rebentrost, P., Santha, M., Wang, X., and Yang, S. Quantum algorithms for hedging and the sparsitron. ar Xiv:2002.06003, 2020. Harrow, A. W., Hassidim, A., and Lloyd, S. Quantum algorithm for solving linear systems of equations. Physical Review Letters, 15:150502, 2009. ar Xiv:0811.3171v3. Havl ı cek, V., C orcoles, A. D., Temme, K., Harrow, A. W., Kandala, A., Chow, J. M., and Gambetta, J. M. Supervised learning with quantum-enhanced feature spaces. Nature, 567(7747):209, 2019. ar Xiv:1804.11326. Jethwani, D., Gall, F. L., and Singh, S. Quantum-inspired classical algorithms for singular value transformation. 2019. ar Xiv:1910.05699. Kapoor, A., Wiebe, N., and Svore, K. Quantum perceptron models. In Proceedings of Neural Information Processing Systems 16, pp. 3999 4007, 2016. arxiv:1602.04799. Kerenidis, I., Landman, J., Luongo, A., and Prakash, A. qmeans: A quantum algorithm for unsupervised machine learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, pp. 4136 4146, 2019. ar Xiv: 1812.03584. Li, T., Chakrabarti, S., and Wu, X. Sublinear quantum algorithms for training linear and kernel-based classifiers. In Proceedings of the 36th International Conference on Machine Learning, ICML, pp. 3815 3824, 2019. ar Xiv:1904.02276. Lloyd, S. and Weedbrook, C. Quantum generative adversarial learning. Physical review letters, 121(4):040502, 2018. Nayak, A. and Wu, F. The quantum query complexity of approximating the median and related statistics. In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, pp. 384 393, 1999. Neven, H., Denchev, V. S., Rose, G., and Macready, W. G. Qboost: Large scale classifier training with adiabatic quantum optimization. In ACML, pp. 333 348, 2012. Peruzzo, A., Mc Clean, J., Shadbolt, P., Yung, M.-H., Zhou, X.-Q., Love, P. J., Aspuru-Guzik, A., and O brien, J. L. A variational eigenvalue solver on a photonic quantum processor. Nature communications, 5:4213, 2014. ar Xiv:1304.3061v1. Prakash, A. Quantum algorithms for linear algebra and machine learning, 2014. Ph D Thesis. Rebentrost, P., Mohseni, M., and Lloyd, S. Quantum support vector machine for big feature and big data classification, 2013. ar Xiv:1307.0471. Rebentrost, P., Schuld, M., Wossnig, L., Petruccione, F., and Lloyd, S. Quantum gradient descent and Newton s method for constrained polynomial optimization. New Journal of Physics, 2019. Rocchetto, A. Stabiliser states are efficiently PAC-learnable. Quantum Information & Computation, 18(7&8):541 552, 2018. Schapire, R. and Freund, Y. Boosting: Foundations and Algorithms. MIT Press, 2012. Schapire, R. E. The strength of weak learnability. Machine Learning, 5:197 227, 1990. Earlier in FOCS 89. Schuld, M. and Petruccione, F. Quantum ensembles of quantum classifiers. Scientific reports, 8(1):2772, 2018. Quantum Boosting Servedio, R. A. and Gortler, S. J. Equivalences and separations between quantum and classical learnability. SIAM Journal on Computing, 33(5):1067 1092, 2004. Valiant, L. G. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984. Wang, X., Ma, Y., Hsieh, M., and Yung, M. Quantum speedup in adaptive boosting of binary classification. 2019. ar Xiv: 1902.00869. Yoganathan, M. A condition under which classical simulability implies efficient state learnability, 2019. ar Xiv:1907.08163.