# quantum_perceptron_models__82b2dc35.pdf

Quantum Perceptron Models

Nathan Wiebe Microsoft Research Redmond WA, 98052 nawiebe@microsoft.com

Ashish Kapoor Microsoft Research Redmond WA, 98052 akapoor@microsoft.com

Krysta M Svore Microsoft Research Redmond WA, 98052 ksvore@microsoft.com

We demonstrate how quantum computation can provide non-trivial improvements in the computational and statistical complexity of the perceptron model. We develop two quantum algorithms for perceptron learning. The ﬁrst algorithm exploits quantum information processing to determine a separating hyperplane using a number of steps sublinear in the number of data points N, namely O(

N). The second algorithm illustrates how the classical mistake bound of O( 1

γ2 ) can be further improved to O( 1 γ ) through quantum means, where γ denotes the margin. Such improvements are achieved through the application of quantum amplitude ampliﬁcation to the version space interpretation of the perceptron model.

1 Introduction

Quantum computation is an emerging technology that utilizes quantum effects to achieve signiﬁcant, and in some cases exponential, speed-ups of algorithms over their classical counterparts. The growing importance of machine learning has in recent years led to a host of studies that investigate the promise of quantum computers for machine learning [1, 2, 3, 4, 5, 6, 7, 8, 9].

While a number of important quantum speedups have been found, the majority of these speedups are due to replacing a classical subroutine with an equivalent albeit faster quantum algorithm. The true potential of quantum algorithms may therefore remain underexploited since quantum algorithms have been constrainted to follow the same methodology behind traditional machine learning methods [10, 8, 9]. Here we consider an alternate approach: we devise a new machine learning algorithm that is tailored to the speedups that quantum computers can provide.

We illustrate our approach by focusing on perceptron training [11]. The perceptron is a fundamental building block for various machine learning models including neural networks and support vector machines [12]. Unlike many other machine learning algorithms, tight bounds are known for the computational and statistical complexity of traditional perceptron training. Consequently, we are able to rigorously show different performance improvements that stem from either using quantum computers to improve traditional perceptron training or from devising a new form of perceptron training that aligns with the capabilities of quantum computers.

We provide two quantum approaches to perceptron training. The ﬁrst approach focuses on the computational aspect of the problem and the proposed method quadratically reduces the scaling of the complexity of training with respect to the number of training vectors. The second algorithm focuses on statistical efﬁciency. In particular, we use the mistake bounds for traditional perceptron training methods and ask if quantum computation lends any advantages. To this end, we propose an algorithm that quadratically improves the scaling of the training algorithm with respect to the margin between the classes in the training data. The latter algorithm combines quantum amplitude estimation in the version space interpretation of the perceptron learning problem. Our approaches showcase the trade-offs that one can consider in developing quantum algorithms, and the ultimate advantages of performing learning tasks on a quantum computer.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Feature Space Version Space

Figure 1: Version space and feature space views of classiﬁcation. This ﬁgure is from [18].

The rest of the paper is organized as follows: we ﬁrst cover the background on perceptrons, version space and Grover search. We then present our two quantum algorithms and provide analysis of their computational and statistical efﬁciency before concluding.

2 Background

2.1 Perceptrons and Version Space

Given a set of N separable training examples {φ1, .., φN} IRD with corresponding labels {y1, .., y N}, yi {+1, 1}, the goal of perceptron learning is to recover a hyperplane w that perfectly classiﬁes the training set [11]. Formally, we want w such that yi w T φi > 0 for all i. There are various simple online algorithms that start with a random initialization of the hyperplane and make updates as they encounter more and more data [11, 13, 14, 15]; however, the rule that we consider for online perceptron training is, upon misclassifying a vector (φ, y), w w + yφ.

A remarkable feature of the perceptron model is that upper bounds exist for the number of updates that need to be made during this training procedure. In particular, if the training data is composed of unit vectors, φi IRD, that are separated by a margin of γ then there are perceptron training algorithms that make at most O( 1

γ2 ) mistakes [16], independent of the dimension of the training vectors. Similar bounds also exist when the data is not separated [17] and also for other generalizations of perceptron training [13, 14, 15]. Note that in the worst case, the algorithm will need to look at all points in the training set at least once, consequently the computation complexity will be O(N).

Our goal is to explore if the quantum procedures can provide improvements both in terms of computational complexity (that is better than O(N)) and statistical efﬁciency (improve upon O( 1

γ2 ). Instead of solely applying quantum constructs to the feature space, we also consider the version space interpretation of perceptrons which leads to the improved scaling with γ.

Formally, version space is deﬁned as the set of all possible hyperplanes that perfectly separate the data: VS := {w|yi w T φi > 0 for all i}. Given a training datum, the traditional representation is to depict data as points in the feature space and use hyperplanes to depict the classiﬁers. However, there exists a dual representation where the hyperplanes are depicted as points and the data points are represented as hyperplanes that induce constraints on the feasible set of classiﬁers. Figure 1, which is borrowed from [18], illustrates the version space interpretation of perceptrons. Given three labeled data points in a 2D space, the dual space illustrates the set of normalized hyperplanes as a yellow ball with unit radius. The third dimension corresponds to the weights that multiply the two dimensions of the input data and the bias term. The planes represent the constraints imposed by observing the labeled data as every labeled data renders one-half of the space infeasible. The version space is then the intersection of all the half-spaces that are valid. Naturally, classiﬁers including SVMs [12] and Bayes point machines [19] lie in the version space.

We note that there are quantum constructs such as Grover search and amplitude ampliﬁcation which provide non-trivial speedups for the search task. This is the main reason why we resort to the version space interpretation. We can use this formalism to simply pose the problem of determining the

Utargψ Uinit

qa qa qa qa qa

Figure 2: A geometric description of the action of Ugrover on an initial state vector ψ.

separating hyperplane as a search problem in the dual space. For example given a set of candidates hyperplanes, our problem reduces to searching amongst the sample set for the classiﬁer that will successfully classify the entire set. Therefore training the perceptron is equivalent to ﬁnding any feasible point in the version space. We describe these quantum constructs in detail below.

2.2 Grover s Search

Both quantum approaches introduced in this work and their corresponding speed-ups stem from a quantum subroutine called Grover s search [20, 21], which is a special case of a more general method referred to as amplitude ampliﬁcation [22]. Rather than sampling from a probability distribution until a given marked element is found, the Grover search algorithm draws only one sample and then uses quantum operations to modify the distribution from which it sampled. The probability distribution is rotated, or more accurately the quantum state that yields the distribution is rotated, into one whose probability is sharply concentrated on the marked element. Once a sharply peaked distribution is identiﬁed, the marked item can be found using just one sample. In general, if the probability of ﬁnding such an element is known to be a then amplitude ampliﬁcation requires O( p

1/a) operations to ﬁnd the marked item with certainty.

While Grover s search is a quantum subroutine, it can in fact be understood using only geometric arguments. The only notions from quantum mechanics used are those of the quantum state vector and that of Born s rule (measurement). A quantum state vector is a complex unit vector whose components have magnitudes that are equal to the square roots of the probabilities. In particular, if ψ is a quantum state vector and p is the corresponding probability distribution then

p = ψ ψ, (1)

where the unit column vector ψ is called the quantum state vector which sits in the vector space Cn, is the Hadamard (pointwise) product and is the complex conjugate transpose. A quantum state can be measured such that if we have a quantum state vector ψ and a basis vector w then the probability of measuring ψ = w is | ψ, w |2, where , denotes the inner product.

We need to implement two unitary operations in order to perform the search algorithm:

Uinit = 2ψψ 11, Utarg = 11 2P. (2)

The operators Uinit and Utarg can be interpreted geometrically as reﬂections within a two dimensional space spanned by the vectors ψ and Pψ. If we assume that Pψ = 0 and Pψ = ψ then these two reﬂection operations can be used to rotate ψ in the space span(ψ, Pψ). Speciﬁcally this rotation is Ugrover = Uinit Utarg. Its action is illustrated in Figure 2. If the angle between the vector ψ and Pψ/ Pψ is π/2 θa, where θa := sin 1(| ψ, Pψ/ Pψ |). It then follows from elementary geometry and the rule for computing the probability distribution from a quantum state (known as Born s rule) that after j iterations of Grover s algorithm the probability of ﬁnding a desirable outcome is p(ψ νgood|j) = sin2((2j + 1)θa). (3)

It is then easy to see that if θa 1 and a probability of success greater than 1/4 is desired then j O(1/ θa) sufﬁces to ﬁnd a marked outcome. This is quadratically faster than is possible from statistical sampling, which requires O(1/θa) samples on average. Simple modiﬁcations to this algorithm allow it to be used in cases where θa is not known [21, 22].

3 Online quantum perceptron

Now that we have discussed Grover s search we turn our attention to applying it to speed up online perceptron training. In order to do so, we ﬁrst need to deﬁne the quantum model that we wish to use as our quantum analogue of perceptron training. While there are many ways of deﬁning such a model but the following approach is perhaps the most direct. Although the traditional feature space perceptron training algorithm is online [16], meaning that the training examples are provided one at a time to it in a streaming fashion, we deviate from this model slightly by instead requiring that the algorithm be fed training examples that are, in effect, sampled uniformly from the training set. This is a slightly weaker model, as it allows for the possibility that some training examples will be drawn multiple times. However, the ability to draw quantum states that are in a uniform superposition over all vectors in the training set enables quantum computing to provide advantages over both classical methods that use either access model.

We assume without loss of generality that the training set consists of N unit vectors, φ1, . . . , φN. If we then deﬁne Φ1, . . . , ΦN to be the basis vectors whose indices each coincide with a (B + 1)-bit representation of the corresponding (φj, yj) where yj { 1, 1} is the class assigned to φj and let Φ0 be a ﬁxed unit vector that is chosen to represent a blank memory register.

We introduce the vectors Φj to make it clear that the quantum vectors states used to represent training vectors do not live in the same vector space as the training vectors themselves. We choose the quantum state vectors here to occupy a larger space than the training vectors because the Heisenberg uncertainty principle makes it much more difﬁcult for a quantum computer to compute the class that the perceptron assigns to a training vector in such cases.

For example, the training vector (φj, yj) ([0, 0, 1, 0]T , 1) can be encoded as an unsigned integer 00101 5, which in turn can be represented by the unit vector Φ = [0, 0, 0, 0, 0, 1]T . More generally, if φj IRD were a vector of ﬂoating point numbers then a similar vector could be constructed by concatenating the binary representations of the D ﬂoating point numbers that comprise it with (yj + 1)/2 and express the bit string as an unsigned integer, Q. The integer can then be expressed as a unit vector Φ : [Φ]q = δq,Q. While encoding the training data as an exponentially long vector is inefﬁcient in a classical computer, it is not in a quantum computer because of the quantum computer s innate ability to store and manipulate exponentially large quantum state vectors.

Any machine learning algorithm, be it quantum or classical, needs to have a mechanism to access the training data. We assume that the data is accessed via an oracle that not only accesses the training data but also determines whether the data is misclassiﬁed. To clarify, let {uj : j = 1 : N} be an orthonormal basis of quantum state vectors that serve as addresses for the training vectors in the database. Given an input address for the training datum, the unitary operations U and U allow the quantum computer to access the corresponding vector. Speciﬁcally, for all j

U[uj Φ0] = uj Φj, U [uj Φj] = uj Φ0. (4)

Given an input address vector uj, the former corresponds to a database access and the latter inverts the database access.

Note that because U and U are linear operators we have that U PN j=1 uj Φ0 = P j uj Φj. A quantum computer can therefore access each training vector simultaneously using a single operation. The resultant vector is often called in the physics literature a quantum superposition of states and this feature of linear transformations is referred to as quantum parallelism within quantum computing.

The next ingredient that we need is a method to test if the perceptron correctly assigns a training vector addressed by a particular uj. This process can be pictured as being performed by a unitary transformation that ﬂips the sign of any basis-vector that is misclassiﬁed. By linearity, a single application of this process ﬂips the sign of any component of the quantum state vector that coincides with a misclassiﬁed training vector. It therefore is no more expensive than testing if a given training vector is misclassiﬁed in a classical setting. We denote the operator, which depends on the perceptron weights w, Fw and require that

Fw[uj Φ0] = ( 1)fw(φj,yj)[uj Φ0], (5)

where fw(φj) is a Boolean function that is 1 if and only if the perceptron with weights w misclassiﬁes training vector φj. Since the classiﬁcation step involves computing the dot products of ﬁnite size vectors, this process is efﬁcient given that the Φj are efﬁciently computable.

We apply Fw in the following way. Let Fw be a unitary operation such that

FwΦj = ( 1)fw(φj,yj)Φj. (6)

Fw is easy to implement in the quantum computer using a multiply controlled phase gate and a quantum implementation of the perceptron classiﬁcation algorithm, fw. We can then write

Fw = U (11 Fw)U. (7)

Classifying the data based on the phases (the minus signs) output by Fw naturally leads to a very memory efﬁcient training algorithm because only one training vector is ever stored in memory during the implementation of Fw given in Eq. (7). We can then use Fw to perform Grover s search algorithm, by taking Utarg = Fw and Uinit = 2ψψ 11 with ψ = Ψ := 1

N PN j=1 uj, to seek out training vectors that the current perceptron model misclassiﬁes. This leads to a quadratic reduction in the number of times that the training vectors need to be accessed by Fw or its classical analogue.

In the classical setting, the natural object to query is slightly different. The oracle that is usually assumed in online algorithms takes the form U c : Z 7 CD where U c(j) = φj. We will assume that a similar function exists in both the classical and the quantum settings for simplicity. In both cases, we will consider the cost of a query to U c to be proportional to the cost of a query to Fw.

We use these operations in to implement a quantum search for training vectors that the perceptron misclassiﬁes. This leads to a quadratic speedup relative to classical methods as shown in the following theorem. It is also worth noting that our algorithm uses a slight variant on the Grover search algorithm to ensure that the runtime is ﬁnite.

Theorem 1. Given a training set that consists of unit vectors Φ1, . . . , ΦN that are separated by a margin of γ in feature space, the number of applications of Fw needed to infer a perceptron model, w, such that P( j : fw(φj) = 1) ϵ using a quantum computer is Nquant where

N) Nquant O

whereas the number of queries to fw needed in the classical setting, Nclass, where the training vectors are found by sampling uniformly from the training data is bounded by

Ω(N) Nclass O N

We assume in Theorem 1 that the training data in the classical case is accessed in a manner that is analogous to the sampling procedure used in the quantum setting. If instead the training data is supplied by a stream (as in the standard online model) then the upper bound changes to Nclass O(N/γ2) because all N training vectors can be deterministically checked to see if they are correctly classiﬁed by the perceptron. A quantum advantage is therefore obtained if N log2(1/ϵγ2).

In order to prove Theorem 1 we need to have two technical lemmas (proven in the supplemental material). The ﬁrst bounds the complexity of the classical analogue to our training method:

Lemma 1. Given only the ability to sample uniformly from the training vectors, the number of queries to fw needed to ﬁnd a training vector that the current perceptron model fails to classify correctly, or conclude that no such example exists, with probability 1 ϵγ2 is at most O(N log(1/ϵγ2)).

The second proves the correctness of our online quantum perceptron algorithm and bounds the complexity of the algorithm:

Lemma 2. Assuming that the training vectors {φ1, . . . , φN} are unit vectors and that they are drawn from two classes separated by a margin of γ in feature space, Algorithm 2 will either update the perceptron weights, or conclude that the current model provides a separating hyperplane between the two classes, using a number of queries to Fw that is bounded above by O(

N log(1/ϵγ2)) with probability of failure at most ϵγ2.

After stating these results, we can now provide the proof of Theorem 1.

Proof of Theorem 1. The upper bounds follow as direct consequences of Lemma 2 and Lemma 1. Novikoff s theorem [16, 17] states that the algorithms described in both lemmas must be applied at most 1/γ2 times before ﬁnding the result. However, either the classical or the quantum algorithm may fail to ﬁnd a misclassiﬁed vector at each of the O(1/γ2) steps. The union bound states that the probability that this happens is at most the sum of the respective probabilities in each step. These probabilities are constrained to be γ2ϵ, which means that the total probability of failing to correctly ﬁnd a mistake is at most ϵ if both algorithms are repeated 1/γ2 times (which is the worst case number of times that they need to be repeated).

The lower bound on the quantum query complexity follows from contradiction. Assume that there exists an algorithm that can train an arbitrary perceptron using o(

N) query operations. Now we want to show that unstructured search with one marked element can be expressed as a perceptron training algorithm. Let w be a known set of perceptron weights and assume that the perceptron only misclassiﬁes one vector φ1. Thus if perceptron training succeeds then w the value of φ1 can be extracted from the updated weights. This training problem is therefore equivalent to searching for a misclassiﬁed vector. Now let φj = [1 F(j), F(j)]T χj where χj is a unit vector that represents the bit string j and F(j) is a Boolean function. Assume that F(0) = 1 and F(j) = 0 if j = 0, which is without loss of generality equivalent to Grover s problem [20, 21]. Now assume that φj is assigned to class 2F(j) 1 and take w = [1/

j χj. This perceptron therefore misclassiﬁes φ0 and no other vector in the training set. Updating the weights yields φj, which in turn yields the value of j such that F(j) = 1, and so Grover s search reduces to perceptron training.

Since Grover s search reduces to perceptron training in the case of one marked item the lower bound of Ω(

N) queries for Grover s search [21] applies to perceptron training. Since we assumed that perceptron training needs o(

N) queries this is a contradiction and the lower bound must be Ω(

We have assumed that in the classical setting that the user only has access to the training vectors through an oracle that is promised to draw a uniform sample from {(φ1, y1), . . . , (φN, y N)}. Since we are counting the number of queries to fw it is clear that in the worst possible case that the training vector that the perceptron makes a mistake on can be the last unique value sampled from this list. Thus the query complexity is Ω(N) classically.

4 Quantum version space perceptron

The strategy for our quantum version space training algorithm is to pose the problem of determining a separating hyperplane as search. Speciﬁcally, the idea is to ﬁrst generate K sample hyperplanes w1, . . . , w K from a spherical Gaussian distribution N(0, 11). Given a large enough K, we are guaranteed to have at least one hyperplane amongst the samples that would lie in the version space and perfectly separate the data. As discussed earlier Grover s algorithm can provide quadratic speedup over the classical search consequently the efﬁciency of the algorithm is determined by K. Theorem 2 provides an insight on how to determine this number of hyperplanes to be sampled.

Theorem 2. Given a training set that consists of d-dimensional unit vectors Φ1, . . . , ΦN with labels y1, . . . , y N that are separated by a margin of γ in feature space, then a D-dimensional vector w sampled from N(0, 11) perfectly separates the data with probability Θ(γ).

The proof of this theorem is provided in the supplementary material. The consequence of Theorem 2 stated below is that the expected number of samples K, required such that a separating hyperplane exists in the set, only needs to scale as O( 1

γ ). This is remarkable because, similar to Novikoff s theorem [16], the number of samples needed does not scale with D. Thus Theorem 2 implies that if amplitude ampliﬁcation is used to boost the probability of ﬁnding a vector in the version space then the resulting quantum algorithm will need only O( 1 γ ) quantum steps on average.

Next we show how to use Grover s algorithm to search for a hyperplane that lies in the version space. Let us take K = 2ℓ, for positive integer ℓ. Then given w1, . . . , w K be the sampled hyperplanes, we represent W1, . . . , WK to be vectors that encode a binary representation of these random perceptron vectors. In analogy to Φ0, we also deﬁne W0 to be a vector that represents an empty data register. We deﬁne the unitary operator V to generate these weights given an address vector uj using the following

V [uj W0] = [uj Wj]. (8)

In this context we can also think of the address vector, uj, as representing a seed for a pseudo random number generator that yields perceptron weights Wj.

Also let us deﬁne the classical analogue of V to be V c which obeys V c(j) = wj. Now using V (and applying the Hadamard transform [23]) we can prepare the following quantum state

k=1 uk Wk, (9)

which corresponds to a uniform distribution over the randomly chosen w.

Now that we have deﬁned the initial state, Ψ, for Grover s search we need to deﬁne an oracle that marks the vectors inside the version space. Let us deﬁne the operator ˆFφ,y via

ˆFφ,y[uj W0] = ( 1)1+fwj (φ,y)[uj W0]. (10)

This unitary operation looks at an address vector, uj, computes the corresponding perceptron model Wj, ﬂips the sign of any component of the quantum state vector that is in the half space in version space speciﬁed by φ and then uncomputes Wj. This process can be realized using a quantum subroutine that computes fw, an application of V and V and also the application of a conditional phase gate (which is a fundamental quantum operation that is usually denoted Z) [23].

The oracle ˆFφ,y does not allow us to directly use Grover s search to rotate a quantum state vector that is outside the version space towards the version space boundary because it effectively only checks one of the half space inequalities that deﬁne the version space. It can, however, be used to build an operation, ˆG, that reﬂects about the version space:

ˆG[uj W0] = ( 1)1+(fwj (φ1,y1) fwj (φN,y N))[uj W0]. (11)

The operation ˆG can be implemented using 2N applications of ˆFφ as well as a sequence of O(N) elementary quantum gates, hence we cost a query to ˆG as O(N) queries to ˆFφ,y.

We use these components in our version space training algorithm to, in effect, amplify the margin between the two classes from γ to γ. We give the asymptotic scaling of this algorithm in the following theorem. Theorem 3. Given a training set that consists of unit vectors Φ1, . . . , ΦN that are separated by a margin of γ in feature space, the number of queries to ˆFφ,y needed to infer a perceptron model with probability at least 1 ϵ, w, such that w is in the version space using a quantum computer is Nquant where

Nquant O N γ log3/2 1

Proof. The proof of the theorem follows directly from bounds on K and the validity of our version space training algorithm. It is clear from previous discussions that the algorithm carries out Grover s search, but instead of searching for a φ that is misclassiﬁed it instead searches for a w in version space. Its validity therefore follows by following the exact same steps followed in the proof of Lemma 2 but with N = K. However, since the algorithm need is not repeated 1/γ2 times in this context we can replace γ with 1 in the proof. Thus if we wish to have a probability of failure of at most ϵ then the number of queries made to ˆG is in O(

K log(1/ϵ )). This also guarantees that if any of the K vectors are in the version space then the probability of failing to ﬁnd that vector is at most ϵ .

Next since one query to ˆG is costed at N queries to ˆFφ,y the query complexity (in units of queries to ˆFφ,y) becomes O(N

K log(1/ϵ )). The only thing that then remains is to bound the value of K needed.

The probability of ﬁnding a vector in the version space is Θ(γ) from Theorem 2. This means that there exists α > 0 such that the probability of failing to ﬁnd a vector in the version space K times is at most (1 αγ)K e αγK. Thus this probability is at most δ for K Ω 1 γ log(1/δ) . It then

sufﬁces to pick K Θ(log(1/δ)/γ) for the algorithm.

The union bound implies that the probability that either none of the vectors lie in the version space or that Grover s search failing to ﬁnd such an element is at most ϵ + δ ϵ. Thus it sufﬁces to pick

ϵ Θ(ϵ) and δ Θ(ϵ) to ensure that the total probability is at most ϵ. Therefore the total number of queries made to ˆFφ,y is in O(N log3/2(1/ϵ)/ γ) as claimed.

The classical algorithm discussed previously has complexity O(N log(1/ϵ)/γ), which follows from the fact from Theorem 2 that K Θ(log(1/ϵ)/γ) sufﬁces to make the probability of not drawing an element of the version space at most ϵ. This demonstrates a quantum advantage if 1

γ log(1/ϵ), and illustrates that quantum computing can be used to boost the effective margins of the training data. Quantum models of perceptrons therefore not only provide advantages in terms of the number of vectors that need to be queried in the training process, they also can make the perceptron much more perceptive by making training less sensitive to small margins.

These performance improvements can also be viewed as mistake bounds for the version space perceptron. The inner loop in the version space algorithm attempts to sample from the version space and then once it draws a sample it tests it against the training vectors to see if it errs on any example. Since the inner loop is repeated O(

K log(1/ϵ)) times, the maximum number of misclassiﬁed vectors that arises from this training process is from Theorem 2 O( 1 γ log3/2(1/ϵ)) which, for constant ϵ, constitutes a quartic improvement over the standard mistake bound of 1/γ2 [16].

5 Conclusion

We have provided two distinct ways to look at quantum perceptron training that each afford different speedups relative to the other. The ﬁrst provides a quadratic speedup with respect to the size of the training data. We further show that this algorithm is asymptotically optimal in that if a super quadratic speedup were possible then it would violate known lower bounds for quantum searching. The second provides a quadratic reduction in the scaling of the training time (as measured by the number of interactions with the training data) with the margin between the two classes. This latter result is especially interesting because it constitutes a quartic speedup relative to the typical perceptron training bounds that are usually seen in the literature.

Perhaps the most signiﬁcant feature of our work is that it demonstrates that quantum computing can provide provable speedups for perceptron training, which is a foundational machine learning method. While our work gives two possible ways of viewing the perceptron model through the lens of quantum computing, other quantum variants of the perceptron model may exist. Seeking new models for perceptron learning that deviate from these classical approaches may not only provide a more complete understanding of what form learning takes within quantum systems, but also may lead to richer classes of quantum models that have no classical analogue and are not efﬁciently simulatable on classical hardware. Such models may not only revolutionize quantum learning but also lead to a deeper understanding of the challenges and opportunities that the laws of physics place on our ability to learn.

[1] M Lewenstein. Quantum perceptrons. Journal of Modern Optics, 41(12):2491 2501, 1994.

[2] Esma A ımeur, Gilles Brassard, and S ebastien Gambs. Machine learning in a quantum world. In Advances in artiﬁcial intelligence, pages 431 442. Springer, 2006.

[3] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and unsupervised machine learning. ar Xiv preprint ar Xiv:1307.0411, 2013.

[4] Nathan Wiebe, Ashish Kapoor, and Krysta Svore. Quantum nearest-neighbor algorithms for machine learning. Quantum Information and Computation, 15:318 358, 2015.

[5] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis. Nature Physics, 10(9):631 633, 2014.

[6] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big data classiﬁcation. Physical review letters, 113(13):130503, 2014.

[7] Nathan Wiebe and Christopher Granade. Can small quantum systems learn? ar Xiv preprint ar Xiv:1512.03145, 2015.

[8] Nathan Wiebe, Ashish Kapoor, and Krysta M Svore. Quantum deep learning. ar Xiv preprint ar Xiv:1412.3489, 2014. [9] Mohammad H Amin, Evgeny Andriyash, Jason Rolfe, Bohdan Kulchytskyy, and Roger Melko. Quantum boltzmann machine. ar Xiv preprint ar Xiv:1601.02036, 2016. [10] Silvano Garnerone, Paolo Zanardi, and Daniel A Lidar. Adiabatic quantum algorithm for search engine ranking. Physical review letters, 108(23):230506, 2012. [11] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [12] Johan AK Suykens and Joos Vandewalle. Least squares support vector machine classiﬁers. Neural processing letters, 9(3):293 300, 1999. [13] Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-Taylor, and Jaz Kandola. The perceptron algorithm with uneven margins. In ICML, volume 2, pages 379 386, 2002. [14] Claudio Gentile. A new approximate maximal margin classiﬁcation algorithm. The Journal of Machine Learning Research, 2:213 242, 2002. [15] Shai Shalev-Shwartz and Yoram Singer. A new perspective on an old perceptron algorithm. In Learning Theory, pages 264 278. Springer, 2005. [16] Albert BJ Novikoff. On convergence proofs for perceptrons. Technical report, DTIC Document, 1963. [17] Yoav Freund and Robert E Schapire. Large margin classiﬁcation using the perceptron algorithm. Machine learning, 37(3):277 296, 1999. [18] Thomas P Minka. A family of algorithms for approximate Bayesian inference. Ph D thesis, Massachusetts Institute of Technology, 2001. [19] Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines: Estimating the bayes point in kernel space. In IJCAI Workshop SVMs, pages 23 27, 1999. [20] Lov K Grover. A fast quantum mechanical algorithm for database search. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 212 219. ACM, 1996. [21] Michel Boyer, Gilles Brassard, Peter Høyer, and Alain Tapp. Tight bounds on quantum searching. ar Xiv preprint quant-ph/9605034, 1996. [22] Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum amplitude ampliﬁcation and estimation. Contemporary Mathematics, 305:53 74, 2002. [23] Michael A Nielsen and Isaac L Chuang. Quantum computation and quantum information. Cambridge university press, 2010.