{"title": "Quantum Perceptron Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3999, "page_last": 4007, "abstract": "We demonstrate how quantum computation can provide non-trivial improvements in the computational and statistical complexity of the perceptron model. We develop two quantum algorithms for perceptron learning. The first algorithm exploits quantum information processing to determine a separating hyperplane using a number of steps sublinear in the number of data points $N$, namely $O(\\sqrt{N})$. The second algorithm illustrates how the classical mistake bound of $O(\\frac{1}{\\gamma^2})$ can be further improved to $O(\\frac{1}{\\sqrt{\\gamma}})$ through quantum means, where $\\gamma$ denotes the margin. Such improvements are achieved through the application of quantum amplitude amplification to the version space interpretation of the perceptron model.", "full_text": "Quantum Perceptron Models\n\nNathan Wiebe\n\nMicrosoft Research\nRedmond WA, 98052\n\nAshish Kapoor\n\nMicrosoft Research\nRedmond WA, 98052\n\nKrysta M Svore\nMicrosoft Research\nRedmond WA, 98052\n\nnawiebe@microsoft.com\n\nakapoor@microsoft.com\n\nksvore@microsoft.com\n\nAbstract\n\nWe demonstrate how quantum computation can provide non-trivial improvements\nin the computational and statistical complexity of the perceptron model. We\ndevelop two quantum algorithms for perceptron learning. The \ufb01rst algorithm\nexploits quantum information processing to determine a separating hyperplane\nusing a number of steps sublinear in the number of data points N, namely O(\nN ).\nThe second algorithm illustrates how the classical mistake bound of O( 1\n\u03b32 ) can be\nfurther improved to O( 1\u221a\n\u03b3 ) through quantum means, where \u03b3 denotes the margin.\nSuch improvements are achieved through the application of quantum amplitude\nampli\ufb01cation to the version space interpretation of the perceptron model.\n\n\u221a\n\n1\n\nIntroduction\n\nQuantum computation is an emerging technology that utilizes quantum effects to achieve signi\ufb01cant,\nand in some cases exponential, speed-ups of algorithms over their classical counterparts. The growing\nimportance of machine learning has in recent years led to a host of studies that investigate the promise\nof quantum computers for machine learning [1, 2, 3, 4, 5, 6, 7, 8, 9].\nWhile a number of important quantum speedups have been found, the majority of these speedups\nare due to replacing a classical subroutine with an equivalent albeit faster quantum algorithm.\nThe true potential of quantum algorithms may therefore remain underexploited since quantum\nalgorithms have been constrainted to follow the same methodology behind traditional machine\nlearning methods [10, 8, 9]. Here we consider an alternate approach: we devise a new machine\nlearning algorithm that is tailored to the speedups that quantum computers can provide.\nWe illustrate our approach by focusing on perceptron training [11]. The perceptron is a fundamental\nbuilding block for various machine learning models including neural networks and support vector\nmachines [12]. Unlike many other machine learning algorithms, tight bounds are known for the\ncomputational and statistical complexity of traditional perceptron training. Consequently, we are\nable to rigorously show different performance improvements that stem from either using quantum\ncomputers to improve traditional perceptron training or from devising a new form of perceptron\ntraining that aligns with the capabilities of quantum computers.\nWe provide two quantum approaches to perceptron training. The \ufb01rst approach focuses on the\ncomputational aspect of the problem and the proposed method quadratically reduces the scaling of\nthe complexity of training with respect to the number of training vectors. The second algorithm\nfocuses on statistical ef\ufb01ciency. In particular, we use the mistake bounds for traditional perceptron\ntraining methods and ask if quantum computation lends any advantages. To this end, we propose an\nalgorithm that quadratically improves the scaling of the training algorithm with respect to the margin\nbetween the classes in the training data. The latter algorithm combines quantum amplitude estimation\nin the version space interpretation of the perceptron learning problem. Our approaches showcase the\ntrade-offs that one can consider in developing quantum algorithms, and the ultimate advantages of\nperforming learning tasks on a quantum computer.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Version space and feature space views of classi\ufb01cation. This \ufb01gure is from [18].\n\nThe rest of the paper is organized as follows: we \ufb01rst cover the background on perceptrons, version\nspace and Grover search. We then present our two quantum algorithms and provide analysis of their\ncomputational and statistical ef\ufb01ciency before concluding.\n\n2 Background\n\n2.1 Perceptrons and Version Space\nGiven a set of N separable training examples {\u03c61, .., \u03c6N} \u2208 IRD with corresponding labels\n{y1, .., yN}, yi \u2208 {+1,\u22121}, the goal of perceptron learning is to recover a hyperplane w that\nperfectly classi\ufb01es the training set [11]. Formally, we want w such that yi \u00b7 wT \u03c6i > 0 for all i.\nThere are various simple online algorithms that start with a random initialization of the hyperplane\nand make updates as they encounter more and more data [11, 13, 14, 15]; however, the rule that we\nconsider for online perceptron training is, upon misclassifying a vector (\u03c6, y), w \u2190 w + y\u03c6.\nA remarkable feature of the perceptron model is that upper bounds exist for the number of updates that\nneed to be made during this training procedure. In particular, if the training data is composed of unit\nvectors, \u03c6i \u2208 IRD, that are separated by a margin of \u03b3 then there are perceptron training algorithms\nthat make at most O( 1\n\u03b32 ) mistakes [16], independent of the dimension of the training vectors. Similar\nbounds also exist when the data is not separated [17] and also for other generalizations of perceptron\ntraining [13, 14, 15]. Note that in the worst case, the algorithm will need to look at all points in the\ntraining set at least once, consequently the computation complexity will be O(N ).\nOur goal is to explore if the quantum procedures can provide improvements both in terms of\ncomputational complexity (that is better than O(N )) and statistical ef\ufb01ciency (improve upon O( 1\n\u03b32 ).\nInstead of solely applying quantum constructs to the feature space, we also consider the version space\ninterpretation of perceptrons which leads to the improved scaling with \u03b3.\nFormally, version space is de\ufb01ned as the set of all possible hyperplanes that perfectly separate the\ndata: VS := {w|yi \u00b7 wT \u03c6i > 0 for all i}. Given a training datum, the traditional representation is\nto depict data as points in the feature space and use hyperplanes to depict the classi\ufb01ers. However,\nthere exists a dual representation where the hyperplanes are depicted as points and the data points are\nrepresented as hyperplanes that induce constraints on the feasible set of classi\ufb01ers. Figure 1, which is\nborrowed from [18], illustrates the version space interpretation of perceptrons. Given three labeled\ndata points in a 2D space, the dual space illustrates the set of normalized hyperplanes as a yellow ball\nwith unit radius. The third dimension corresponds to the weights that multiply the two dimensions\nof the input data and the bias term. The planes represent the constraints imposed by observing the\nlabeled data as every labeled data renders one-half of the space infeasible. The version space is then\nthe intersection of all the half-spaces that are valid. Naturally, classi\ufb01ers including SVMs [12] and\nBayes point machines [19] lie in the version space.\nWe note that there are quantum constructs such as Grover search and amplitude ampli\ufb01cation which\nprovide non-trivial speedups for the search task. This is the main reason why we resort to the version\nspace interpretation. We can use this formalism to simply pose the problem of determining the\n\n2\n\nFeature SpaceVersion Space\fFigure 2: A geometric description of the action of Ugrover on an initial state vector \u03c8.\n\nseparating hyperplane as a search problem in the dual space. For example given a set of candidates\nhyperplanes, our problem reduces to searching amongst the sample set for the classi\ufb01er that will\nsuccessfully classify the entire set. Therefore training the perceptron is equivalent to \ufb01nding any\nfeasible point in the version space. We describe these quantum constructs in detail below.\n\n2.2 Grover\u2019s Search\n\nBoth quantum approaches introduced in this work and their corresponding speed-ups stem from a\nquantum subroutine called Grover\u2019s search [20, 21], which is a special case of a more general method\nreferred to as amplitude ampli\ufb01cation [22]. Rather than sampling from a probability distribution until\na given marked element is found, the Grover search algorithm draws only one sample and then uses\nquantum operations to modify the distribution from which it sampled. The probability distribution is\nrotated, or more accurately the quantum state that yields the distribution is rotated, into one whose\nprobability is sharply concentrated on the marked element. Once a sharply peaked distribution is\nidenti\ufb01ed, the marked item can be found using just one sample. In general, if the probability of\n\n\ufb01nding such an element is known to be a then amplitude ampli\ufb01cation requires O((cid:112)1/a) operations\n\nto \ufb01nd the marked item with certainty.\nWhile Grover\u2019s search is a quantum subroutine, it can in fact be understood using only geometric\narguments. The only notions from quantum mechanics used are those of the quantum state vector\nand that of Born\u2019s rule (measurement). A quantum state vector is a complex unit vector whose\ncomponents have magnitudes that are equal to the square\u2013roots of the probabilities. In particular, if \u03c8\nis a quantum state vector and p is the corresponding probability distribution then\n\np = \u03c8\u2020 \u25e6 \u03c8,\n\n(1)\nwhere the unit column vector \u03c8 is called the quantum state vector which sits in the vector space Cn, \u25e6\nis the Hadamard (pointwise) product and \u2020 is the complex conjugate transpose. A quantum state can\nbe measured such that if we have a quantum state vector \u03c8 and a basis vector w then the probability\nof measuring \u03c8 = w is |(cid:104)\u03c8, w(cid:105)|2, where (cid:104)\u00b7,\u00b7(cid:105) denotes the inner product.\nWe need to implement two unitary operations in order to perform the search algorithm:\n\nUinit = 2\u03c8\u03c8\u2020 \u2212 11, Utarg = 11 \u2212 2P.\n\n(2)\n\nThe operators Uinit and Utarg can be interpreted geometrically as re\ufb02ections within a two\u2013dimensional\nspace spanned by the vectors \u03c8 and P \u03c8. If we assume that P \u03c8 (cid:54)= 0 and P \u03c8 (cid:54)= \u03c8 then these two\nre\ufb02ection operations can be used to rotate \u03c8 in the space span(\u03c8, P \u03c8). Speci\ufb01cally this rotation\nis Ugrover = UinitUtarg. Its action is illustrated in Figure 2. If the angle between the vector \u03c8 and\nP \u03c8/(cid:107)P \u03c8(cid:107) is \u03c0/2 \u2212 \u03b8a, where \u03b8a := sin\u22121(|(cid:104)\u03c8, P \u03c8/(cid:107)P \u03c8(cid:107)(cid:105)|). It then follows from elementary\ngeometry and the rule for computing the probability distribution from a quantum state (known as\nBorn\u2019s rule) that after j iterations of Grover\u2019s algorithm the probability of \ufb01nding a desirable outcome\nis\n\np(\u03c8 \u2208 \u03bdgood|j) = sin2((2j + 1)\u03b8a).\n\n(3)\nIt is then easy to see that if \u03b8a (cid:28) 1 and a probability of success greater than 1/4 is desired then\nj \u2208 O(1/\n\u03b8a) suf\ufb01ces to \ufb01nd a marked outcome. This is quadratically faster than is possible\nfrom statistical sampling, which requires O(1/\u03b8a) samples on average. Simple modi\ufb01cations to this\nalgorithm allow it to be used in cases where \u03b8a is not known [21, 22].\n\n\u221a\n\n3\n\n\u03c8P\u03c8/||P\u03c8||\u03c8P\u03c8/||P\u03c8||Utarg\u03c8\u03c8P\u03c8/||P\u03c8||Utarg\u03c8Utarg\u03c8Uinitqaqaqaqaqaqa2\f3 Online quantum perceptron\n\nNow that we have discussed Grover\u2019s search we turn our attention to applying it to speed up online\nperceptron training. In order to do so, we \ufb01rst need to de\ufb01ne the quantum model that we wish to\nuse as our quantum analogue of perceptron training. While there are many ways of de\ufb01ning such a\nmodel but the following approach is perhaps the most direct. Although the traditional feature space\nperceptron training algorithm is online [16], meaning that the training examples are provided one at a\ntime to it in a streaming fashion, we deviate from this model slightly by instead requiring that the\nalgorithm be fed training examples that are, in effect, sampled uniformly from the training set. This\nis a slightly weaker model, as it allows for the possibility that some training examples will be drawn\nmultiple times. However, the ability to draw quantum states that are in a uniform superposition over\nall vectors in the training set enables quantum computing to provide advantages over both classical\nmethods that use either access model.\nWe assume without loss of generality that the training set consists of N unit vectors, \u03c61, . . . , \u03c6N . If\nwe then de\ufb01ne \u03a61, . . . , \u03a6N to be the basis vectors whose indices each coincide with a (B + 1)-bit\nrepresentation of the corresponding (\u03c6j, yj) where yj \u2208 {\u22121, 1} is the class assigned to \u03c6j and let\n\u03a60 be a \ufb01xed unit vector that is chosen to represent a blank memory register.\nWe introduce the vectors \u03a6j to make it clear that the quantum vectors states used to represent training\nvectors do not live in the same vector space as the training vectors themselves. We choose the\nquantum state vectors here to occupy a larger space than the training vectors because the Heisenberg\nuncertainty principle makes it much more dif\ufb01cult for a quantum computer to compute the class that\nthe perceptron assigns to a training vector in such cases.\nFor example, the training vector (\u03c6j, yj) \u2261 ([0, 0, 1, 0]T , 1) can be encoded as an unsigned integer\n00101 \u2261 5, which in turn can be represented by the unit vector \u03a6 = [0, 0, 0, 0, 0, 1]T . More generally,\nif \u03c6j \u2208 IRD were a vector of \ufb02oating point numbers then a similar vector could be constructed\nby concatenating the binary representations of the D \ufb02oating point numbers that comprise it with\n(yj + 1)/2 and express the bit string as an unsigned integer, Q. The integer can then be expressed as\na unit vector \u03a6 : [\u03a6]q = \u03b4q,Q. While encoding the training data as an exponentially long vector is\ninef\ufb01cient in a classical computer, it is not in a quantum computer because of the quantum computer\u2019s\ninnate ability to store and manipulate exponentially large quantum state vectors.\nAny machine learning algorithm, be it quantum or classical, needs to have a mechanism to access the\ntraining data. We assume that the data is accessed via an oracle that not only accesses the training\ndata but also determines whether the data is misclassi\ufb01ed. To clarify, let {uj : j = 1 : N} be an\northonormal basis of quantum state vectors that serve as addresses for the training vectors in the\ndatabase. Given an input address for the training datum, the unitary operations U and U\u2020 allow the\nquantum computer to access the corresponding vector. Speci\ufb01cally, for all j\n\nU [uj \u2297 \u03a60] = uj \u2297 \u03a6j,\n\nU\u2020[uj \u2297 \u03a6j] = uj \u2297 \u03a60.\n\nj=1 uj \u2297 \u03a60 =(cid:80)\n\nNote that because U and U\u2020 are linear operators we have that U(cid:80)N\n\n(4)\nGiven an input address vector uj, the former corresponds to a database access and the latter inverts\nthe database access.\nj uj \u2297 \u03a6j. A\nquantum computer can therefore access each training vector simultaneously using a single operation.\nThe resultant vector is often called in the physics literature a quantum superposition of states and this\nfeature of linear transformations is referred to as quantum parallelism within quantum computing.\nThe next ingredient that we need is a method to test if the perceptron correctly assigns a training\nvector addressed by a particular uj. This process can be pictured as being performed by a unitary\ntransformation that \ufb02ips the sign of any basis-vector that is misclassi\ufb01ed. By linearity, a single\napplication of this process \ufb02ips the sign of any component of the quantum state vector that coincides\nwith a misclassi\ufb01ed training vector. It therefore is no more expensive than testing if a given training\nvector is misclassi\ufb01ed in a classical setting. We denote the operator, which depends on the perceptron\nweights w, Fw and require that\n\nFw[uj \u2297 \u03a60] = (\u22121)fw(\u03c6j ,yj )[uj \u2297 \u03a60],\n\n(5)\nwhere fw(\u03c6j) is a Boolean function that is 1 if and only if the perceptron with weights w misclassi\ufb01es\ntraining vector \u03c6j. Since the classi\ufb01cation step involves computing the dot\u2013products of \ufb01nite size\nvectors, this process is ef\ufb01cient given that the \u03a6j are ef\ufb01ciently computable.\n\n4\n\n\fWe apply Fw in the following way. Let Fw be a unitary operation such that\n\n(6)\nFw is easy to implement in the quantum computer using a multiply controlled phase gate and a\nquantum implementation of the perceptron classi\ufb01cation algorithm, fw. We can then write\n\nFw\u03a6j = (\u22121)fw(\u03c6j ,yj )\u03a6j.\n\nFw = U\u2020(11 \u2297 Fw)U.\n\n(7)\n\nN\n\n(cid:80)N\n\nClassifying the data based on the phases (the minus signs) output by Fw naturally leads to a very\nmemory ef\ufb01cient training algorithm because only one training vector is ever stored in memory during\nthe implementation of Fw given in Eq. (7). We can then use Fw to perform Grover\u2019s search algorithm,\nby taking Utarg = Fw and Uinit = 2\u03c8\u03c8\u2020 \u2212 11 with \u03c8 = \u03a8 := 1\u221a\nj=1 uj, to seek out training\nvectors that the current perceptron model misclassi\ufb01es. This leads to a quadratic reduction in the\nnumber of times that the training vectors need to be accessed by Fw or its classical analogue.\nIn the classical setting, the natural object to query is slightly different. The oracle that is usually\nassumed in online algorithms takes the form U c : Z (cid:55)\u2192 CD where U c(j) = \u03c6j. We will assume that\na similar function exists in both the classical and the quantum settings for simplicity. In both cases,\nwe will consider the cost of a query to U c to be proportional to the cost of a query to Fw.\nWe use these operations in to implement a quantum search for training vectors that the perceptron\nmisclassi\ufb01es. This leads to a quadratic speedup relative to classical methods as shown in the following\ntheorem. It is also worth noting that our algorithm uses a slight variant on the Grover search algorithm\nto ensure that the runtime is \ufb01nite.\nTheorem 1. Given a training set that consists of unit vectors \u03a61, . . . , \u03a6N that are separated by a\nmargin of \u03b3 in feature space, the number of applications of Fw needed to infer a perceptron model,\nw, such that P (\u2203 j : fw(\u03c6j) = 1) \u2264 \u0001 using a quantum computer is Nquant where\n\nwhereas the number of queries to fw needed in the classical setting, Nclass, where the training vectors\nare found by sampling uniformly from the training data is bounded by\n\n\u2126(N ) (cid:51) Nclass \u2208 O\n\n(cid:20) 1\n\n(cid:21)(cid:19)\n\n\u03b32 log\n\n\u0001\u03b32\n\n.\n\nWe assume in Theorem 1 that the training data in the classical case is accessed in a manner that\nis analogous to the sampling procedure used in the quantum setting. If instead the training data is\nsupplied by a stream (as in the standard online model) then the upper bound changes to Nclass \u2208\nO(N/\u03b32) because all N training vectors can be deterministically checked to see if they are correctly\nclassi\ufb01ed by the perceptron. A quantum advantage is therefore obtained if N (cid:29) log2(1/\u0001\u03b32).\nIn order to prove Theorem 1 we need to have two technical lemmas (proven in the supplemental\nmaterial). The \ufb01rst bounds the complexity of the classical analogue to our training method:\nLemma 1. Given only the ability to sample uniformly from the training vectors, the number of queries\nto fw needed to \ufb01nd a training vector that the current perceptron model fails to classify correctly, or\nconclude that no such example exists, with probability 1 \u2212 \u0001\u03b32 is at most O(N log(1/\u0001\u03b32)).\n\nThe second proves the correctness of our online quantum perceptron algorithm and bounds the\ncomplexity of the algorithm:\nLemma 2. Assuming that the training vectors {\u03c61, . . . , \u03c6N} are unit vectors and that they are drawn\nfrom two classes separated by a margin of \u03b3 in feature space, Algorithm 2 will either update the\nperceptron weights, or conclude that the current model provides a separating hyperplane between\nthe two classes, using a number of queries to Fw that is bounded above by O(\nN log(1/\u0001\u03b32)) with\nprobability of failure at most \u0001\u03b32.\n\n\u221a\n\nAfter stating these results, we can now provide the proof of Theorem 1.\n\n5\n\n\u221a\n\u2126(\n\nN ) (cid:51) Nquant \u2208 O\n\n(cid:21)(cid:33)\n\n,\n\n(cid:20) 1\n\n\u0001\u03b32\n\nN\n\u03b32 log\n\n(cid:32)\u221a\n\n(cid:18) N\n\n\fProof of Theorem 1. The upper bounds follow as direct consequences of Lemma 2 and Lemma 1.\nNovikoff\u2019s theorem [16, 17] states that the algorithms described in both lemmas must be applied at\nmost 1/\u03b32 times before \ufb01nding the result. However, either the classical or the quantum algorithm\nmay fail to \ufb01nd a misclassi\ufb01ed vector at each of the O(1/\u03b32) steps. The union bound states that the\nprobability that this happens is at most the sum of the respective probabilities in each step. These\nprobabilities are constrained to be \u03b32\u0001, which means that the total probability of failing to correctly\n\ufb01nd a mistake is at most \u0001 if both algorithms are repeated 1/\u03b32 times (which is the worst case number\nof times that they need to be repeated).\n\u221a\nThe lower bound on the quantum query complexity follows from contradiction. Assume that there\nexists an algorithm that can train an arbitrary perceptron using o(\nN ) query operations. Now we\nwant to show that unstructured search with one marked element can be expressed as a perceptron\ntraining algorithm. Let w be a known set of perceptron weights and assume that the perceptron only\nmisclassi\ufb01es one vector \u03c61. Thus if perceptron training succeeds then w the value of \u03c61 can be\nextracted from the updated weights. This training problem is therefore equivalent to searching for a\nmisclassi\ufb01ed vector. Now let \u03c6j = [1 \u2295 F (j), F (j)]T \u2297 \u03c7j where \u03c7j is a unit vector that represents\nthe bit string j and F (j) is a Boolean function. Assume that F (0) = 1 and F (j) = 0 if j (cid:54)= 0,\nwhich is without loss of generality equivalent to Grover\u2019s problem [20, 21]. Now assume that \u03c6j is\nassigned to class 2F (j) \u2212 1 and take w = [1/\nj \u03c7j. This perceptron therefore\nmisclassi\ufb01es \u03c60 and no other vector in the training set. Updating the weights yields \u03c6j, which in turn\nyields the value of j such that F (j) = 1, and so Grover\u2019s search reduces to perceptron training.\nSince Grover\u2019s search reduces to perceptron training in the case of one marked item the lower bound\nof \u2126(\nN ) queries for Grover\u2019s search [21] applies to perceptron training. Since we assumed that\nperceptron training needs o(\nN ).\nWe have assumed that in the classical setting that the user only has access to the training vectors\nthrough an oracle that is promised to draw a uniform sample from {(\u03c61, y1), . . . , (\u03c6N , yN )}. Since\nwe are counting the number of queries to fw it is clear that in the worst possible case that the training\nvector that the perceptron makes a mistake on can be the last unique value sampled from this list.\nThus the query complexity is \u2126(N ) classically.\n\n\u221a\nN ) queries this is a contradiction and the lower bound must be \u2126(\n\n(cid:80)\n\n2]T \u2297 1\u221a\n\nN\n\n\u221a\n\n\u221a\n\n\u221a\n\n2, 1/\n\n\u221a\n\n4 Quantum version space perceptron\n\nThe strategy for our quantum version space training algorithm is to pose the problem of determining\na separating hyperplane as search. Speci\ufb01cally, the idea is to \ufb01rst generate K sample hyperplanes\nw1, . . . , wK from a spherical Gaussian distribution N (0, 11). Given a large enough K, we are\nguaranteed to have at least one hyperplane amongst the samples that would lie in the version space\nand perfectly separate the data. As discussed earlier Grover\u2019s algorithm can provide quadratic speedup\nover the classical search consequently the ef\ufb01ciency of the algorithm is determined by K. Theorem 2\nprovides an insight on how to determine this number of hyperplanes to be sampled.\nTheorem 2. Given a training set that consists of d-dimensional unit vectors \u03a61, . . . , \u03a6N with labels\ny1, . . . , yN that are separated by a margin of \u03b3 in feature space, then a D-dimensional vector w\nsampled from N (0, 11) perfectly separates the data with probability \u0398(\u03b3).\n\nThe proof of this theorem is provided in the supplementary material. The consequence of Theorem 2\nstated below is that the expected number of samples K, required such that a separating hyperplane\nexists in the set, only needs to scale as O( 1\n\u03b3 ). This is remarkable because, similar to Novikoff\u2019s\ntheorem [16], the number of samples needed does not scale with D. Thus Theorem 2 implies that if\namplitude ampli\ufb01cation is used to boost the probability of \ufb01nding a vector in the version space then\nthe resulting quantum algorithm will need only O( 1\u221a\n\n\u03b3 ) quantum steps on average.\n\nNext we show how to use Grover\u2019s algorithm to search for a hyperplane that lies in the version space.\nLet us take K = 2(cid:96), for positive integer (cid:96). Then given w1, . . . , wK be the sampled hyperplanes, we\nrepresent W1, . . . , WK to be vectors that encode a binary representation of these random perceptron\nvectors. In analogy to \u03a60, we also de\ufb01ne W0 to be a vector that represents an empty data register. We\nde\ufb01ne the unitary operator V to generate these weights given an address vector uj using the following\n\nV [uj \u2297 W0] = [uj \u2297 Wj].\n\n(8)\n\n6\n\n\fIn this context we can also think of the address vector, uj, as representing a seed for a pseudo\u2013random\nnumber generator that yields perceptron weights Wj.\nAlso let us de\ufb01ne the classical analogue of V to be V c which obeys V c(j) = wj. Now using V (and\napplying the Hadamard transform [23]) we can prepare the following quantum state\n\nK(cid:88)\n\nk=1\n\n\u03a8 :=\n\n1\u221a\nK\n\nuk \u2297 Wk,\n\n(9)\n\nwhich corresponds to a uniform distribution over the randomly chosen w.\nNow that we have de\ufb01ned the initial state, \u03a8, for Grover\u2019s search we need to de\ufb01ne an oracle that\nmarks the vectors inside the version space. Let us de\ufb01ne the operator \u02c6F\u03c6,y via\n\n\u02c6F\u03c6,y[uj \u2297 W0] = (\u22121)1+fwj (\u03c6,y)[uj \u2297 W0].\n\n(10)\nThis unitary operation looks at an address vector, uj, computes the corresponding perceptron model\nWj, \ufb02ips the sign of any component of the quantum state vector that is in the half space in version\nspace speci\ufb01ed by \u03c6 and then uncomputes Wj. This process can be realized using a quantum\nsubroutine that computes fw, an application of V and V \u2020 and also the application of a conditional\nphase gate (which is a fundamental quantum operation that is usually denoted Z) [23].\nThe oracle \u02c6F\u03c6,y does not allow us to directly use Grover\u2019s search to rotate a quantum state vector that\nis outside the version space towards the version space boundary because it effectively only checks\none of the half\u2013space inequalities that de\ufb01ne the version space. It can, however, be used to build an\noperation, \u02c6G, that re\ufb02ects about the version space:\n\n\u02c6G[uj \u2297 W0] = (\u22121)1+(fwj (\u03c61,y1)\u2228\u00b7\u00b7\u00b7\u2228fwj (\u03c6N ,yN ))[uj \u2297 W0].\n\n(11)\nThe operation \u02c6G can be implemented using 2N applications of \u02c6F\u03c6 as well as a sequence of O(N )\nelementary quantum gates, hence we cost a query to \u02c6G as O(N ) queries to \u02c6F\u03c6,y.\nWe use these components in our version space training algorithm to, in effect, amplify the margin\nbetween the two classes from \u03b3 to\n\u03b3. We give the asymptotic scaling of this algorithm in the\nfollowing theorem.\nTheorem 3. Given a training set that consists of unit vectors \u03a61, . . . , \u03a6N that are separated by a\nmargin of \u03b3 in feature space, the number of queries to \u02c6F\u03c6,y needed to infer a perceptron model with\nprobability at least 1 \u2212 \u0001, w, such that w is in the version space using a quantum computer is Nquant\nwhere\n\n\u221a\n\nNquant \u2208 O\n\nlog3/2\n\n.\n\n(cid:18) N\u221a\n\n\u03b3\n\n(cid:20) 1\n\n(cid:21)(cid:19)\n\n\u0001\n\nProof. The proof of the theorem follows directly from bounds on K and the validity of our version\nspace training algorithm. It is clear from previous discussions that the algorithm carries out Grover\u2019s\nsearch, but instead of searching for a \u03c6 that is misclassi\ufb01ed it instead searches for a w in version space.\nIts validity therefore follows by following the exact same steps followed in the proof of Lemma 2\nbut with N = K. However, since the algorithm need is not repeated 1/\u03b32 times in this context we\ncan replace \u03b3 with 1 in the proof. Thus if we wish to have a probability of failure of at most \u0001(cid:48) then\n\u221a\nK log(1/\u0001(cid:48))). This also guarantees that if any of the K\nthe number of queries made to \u02c6G is in O(\nvectors are in the version space then the probability of failing to \ufb01nd that vector is at most \u0001(cid:48).\nNext since one query to \u02c6G is costed at N queries to \u02c6F\u03c6,y the query complexity (in units of queries\nto \u02c6F\u03c6,y) becomes O(N\nK log(1/\u0001(cid:48))). The only thing that then remains is to bound the value of K\nneeded.\nThe probability of \ufb01nding a vector in the version space is \u0398(\u03b3) from Theorem 2. This means that\nthere exists \u03b1 > 0 such that the probability of failing to \ufb01nd a vector in the version space K times is\nat most (1 \u2212 \u03b1\u03b3)K \u2264 e\u2212\u03b1\u03b3K. Thus this probability is at most \u03b4 for K \u2208 \u2126\n. It then\nsuf\ufb01ces to pick K \u2208 \u0398(log(1/\u03b4)/\u03b3) for the algorithm.\nThe union bound implies that the probability that either none of the vectors lie in the version space\nor that Grover\u2019s search failing to \ufb01nd such an element is at most \u0001(cid:48) + \u03b4 \u2264 \u0001. Thus it suf\ufb01ces to pick\n\n\u03b3 log(1/\u03b4)\n\n(cid:16) 1\n\n(cid:17)\n\n\u221a\n\n7\n\n\f\u0001(cid:48) \u2208 \u0398(\u0001) and \u03b4 \u2208 \u0398(\u0001) to ensure that the total probability is at most \u0001. Therefore the total number of\n\u221a\nqueries made to \u02c6F\u03c6,y is in O(N log3/2(1/\u0001)/\n\n\u03b3) as claimed.\n\nThe classical algorithm discussed previously has complexity O(N log(1/\u0001)/\u03b3), which follows from\nthe fact from Theorem 2 that K \u2208 \u0398(log(1/\u0001)/\u03b3) suf\ufb01ces to make the probability of not drawing an\n\u03b3 (cid:29) log(1/\u0001),\nelement of the version space at most \u0001. This demonstrates a quantum advantage if 1\nand illustrates that quantum computing can be used to boost the effective margins of the training data.\nQuantum models of perceptrons therefore not only provide advantages in terms of the number of\nvectors that need to be queried in the training process, they also can make the perceptron much more\nperceptive by making training less sensitive to small margins.\nThese performance improvements can also be viewed as mistake bounds for the version space\nperceptron. The inner loop in the version space algorithm attempts to sample from the version space\nand then once it draws a sample it tests it against the training vectors to see if it errs on any example.\nSince the inner loop is repeated O(\nK log(1/\u0001)) times, the maximum number of misclassi\ufb01ed\nvectors that arises from this training process is from Theorem 2 O( 1\u221a\n\u03b3 log3/2(1/\u0001)) which, for\nconstant \u0001, constitutes a quartic improvement over the standard mistake bound of 1/\u03b32 [16].\n\n\u221a\n\n5 Conclusion\n\nWe have provided two distinct ways to look at quantum perceptron training that each afford different\nspeedups relative to the other. The \ufb01rst provides a quadratic speedup with respect to the size of the\ntraining data. We further show that this algorithm is asymptotically optimal in that if a super\u2013quadratic\nspeedup were possible then it would violate known lower bounds for quantum searching. The second\nprovides a quadratic reduction in the scaling of the training time (as measured by the number of\ninteractions with the training data) with the margin between the two classes. This latter result is\nespecially interesting because it constitutes a quartic speedup relative to the typical perceptron training\nbounds that are usually seen in the literature.\nPerhaps the most signi\ufb01cant feature of our work is that it demonstrates that quantum computing\ncan provide provable speedups for perceptron training, which is a foundational machine learning\nmethod. While our work gives two possible ways of viewing the perceptron model through the lens of\nquantum computing, other quantum variants of the perceptron model may exist. Seeking new models\nfor perceptron learning that deviate from these classical approaches may not only provide a more\ncomplete understanding of what form learning takes within quantum systems, but also may lead to\nricher classes of quantum models that have no classical analogue and are not ef\ufb01ciently simulatable\non classical hardware. Such models may not only revolutionize quantum learning but also lead to a\ndeeper understanding of the challenges and opportunities that the laws of physics place on our ability\nto learn.\n\nReferences\n\n[1] M Lewenstein. Quantum perceptrons. Journal of Modern Optics, 41(12):2491\u20132501, 1994.\n[2] Esma A\u00a8\u0131meur, Gilles Brassard, and S\u00b4ebastien Gambs. Machine learning in a quantum world. In\n\nAdvances in arti\ufb01cial intelligence, pages 431\u2013442. Springer, 2006.\n\n[3] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and\n\nunsupervised machine learning. arXiv preprint arXiv:1307.0411, 2013.\n\n[4] Nathan Wiebe, Ashish Kapoor, and Krysta Svore. Quantum nearest-neighbor algorithms for\n\nmachine learning. Quantum Information and Computation, 15:318\u2013358, 2015.\n\n[5] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis.\n\nNature Physics, 10(9):631\u2013633, 2014.\n\n[6] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big\n\ndata classi\ufb01cation. Physical review letters, 113(13):130503, 2014.\n\n[7] Nathan Wiebe and Christopher Granade. Can small quantum systems learn? arXiv preprint\n\narXiv:1512.03145, 2015.\n\n8\n\n\f[8] Nathan Wiebe, Ashish Kapoor, and Krysta M Svore. Quantum deep learning. arXiv preprint\n\narXiv:1412.3489, 2014.\n\n[9] Mohammad H Amin, Evgeny Andriyash, Jason Rolfe, Bohdan Kulchytskyy, and Roger Melko.\n\nQuantum boltzmann machine. arXiv preprint arXiv:1601.02036, 2016.\n\n[10] Silvano Garnerone, Paolo Zanardi, and Daniel A Lidar. Adiabatic quantum algorithm for search\n\nengine ranking. Physical review letters, 108(23):230506, 2012.\n\n[11] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organiza-\n\ntion in the brain. Psychological review, 65(6):386, 1958.\n\n[12] Johan AK Suykens and Joos Vandewalle. Least squares support vector machine classi\ufb01ers.\n\nNeural processing letters, 9(3):293\u2013300, 1999.\n\n[13] Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-Taylor, and Jaz Kandola. The\n\nperceptron algorithm with uneven margins. In ICML, volume 2, pages 379\u2013386, 2002.\n\n[14] Claudio Gentile. A new approximate maximal margin classi\ufb01cation algorithm. The Journal of\n\nMachine Learning Research, 2:213\u2013242, 2002.\n\n[15] Shai Shalev-Shwartz and Yoram Singer. A new perspective on an old perceptron algorithm. In\n\nLearning Theory, pages 264\u2013278. Springer, 2005.\n\n[16] Albert BJ Novikoff. On convergence proofs for perceptrons. Technical report, DTIC Document,\n\n1963.\n\n[17] Yoav Freund and Robert E Schapire. Large margin classi\ufb01cation using the perceptron algorithm.\n\nMachine learning, 37(3):277\u2013296, 1999.\n\n[18] Thomas P Minka. A family of algorithms for approximate Bayesian inference. PhD thesis,\n\nMassachusetts Institute of Technology, 2001.\n\n[19] Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines: Estimating the\n\nbayes point in kernel space. In IJCAI Workshop SVMs, pages 23\u201327, 1999.\n\n[20] Lov K Grover. A fast quantum mechanical algorithm for database search. In Proceedings of the\ntwenty-eighth annual ACM symposium on Theory of computing, pages 212\u2013219. ACM, 1996.\n[21] Michel Boyer, Gilles Brassard, Peter H\u00f8yer, and Alain Tapp. Tight bounds on quantum\n\nsearching. arXiv preprint quant-ph/9605034, 1996.\n\n[22] Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum amplitude ampli\ufb01cation\n\nand estimation. Contemporary Mathematics, 305:53\u201374, 2002.\n\n[23] Michael A Nielsen and Isaac L Chuang. Quantum computation and quantum information.\n\nCambridge university press, 2010.\n\n9\n\n\f", "award": [], "sourceid": 2005, "authors": [{"given_name": "Ashish", "family_name": "Kapoor", "institution": "Microsoft Research"}, {"given_name": "Nathan", "family_name": "Wiebe", "institution": "Microsoft Research"}, {"given_name": "Krysta", "family_name": "Svore", "institution": "Microsoft"}]}