{"title": "Efficient Approximation Algorithms for Strings Kernel Based Sequence Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 6935, "page_last": 6945, "abstract": "Sequence classification algorithms, such as SVM, require a definition of distance (similarity) measure between two sequences. A commonly used notion of similarity is the number of matches between k-mers (k-length subsequences) in the two sequences. Extending this definition, by considering two k-mers to match if their distance is at most m, yields better classification performance. This, however, makes the problem computationally much more complex. Known algorithms to compute this similarity have computational complexity that render them applicable only for small values of k and m. In this work, we develop novel techniques to efficiently and accurately estimate the pairwise similarity score, which enables us to use much larger values of k and m, and get higher predictive accuracy. This opens up a broad avenue of applying this classification approach to audio, images, and text sequences. Our algorithm achieves excellent approximation performance with theoretical guarantees. In the process we solve an open combinatorial problem, which was posed as a major hindrance to the scalability of existing solutions. We give analytical bounds on quality and runtime of our algorithm and report its empirical performance on real world biological and music sequences datasets.", "full_text": "Ef\ufb01cient Approximation Algorithms for String Kernel\n\nBased Sequence Classi\ufb01cation\n\nMuhammad Farhan\n\nDepartment of Computer Science\nSchool of Science and Engineering\n\nJuvaria Tariq\n\nDepartment of Mathematics\n\nSchool of Science and Engineering\n\nLahore University of Management Sciences\n\nLahore University of Management Sciences\n\nLahore, Pakistan\n\n14030031@lums.edu.pk\n\nLahore, Pakistan\n\njtariq@emory.edu\n\nArif Zaman\n\nDepartment of Computer Science\nSchool of Science and Engineering\n\nLahore University of Management Sciences\n\nLahore, Pakistan\n\narifz@lums.edu.pk\n\nMudassir Shabbir\n\nDepartment of Computer Science\nInformation Technology University\n\nLahore, Pakistan\n\nmudassir.shabbir@itu.edu.pk\n\nImdad Ullah Khan\n\nDepartment of Computer Science\nSchool of Science and Engineering\n\nLahore University of Management Sciences\n\nLahore, Pakistan\n\nimdad.khan@lums.edu.pk\n\nAbstract\n\nSequence classi\ufb01cation algorithms, such as SVM, require a de\ufb01nition of distance\n(similarity) measure between two sequences. A commonly used notion of similarity\nis the number of matches between k-mers (k-length subsequences) in the two\nsequences. Extending this de\ufb01nition, by considering two k-mers to match if their\ndistance is at most m, yields better classi\ufb01cation performance. This, however,\nmakes the problem computationally much more complex. Known algorithms to\ncompute this similarity have computational complexity that render them applicable\nonly for small values of k and m. In this work, we develop novel techniques to\nef\ufb01ciently and accurately estimate the pairwise similarity score, which enables us\nto use much larger values of k and m, and get higher predictive accuracy. This\nopens up a broad avenue of applying this classi\ufb01cation approach to audio, images,\nand text sequences. Our algorithm achieves excellent approximation performance\nwith theoretical guarantees. In the process we solve an open combinatorial problem,\nwhich was posed as a major hindrance to the scalability of existing solutions. We\ngive analytical bounds on quality and runtime of our algorithm and report its\nempirical performance on real world biological and music sequences datasets.\n\n1\n\nIntroduction\n\nSequence classi\ufb01cation is a fundamental task in pattern recognition, machine learning, and data\nmining with numerous applications in bioinformatics, text mining, and natural language processing.\nDetecting proteins homology (shared ancestry measured from similarity of their sequences of amino\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\facids) and predicting proteins fold (functional three dimensional structure) are essential tasks in\nbioinformatics. Sequence classi\ufb01cation algorithms have been applied to both of these problems with\ngreat success [3, 10, 13, 18, 19, 20, 25]. Music data, a real valued signal when discretized using vector\nquantization of MFCC features is another \ufb02avor of sequential data [26]. Sequence classi\ufb01cation\nhas been used for recognizing genres of music sequences with no annotation and identifying artists\nfrom albums [12, 13, 14]. Text documents can also be considered as sequences of words from a\nlanguage lexicon. Categorizing texts into classes based on their topics is another application domain\nof sequence classi\ufb01cation [11, 15].\nWhile general purpose classi\ufb01cation methods may be applicable to sequence classi\ufb01cation, huge\nlengths of sequences, large alphabet sizes, and large scale datasets prove to be rather challenging for\nsuch techniques. Furthermore, we cannot directly apply classi\ufb01cation algorithms devised for vectors\nin metric spaces because in almost all practical scenarios sequences have varying lengths unless\nsome mapping is done beforehand. In one of the more successful approaches, the variable-length\nsequences are represented as \ufb01xed dimensional feature vectors. A feature vector typically is the\nspectra (counts) of all k-length substrings (k-mers) present exactly [18] or inexactly (with up to m\nmismatches) [19] within a sequence. A kernel function is then de\ufb01ned that takes as input a pair of\nfeature vectors and returns a real-valued similarity score between the pair (typically inner-product of\nthe respective spectra\u2019s). The matrix of pairwise similarity scores (the kernel matrix) thus computed\nis used as input to a standard support vector machine (SVM) [5, 27] classi\ufb01er resulting in excellent\nclassi\ufb01cation performance in many applications [19]. In this setting k (the length of substrings used\nas bases of feature map) and m (the mismatch parameter) are independent variables directly related\nto classi\ufb01cation accuracy and time complexity of the algorithm. It has been established that using\nlarger values of k and m improve classi\ufb01cation performance [11, 13]. On the other hand, the runtime\nof kernel computation by the ef\ufb01cient trie-based algorithm [19, 24] is O(km+1|\u03a3|m(|X| + |Y |)) for\ntwo sequences X and Y over alphabet \u03a3.\nComputation of mismatch kernel between two sequences X and Y reduces to the following two\nproblems. i) Given two k-mers \u03b1 and \u03b2 that are at Hamming distance d from each other, determine\nthe size of intersection of m-mismatch neighborhoods of \u03b1 and \u03b2 (k-mers that are at distance at\nmost m from both of them). ii) For 0 \u2264 d \u2264 min{2m, k} determine the number of pairs of k-mers\n(\u03b1, \u03b2) \u2208 X \u00d7 Y such that Hamming distance between \u03b1 and \u03b2 is d. In the best known algorithm [13]\nthe former problem is addressed by precomputing the intersection size in constant time for m \u2264 2\nonly. While a sorting and enumeration based technique is proposed for the latter problem that has\ncomputational complexity O(2k(|X| + |Y |), which makes it applicable for moderately large values\nof k (of course limited to m \u2264 2 only).\nIn this paper, we completely resolve the combinatorial problem (problem i) for all values of m. We\nprove a closed form expression for the size of intersection of m-mismatch neighborhoods that lets us\nprecompute these values in O(m3) time (independent of |\u03a3|, k, lengths and number of sequences).\nFor the latter problem we devise an ef\ufb01cient approximation scheme inspired by the theory of locality\nsensitive hashing to accurately estimate the number of k-mer pairs between the two sequences that\nare at distance d. Combining the above two we design a polynomial time approximation algorithm for\nkernel computation. We provide probabilistic guarantees on the quality of our algorithm and analytical\nbounds on its runtime. Furthermore, we test our algorithm on several real world datasets with large\nvalues of k and m to demonstrate that we achieve excellent predictive performance. Note that string\nkernel based sequence classi\ufb01cation was previously not feasible for this range of parameters.\n\n2 Related Work\n\nIn the computational biology community pairwise alignment similarity scores were used traditionally\nas basis for classi\ufb01cation, like the local and global alignment [5, 29]. String kernel based classi\ufb01cation\nwas introduced in [30, 9]. Extending this idea, [30] de\ufb01ned the gappy n-gram kernel and used it\nin conjunction with SVM [27] for text classi\ufb01cation. The main drawback of this approach is that\nruntime for kernel evaluations depends quadratically on lengths of the sequences.\nAn alternative model of string kernels represents sequences as \ufb01xed dimensional vectors of counts of\noccurrences of k-mers in them. These include k-spectrum [18] and substring [28] kernels. This notion\nis extended to count inexact occurrences of patterns in sequences as in mismatch [19] and pro\ufb01le\n[10] kernels. In this transformed feature space SVM is used to learn class boundaries. This approach\n\n2\n\n\fyields excellent classi\ufb01cation accuracies [13] but computational complexity of kernel evaluation\nremains a daunting challenge [11].\nThe exponential dimensions (|\u03a3|k) of the feature space for both the k-spectrum kernel and k, m-\nmismatch kernel make explicit transformation of strings computationally prohibitive. SVM does not\nrequire the feature vectors explicitly; it only uses pairwise dot products between them. A trie-based\nstrategy to implicitly compute kernel values for pairs of sequences was proposed in [18] and [19].\nA (k, m)-mismatch tree is introduced which is a rooted |\u03a3|-ary tree of depth k, where each internal\nnode has a child corresponding to each symbol in \u03a3 and every leaf corresponds to a k-mer in \u03a3k.\nThe runtime for computing the k, m mismatch kernel value between two sequences X and Y , under\nthis trie-based framework, is O((|X| + |Y |)km+1|\u03a3|m), where |X| and |Y | are lengths of sequences.\nThis makes the algorithm only feasible for small alphabet sizes and very small number of allowed\nmismatches.\nThe k-mer based kernel framework has been extended in several ways by de\ufb01ning different string\nkernels such as restricted gappy kernel, substitution kernel, wildcard kernel [20], cluster kernel [32],\nsparse spatial kernel [12], abstraction-augmented kernel [16], and generalized similarity kernel [14].\nFor literature on large scale kernel learning and kernel approximation see [34, 1, 7, 22, 23, 33] and\nreferences therein.\n\n3 Algorithm for Kernel Computation\n\nIn this section we formulate the problem, describe our algorithm and analyze it\u2019s runtime and quality.\nk-spectrum and k, m-mismatch kernel: Given a sequence X over alphabet \u03a3, the k, m-mismatch\nspectrum of X is a |\u03a3|k-dimensional vector, \u03a6k,m(X) of number of times each possible k-mer occurs\nin X with at most m mismatches. Formally,\n\n\u03a6k,m(X) = (\u03a6k,m(X)[\u03b3])\u03b3\u2208\u03a3k =\n\nIm(\u03b1, \u03b3)\n\n,\n\n\u03b3\u2208\u03a3k\n\n(1)\n\nwhere Im(\u03b1, \u03b3) = 1, if \u03b1 belongs to the set of k-mers that differ from \u03b3 by at most m mismatches,\nthe Hamming distance between \u03b1 and \u03b3, d(\u03b1, \u03b3) \u2264 m. Note that for m = 0, it is known\ni.e.\nas k-spectrum of X. The k, m-mismatch kernel value for two sequences X and Y (the mismatch\nspectrum similarity score) [19] is de\ufb01ned as:\n\nK(X, Y |k, m) = (cid:104)\u03a6k,m(X), \u03a6k,m(Y )(cid:105) =\n\n\u03a6k,m(X)[\u03b3]\u03a6k,m(Y )[\u03b3]\n\n(cid:33)\n\n(cid:32)(cid:88)\n\n\u03b1\u2208X\n\n(cid:88)\n(cid:88)\n\n\u03b3\u2208\u03a3k\n\n(cid:88)\n\n(cid:88)\n\n\u03b1\u2208X\n\n\u03b2\u2208Y\n\n\u03b3\u2208\u03a3k\n\n(cid:88)\n\n(cid:88)\n\n\u03b3\u2208\u03a3k\n\n\u03b1\u2208X\n\n(cid:88)\n\n\u03b2\u2208Y\n\n=\n\nIm(\u03b1, \u03b3)\n\nIm(\u03b2, \u03b3) =\n\nIm(\u03b1, \u03b3)Im(\u03b2, \u03b3).\n\n(2)\n\nFor a k-mer \u03b1, let Nk,m(\u03b1) = {\u03b3 \u2208 \u03a3k : d(\u03b1, \u03b3) \u2264 m} be the m-mutational neighborhood of \u03b1.\nThen for a pair of sequences X and Y , the k, m-mismatch kernel given in eq (2) can be equivalently\ncomputed as follows [13]:\n\nK(X, Y |k, m) =\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\n\u03b1\u2208X\n\n(cid:88)\n(cid:88)\n\n\u03b2\u2208Y\n\nIm(\u03b1, \u03b3)Im(\u03b2, \u03b3)\n\n\u03b3\u2208\u03a3k\n|Nk,m(\u03b1) \u2229 Nk,m(\u03b2)| =\n\n(cid:88)\n\n(cid:88)\n\n=\n\n\u03b2\u2208Y\n\n\u03b1\u2208X\n\nIm(\u03b1, \u03b2),\n\n(3)\nwhere Im(\u03b1, \u03b2) = |Nk,m(\u03b1) \u2229 Nk,m(\u03b2)| is the size of intersection of m-mutational neighborhoods\nof \u03b1 and \u03b2. We use the following two facts.\nFact 3.1. Im(\u03b1, \u03b2), the size of the intersection of m-mismatch neighborhoods of \u03b1 and \u03b2, is a\nfunction of k, m, |\u03a3| and d(\u03b1, \u03b2) and is independent of the actual k-mers \u03b1 and \u03b2 or the actual\npositions where they differ. (See section 3.1)\nFact 3.2. If d(\u03b1, \u03b2) > 2m, then Im(\u03b1, \u03b2) = 0.\nIn view of the above two facts we can rewrite the kernel value (3) as\n\n\u03b1\u2208X\n\n\u03b2\u2208Y\n\nK(X, Y |k, m) =\n\nMi \u00b7 Ii,\n\n(4)\n\n(cid:88)\n\n(cid:88)\n\n\u03b1\u2208X\n\n\u03b2\u2208Y\n\nIm(\u03b1, \u03b2) =\n\n3\n\nmin{2m,k}(cid:88)\n\ni=0\n\n\fwhere Ii = Im(\u03b1, \u03b2) when d(\u03b1, \u03b2) = i and Mi is the number of pairs of k-mers (\u03b1, \u03b2) such that\nd(\u03b1, \u03b2) = i, where \u03b1 \u2208 X and \u03b2 \u2208 Y . Note that bounds on the last summation follows from Fact\n3.2 and the fact that the Hamming distance between two k-mers is at most k. Hence the problem of\nkernel evaluation is reduced to computing Mi\u2019s and evaluating Ii\u2019s.\n\n3.1 Closed form for Intersection Size\n\nLet Nk,m(\u03b1, \u03b2) be the intersection of m-mismatch neighborhoods of \u03b1 and \u03b2 i.e.\n\nNk,m(\u03b1, \u03b2) = Nk,m(\u03b1) \u2229 Nk,m(\u03b2).\n\nAs de\ufb01ned earlier |Nk,m(\u03b1, \u03b2)| = Im(\u03b1, \u03b2). Let Nq(\u03b1) = {\u03b3 \u2208 \u03a3k : d(\u03b1, \u03b3) = q} be the set of\nk-mers that differ with \u03b1 in exactly q indices. Note that Nq(\u03b1) \u2229 Nr(\u03b1) = \u2205 for all q (cid:54)= r. Using\nthis and de\ufb01ning nqr(\u03b1, \u03b2) = |Nq(\u03b1) \u2229 Nr(\u03b2)|,\n\nm(cid:91)\n\nm(cid:91)\n\nNk,m(\u03b1, \u03b2) =\n\nNq(\u03b1) \u2229 Nr(\u03b2)\n\nand Im(\u03b1, \u03b2) =\n\nnqr(\u03b1, \u03b2).\n\nq=0\n\nr=0\n\nq=0\n\nr=0\n\nHence we give a formula to compute nij(\u03b1, \u03b2). Let s = |\u03a3|.\nTheorem 3.3. Given two k-mers \u03b1 and \u03b2 such that d(\u03b1, \u03b2) = d, we have that\n\n(cid:18)2d \u2212 i \u2212 j + 2t\n\n(cid:19)(cid:18)\n\nd \u2212 (i \u2212 t)\n\n2(cid:88)\n\ni+j\u2212d\n\nt=0\n\nnij(\u03b1, \u03b2) =\n\n(cid:19)\n\nd\n\ni + j \u2212 2t \u2212 d\n\n(s \u2212 2)i+j\u22122t\u2212d\n\n(s \u2212 1)t.\n\nm(cid:88)\n\nm(cid:88)\n\n(cid:18)k \u2212 d\n(cid:19)\n\nt\n\nProof. nij(\u03b1, \u03b2) can be interpreted as the number of ways to make i changes in \u03b1 and j changes\nin \u03b2 to get the same string. For clarity, we \ufb01rst deal with the case when we have d(\u03b1, \u03b2) = 0, i.e\nboth strings are identical. We wish to \ufb01nd nij(\u03b1, \u03b2) = |Ni(\u03b1) \u2229 Nj(\u03b2)|. It is clear that in this\ncase i = j, otherwise making i and j changes to the same string will not result in the same string.\n\n(cid:1)(s \u2212 1)i. Second we consider \u03b1, \u03b2 such that d(\u03b1, \u03b2) = k. Clearly k \u2265 i and k \u2265 j.\n\nHence nij =(cid:0)k\n\nMoreover, since both strings do not agree at any index, character at every index has to be changed in\nat least one of \u03b1 or \u03b2. This gives k \u2264 i + j.\nNow for a particular index p, \u03b1[p] and \u03b2[p] can go through any one of the following three changes.\nLet \u03b1[p] = x, \u03b2[p] = y. (I) Both \u03b1[p] and \u03b2[p] may change from x and y respectively to some\ncharacter z. Let l1 be the count of indices going through this type of change. (II) \u03b1[p] changes from\nx to y, call the count of these l2. (III) \u03b2[p] changes from y to x, let this count be l3. It follows that\n\ni\n\n,\n\ni = l1 + l2\n\nj = l1 + l3,\n\nchange, we have s \u2212 2 character choices for each such index and(cid:0)\n(cid:0)2k\u2212i\u2212j\n\nThis results in l1 = i + j \u2212 k. Since l1 is the count of indices at which characters of both strings\nindices for l1. From the remaining l2 + l3 = 2k \u2212 i \u2212 j indices, we choose l2 = k \u2212 j indices in\n\n(cid:1) possible combinations of\n(cid:1) ways and change the characters at these indices of \u03b1 to characters of \u03b2 at respective indices.\n\nl1 + l2 + l3 = k.\n\nk\n\ni+j\u2212k\n\n,\n\nFinally, we are left with only l3 remaining indices and we change them according to the de\ufb01nition of\nl3. Thus the total number of strings we get after making i changes in \u03b1 and j changes in \u03b2 is\n\nk\u2212j\n\n(cid:18)\n\n(cid:19)(cid:18)2k \u2212 i \u2212 j\n\n(cid:19)\n\nk \u2212 j\n\n.\n\n(s \u2212 2)i+j\u2212k\n\nk\n\ni + j \u2212 k\n\nNow we consider general strings \u03b1 and \u03b2 of length k with d(\u03b1, \u03b2) = d. Without loss of generality\nassume that they differ in the \ufb01rst d indices. We parameterize the system in terms of the number of\nchanges that occur in the last k \u2212 d indices of the strings i.e let t be the number of indices that go\nthrough a change in last k \u2212 d indices. Number of possible such changes is\n\n(s \u2212 1)t.\n\n(5)\n\n(cid:18)k \u2212 d\n(cid:19)\n\nt\n\nLets call the \ufb01rst d-length substrings of both strings \u03b1(cid:48) and \u03b2(cid:48). There are i \u2212 t characters to be\nchanged in \u03b1(cid:48) and j \u2212 t in \u03b2(cid:48). As reasoned above, we have d \u2264 (i \u2212 t) + (j \u2212 t) =\u21d2 t \u2264 i+j\u2212d\n.\n\n2\n\n4\n\n\fIn this setup we get i \u2212 t = l1 + l2, j \u2212 t = l1 + l3, l1 + l2 + l3 = d and l1 = (i \u2212 t) + (j \u2212 t) \u2212 d.\nWe immediately get that for a \ufb01xed t, the total number of resultant strings after making i \u2212 t changes\nin \u03b1(cid:48) and j \u2212 t changes in \u03b2(cid:48) is\n\n(cid:19)\n\n(cid:18)2d \u2212 (i \u2212 t) \u2212 (j \u2212 t)\n(cid:19)(cid:18)\n\nd \u2212 (i \u2212 t)\n\nd\n\n(i \u2212 t) + (j \u2212 t) \u2212 d\n\n(s \u2212 2)(i\u2212t)+(j\u2212t)\u2212d.\n\n(6)\n\nFor a \ufb01xed t, every substring counted in (5), every substring counted in (6) gives a required string\nobtained after i and j changes in \u03b1 and \u03b2 respectively. The statement of the theorem follows.\nCorollary 3.4. Runtime of computing Id is O(m3), independent of k and |\u03a3|.\nThis is so, because if d(\u03b1, \u03b2) = d, Id =\n\nnqr(\u03b1, \u03b2) and nqr(\u03b1, \u03b2) can be computed in O(m).\n\nm(cid:80)\n\nm(cid:80)\n\n3.2 Computing Mi\n\nq=0\n\nr=0\n\nRecall that given two sequences X and Y , Mi is the number of pairs of k-mers (\u03b1, \u03b2) such that\nd(\u03b1, \u03b2) = i, where \u03b1 \u2208 X and \u03b2 \u2208 Y . Formally, the problem of computing Mi is as follows:\nProblem 3.5. Given k, m, and two sets of k-mers SX and SY (set of k-mers extracted from the\nsequences X and Y respectively) with |SX| = nX and |SY | = nY . Compute\n\nMi = |{(\u03b1, \u03b2) \u2208 SX \u00d7 SY : d(\u03b1, \u03b2) = i}|\n\nfor 0 \u2264 i \u2264 min{2m, k}.\n\nNote that the brute force approach to compute Mi requires O(nX \u00b7 nY \u00b7 k) comparisons. Let Qk(j)\ndenote the set of all j-sets of {1, . . . , k} (subsets of indices). For \u03b8 \u2208 Qk(j) and a k-mer \u03b1, let \u03b1|\u03b8\nbe the j-mer obtained by selecting the characters at the j indices in \u03b8. Let f\u03b8(X, Y ) be the number\nof pairs of k-mers in SX \u00d7 SY as follows;\n\nf\u03b8(X, Y ) = |{(\u03b1, \u03b2) \u2208 SX \u00d7 SY : d(\u03b1|\u03b8, \u03b2|\u03b8) = 0}|.\n\nWe use the following important observations about f\u03b8.\nFact 3.6. For 0 \u2264 i \u2264 k and \u03b8 \u2208 Qk(k \u2212 i), if d(\u03b1|\u03b8, \u03b2|\u03b8) = 0, then d(\u03b1, \u03b2) \u2264 i.\nFact 3.7. For 0 \u2264 i \u2264 k and \u03b8 \u2208 Qk(k \u2212 i), f\u03b8(X, Y ) can be computed in O(kn log n) time.\nThis can be done by \ufb01rst lexicographically sorting the k-mers in each of SX and SY by the indices\nin \u03b8. The pairs in SX \u00d7 SY that are the same at indices in \u03b8 can then be enumerated in one linear\nscan over the sorted lists. Let n = nX + nY , runtime of this computation is O(k(n + |\u03a3|)) if we\nuse counting sort (as in [13]) or O(kn log n) for mergesort (since \u03b8 has O(k) indices.) Since this\nprocedure is repeated many times, we refer to this as the SORT-ENUMERATE subroutine. We de\ufb01ne\n\nLemma 3.8.\n\nFi(X, Y ) =\n\nProof. Let (\u03b1, \u03b2) be a pair that contributes to Mj, i.e. d(\u03b1, \u03b2) = j. Then for every \u03b8 \u2208 Qk(k \u2212 i)\nthat has all indices within the k \u2212 j positions where \u03b1 and \u03b2 agree, the pair (\u03b1, \u03b2) is counted in\n\nk\u2212i\n\nthe required equality.\n\nf\u03b8(X, Y ). The number of such \u03b8\u2019s are(cid:0)k\u2212j\n(cid:1), hence Mj is counted(cid:0)k\u2212j\nCorollary 3.9. Mi can readily be computed as: Mi = Fi(X, Y ) \u2212 i\u22121(cid:80)\nBy de\ufb01nition, Fi(X, Y ) can be computed with(cid:0) k\n(cid:33)\n\n(cid:1) =(cid:0)k\n\nk\u2212i\n\nk\u2212i\n\n(cid:32) t(cid:88)\n\n(cid:18)k\n\n(cid:19)\n\n(cid:1) times in Fi(X, Y ), yielding\n(cid:0)k\u2212j\n\n(cid:1)Mj.\n\nk\u2212i\n\n(cid:1) f\u03b8 computations. Let t = min{2m, k}.\n\nj=0\n\nK(X, Y |k, m) can be evaluated by (4) after computing Mi (by (8)) and Ii (by Corollary 3.4) for\n0 \u2264 i \u2264 t. The overall complexity of this strategy thus is\n\ni\n\n(k \u2212 i)(n log n + n)\n\n+ O(n) = O(k \u00b7 2k\u22121 \u00b7 (n log n)).\n\ni\n\ni=0\n\n5\n\nFi(X, Y ) =\n\nf\u03b8(X, Y ).\n\n(cid:88)\n(cid:18)k \u2212 j\ni(cid:88)\n\n\u03b8\u2208Qk(k\u2212i)\n\n(cid:19)\n\nk \u2212 i\n\nMj.\n\nj=0\n\n(7)\n\n(8)\n\n\f\u03b4\n\n(cid:1))\n\nAlgorithm 1 : Approximate-Kernel(SX,SY ,k,m,\u0001,\u03b4,B)\n1: I, M(cid:48) \u2190 ZEROS(t + 1)\n2: \u03c3 \u2190 \u0001 \u00b7 \u221a\n3: Populate I using Corollary 3.4\n4: for i = 0 to t do\n5:\n6:\n7:\n8:\n9:\n\n\u00b5F \u2190 0\niter \u2190 1\nvarF \u2190 \u221e\nwhile varF > \u03c32 \u2227 iter < B do\n\n\u03b8 \u2190 RANDOM((cid:0) k\n\u00b5F \u2190 \u00b5F \u00b7 (iter \u2212 1) + SORT-ENUMERATE(SX , SY , k, \u03b8)\nF (cid:48)[i] \u2190 \u00b5F \u00b7(cid:0) k\n(cid:1)\niter\nvarF \u2190 VARIANCE(\u00b5F , varF , iter)\niter \u2190 iter + 1\nk\u2212i\nM(cid:48)[i] \u2190 M(cid:48)[i] \u2212(cid:0)k\u2212j\n\n10:\n11:\n12:\n13:\n14:\n15:\n16:\nk\u2212i\n17: K(cid:48) \u2190 SUMPRODUCT(M(cid:48),I)\n18: return K(cid:48)\n\nM(cid:48)[i] \u2190 F (cid:48)[i]\nfor j = 0 to i \u2212 1 do\n\n(cid:1) \u00b7 M(cid:48)[j]\n\nk\u2212i\n\n(cid:46) Application of Fact 3.7\n(cid:46) Compute online variance\n\n(cid:46) Application of Corollary 3.9\n\n(cid:46) Applying Equation (4)\n\nWe give our algorithm to approximate K(X, Y |k, m), it\u2019s explanation followed by it\u2019s analysis.\nAlgorithm 1 takes \u0001, \u03b4 \u2208 (0, 1), and B \u2208 Z+ as input parameters; the \ufb01rst two controls the accuracy\nof estimate while B is an upper bound on the sample size. We use (7) to estimate Fi = Fi(X, Y ) with\nan online sampling algorithm, where we choose \u03b8 \u2208 Qk(k \u2212 i) uniformly at random and compute the\nonline mean and variance of the estimate for Fi. We continue to sample until the variance is below the\nthreshold (\u03c32 = \u00012\u03b4) or the sample size reaches the upper bound B. We scale up our estimate by the\npopulation size and use it to compute M(cid:48)\ni \u2019s together\nwith the precomputed exact values of Ii\u2019s are used to compute our estimate, K(cid:48)(X, Y |k, m, \u03c3, \u03b4, B),\nfor the kernel value using (4). First we give an analytical bound on the runtime of Algorithm 1 then\nwe provide guarantees on it\u2019s performance.\nTheorem 3.10. Runtime of Algorithm 1 is bounded above by O(k2n log n).\n\ni (estimates of Mi) using Corollary 3.9. These M(cid:48)\n\nProof. Observe that throughout the execution of the algorithm there are at most tB computations of\nf\u03b8, which by Fact 3.7 needs O(kn log n) time. Since B is an absolute constant and t \u2264 k, we get\nthat the total runtime of the algorithm is O(k2n log n). Note that in practice the while loop in line 8\nis rarely executed for B iterations; the deviation is within the desired range much earlier.\nLet K(cid:48) = K(cid:48)(X, Y |k, m, \u0001, \u03b4, B) be our estimate (output of Algorithm 1) for K = K(X, Y |k, m).\nTheorem 3.11. K(cid:48) is an unbiased estimator of the true kernel value, i.e. E(K(cid:48)) = K.\n\nProof. For this we need the following result, whose proof is deferred.\nLemma 3.12. E(M(cid:48)\n\nBy Line 17 of Algorithm 1, E(K(cid:48)) = E((cid:80)t\n\ni ) = Mi.\n\ni=0 IiM(cid:48)\n\nLemma 3.12 we get that\n\nt(cid:88)\n\nE(K(cid:48)) =\n\nIiE(M(cid:48)\n\ni ) =\n\ni ). Using the fact that Ii\u2019s are constants and\nmin{2m,k}(cid:88)\n\nIiMi = K.\n\ni=0\n\ni=0\n\nTheorem 3.13. For any 0 < \u0001, \u03b4 < 1, Algorithm 1 is an (\u0001Imax, \u03b4)\u2212additive approximation\nalgorithm, i.e. P r(|K \u2212 K(cid:48)| \u2265 \u0001Imax) < \u03b4, where Imax = maxi{Ii}.\n\n6\n\n\fNote that these are very loose bounds, in practice we get approximation far better than these bounds.\nFurthermore, though Imax could be large, but it is only a fraction of one of the terms in summation\nfor the kernel value K(X, Y |k, m).\n\ni be our estimate for Fi (X, Y ) = Fi. We use the following bound on the variance of\n\nProof. Let F (cid:48)\nK(cid:48) that is proved later.\nLemma 3.14. V ar(K(cid:48)) \u2264 \u03b4(\u0001 \u00b7 Imax)2.\nBy Lemma 3.12 we have E(K(cid:48)) = K, hence by Lemma 3.14, P r[|K(cid:48) \u2212 K|] \u2265 \u0001Imax is equivalent\nto P r[|K(cid:48) \u2212 E(K(cid:48))|] \u2265 1\u221a\nmost \u03b4. Therefore, Algorithm 1 is an (\u0001Imax, \u03b4)\u2212additive approximation algorithm.\n\n(cid:112)V ar(K(cid:48)). By the Chebychev\u2019s inequality, this latter probability is at\n\n\u03b4\n\nProof. (Proof of Lemma 3.12) We prove it by induction on i. The base case (i = 0) is true as we\nj) = Mj for 0 \u2264 j \u2264 i \u2212 1. Let iter be\ncompute M(cid:48)[0] exactly, i.e. M(cid:48)[0] = M [0]. Suppose E(M(cid:48)\nthe number of iterations for i, after execution of Line 10 we get\n\n(cid:80)iter\n\n(cid:18) k\n\n(cid:19)\n\nk \u2212 i\n\n=\n\nr=1 f\u03b8r (X, Y )\n\niter\n\n(cid:18) k\n\n(cid:19)\n\nk \u2212 i\n\n,\n\nF (cid:48)[i] = \u00b5F\n\n(cid:18) k\n\nwhere \u03b8r is the random (k \u2212 i)-set chosen in the rth iteration of the while loop. Since \u03b8r is chosen\nuniformly at random we get that\n\n(cid:1) (cid:18) k\n(cid:0) k\n(cid:0)k\u2212j\nAfter the loop on Line 15 is executed we get that E(M(cid:48)[i]) = Fi(X, Y ) \u2212 i\u22121(cid:80)\n\n(cid:19)\n(cid:1)E(M(cid:48)\n\nE(F (cid:48)[i]) = E(\u00b5F )\n\n(cid:18) k\n\n= E(f\u03b8r (X, Y ))\n\nFi(X, Y )\n\nk \u2212 i\n\nk \u2212 i\n\nk \u2212 i\n\n(cid:19)\n\n(cid:19)\n\nk\u2212i\n\n(9)\n\nj). Using\n\n=\n\n.\n\nk\u2212i\n\nE(M(cid:48)\n\nj) = Mj (inductive hypothesis) in (8) we get that E(M(cid:48)\n\ni ) = Mi.\n\nj=0\n\n(cid:1)M(cid:48)\n\nProof. (Proof of Lemma 3.14) After execution of the while loop in Algorithm 1, we have F (cid:48)\n\n(cid:0)k\u2212j\ni(cid:80)\nFact 3.15. Suppose X0, . . . , Xt are random variables and let S =(cid:80)t\n\nj. We use the following fact that follows from basic calculations.\n\nk\u2212i\n\ni=0 aiXi, where a0, . . . , at\n\ni =\n\nj=0\n\nare constants. Then\n\nt(cid:88)\n\nt(cid:88)\n\nt(cid:88)\n\nV ar(S) =\n\na2\ni V ar(Xi) + 2\n\naiajCov(Xi, Xj).\n\ni=0\n\ni=0\n\nj=i+1\n\nUsing fact 3.15 and de\ufb01nitions of Imax and \u03c3 we get that\n\nV ar(K(cid:48)) =\n\n2V ar(M(cid:48)\n\nIiIjCov(M(cid:48)\n\ni , M(cid:48)\nj)\n\nV ar(M(cid:48)\n\nCov(M(cid:48)\n\nmaxV ar(F (cid:48)\n\nmax\u03c32 = \u03b4(\u0001\u00b7Imax)2.\n\nThe last inequality follows from the following relation derived from de\ufb01nition of F (cid:48)\n\ni and Fact 3.15.\n\nV ar(F (cid:48)\n\nt ) =\n\nV ar(M(cid:48)\n\ni ) + 2\n\nCov(M(cid:48)\n\ni , M(cid:48)\nj).\n\n(10)\n\nt ) \u2264 I 2\n(cid:19)\n\n(cid:19)(cid:18)k \u2212 j\n\nk \u2212 t\n\n\uf8ee\uf8f0 t(cid:88)\n\ni=0\n\n\u2264 I 2\n\nmax\n\ni=0\n\nIi\n\nt(cid:88)\nt(cid:88)\nt(cid:88)\n(cid:19)2\n(cid:18)k \u2212 i\n\ni ) + 2\n\ni=0\n\nj=i+1\n\nk \u2212 t\n\nt(cid:88)\n\ni=0\n\ni ) + 2\n\ni=0\n\nj=i+1\n\nt(cid:88)\nt(cid:88)\n\uf8f9\uf8fb \u2264 I 2\ni , M(cid:48)\nj)\n(cid:18)k \u2212 i\nt(cid:88)\nt(cid:88)\n\nk \u2212 t\n\ni=0\n\nj=i+1\n\n7\n\n\f4 Evaluation\n\nWe study the performance of our algorithm in terms of runtime, quality of kernel estimates and pre-\ndictive accuracies on standard benchmark sequences datasets (Table 1) . For the range of parameters\nfeasible for existing solutions, we generated kernel matrices both by algorithm of [13] (exact) and\nour algorithm (approximate). These experiments are performed on an Intel Xeon machine with (8\nCores, 2.1 GHz and 32 GB RAM) using the same experimental settings as in [13, 15, 17]. Since\nour algorithm is applicable for signi\ufb01cantly wider range of k and m, we also report classi\ufb01cation\nperformance with large k and m. For our algorithm we used B \u2208 {300, 500} and \u03c3 \u2208 {0.25, 0.5}\nwith no signi\ufb01cant difference in results as implied by the theoretical analysis. In all reported results\nB = 300 and \u03c3 = 0.5. In order to perform comparisons, for a few combinations of parameters\nwe generated exact kernel matrices of each dataset on a much more powerful machine (a cluster of\n20 nodes, each having 24 CPU\u2019s with 2.5 GHz speed and 128GB RAM). Sources for datasets and\nsource code are available at 1.\n\nTable 1: Datasets description\n\nName\nDing-Dubchak [6]\nSCOP [4, 31]\nMusic [21, 26]\nArtist20 [8, 17]\nISMIR [17]\n\nTask\nprotein fold recognition\nprotein homology detection\nmusic genre recognition\nartist identi\ufb01cation\nmusic genre recognition\n\nClasses\n27\n54\n10\n20\n6\n\nSeq.\n694\n7329\n1000\n1413\n729\n\nAv.Len. Evaluation\n10-fold CV\n169\n54 binary class.\n308\n5-fold CV\n2368\n6-fold CV\n9854\n5-fold CV\n10137\n\nRunning Times: We report difference in running times for kernels generation in Figure 1. Exact\nkernels are generated using code provided by authors of [13, 14] for 8 \u2264 k \u2264 16 and m = 2 only.\nWe achieve signi\ufb01cant speedups for large values of k (for k = 16 we get one order of magnitude\ngains in computational ef\ufb01ciency on all datasets). The running times for these algorithms are O(2kn)\nand O(k2n log n) respectively. We can use larger values of k without an exponential penalty, which\nis visible in the fact that in all graphs, as k increases the growth of running time of the exact algorithm\nis linear (on the log-scale), while that of our algorithm tends to taper off.\n\nFigure 1: Log scaled plot of running time of approximate and exact kernel generation for m = 2\n\nKernel Error Analysis: We show that despite reduction in runtimes, we get excellent approximation\nof kernel matrices. In Table 2 we report point-to-point error analysis of the approximate kernel\nmatrices. We compare our estimates with exact kernels for m = 2. For m > 2 we report statistical\nerror analyses. More precisely, we evaluate differences with principal submatrices of the exact kernel\nmatrix. These principal submatrices are selected by randomly sampling 50 sequences and computing\ntheir pairwise kernel values. We report errors for four datasets; the \ufb01fth one, not included for space\nreasons, showed no difference in error. From Table 2 it is evident that our empirical performance is\nsigni\ufb01cantly more precise than the theoretical bounds proved on errors in our estimates.\n\n1https://github.com/mufarhan/sequence_class_NIPS_2017\n\n8\n\n110100100010000810121416810121416810121416810121416810121416DingDubchakSCOPMusicGenreISMIR2004Artist20k Running Time (sec) ExactApproximate\fTable 2: Mean absolute error (MAE) and root mean squared error (RMSE) of approximate kernels.\nFor m > 2 we report average MAE and RMSE of three random principal submatrices of size 50 \u00d7 50\n\n(k, m)\n(10, 2)\n(12, 2)\n(14, 2)\n(16, 2)\n(12, 6)\n\nMusic Genre\n\nRMSE\n\nMAE\n\nISMIR\n\nRMSE\n\nMAE\n\nArtist20\n\nRMSE\n\nMAE\n\n0\n0\n\n2.0E\u22128\n1.3E\u22128\n1.97E\u22125\n\n0\n0\n0\n0\n\n8.5E\u22127\n\n0\n0\n\n2.0E\u22128\n4.0E\u22128\n\n0\n0\n0\n\n3.3E\u22129\n\n0\n0\n\n3.3E\u22128\n\n0\n0\n\n1.3E\u22128\n\nSCOP\n\nRMSE\n1.3E\u22126\n1.4E\u22126\n2.9E\u22126\n2.9E\u22126\n2.4E\u22124\n\nMAE\n9.0E\u22128\n1.0E\u22128\n1.3E\u22128\n1.0E\u22128\n1.8E\u22125\n\nPrediction Accuracies: We compare the outputs of SVM on the exact and approximate kernels\nusing the publicly available SVM implementation LIBSVM [2]. We computed exact kernel matrices\nby brute force algorithm for a few combinations of parameters for each dataset on the much more\npowerful machine. Generating these kernels took days; we only generated to compare classi\ufb01cation\nperformance of our algorithm with the exact one. We demonstrate that our predictive accuracies\nare suf\ufb01ciently close to that with exact kernels in Table 3 (bio-sequences) and Table 4 (music). The\nparameters used for reporting classi\ufb01cation performance are chosen in order to maintain comparability\nwith previous studies. Similarly all measurements are made as in [13, 14], for instance for music\ngenre classi\ufb01cation we report results of 10-fold cross-validation (see Table 1). For our algorithm we\nused B = 300 and \u03c3 = 0.5 and we take an average of performances over three independent runs.\nTable 3: Classi\ufb01cation performance comparisons on SCOP (ROC) and Ding-Dubchak (Accuracy)\n\nSCOP\n\nExact\n\nApprox\n\nk, m ROC ROC50 ROC ROC50\n38.60\n8, 2\n10, 2\n26.72\n11.04\n12, 2\n6.66\n14, 2\n5.76\n16, 2\n10, 5\n54.1\n48.44\n10, 7\n12, 8\n52.08\n\n38.71\n28.18\n23.27\n7.78\n6.89\n53.77\n48.18\n50.54\n\n88.09\n81.65\n71.31\n67.91\n64.45\n91.60\n90.27\n91.44\n\n88.05\n80.56\n66.93\n63.67\n61.64\n91.67\n90.30\n90.97\n\nDing-Dubchak\nExact Approx\n\nAccuracy\n\n34.01\n28.1\n27.23\n25.5\n25.94\n45.1\n58.21\n58.21\n\n31.65\n26.9\n26.66\n25.5\n25.03\n43.80\n57.20\n57.83\n\nTable 4: Classi\ufb01cation error comparisons on music datasets exact and estimated kernels\n\nMusic Genre\n\nk, m Exact\n10, 2\n14, 2\n16, 2\n10, 7\n12, 6\n12, 8\n\n61.30 \u00b1 3.3\n71.70 \u00b1 3.0\n73.90 \u00b1 1.9\n37.00 \u00b1 3.5\n54.20 \u00b1 2.7\n43.70 \u00b1 3.2\n\nEstimate\n61.30 \u00b1 3.3\n71.70 \u00b1 3.0\n73.90 \u00b1 1.9\n37.00 \u00b1 3.5\n54.13 \u00b1 2.9\n44.20 \u00b1 3.2\n\nISMIR\n\nExact\n54.32 \u00b1 1.6\n55.14 \u00b1 1.1\n54.73 \u00b1 1.5\n52.12 \u00b1 2.0\n47.03 \u00b1 2.6\n\nEstimate\n54.32 \u00b1 1.6\n55.14 \u00b1 1.1\n54.73 \u00b1 1.5\n27.16 \u00b1 1.6\n52.08 \u00b1 1.5\n47.41 \u00b1 2.4\n\nArtist20\n\nExact\n82.10 \u00b1 2.2\n86.84 \u00b1 1.8\n87.56 \u00b1 1.8\n55.75 \u00b1 4.7\n79.57 \u00b1 2.4\n\nEstimate\n82.10 \u00b1 2.2\n86.84 \u00b1 1.8\n87.56 \u00b1 1.8\n55.75 \u00b1 4.7\n80.00 \u00b1 2.6\n67.57 \u00b1 3.6\n\n5 Conclusion\n\nIn this work we devised an ef\ufb01cient algorithm for evaluation of string kernels based on inexact\nmatching of subsequences (k-mers). We derived a closed form expression for the size of intersection\nof m-mismatch neighborhoods of two k-mers. Another signi\ufb01cant contribution of this work is a novel\nstatistical estimate of the number of k-mer pairs at a \ufb01xed distance between two sequences. Although\nlarge values of the parameters k and m were known to yield better classi\ufb01cation results, known\nalgorithms are not feasible even for moderately large values. Using the two above mentioned results\nour algorithm ef\ufb01ciently approximate kernel matrices with probabilistic bounds on the accuracy.\nEvaluation on several challenging benchmark datasets for large k and m, show that we achieve state\nof the art classi\ufb01cation performance, with an order of magnitude speedup over existing solutions.\n\n9\n\n\fReferences\n[1] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods.\n\nInternational Conference on Machine Learning, ICML, pages 33\u201340, 2005.\n\nIn\n\n[2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions\n\non Intelligent Systems and Technology, 2:27:1\u201327:27, 2011.\n\n[3] J. Cheng and P. Baldi. A machine learning information retrieval approach to protein fold\n\nrecognition. Bioinformatics, 22(12):1456\u20131463, 2006.\n\n[4] L. Conte, B. Ailey, T. Hubbard, S. Brenner, A. Murzin, and C. Chothia. Scop: A structural\n\nclassi\ufb01cation of proteins database. Nucleic Acids Research, 28(1):257\u2013259, 2000.\n\n[5] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other\n\nkernel-based learning methods. Cambridge university press, 2000.\n\n[6] C. Ding and I. Dubchak. Multi-class protein fold recognition using support vector machines\n\nand neural networks. Bioinformatics, 17(4):349\u2013358, 2001.\n\n[7] P. Drineas and M. W. Mahoney. On the nystr\u00f6m method for approximating a gram matrix for\nimproved kernel-based learning. The Journal of Machine Learning Research, 6:2153\u20132175,\n2005.\n\n[8] D. P. Ellis. Classifying music audio with timbral and chroma features. In ISMIR, volume 7,\n\npages 339\u2013340, 2007.\n\n[9] D. Haussler. Convolution kernels on discrete structures. Technical Report UCS-CRL-99-10,\n\nUniversity of California at Santa Cruz, 1999.\n\n[10] R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, and C. Leslie. Pro\ufb01le-based string\nkernels for remote homology detection and motif extraction. Journal of Bioinformatics and\nComputational Biology, 3(3):527\u2013550, 2005.\n\n[11] P. Kuksa. Scalable kernel methods and algorithms for general sequence analysis. PhD thesis,\n\nDepartment of Computer Science, Rutgers, The State University of New Jersey, 2011.\n\n[12] P. Kuksa, P.-H. Huang, and V. Pavlovic. Fast protein homology and fold detection with sparse\nspatial sample kernels. In 19th International Conference on Pattern Recognition, ICPR, pages\n1\u20134. IEEE, 2008.\n\n[13] P. Kuksa, P.-H. Huang, and V. Pavlovic. Scalable algorithms for string kernels with inexact\nmatching. In Advances in Neural Information Processing Systems, NIPS, pages 881\u2013888. MIT\nPress, 2009.\n\n[14] P. Kuksa, I. Khan, and V. Pavlovic. Generalized similarity kernels for ef\ufb01cient sequence\nclassi\ufb01cation. In SIAM International Conference on Data Mining, SDM, pages 873\u2013882. SIAM,\n2012.\n\n[15] P. Kuksa and V. Pavlovic. Spatial representation for ef\ufb01cient sequence classi\ufb01cation. In 20th\n\nInternational Conference on Pattern Recognition, ICPR, pages 3320\u20133323. IEEE, 2010.\n\n[16] P. Kuksa, Y. Qi, B. Bai, R. Collobert, J. Weston, V. Pavlovic, and X. Ning. Semi-supervised\nabstraction-augmented string kernel for multi-level bio-relation extraction. In Joint European\nConference on Machine Learning and Knowledge Discovery in Databases, ECML-PKDD,\npages 128\u2013144. Springer, 2010.\n\n[17] P. P. Kuksa. Ef\ufb01cient multivariate sequence classi\ufb01cation. In CoRR abs/1409.8211, 2013.\n\n[18] C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: A string kernel for svm protein\nclassi\ufb01cation. In Paci\ufb01c Symposium on Biocomputing, volume 7 of PSB, pages 566\u2013575, 2002.\n\n[19] C. Leslie, E. Eskin, J. Weston, and W. Noble. Mismatch string kernels for svm protein\nclassi\ufb01cation. In Advances in Neural Information Processing Systems, NIPS, pages 1441\u20131448.\nMIT Press, 2003.\n\n10\n\n\f[20] C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences.\n\nJournal of Machine Learning Research, 5:1435\u20131455, 2004.\n\n[21] T. Li, M. Ogihara, and Q. Li. A comparative study on content-based music genre classi\ufb01ca-\ntion. In 26th Annual International ACM SIGIR Conference on Research and Development in\nInformation Retrieval, ACM/SIGIR, pages 282\u2013289. ACM, 2003.\n\n[22] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nNeural Information Processing Systems, NIPS, pages 1177\u20131184, 2007.\n\n[23] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization\nwith randomization in learning. In Advances in Neural Information Processing Systems, NIPS,\npages 1313\u20131320, 2008.\n\n[24] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university\n\npress, 2004.\n\n[25] S. Sonnenburg, G. R\u00e4tsch, and B. Sch\u00f6lkopf. Large scale genomic sequence svm classi\ufb01ers. In\n\n22nd International Conference on Machine Learning, ICML, pages 848\u2013855. ACM, 2005.\n\n[26] G. Tzanetakis and P. Cook. Musical genre classi\ufb01cation of audio signals. IEEE Transactions on\n\nSpeech and Audio Processing, 10(5):293\u2013302, 2002.\n\n[27] V. Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.\n\n[28] S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In Advances in\n\nNeural Information Processing Systems, NIPS, pages 585\u2013592, 2002.\n\n[29] M. Waterman, J. Joyce, and M. Eggert. Computer alignment of sequences. Phylogenetic\n\nanalysis of DNA sequences, pages 59\u201372.\n\n[30] C. Watkins. Dynamic alignment kernels. In Advances in Large Margin Classi\ufb01ers, pages 39\u201350.\n\nMIT Press, 1999.\n\n[31] J. Weston, C. Leslie, E. Ie, D. Zhou, A. Elisseeff, and W. Noble. Semi-supervised protein\n\nclassi\ufb01cation using cluster kernels. Bioinformatics, 21(15):3241\u20133247, 2005.\n\n[32] J. Weston, C. Leslie, D. Zhou, A. Elisseeff, and W. Noble. Semi-supervised protein classi\ufb01cation\nusing cluster kernels. In Advances in Neural Information Processing Systems, NIPS, pages\n595\u2013602. MIT Press, 2004.\n\n[33] C. K. I. Williams and M. Seeger. Using the nystr\u00f6m method to speed up kernel machines. In\n\nAdvances in Neural Information Processing Systems, NIPS, pages 661\u2013667, 2000.\n\n[34] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nystr\u00f6m method vs random fourier\nfeatures: A theoretical and empirical comparison. In Advances in Neural Information Processing\nSystems, NIPS, pages 476\u2013484, 2012.\n\n11\n\n\f", "award": [], "sourceid": 3476, "authors": [{"given_name": "Muhammad", "family_name": "Farhan", "institution": "LUMS"}, {"given_name": "Juvaria", "family_name": "Tariq", "institution": "Emory Univeristy"}, {"given_name": "Arif", "family_name": "Zaman", "institution": "LUMS"}, {"given_name": "Mudassir", "family_name": "Shabbir", "institution": "ITU"}, {"given_name": "Imdad Ullah", "family_name": "Khan", "institution": "LUMS"}]}