Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)
Eleazar Eskin, Jason Weston, William Noble, Christina Leslie
We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence sim- ilarity based on shared occurrences of -length subsequences, counted with up to mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most success- ful method for remote homology detection, while achieving considerable computational savings.