{"title": "Going Metric: Denoising Pairwise Data", "book": "Advances in Neural Information Processing Systems", "page_first": 841, "page_last": 848, "abstract": null, "full_text": "Going  Metric:  Denoising  Pairwise  Data \n\nVolker  Roth \n\nInformatik III,  University of Bonn \n\nJulian Laub \n\nFraunhofer FIRST.IDA \n\nRoemerstr 164,  53117 Bonn,  Germany \n\nKekulestr.  7,  12489 Berlin,  Germany \n\nroth\u00a9cs.uni-bonn.de \n\njlaub\u00a9first.fhg.de \n\nJoachim M.  Buhmann \n\nInformatik III,  University of Bonn \n\nRoemerstr 164,  53117 Bonn,  Germany \n\njb\u00a9cs.uni-bonn.de \n\nKlaus-Robert Miiller \nFraunhofer FIRST.IDA, \n12489 Berlin,  Germany, \nUniversity of Potsdam, \n\n14482 Potsdam, Germany \n\nklaus\u00a9first.fhg.de \n\nAbstract \n\nPairwise  data in  empirical  sciences  typically  violate  metricity,  ei(cid:173)\nther  due  to  noise  or  due  to  fallible  estimates,  and  therefore  are \nhard  to  analyze  by  conventional  machine  learning  technology.  In \nthis  paper  we  therefore  study  ways  to  work  around  this  problem. \nFirst,  we  present  an  alternative  embedding  to  multi-dimensional \nscaling  (MDS)  that  allows  us  to  apply  a  variety  of classical  ma(cid:173)\nchine learning and signal processing algorithms.  The class of pair(cid:173)\nwise grouping algorithms which share the shift-invariance property \nis  statistically  invariant  under  this  embedding  procedure,  leading \nto  identical assignments  of objects to clusters.  Based on this  new \nvectorial  representation,  denoising  methods  are  applied  in  a  sec(cid:173)\nond  step.  Both steps  provide a  theoretically  well  controlled setup \nto  translate  from  pairwise  data  to  the  respective  denoised  met(cid:173)\nric  representation.  We  demonstrate the practical usefulness of our \ntheoretical  reasoning by  discovering structure in  protein sequence \ndata bases, visibly improving performance upon existing automatic \nmethods. \n\n1 \n\nIntroduction \n\nUnsupervised grouping or  clustering  aims at extracting hidden structure from  data \n(see  e.g.  [5]).  However, for  several major applications,  e.g.  bioinformatics or imag(cid:173)\ning,  the data is  solely available as  scores of pairwise comparisons.  Pairwise data is \nin no  natural  way  related to the  common viewpoint  of objects  lying in  some \"well \nbehaved\"  space like  a  vector space.  Particularly, pairwise  data may violate the tri(cid:173)\nangular inequality.  Two  cases  should  be  distinguished:  (i)  The  triangle inequality \nmight  not  be satisfied  as  a  result  of noisy  measurements  (for  instance using  string \nalignment  algorithms  in  DNA  analysis).  (ii)  The  violation  might  be  an  intrinsic \nfeature  of the  data.  This  case,  for  instance,  applies  to  datasets  based  upon  some \nhuman judgment, e.g.  \"X likes Y,  Y  likes  Z  =I?  X  likes  Z\". \n\n\fSuch violations preclude the use of well established machine learning methods, which \ntypically have been formulated  for  metric data only.  This paper proposes an algo(cid:173)\nrithm  to  metricize  and  subsequently  de noise  pairwise  data.  It uses  the  so-called \nconstant shift embedding  (cf.  [14])  for  metrization, then constructs a  positive semi(cid:173)\ndefinite  matrix which  can in  sequel  be  used  for  denoising and  clustering  purposes. \nRegarding  data-mining or  clustering  purposes,  the  most  outstanding  difference  to \nclassical  MDS  is  the  following:  for  the  class  of  pairwise  clustering  cost  functions \nsharing the shift-invariance property1  the metrization step is  loss-free  in the sense \nthat the optimal assignments of objects to clusters remain unchanged. \n\nThe  next  section  introduces  techniques  for  metrization,  denoising  and  clustering \npairwise  data.  This is  followed  by a  section illustrating our methods for  real world \ndata such as bacterial GyrE  amino acid sequences and sequences from the ProD om \ndata base  and a  brief discussion. \n\n2  Proximity-based clustering and  denoising \n\nOne of the most popular methods for  grouping vectorial data is  k-means clustering \n(see  e.g.  [1][5]).  It derives  a  set of k  prototype vectors  which  quantize the data set \nwith  minimal quantization error. \n\nPartitioning proximity data is considered a much harder problem, since the inherent \nstructure of n  samples  is  hidden  in n 2  pairwise relations.  The pairwise proximities \ncan violate the requirements of a  distance measure, i.e.  they may be non-symmetric \nand negative, and the triangular inequality does not necessarily hold.  Thus, a  loss(cid:173)\nfree  embedding into a vector space is not possible, so that grouping problems of this \nkind cannot be directly transformed into a vectorial representation by means of clas(cid:173)\nsical embedding strategies such  as  multi-dimensional  scaling  (MDS  [4]).  Moreover \nclustering  the  MDS  embedded  data-vectors  in  general  yields  partitionings  differ(cid:173)\nent from  those obtained by directly solving the pairwise problem, since embedding \nconstraints might  be in conflict  with the clustering goal. \n\nLet  us  start  from  a  pairwise  clustering  loss  function  (see  [12])  that  combines  the \nproperties of additivity, scale- and shift invariance,  and statistical robustness \n\nHPc  = t 2:~=1 2:7=1 MivMjvDij \n\nv=1 \n\n2:~=1 Mlv \n\n' \n\n(1) \n\nwhere  the  data  are  characterized  by  the  matrix  of  pairwise  dissimilarities  D ij . \nThe  assignments  of  objects  to  clusters  are  encoded  in  the  binary  stochastic  ma-\ntrix  M  E {O,  l}nxk  :  2:~=1 Miv  =  1.  For  such cost functions  it  can  be  shown  [14] \nthat there  always  exists  a  set  of vectorial data representations-the constant shift \nembeddings-such that the grouping problem can be equivalently restated in terms \nof Euclidian  distances  between  these  vectors.  In  order  to  handle  non-symmetric \ndissimilarities,  it  should  be  noticed  that  HPc  is  also  invariant  under  symmetriz(cid:173)\ning transformations:  Dij  +- 1/2(Dij + Dji).  In  the following  we  will  thus  restrict \nourselves to the case of symmetric dissimilarity matrices. \nTheorem 2.1.  [141  Given  an  arbitrary  (possibly  non-metric)  (n  x  n)  dissimilarity \nmatrix  D  with  zero  self-dissimilarities,  there  exists  a  transformed  matrix fJ  such \nthat \n(i)  the  matrix  fJ  can  be  interpreted  as  a  matrix  of  squared  Euclidian  distances \n\nIThe  term shift-invariance  means  that  the  optimal  assignments  of objects  to  clusters \nare not influenced by constant additive shifts of the pairwise dissimilarities  (excluding the \nself-dissimilarities  which  are assumed to be zero). \n\n\fbetween  a  set  of vectors  {xdi=l'  D  is  derived  from  D  by  both  symmetrizing  and \napplying  the  constant shift  embedding  trick; \n(ii)  the  original  pairwise  clustering  problem  is  equivalent  to  a  k-means  problem  in \nthis  vector  space,  in  the  sense  that  the  optimal  assignments  of objects  to  clusters \n{MiV }  are  identical  in  both  problems. \n\nA  re-formulation  of pairwise  clustering  as  a  k-means  problem  is  clearly  advanta(cid:173)\ngeous:  (i)  the  availability  of prototype  vectors  defines  a  generic  rule  for  using  the \nlearned  partitioning  in  a  predictive  sense,  (ii)  we  can  apply  standard  noise- and \ndimensionality-reduction  methods  in  order to both stabilize  the  estimation proce(cid:173)\ndure and to speed  up the grouping itself. \n\nConstant  shift  embedding  Let  D  =  (Dij)  E  jRnxn  be  the  matrix of pairwise \nsquared  dissimilarities  between  n  objects.  For  a  generic  noisy  dataset  yfJ5:j  1:. \nJD ik  + JD kj  so  that  v15  is  non  metric.  Since\";-:  is  monotonically  increasing, \n~ Do  such that  JDij + Do  ~ JDik + Do + JDkj + Do  V i,j, k  =  1,2 ... n.  Let \n\nD=D+Do(eeT  -In) \n\n(2) \n\nwhere e =  (1 , 1, ... 1) T  is a n-dimensional column-vector and In the identity matrix. \nfor  all  i  i:- j.  We \nThis  corresponds  to  a  constant  additive  shift  Dij  =  Dij + Do \nlook for  the minimal  constant shift  Do  such  that D satisfy  the triangle inequality. \nIn  order to  make  the  main  result  clear,  we  first  need  to  introduce the  notion  of a \ncentralized  matrix.  Let  P  be  an  arbitrary matrix and let  Q  =  I  - ~ee T.  Q  is  the \nprojection matrix on the orthogonal complement of e.  Define the  centralized  P  by: \n\npe = QPQ. \n\n(3) \n\nLet D  be fixed  and let  us  decompose D  as follows: \n\nDij  = Sii + Sjj  - 2Sij . \n\n(4) \nThis decomposition is motivated by the fact that if D is a squared Euclidian distance \nbetween the vectorial data Xi,  then  Dij  = Ilxi - xjl12  = IIxil12 + IIxjl12 - 2x{ Xj' \nIt follows  from  equation  (4)  that a  constant off-diagonal shift  on D  corresponds to \na  constant shift  on the  diagonal  of S.  S  is  not  fixed  by  the  choice  of D,  since  we \nmay  always  change  its  diagonal  elements,  yet  recover  the  same  D.  That  is,  any \nmatrix  of the  form  (Sij  + I/2~Si + I/2~Sj) gives  the  same  distance  D  as  S  for \narbitrary  ~Si's.  By  simple  algebra it  can  be  shown  that  se  =  - ~ De,  i. e.  se  is \nunique.  Furthermore D  derives from  a  squared Euclidian distance if and only if s e \nis  positive semi-definite  [14].  Let  s e =  s e - An(se)In,  where  AnU  is  the  minimal \neigenvalue  of its  argument.  Then  se  is  positive  semi-definite  [14].  These  are  the \nmain ingredients for  proving the following: \nTheorem 2.2  (Minimal Do).  !4J.  Do  =  -2An(se)  is  the  minimal  constant such \nthat D  =  D + Do (ee T  -\n\nIn)  derive  from  squared Euclidian  distance. \n\nAll  proofs  can  be  found  in  [14] .  We  have  thus  shown  that  applying  large  enough \nadditive shifts to the off-diagonal elements of D  results in a  matrix se that is  posi(cid:173)\ntive semi-definite,  and can thus be interpreted as a  Gram matrix.  This means, that \nin some  (n -\nI)-dimensional Euclidian space there exists  a  vector representation of \nthe objects,  summarized in the \"design\"  matrix X  (the  rows  of X  are the  feature \nvectors),  such that se =  XX T . \nFor  the  pairwise  clustering  cost  function  the  optimal  assignments  of  objects  to \nclusters  are invariant  under the  constant-shift  embedding procedure,  according to \n\n\ftheorem 2.1.  Hence,  the grouping problem can  be re-formulated  as  optimizing the \nclassical  k-means  criterion in the embedding space. \nIn  many  applications,  however,  it  is  advantageous  not  to cluster  in  the  full  space \nbut to insert  some  dimension  reduction step,  that serves  the purpose of increasing \nefficiency  and noise  reduction.  While  it  is  unclear  how  to  denoise  for  the  original \npairwise  object  representations  while  respecting additivity,  scale- and  shift  invari(cid:173)\nance,  and statistical robustness properties of the clustering criterion,  we  can easily \napply kernel PCA [16]  to Be  after the constant-shift embedding. \n\nDenoising  of pairwise  data  by  Constant  Shift  Embedding  For  de noising \nwe  construct D which derives from \"real\" points in a vector space, i.e.  Be  is  positive \nIn  a  first  step,  we  briefly  describe,  how  these  real  points  can  be \nsemi-definite. \nrecovered by loss-free  kernel  PCA  [16]: \n(i)  Calculate the centralized kernel  matrix se  =  -~QDQ . \n(ii)  Decompose  se  =  V A V T  where  V  =  (Vl,'\"  vn )  with  eigenvectors  vi's  and \n.An)  with eigenvalues .A1  ~ ... ~ .Ap  > .Ap+1  = a ~ .Ap+2  ~ ... ~ .An. \nA = diag(.A1 , '\" \n(iii)  Calculate  the  n  x  (n  - 2)  mapping  matrix  X~_2 =  V':_2 (A~_2)1 /2,  where \nV':_2  =  (V1, ... Vp ,Vp+2,\u00b7\u00b7\u00b7 vn-d  and  A~_2  =  diag(.A1  -\n.An,.Ap+2-\n.An,'\" \nThe  rows  of  X~_2 contain  the  vectors  {xD  (i  =  1,2 ... n)  in  n  - 2  dimensional \nspace,  whose  mutual distances  are given by D.  When focusing  on noise  reduction, \nhowever, we are rather interested in some approximative reconstructions of the ''real'' \nvectors.  In  the  PCA  framework,  one  usually  discards  the  directions  which  corre(cid:173)\nspond to small  eigenvalues  as  noise  (c.f.  [9]).  We  can thus obtain a  representation \nin a space of reduced dimension  (with the well-defined error of PCA reconstruction) \nwhen choosing t  < n - 2 in step (iii)  of the above algorithm: \n\n.An)  (these  are the constantly shifted eigenvalues). \n\n.An, ... .Ap  -\n\n.An-1  -\n\nX*  - y.*(A*)1/2 \n, \n\nt \n\nt \n\n-\n\nt \n\nwhere  i't*  consists  of the  first  t  column  vectors  of V':_2  and  At  is  the  top  txt \nsubmatrix of A~ _ 2'  The vectors in ~t then differ the least from the vectors in ~n-2 \nin the sense of a  quadratic error. \nThe advantages of this  method in comparison to directly applying classical scaling \nvia  MDS  are:  (i)  t  can  be  larger  than  the  number  p  of positive  eigenvalues,  (ii) \nthe embedded vectors are the best least squares error approximation to the optimal \nvectors which  preserve the grouping structure. \nIt should be noticed, however, that given the exactly reconstructed vectors in ~n-2 \nfound by loss-free kernel PCA, we could have also applied any other standard meth(cid:173)\nods for dimensionality reduction or visualization, such as projection pursuit [6],  local \nlinear  embedding  (LLE)  [15],  Isomap  [17]  or Self-organizing  maps  [8]. \n\n3  Application on protein sequences \n\n3.1  Bacterial  GyrB  amino acid  sequences \n\nWe first illustrate our de noising technique on the gyrase subunit B. The dataset con(cid:173)\nsists  of 84  amino  acid  sequences from  five  genera in  Actinobacteria:  1:  Corynebac(cid:173)\nterium,  2:  Mycobacterium,  3:  Gordonia,  4:  Nocardia  and  5:  Rhodococcus.  A  de(cid:173)\ntailed description can be found in  [7].  This dataset was  used in  [18]  for  illustration \nof marginalized kernels.  The authors hinted at the possibility of computing the dis(cid:173)\ntance  matrix by  using  BLAST scores  [2],  noting,  however, that these  scores  could \nnot be converted into positive semidefinite kernels. \n\n\fIn our experiment, the sequences  have  been aligned by the Smith-Waterman algo(cid:173)\nrithm [11]  which  yields  pairwise alignment  scores.  Using constant shift  embedding \na  positive  semidefinite kernel is  obtained, leaving the cluster assignment unchanged \nfor  shift  invariant cost functions. \nThe important step  is  the denoising.  Several projections to lower  dimensions have \nbeen tested and t  = 5 turned out to be a  good choice,  eliminating the bulk of noise \nwhile  retaining the essential cluster structure. \nFigure  1  shows  the  striking  improvement  of  the  distance  matrix  after  denoising. \nOn the left  hand side  the  ideal  distance  matrix is  depicted,  consisting solely of O's \n(black)  and  l 's  (white),  reflecting the true cluster membership.  In the middle  and \non the right the original and the denoised  distance matrix are shown,  respectively. \nDenoising  visibly  accentuates the  cluster  structure in  the pairwise  data.  Since  we \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\n20 \n\n40 \n\n60 \n\n80 \n\n20 \n\n40 \n\n60 \n\n80 \n\n20 \n\n40 \n\n60 \n\n80 \n\nFigure  1:  Distance  matrix:  On  the  left  the  ideal  distance  matrix reflects  the true \ncluster structure.  In the middle  and on the right:  distance matrix before and after \nde noising \n\ndispose  of the  true  labels,  we  can  quantitatively  assess  the  improvement  by  de(cid:173)\nnoising.  We  performed  usual  k-means  clustering,  followed  by  a  majority  voting \nto  match  cluster  labeling.  For  the  denoised  data we  obtained  3  misclassifications \n(3.61%)  whereas we  got  17  (20.48%)  for  the original data.  This simple experiment \ncorroborates  the  usefulness  of our  embedding  and  denoising  strategy for  pairwise \ndata. \n\nIn  order  to  fulfill  the  spirit  of the  theory  of  constant-shift  embedding,  the  cost(cid:173)\nfunction  of  the  data-mining  algorithm  subsequent  to  the  embedding  needs  to  be \nshift  invariant.  We  may by the same token go  a  step further  and apply algorithms \nfor  which this condition does  not hold.  In doing so,  however,  we  give up the math(cid:173)\nematical traceability of the error. \nTo  illustrate that  denoised  pairwise data can  act  as  standalone quality  data inde(cid:173)\npendent of the framework of algorithms based on shift invariant cost functions  (and \nin order to compare to the results obtained in  [18]),  a  linear SVM is  trained on 25% \nof the total data to mutually classify the genera-pairs:  3 - 4,  3 - 5,  4 - 5.  Genera 1 \nand 2 separate errorless and have therefore been omitted.  Model selection over the \nregularization parameter C  has been performed by choosing the optimal value  out \nof 10  equally  spaced values  from  [10-4, 102 ].  The  results  and  have  been  averaged \nby a  lOOO-fold  sampling  (cf.  table  1).  The best values  are printed in  bold. \nFor  the  classification  of genera 3  - 5  and  4  - 5  we  obtain  a  substantial  improve(cid:173)\nment  by  denoising.  Interestingly  this  is  not  the  case  for  genera 3  - 4  which  may \nbe  due  to  the  elimination  of discriminative  features  by  the  de noising  procedure. \nThe  error  still  is  significantly  smaller  than  the  error obtained  by  MCK2  and  FK, \nwhich  is  in  agreement  with  the superiority of a  structure preserving embedding of \nSmith-Waterman scores  even  when  left  undenoised:  FK  and MCK  are kernels  de-\n\n\fGenera \n3 - 4 \n3-5 \n4-5 \n\nFK \n10.4 \n10.9 \n23.1 \n\nMCK2 \n\nUndenoised \n\nDenoised \n\n8.48 \n5.71 \n11.6 \n\n5.06 \n5.72 \n7.55 \n\n5.43 \n3.83 \n3.17 \n\nTable  1:  Comparison of mean test-error of supervised classification by  linear  SVM \nof genera with  training  sample  25  % of the  total  sample.  The  results  for  MCK2 \n(Marginalized  Count  Kernel)  and FK  (Fisher  Kernel)  is  obtained by  kernel  Fisher \ndiscriminant analysis which compares favorably to the SVM in several benchmarks \n[18]. \n\nrived  from  a  generative  model,  whereas  the  alignment  scores  are obtained from  a \nmatching algorithm specifically  tuned for  protein sequences, reflecting much better \nthe underlying structure of protein data. \n\n3.2  Clustering of ProDom sequences \n\nThe analysis described in this section aims at finding a partition of domain sequences \nfrom  the ProDom  database,  [3],  that  is  meaningful  w.r.t.  structural  similarity.  In \norder to measure the quality of the grouping solution, we use the computed solution \nin  a  predictive  way  to  assign  group  labels  to  SCOP  sequences,  which  have  been \nlabeled by experts according to their structure,  [10].  The predicted labels  are then \ncompared with the \"true\"  SCOP labels. \n\nFor  demonstration  purposes,  we  select  the  following  subset  of  sequences  from \nprodom2001. 2. srs:  among  all  sequences  we  choose  those  which  are  highly  simi(cid:173)\nlar to at least one sequence contained in the first four  folds  of the SCOP database. 2 \nBetween these sequences, we  compute pairwise (length-corrected and standardized) \nSmith-Waterman alignment scores, summarized in the matrix (Sij).  These similar(cid:173)\nities  are  transformed  into  dissimilarities  by  setting  Dij  :=  Sii + Sjj  - 2Sij .  The \ncentralized score matrix SC  =  -1/2Dc possesses  some  highly negative eigenvalues, \nindicating that metric properties are violated.  Applying the constant-shift embed(cid:173)\nding  method,  a  valid  Mercer  kernel  is  derived,  with  an  eigenvalue  spectrum  that \nshows only a  few  dominating components over a  broad \"noise\"-spectrum (see figure \n2).  Extracting the first  16  leading  principal  components3  leads  to  a  vector  repre(cid:173)\nsentation  of the  sequences  as  points  in  ~16.  These  points  are  then  clustered  by \nminimizing the k-means cost function  within a  deterministic annealing framework. \nThe  model  order  was  selected  by  applying  a  re-sampling  based  stability  analysis, \nwhich  has  been  demonstrated  to  be  a  suitable  model  order  selection  criterion  for \nunsupervised grouping problems in  [13]. \n\nIn order to measure the quality of the grouping solution, all  1158 SCOP sequences \nfrom the first four folds  are embedded into the 16-dimensional space.  The predicted \ngroup  structure on this  test  set  is  then  compared with  the true SCOP fold-labels. \nFigure 3 shows  both the predicted group  membership of these sequences  and their \ntrue  SCOP  fold-label  in the  form  of a  bar diagram:  the  sequences  are  ordered  by \nincreasing group  label  (the lower  horizontal bar),  and compared with the true fold \nclassification (upper bar) .  In order to quantify the results,  the inferred clusters are \n\n2\"Highly  similar\"  here  means  that  the  highest  alignment  score  exceeds  a  predefined \n\nthreshold.  The result  is  a  subset of roughly  2700  ProD om domain sequences. \n\n3Subsampling  techniques  or  deflation  can  be  used  to  reduce  computational  load  for \nlarge-scale problems.  We only used a subset of 800 randomly chosen proteins for estimating \nthe 16  leading eigenvectors. \n\n\f<1,) 1200 -\n\n\" \" ~ \"'OO \nij \n.~ ~) \n'\" \n\n16lcading cigcnvcctors selcctcd \n\n(Partial)  eigenvalue  spec(cid:173)\n\nFigure  2: \ntrum of the  shifted  score  matrix.  The \ndata are projected onto the first leading \n16  eigenvectors, whereas the remaining \nprincipal components are considered to \nbe dominated by noise. \n\nre-Iabeled (''re-colored'') according to the maximum number of correctly identifiable \nfold-labels.  This procedure allows  us  to correctly identify the fold  label  of roughly \n94  % of the  SCOP sequences. \n\n1158 SCOP sequences from  folds  1-4 \n\n~1==::j1\"1  -------~I ~I(=~~~II -.... ~I\nr=11+1. =1 ~II  ___ ..... _IIIIIIJI ~I(==:_-~I II Prediction \n\nI  SCOP fold label \n\n1 \n\nCluster  I  Cluster 3 ... \n\nCluster 2 \n\nErrors \n\nre- Iabeled by \nmajority voting \n\nFigure 3:  Visualization of cluster membership of the  SCOP sequences contained in \nfolds  1-4. \n\nDespite  this  surprisingly  high  percentage,  it  is  necessary  to  deeper  analyze  the \nbiological  relevance  of the  inferred  grouping  solution.  In  order  to  check  to  what \nextent  the  above  \"over-all\"  result  is  influenced  by  artefacts  due  to  highly  related \n(or  even  almost  identical)  SCOP  sequences,  we  repeated  the  analysis  based  on \nthe  subset  of 128  SCOP  sequences  with  less  than  50  % sequence  identity  (PDB-\n50).  Predicting the  group  membership  of these  128  sequences  and using the  same \nre-Iabeling approach,  we  can correctly identify  86  % of the fold-labels.  This  result \ndemonstrates that we have not only found trivial groups of almost identical proteins, \nbut that we  have indeed extracted relevant  structural information. \n\n4  Discussion and  Conclusion \n\nThis paper provides two  main contributions that are highly  useful  when  analyzing \npairwise data.  First, we  employ the concept of constant shift embedding to provide \na metric representation of the data.  For a certain class of grouping principles sharing \na shift-invariance property, this embedding is  distortion-less in the sense that it does \nnot  influence  the  optimal  assignments  of objects  to  groups.  Given  the  metricized \ndata we  can now use common signal  (pre-)processing and denoising techniques that \nare typically only defined for  vectorial data. \n\nAs  we investigate the clustering of protein sequences from data bases like  GyrB and \nProDom,  we  are given  non-metric  pairwise  proximity information that is  strongly \ndeteriorated  by  the  shortcomings  of the  available  alignment  procedures.  Thus,  it \nis  important  to  apply  denoising  techniques  to  the  data  as  a  second  step  before \nrunning the actual clustering procedure.  We  find  that the combination of these two \nprocessing steps is successful in unraveling protein structure, greatly improving over \nexisting methods  (as exemplified for  GyrB and ProDom). \n\n\fFuture  research will  be  dedicated  to further  evaluation of the proposed algorithm. \nWe  will  also explore the perspectives it  opens  in any field  handling pairwise  data. \n\nAcknowledgments The  gyrE  amino acid  sequences  where  offered  by  courtesy of \nIdentification and Classification of Bacteria (ICB) databank team [19].  The authors \nare partially supported by  DFG grants #  MU  987/ 1-1  and #  BU 914/ 4-1. \n\nReferences \n\n[1]  A.KJain,  M.N.  Murty,  and P.J.  Flynn.  Data clustering:  a  review.  ACM Computing \n\nSurveys,  31(3):264- 323,  1999. \n\n[2]  S.  F.  Altschul,  W.  Gish,  W.  Miller,  E.  W.  Myers,  and  D.  J.  Lipman.  Basic  local \n\nalignment search tool.  J.  Mol.  Bioi.,  215:403  - 410,  1990. \n\n[3]  F.  Corpet,  F.  Servant,  J.  Gouzy,  and  D.  Kahn.  Prodom  and  prodom-cg:  tools  for \nprotein domain analysis and whole genome comparisons.  Nucleid Acids  Res.,  28:267-\n269,  2000. \n\n[4]  T.  F.  Cox  and M.  A.  A.  Cox.  Multidimensional  Scaling.  Chapman & Hall,  London, \n\n2001. \n\n[5]  R.O.  Duda,  P.E.Hart,  and  D.G.Stork.  Pattern  classification.  John  Wiley  &  Sons, \n\nsecond edition,  2001. \n\n[6]  P.  J.  Huber.  Projection  pursuit.  The  Annals  of Statistics,  pages 435--475,  1985. \n[7]  H.  Kasai,  A.  Bairoch,  K  Watanabe, K  Isono,  and S.  Harayama.  Construction of the \ngyrb database for the identification and classification of bacteria.  Genome Informatics, \npages  13  - 21,  1998. \n\n[8]  T.  Kohonen.  Self-Organizing  Maps.  Springer-Verlag,  Berlin,  1995. \n[9]  S.  Mika,  B.  SchOlkopf,  A.J.  Smola,  K-R.  Miiller,  M.  Scholz,  and G.  Ratsch.  Kernel \nPCA  and de- noising  in  feature  spaces.  In  M.S.  Kearns,  S.A.  Solla,  and D.A.  Cohn, \neditors,  Advances  in  Neural  Information  Processing  Systems,  volume  11,  pages  536-\n542.  MIT Press,  1999. \n\n[10]  A.G.  Murzin,  S.E.  Brenner, T. Hubbard, and C.  Chothia.  Scop:  a structural classifi(cid:173)\ncation of proteins database for  the investigation  of sequences and structures.  J.  Mol. \nBioi.,  247:536- 540,  1995. \n\n[11]  W.  R.  Pearson  and  D.  J.  Lipman.  Improved tools  for  biological  sequence  analysis. \n\nProc.  Natl.  Acad.  Sci,  85:2444  - 2448,  1988. \n\n[12]  J.  Puzicha,  T.  Hofmann,  and J.  Buhmann.  A  theory  of proximity  based  clustering: \n\nStructure detection by optimization.  Pattern  Recognition,  33(4):617- 634,  1999. \n\n[13]  V.  Roth,  M.  Braun,  T.  Lange,  and J.  Buhmann.  A  resampling  approach  to  cluster \n\nvalidation.  In  Computational  Statistics-COMPSTAT'02,  2002.  To  appear. \n\n[14]  V.  Roth,  J.  Laub,  M.  Kawanabe,  and  J.M.  Buhmann.  Optimal  cluster  preserving \nembedding of non-metric proximity data. Technical Report IAI-TR-2002-5, University \nof Bonn,  2002. \n\n[15]  S.  Roweis  and L.  Saul.  Nonlinear  dimensionality  reduction  by locally  linear  embed(cid:173)\n\nding.  Science,  290:2323-2326,  2000. \n\n[16]  B.  Schiilkopf,  A.  Smola,  and K-R.  Miiller.  Nonlinear component analysis  as  a  kernel \n\neigenvalue  problem.  Neural  Computation,  10:1299- 1319,  1998. \n\n[17]  J.B.  Tenenbaum,  V.  Silva,  and  J.C.  Langford.  A  global  geometric  framework  for \n\nnonlinear  dimensionality reduction.  Science,  290:2319- 2323,  2000. \n\n[18]  K  Tsuda,  T.  Kin,  and K  Asai.  Marginalized  kernels for  biological  sequences.  Proc. \n\nISMB,  to appear:2002 ,  http://www.cbrc.jp/  tsuda/. \n\n[19]  K  Watanabe, J.  Nelson,  S.  Harayama, and H. Kasai.  Icb database:  the gyrb database \nfor  identification and classification of bacteria.  Nucleic  Acids Res.,  29:344 - 345,  2001. \n\n\f", "award": [], "sourceid": 2215, "authors": [{"given_name": "Volker", "family_name": "Roth", "institution": null}, {"given_name": "Julian", "family_name": "Laub", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}