{"title": "Agnostic Classification of Markovian Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 471, "abstract": "", "full_text": "Agnostic Classification of Markovian \n\nSequences \n\nRan EI-Yaniv \n\nShai Fine \n\nNaftali Tishby* \n\nInstitute of Computer Science and Center for  Neural  Computation \n\nThe Hebrew  University \nJerusalem 91904, Israel \n\nE-Dlail:  {ranni,fshai,tishby}Ocs.huji.ac.il \n\nCategory:  Algorithms. \n\nAbstract \n\nClassification of finite sequences without explicit knowledge of their \nstatistical nature is  a  fundamental  problem  with  many  important \napplications.  We  propose  a  new  information  theoretic  approach \nto this  problem which is  based on the following  ingredients:  (i)  se(cid:173)\nquences are similar when they are likely to be generated by the same \nsource;  (ii) cross entropies can be estimated via \"universal compres(cid:173)\nsion\";  (iii)  Markovian  sequences  can  be  asymptotically-optimally \nmerged. \nWith these ingredients we  design a  method for  the classification of \ndiscrete sequences whenever they can be compressed.  We introduce \nthe method and illustrate its application for hierarchical clustering \nof languages and for  estimating similarities of protein sequences. \n\n1 \n\nIntrod uction \n\nWhile  the relationship  between  compression  (minimal description)  and supervised \nlearning  is  by  now  well  established,  no  such  connection  is  generally  accepted  for \nthe unsupervised case.  Unsupervised classification is  still largely based on ad-hock \ndistance  measures  with  often  no  explicit  statistical  justification.  This  is  particu(cid:173)\nlarly true for  unsupervised  classification of sequences  of discrete  symbols  which  is \nencountered in numerous important applications in machine learning and data min(cid:173)\ning,  such as text categorization, biological sequence modeling, and analysis of spike \ntrains. \n\nThe  emergence  of  \"universal\"  (Le.  asymptotically  distribution  independent)  se-\n\n\u00b7Corresponding author. \n\n\f466 \n\nR.  EI-Yaniv,  S.  FineandN. TlShby \n\nquence  compression  techniques  suggests  the existence  of  \"universal\"  classification \nmethods that make minimal  assumptions  about the statistical nature of the data. \nSuch  techniques  are potentially  more robust  and  appropriate for  real  world  appli(cid:173)\ncations. \nIn  this  paper we  introduce  a  specific  method that  utilizes  the connection  between \nuniversal  compression  and  unsupervised  classification of sequences.  Our  only  un(cid:173)\nderlying assumption is that the sequences can be approximated  (in the information \ntheoretic sense) by some finite order Markov sources.  There are three ingredients to \nour approach.  The first  is  the assertion that two sequences  are statistically similar \nif they are likely to be independently generated by the same source.  This likelihood \ncan then be estimated, given a typical sequence of the most likely joint source, using \nany good compression method for  the sequence samples.  The third ingredient  is  a \nnovel and simple randomized sequence merging algorithm which provably generates \na  typical sequence of the most likely joint source of the sequences,  under the above \nMarkovian approximation assumption. \n\nOur  similarity  measure  is  also  motivated  by  the  known  \"two  sample  problem\" \n[Leh59]  of estimating  the  probability  that  two  given  samples  are  taken  from  the \nsame distribution.  In the LLd.  (Bernoulli)  case this problem was thoroughly inves(cid:173)\ntigated  and the. optimal  statistical test  is  given  by  the  sum of the  empirical  cross \nentropies  between  the  two  samples  and  their  most  likely  joint -source.  We  argue \nthat this measure can be extended for  arbitrary order Markov sources and use it to \nconstruct and sample the most likely joint source. \n\nThe similarity measure and the statistical merging algorithm can be naturally com(cid:173)\nbined  into  classification  algorithms  for  sequences.  Here  we  apply  the  method  to \nhierarchical clustering of short text segments in 18 European languages and to eval(cid:173)\nuation of similarities of protein sequences.  A complete analysis of the method, with \nfurther  applications,  will  be presented elsewhere [EFT97]. \n\n2  Measuring the statistical similarity of sequences \n\nEstimating the statistical similarity of two individual sequences is traditionally done \nby training a  statistical model for  each sequence and then measuring the likelihood \nof the other sequence by the model.  Training a  model entails an assumption about \nthe  nature  of  the  noise  in  the  data  and  this  is  the  rational  behind  most  \"edit \ndistance\"  measures, even when the noise model  is  not explicitly stated. \nEstimating the log-likelihood  of a  sequence-sample  over  a  discrete  alphabet  E  by \na  statistical  model  can  be  done  through  the  Cross  Entropy  or  Kullback-Leibler \nDivergence[CT91]  between  the  sample  empirical  distribution  p  and  model  distri(cid:173)\nbution q,  defined  as: \n\nDKL (pllq)  =  L P (0-)  log P((:\u00bb  . \n\n(1) \n\nuEE \n\nq \n\nThe  KL-divergence,  however,  has  some  serious  practical  drawbacks.  It is  non(cid:173)\nsymmetric and unbounded unless the model distribution q is  absolutely continuous \nwith respect to p  (Le.  q = 0 ::::}  P = 0).  The KL-divergence is therefore highly sensi(cid:173)\ntive to low probability events under q.  Using the  \"empirical\"  (sample)  distributions \nfor  both p  and q  can result in very unreliable estimates of the true divergences.  Es(cid:173)\nsentially,  D K L [Pllq]  measures the  asymptotic  coding  inefficiency  when  coding  the \nsample p  with an optimal code for  the model distribution q. \nThe  symmetric  divergence,  i.e.  D (p, q)  = DKL [Pllq]  + DKL [qllp],  suffers  from \n\n\fAgnostic Classification of Markovian Sequences \n\n467 \n\nsimilar sensitivity problems and lacks the clear statistical meaning. \n\n2.1  The  \"two sample problem\" \n\nDirect Bayesian arguments, or alternately the method of types [CK81], suggest that \nthe probability that there exists  one  source distribution M for  two  independently \ndrawn samples,  x  and y  [Leh59],  is  proportional to \n\n! dJ-L  (M)  Pr (xIM) . Pr (yIM)  = ! dJ-L  (M) . 2-(lzIDKdp\",IIM1+lyIDKL[P1l1IM]), \n\n(2) \n\nwhere  dJ-L(M)  is  a  prior  density  of all  candidate  distributions,  pz  and Py  are  the \nempirical (sample)  distributions, and Ixl and Iyl are the corresponding sample sizes. \nFor  large enough  samples this integral is  dominated  (for  any  non-vanishing prior) \nby the maximal  exponent  in  the integrand,  or  by  the  most likely  joint source of x \nand  y,  M>..,  defined  as \n\nM>..  =  argmin {lxIDKL (PzIIM') + IYIDKL  (pyIlM')}. \n\n(3) \nwhere 0 ~ A =  Ixl/(lxl + Iyl) ~ 1 is the sample  mixture ratio.  The convexity of the \nKL-divergence guarantees that this minimum is  unique and is given by \n\nM' \n\nM>..  = APz  + (1  - A) PY' \n\nthe A - mixture of pz  and py. \nThe similarity measure between two  samples,  d(x, y),  naturally follows  as the min(cid:173)\nimal value of the above exponent.  That is, \n\nDefinition 1  The  similarity  measure,  d(x, y)  =  V>..(Pz,Py),  of two  samples x  and \ny,  with  empirical  distributions pz  and Py  respectively,  is  defined  as \n\nd(x, y)  = V>..(Pz,Py)  =  ADKL (PzIIM>..)  + (1- A) DKL (pyIIM>..) \n\n(4) \n\nwhere  M>..  is  the  A-mixture  of pz  and Py. \n\nThe function  V>..  (p, q)  is  an  extension of the  Jensen-Shannon  divergence  (see  e.g. \n[Lin91])  and satisfies many useful analytic properties, such as symmetry and bound(cid:173)\nedness  on  both  sides  by  the L1-norm,  in  addition  to its clear  statistical  meaning. \nSee  [Lin91,  EFT97]  for  a  more complete discussion of this measure. \n\n2.2  Estimating the V>..  similarity measure \n\nThe key component of our classification method is the estimation of V>..  for  individ(cid:173)\nual finite  sequences, without  an explicit  model  distribution. \nSince  cross entropies,  D K L, express code-length  differences,  they can be estimated \nusing  any  efficient  compression  algorithm  for  the  two  sequences.  The  existence \nof  \"universal\"  compression  methods,  such  as  the  Lempel-Ziv  algorithm  (see  e.g. \n[CT91])  which  are  provably  asymptotically  optimal  for  any  sequence,  give  us  the \nmeans for  asymptotically optimal estimation of V>..,  provided that we can obtain a \ntypical sequence of the most-likely joint source,  M >... \nWe  apply  an  improvement  of the  method  of Ziv  and  Merhav  [ZM93]  for  the esti(cid:173)\nmation of the two cross-entropies using the Lempel-Ziv algorithm given two sample \nsequences  [BE97].  Notice that our estimation of V>..  is  as  good as the compression \nmethod used,  namely, closer to optimal compression yields better estimation of the \nsimilarity measure. \nIt remains  to  show  how  a  typical  sequence  of the  most-likely  joint  source can  be \ngenerated. \n\n\f468 \n\nR.  El-Yaniv,  S. Fine and N. Tishby \n\n3 \n\nJoint Sources of Markovian Sequences \n\nIn this section we first explicitly generalize the notion of the joint statistical source to \nfinite order Markov probability measures.  We identify the joint source of Markovian \nsequences and show how to construct a  typical random sample of this source. \n\nMore precisely,  let  x  and  y  be two sequences  generated by  Markov  processes with \ndistributions  P  and Q,  respectively.  We present a  novel algorithm  for  the merging \nthe  two  sequences,  by  generating  a  typical  sequence  of an  approximation  to  the \nmost likely joint source of x  and y.  The algorithm does not require the parameters \nof the true sources P  and Q and the computation of the sequence is  done directly \nfrom  the sequence samples x  and y. \nAs  before,  r;  denotes  a  finite  alphabet  and  P  and  Q  denote  two  ergodic  Markov \nsources over  r;  of orders Kp and  KQ,  respectively.  By  equation  3,  the :A-mixture \njoint source M>..  of P  and Q is M>..  = argminM' :ADKdPIIM')+(I-:A)DKdQIIM') , \nwhere  for  sequences  DKdPIIM)  = limsuPn-too ~ L:zE!:n P(x) log :1:))'  The  fol(cid:173)\nlowing theorem identifies the joint source of P  and Q. \n\nTheorem 1  The  unique  :A-mixture  joint  source  M>..  of P  and  Q,  of order  K  = \nmax {K p, K Q},  is  given  by  the  following  conditional  distribution.  For  each  s  E \nr;K,aEE, \n\nM>..(als)  =  :AP(s) + (1  _ :A)Q(s) P(als) + :AP(s) + (1- :A)Q(s) Q(als)  . \n\n:AP(s) \n\n(1  - :A)Q(s) \n\nThis distribution can be naturally extended to n  sources with priors :At, ... ,:An. \n\n3.1  The  \"sequence merging\"  algorithm \n\nThe above theorem can be easily translated into an algorithm.  Figure 1 describes a \nrandomized algorithm that generates from  the given sequences x  and y,  an asymp(cid:173)\ntotically typical  sequence  z  of the most  likely joint source,  as defined  by  Theorem \n1,  of P  and Q. \n\nInitialization: \n\n\u2022  z [OJ  =  choose  a  symbol from  x  with probability ,x  or y  with probability 1 - ,x \n\n\u2022  i  =  0 \n\nLoop: \nRepeat until the approximation error is  lower  then a  prescribe threshold \n\n\u2022  s\",  :=  max length suffix of z  appearing somewhere in  x \n\u2022  Sy  := max length suffix of z  appearing somewhere in  y \n\u2022  A(,x  S \n\n>.Pr .. (s .. ) \n\n>.Pr .. (s.,)+(l->.) Prll(s\\I) \n\n, \n\n\"\"  Sy \n\n}  -\n-\n\n\u2022  r  =  choose x  with probability A(,x, s\"\"  Sy}  or y with probability 1-A(,x, S\"\"  S1/} \n\u2022  r (Sr)  =  randomly choose one of the occurrences of Sr  in  r \n\u2022  z [i + 1)  =  the symbol appearing immediately after r (Sr)  at  r \n\u2022  i=i+1 \n\nEnd Repeat \n\nFigure 1:  The most-likely joint source algorithm \n\n\fAgnostic Classification of Markovian Sequences \n\n469 \n\nNotice that the algorithm is  completely unparameterized, even the sequence alpha(cid:173)\nbets, which may differ from one sequence to another, are not explicitly needed.  The \nalgorithm can be efficiently implemented by  pre-preparing suffix trees for the given \nsequences,  and the  merging algorithm is  naturally generalizable to any  number of \nsequences. \n\n4  Applications \n\nThere are several possible applications of our sequence merging algorithm and sim(cid:173)\nilarity measure.  Here  we  focus  on three possible applications:  the  source merging \nproblem,  estimation of sequence similarity,  and bottom-up sequence-classification. \nThese  algorithms  are  different  from  most  existing  approaches  because  they  rely \nonly on the sequenced data, similar to universal compression, without explicit mod(cid:173)\neling assumptions.  Further details, analysis, and applications of the method will be \npresented elsewhere [EFT97]. \n\n4.1  Merging and synthesis of sequences \n\nAn immediate application of the source merging algorithm is for  synthesis of typical \nsequences of the joint source from  some  given  data sequences,  without  any  access \nto  an  explicit  model  of the  source. \n\nTo illustrate this point consider  the sequence in  Figure 2.  This sequence was  ran(cid:173)\ndomly  generated,  character  by  character,  from  two  natural  excerpts:  a  47,655-\ncharacter  string  from  Dickens'  Tale  of  Two  Cities,  and  a  59,097-character  string \nfrom  Twain's The King and the Pauper. \n\nDo  your  way  to  her  breast,  and  sent  a  treason's  sword- and  not  empty. \n\n\"I  am  particularly  and  when  the  stepped  of  his  ovn  commits  place.  No;  yes, \nof  course,  and  he  passed  behind  that  by  turns  ascended  upon  him,  and  my  bone \nto  touch  it,  less  to  say: \nIn  miness?\" \nThe  books  third  time.  There  was  but  pastened  her  unave  misg  his  ruined  head \nthan  they  had  knovn  to  keep  his  saw  whether  think\"  The  feet  our  grace  he \ncalled  offer  information? \n\n'Remove  thought,  everyone!  Guards! \n\n[Twickens,  1997] \n\nFigure  2:  A  typical  excerpt  of  random  text  generated  by  the  \"joint  source\"  of \nDickens and Twain. \n\n4.2  Pairwise similarity of proteins \n\nThe  joint  source  algorithm,  combined  with  the  new  similarity  measure,  provide \nnatural means for  computing the similarity of sequences over any alphabet.  In this \nsection  we  illustrate this  applicationl  for  the  important  case of protein  sequences \n(sequences over the set of the 20  amino-acids). \nFrom  a  database  of all  known  proteins  we  selected  6 different  families  and  within \neach family  we  randomly  chose  10  proteins.  The families  chosen  are:  Chaperonin, \nMHC1,  Cytochrome,  Kinase,  Globin  Alpha  and  Globin  Beta.  Our  pairwise  dis(cid:173)\ntances between all 60 proteins were computed using our agnostic algorithm and are \ndepicted in the 6Ox60 matrix of Figure 3.  As can be seen, the algorithm succeeds to \n\nIThe protein  results  presented here are part of an  ongoing  work  with  G.  Yona and E. \n\nBen-Sasson. \n\n\f470 \n\nR.  El-Yaniv,  S.  Fine and N.  TlShby \n\nidentify the families  (the success  with the Kinase and Cytochrome families  is  more \nlimited). \n\nPairwIse Distances of Protein Sequences \n\nchaperonin \n\nMHC I \n\ncytochrome \n\nkinase \n\nglobin a \n\nglobin b \n\nFigure 3:  A  60x60 symmetric  matrix  representing the  pairwise  distances,  as  com(cid:173)\nputed by  our agnostic algorithm,  between  60  proteins,  each  consecutive  10 belong \nto a  different family.  Darker gray represent  higher similarity. \n\nIn another experiment we  considered all  the 200 proteins of the Kinase family  and \ncomputed  the  pairwise  distances  of these  proteins  using  the  agnostic  algorithm. \nFor comparison we  computed the pairwise similarities of these sequences  using the \nwidely  used Smith-Waterman algorithm  (see e.g.  [HH92]).2  The resulting agnostic \nsimilarities, computed with no biological information whatsoever, are very similar to \nthe Smith-Waterman similarities. 3  Furthermore,  our agnostic  measure discovered \nsome biological similarities not detected by  the Smith-Waterman method. \n\n4.3  Agnostic classification of languages \n\nThe sample of the joint source of two sequences can be considered as their \"average\" \nor  \"centroid\", capturing a mixture of their statistics.  Averaging and measuring dis(cid:173)\ntance between objects are sufficient for most standard clustering algorithms such as \nbottom-up greedy clustering, vector quantization (VQ), and clustering by determin(cid:173)\nistic  annealing.  Thus,  our merging method and similarity  measure can be directly \napplied for  the classification of finite  sequences via standard clustering algorithms. \n\nTo illustrate the power of this new sequence clustering method we give the result of a \nrudimentary linguistic experiment using a greedy bottom-up (conglomerative) clus(cid:173)\ntering of short excerpts  (1500 characters) from  eighteen languages.  Specifically,  we \ntook sixteen  random  excerpts  from  the following  Porto-Indo-European languages: \nAfrikaans,  Catalan,  Danish,  Dutch,  English,  Flemish,  French,  German,  Italian, \nLatin,  Norwegian,  Polish,  Portuguese,  Spanish,  Swedish  and  Welsh,  together with \n\n2we  applied the Smith-Waterman for  computing local-alignment  costs using  the state(cid:173)\n\nof-the-art  blosum62 biological  cost  matrix. \n\n3These  results  are  not  given  here  due  to  space  limitations  and  will  be  discussed \n\nelsewhere. \n\n\fAgnostic Classification of Markovian Sequences \n\n471 \n\ntwo artificial languages:  Esperanto and Klingon4. \n\nThe  resulting  hierarchical classification tree is  depicted  in  Figure 4.  This  entirely \nunsupervised  method, when applied to these short random excerpts, clearly agrees \nwith the \"standard\" philologic tree of these languages, both in terms of the grouping \nand the levels  of similarity  (depth  of the split)  of the languages  (the Polish-Welsh \n\"similarity\"  is  probably due to the specific transcription used). \n\nFigure 4:  Agnostic bottom-up greedy clustering of eighteen languages \n\nAcknowledgments \n\nWe  sincerely thank Ran Bachrach and Golan Yona for  helpful discussions.  We  also \nthank Sageev Oore for  many useful comments. \n\nReferences \n[BE97]  R.  Bachrach and  R.  EI-Yaniv,  An Improved Measure of Relative  Entropy \n\nBetween Individual Sequences,  unpublished manuscript. \n\n[CK81]  1.  Csiszar and J . Krorner.  Information Theory:  Coding Theorems for  Dis(cid:173)\n\ncrete Memoryless Systems,  Academic Press, New-York  1981. \n\n[CT91]  T.  M.  Cover  and  J.  A.  Thomas.  Elements  of Information  Theory,  John \n\nWiley  &  Sons, New-York  1991. \n\n[EFT97]  R.  EI-Yaniv,  S.  Fine  and  N.  Tishby.  Classifying  Markovian  Sources,  in \n\npreparations, 1997. \n\n[HH92]  S.  Henikoff and  J .  G.  Henikoff  (1992) .  Amino  acid  substitution  matrices \n\nfrom  protein blocks.  Proc. Natl.  Acad.  Sci.  USA  89, 10915-10919. \n\n[Leh59]  E.  L.  Lehmann. Testing Statistical Hypotheses,  John Wiley & Sons,  New(cid:173)\n\nYork  1959. \n\n[Lin91]  J.  Lin,  1991.  Divergence measures  based on the Shannon entropy.  IEEE \n\nTransactions  on  In/ormation  Theory, 37(1):145-15l. \n\n[ZM93]  J .  Ziv  and  N.  Merhav,  1993.  A  Measure  of Relative  Entropy  Between \nIndividual  Sequences  with  Application  to  Universal  Classification,  IEEE \nTransactions  on  In/ormation  Theory,  39(4). \n\n4Klingon is  a  synthetic language that was  invented for  the Star-Trek TV series. \n\n\f", "award": [], "sourceid": 1376, "authors": [{"given_name": "Ran", "family_name": "El-Yaniv", "institution": null}, {"given_name": "Shai", "family_name": "Fine", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}