{"title": "Adaptive Mixture of Probabilistic Transducers", "book": "Advances in Neural Information Processing Systems", "page_first": 381, "page_last": 387, "abstract": null, "full_text": "Adaptive Mixture of Probabilistic Transducers \n\nYoram Singer \n\nAT&T Bell Laboratories \nsinger@research.att.com \n\nAbstract \n\nWe  introduce and  analyze  a  mixture  model  for  supervised  learning  of \nprobabilistic transducers.  We  devise an  online learning algorithm  that \nefficiently infers the structure and estimates the parameters of each model \nin the mixture. Theoretical analysis and comparative simulations indicate \nthat the learning algorithm tracks the best model from an arbitrarily large \n(possibly infinite) pool of models.  We also present an application of the \nmodel for inducing a noun phrase recognizer. \n\n1 \n\nIntroduction \n\nSupervised learning of a probabilistic mapping between temporal sequences is an important \ngoal of natural sequences analysis and classification with a broad range of applications such \nas handwriting and speech recognition, natural language processing and DNA analysis. Re(cid:173)\nsearch efforts in supervised learning of probabilistic mappings have been almost exclusively \nfocused on estimating the parameters of a predefined model.  For example, in [5]  a second \norder recurrent  neural  network  was  used  to  induce a  finite state  automata that classifies \ninput sequences and in [1] an input-output HMM architecture was used for similar tasks. \n\nIn this paper we introduce and analyze an alternative approach based on a mixture model \nof a new subclass of probabilistic transducers, which  we call  suffix tree transducers.  The \nmixture of experts architecture has been proved to be a powerful approach both theoretically \nand experimentally.  See [4,8,6, 10,2, 7] for analyses and applications of mixture models, \nfrom  different perspectives such as  connectionism, Bayesian inference and computational \nlearning  theory.  By combining  techniques used  for  compression  [13]  and  unsupervised \nlearning [12], we devise an  online algorithm that efficiently  updates the mixture weights \nand the parameters of all the possible models from an  arbitrarily large (possibly infinite) \npool of suffix tree transducers.  Moreover, we employ the mixture estimation paradigm to \nthe estimation of the parameters of each model in the pool and achieve an efficient estimate \nof the  free  parameters  of each  model.  We  present  theoretical  analysis,  simulations and \nexperiments with real  data which show that  the learning algorithm indeed  tracks the best \nmodel in a growing pool of models, yielding an accurate approximation of the source.  All \nproofs are omitted due to the lack of space \n\n2  Mixture of Suffix Tree Transducers \n\nLet ~in and ~Ot.lt be two finite alphabets.  A Suffix Tree  Transducer  T  over (~in, ~Ot.lt) is a \nrooted,l~jn I-ary tree where every internal node of T  has one child for each symbol in ~in. \nThe nodes of the tree are labeled by pairs (s, l' ~), where s is the string associated with the path \n(sequence of symbols in ~n) that leads from the root to that node, and 1'~  :  ~Ot.lt  -+ [0,1] \nis the output probability function.  A suffix tree transducer (stochastically) maps arbitrarily \nlong input sequences over ~in to output sequences over ~Ot.lt  as  follows.  The probability \n\n\f382 \n\nY.  SINGER \n\nthat T  will output a  string  Yl, Y2, ... ,Yn  in  I:~ut given an  input string  Xl, X2,  ... , Xn  in \nI:in, denoted by PT(Yl, Y2,  ... , YnlXl, X2 , \"\"  xn),  is  n~=li8. (Yk),  where  sl  =  Xl, and \nfor  1  ::;  j  ::;  n - 1,  si  is the string labeling the deepest  node reached  by taking the path \ncorresponding to  xi, xi -1, Xi -2, ... starting at the root of T.  A suffix  tree transducer is \ntherefore a probabilistic mapping that induces a measure over the possible output strings \ngiven an input string. Examples of suffix tree transducers are given in Fig.  1. \n\nFigure 1:  A suffix tree transducer (left) over (Lin, LQut)  = ({O, 1} , {a , b, c}) and two ofits possible \nsub-models (subtrees). The strings labeling the nodes are the suffixes of the input string used to predict \nthe output string.  At each node there is an output probability function defined for each of the possible \noutput symbols.  For instance, using the suffix tree transducer depicted on the left,  the probability of \nobserving the symbol b given that  the input sequence is  ... ,0, 1,0, is 0.1.  The probability of the \ncurrent output, when each transducer is associated with a weight (prior),  is  the weighted sum of the \npredictions of each transducer. For example, assume that the weights of the trees are 0.7 (left tree), 0.2 \n(middle), and 0.1. then the probability thattheoutputYn = a  given that (X n -2, Xn-l, Xn)  =  (0,1,0) \nis 0.7\u00b7 P7j (aIOl0) + 0.2\u00b7 P7i(aIIO) + 0.1  . P7)(aIO) = 0.7  . 0.8 + 0.2  . 0.7 + 0.1  . 0.5 = 0.75. \n\nGiven a  suffix  tree transducer T  we are interested in  the prediction of the  mixture of all \npossible subtrees of T. We associate with each subtree (including T) a weight which can be \ninterpreted as its prior probability.  We  later show how the learning algorithm of a mixture \nof suffix  tree  transducers  adapts  these  weights with  accordance  to  the performance (the \nevidence in Bayesian terms) of each subtree on past observations.  Direct calculation of the \nmixture probability is  infeasible since  there might be exponentially many  such  subtrees. \nHowever,  the technique introduced in  [13]  can  be generalized and applied to our setting. \nLet T' be a  subtree of T.  Denote by  nl  the number of the internal  nodes of T' and by \nn2  the  number  of leaves  of T'  which  are  not  leaves  of T.  For example,  nl  =  2  and \nn2  = I,  for  the  tree depicted  on  the  right  part of Fig.  1,  assuming  that  T  is  the  tree \ndepicted on the left part of the figure.  The prior weight of a tree T'. denoted by Po(T') is \ndefined to be  (1  - Q-)n\\ a n2 ,  where a  E  (0, 1).  Denote by Sub(T) the set of all possible \nsubtrees of T  including T  itself. It can be easily verified that this definition of the weights \nis a  proper measure,  i.e.,  LT/ESUb(T) Po(T')  = 1.  This distribution over trees  can  be \nextended to unbounded trees assuming that the largest tree is an infinite lI:in I-ary suffix tree \ntransducer and  using the following randomized recursive process.  We start with a suffix \ntree that includes only  the  root node.  With  probability a  we  stop the process and  with \nprobability 1 - a  we add all the possible lI:in I sons of the node and continue the process \nrecursively for each  of the sons.  Using this recursive prior the suffix tree transducers, we \ncan calculate the prediction of the mixture at step n in time that is linear in  n, as  follows, \n\naie(Yn)  + (1  - a) (aixn(Yn)  + (1- a) (aixn_\\xn(Yn)  + (1  - a)  ... \n\nTherefore, the prediction time of a single symbol is bounded by the maximal  depth of T, \nor the length of the input sequence if T  is infinite.  Denote by 1'8 (Yn)  the prediction of the \nmixture of subtrees rooted at s, and let  Leaves(T) be the set of leaves of T .  The above \n\n\fAdaptive  Mixture of Probabilistic Transducers \n\nsum equals to 'Ye(Yn), and can be evaluated recursively as follows,1 \n\n'Y3(Yn)  =  {  13(Yn) \n\n_ \n\na I3 (Yn) + (I - a)r(X n_I.I.3)(Yn)  otherwise \n\nS  E Le~ves(T) \n\n383 \n\n(  ) \nI \n\nFor example,  given  that the input sequence  is  ... ,0, 1, 1,0, then  the probabilities of the \nmixtures of subtrees for the tree depicted on the left part of Fig.  1, for Yn  =  b and given \nthat a  = 1/2, are,  'Yllo(b)  = 0.4  ,  'YlO(b)  = 0.5  .  11O(b) + 0.5  .  0.4 = 0.3  ,  'Yo(b)  = \n0.5  .  lo(b) + 0.5  \u00b70.3 = 0.25,  'Ye(b)  = 0.5  . le(b) + 0.5  \u00b70.25 = 0.25. \n\n3  An Online Learning Algorithm \n\nWe now describe an  efficient learning algorithm for  a mixture of suffix  tree transducers. \nThe  learning  algorithm  uses  the  recursive  priors  and  the evidence  to  efficiently  update \nthe posterior weight of each  possible subtree.  In  this section  we assume  that the output \nprobability functions are known.  Hence, we need to evaluate the following, \n\n~  P(Yn IT')P(T'I(XI, YI), ... ,(Xn_l, Yn-t) \n\nT'ESub(T) \n\ndef  ~  P(Yn IT')Pn (T') \n\nT'ESub(T) \n\n(2) \n\nwhere Pn(T') is the posterior weight of T'.  Direct calculation of the above sum requires \nexponential  time..  However,  using  the  idea  of recursive  calculation  as  in  Equ.  (1)  we \ncan  efficiently  calculate  the  prediction of the  mixture.  Similar  to  the  definition  of the \nrecursive prior a, we define qn (s)  to be the posterior weight of a node  S  compared to the \nmixture of all nodes below s.  We can compute the prediction of the mixture of suffix tree \ntransducers rooted at s  by simply replacing the prior weight a  with the posterior weight, \nqn-l (s), as follows, \n\n) _  {  13(Yn) \n\nqn-I(S)r3(Yn) + (1  - qn-l(S\u00bb'Y(X n _I.I.3)(Yn)  otherwise \n\n_ ( \n13  Yn  -\nIn  order  to  update  qn(s)  we  introduce  one  more  variable,  denoted  by  rn(s). \nro(s) = 10g(a/(1 - a\u00bb for all s, rn(s) is updated as follows, \n\nS  E Leaves(T) \n\nrn(s) = rn-l(s) + log(/3(Yn\u00bb -log('YXn_I.13(Yn\u00bb  . \n\n, \n\n(3) \n\nSetting \n\n(4) \n\nTherefore,  rn( s)  is the log-likelihood ratio between the prediction of s and the prediction \nof the mixture of all nodes below s in T.  The new posterior weights qn (s)  are calculated \nfrom  rn(s), \n\n(5) \nIn summary, for each new observation pair, we traverse the tree by following the path that \ncorresponds to the input sequence x n X n -I X n _ 2 .. . The predictions of each sub-mixture are \ncalculated using Equ. (3).  Given these predictions the posterior weights of each sub-mixture \nare updated using Equ. (4) and Equ. (5).  Finally, the probability of Yn  induced by the whole \nmixture is the prediction propagated out of the root node, as stated by Lemma 3.1. \n\nLemma3.1  LT'ESub(T) P(YnlT')Pn(T') = 'Ye(Yn). \n\nLet Lossn (T) be the logarithmic loss (negative log-likelihood) of a suffix tree transducer T \nafter n input-output pairs.  That is, Lossn(T) =  L7=1  -log(P(YiIT\u00bb. Similarly, the loss \n\n1 A similar derivation still holds even if there is  a different prior 0'. at each node s of T.  For the \n\nsake of simplicity we assume that 0' is  constant. \n\n\f384 \n\nY.  SINGER \n\nof the mixture is defined to be,  Lossr;:ix  =  2:~=1 -log(.ye(yd).  The advantage of using \na mixture of suffix tree transducers over a single suffix tree is due to the robustness of the \nsolution, in the sense that the prediction of the mixture is almost as good as the prediction \nof the best suffix tree in the mixture. \n\nTheorem 1  Let  T  be  a \nlet \n(Xl, yd, .. . , (xn, Yn)  be  any  possible  sequence  of input-output pairs.  The  loss  of the \nmixture is at most,  Lossn(T') -log(Po(T'\u00bb,Jor each possible subtree  T'.  The  running \ntime of the algorithm is D n where  D is the maximal depth ofT or n2 when T  is infinite. \n\ntransducer,  and \n\ntree \n\n(possibly \n\ninfinite)  suffix \n\nThe proof is based on a technique introduced in [4].  Note that the additional loss is constant, \nhence the normalized loss per observation pair is, Po(T')/n, which decreases like O(~). \nGiven a long sequence of input-output pairs or many short sequences,  the structure of the \nsuffix tree transducer is inferred as well.  This is done by updating the output functions, as \ndescribed in the next section,  while adding new  branches to  the tree whenever the suffix \nof the  input sequence  does  not appear  in  the  current  tree.  The  update of the  weights, \nthe parameters,  and the structure ends  when  the maximal  depth  is reached,  or when  the \nbeginning of the input sequence is encountered. \n\n4  Parameter Estimation \n\nIn this section we describe how the output probability functions are estimated.  Again, we \ndevise an online scheme.  Denote by C;'(y)  the number of times the output symbol y was \nobserved out of the n times the node s was visited.  A commonly used estimator smoothes \neach count by adding a constant ( as follows, \n\n(6) \nThe special  case of (  = ~ is termed  Laplace's  modified rule of succession or the add~ \nestimator.  In [9], Krichevsky and Trofimov proved that the loss of the add~ estimator, when \napplied  sequentially,  has  a  bounded  logarithmic loss  compared  to  the  best  (maximum(cid:173)\nlikelihood) estimator calculated  after  observing  the  entire  input-output sequence.  The \nadditional loss of the estimator after n observations is, 1/2(II:out l - 1) log(n) + lI:outl-l. \nWhen the output alphabet I:out is rather small, we approximate \"y 8 (y) by 78 (y) using Equ. (6) \nand increment the count of the corresponding symbol every time the node s is visited.  We \npredict by replacing \"y with its estimate 7 in Equ. (3).  The loss of the mixture with estimated \noutput probability functions, compared to any  subtree T' with known  parameters, is now \nbounded as follows, \nLOSS,:ix  ~ Lossn(T') -log(Po(T'))  +  1/2IT'1 (Il:outl-l) log(n/IT'I) + IT'I (Il:outl-l), \nwhere IT'I is the number of leaves in T'. This bound is obtained by combining the bound \non the prediction of the mixture from Thm.  1 with the loss of the smoothed estimator while \napplying Jensen's inequality [3]. \n\nWhen  lI:out 1 is fairly large or the sample size if fairly small, the smoothing of the output \nprobabilities is  too crude.  However,  in  many  real  problems,  only a  small  subset of the \noutput alphabet is observed  in  a  given  context  (a  node in  the  tree).  For example,  when \nmapping phonemes to phones [II], for a given sequence of input phonemes the phones that \ncan be pronounced is limited to a few  possibilities.  Therefore, we would like to devise an \nestimation scheme that statistically depends on the effective local alphabet and not on the \nwhole alphabet.  Such an  estimation scheme can be devised by employing again a mixture \nof models,  one model  for each  possible subset I:~ut of I:out.  Although there are 211:0  .. ,1 \nsubsets of I:out, we next show that if the estimators depend only on the size of each subset \nthen the whole mixture can be maintained in time linear in  lI:out I. \n\n\fAdaptive  Mixture of Probabilistic Transducers \n\n385 \n\nDenote  by  .y~ (YIII:~ut 1 = i)  the  estimate  of 'Y~ (y)  after  n  observations given  that  the \nalphabet  I:~t  is  of size  i.  Using  the  add~ estimator,  .y~(YIII:~utl  = i)  = (C~(y) + \n1/2)/(n + i/2). Let I:~ut(s) be the set of different output symbols observed at node s, i.e. \n\nI:~ut(s) = {u  1 u = Yi\",  s = (xi\"-I~I+1' .. . ,Xi,,),  1 $  k  $  n} \nout \n\nand define  I:0 \nsize i.  Thus, the prediction of the mixture of all possible subsets of I:out  is, \n\n(s)  to be the empty set  There are  (IIo.'I-II~.,(~)I) possible alphabets of \n\ni-II:.,(~)I \n\n, \n\n. \n\nAn(  ) =  I~I  (lI:outl- 1I:~ut(s)l) \ni~  Y \n\nL...J \n\n. _  lI:n  (s)1 \nJ \n\nout \n\nj=II:.,(~)1 \n\n':l  An(  I\u00b7) \nw1  'Y~  Y J \n\n, \n\n(7) \n\nwhere  wi  is  the  posterior probability of an  alphabet  of size i.  Evaluation of this  sum \nrequires O(II:01.lt I)  operations (and not 0(2IIo.,1 \u00bb.  We can compute Equ. (7) in an online \nfashion as follows.  Let, \n\n( lI:01.l tl- 1I:~ut(s)l)  0  TIn  Ak-l(  .  10) \ny,,,  ~ \n\n'Y8 \n\nW, \n\n0_ lI:n  ()I \n2 \n\nout  s \n\nk=l \n\n(8) \n\nWithout loss of generality,  let us  assume a  uniform prior for  the possible alphabet sizes. \nThen, \n\nPo(I:~ut) = Po(II:~1.Itl = i) ~ w?  = 1/ (I I:01.lt1 (lI::utl))  0 \nThus, for all i ~(i) =  1/II:01.ltl.  ~+1 (i) is updated from ~(i) as follows, \n\nm+l (0)  _  m ( 0) \n1  ~ \n\n1  ~  2  X \n\n2 \n\n-\n\n{~:(';'j;' )+'/2 \n\nn+I/2 \n\ni-II~y,(~)1  ~ \n\nIIo.,I-II:.,(8)1 n+i/2 \n\nif 1I:~ti/(s)1 > i \nif 1I:~titl(s)1  $  i and Yi,,+1  E I:~1.It(s) \nif 1I:~titl(s)1 $  i and Yi,,+1  \u00a2 I:~ut(s) \n\nInformally: If the number of different symbols observed so far exceeds a given size then all \nalphabets of this size are eliminated from the mixture by slashing their posterior probability \nto zero.  Otherwise, if the next symbol was observed before, the output probability is the \nprediction of the addi estimator.  Lastly, if the next symbol is entirely new, we need to sum \nthe predictions of all the alphabets of size i  which agree on the first  1I:~1.It(s)1 and Yi,,+1  is \none of their i  -\n1I:~1.It(s)1 (yet) unobserved symbols.  Funhermore, we need to multiply by \nthe apriori probability of observing Yi ,,+10  Assuming a uniform prior over the unobserved \nsymbols this probability equals to 1/(II:01.lt 1 -\n1I:~1.It( s)l).  Applying Bayes rule again, the \nprediction of the mixture of all possible subsets of the output alphabet is, \n\nIIo.\" \n\n.y~(Yin+l) = 2: ~+l(i) /  2: ~(i)  \u00b0 \n\nIIo.,1 \n\n(9) \n\ni=l \n\ni=l \n\nApplying twice  the online mixture estimation  technique,  first  for  the structure and  then \nfor the parameters,  yields an efficient and robust online algorithm.  For a sample of size \nn,  the  time complexity of the algorithm  is  DII:01.ltln  (or  lI:01.ltln2  if 7  is  infinite).  The \npredictions of the adaptive mixture is almost as good as any suffix tree transducer with any \nset of parameters.  The logarithmic loss of the mixture depends on the number of non-zero \nparameters as follows, \n\nLossr;:ix  $  Lossn  (7') -log(Po(7'\u00bb  +  1/21Nz  log(n)  + 0(17'III:01.ltl)  , \n\nwhere lNz is the number of non-zero parameters of the transducer 7'0  If lNz  ~ 17'III:out l \nthen  the  performance  of the  above  scheme,  when  employing  a  mixture  model  for  the \nparameters as well, is significantly better than using the add~ rule with the full alphabet. \n\n\f386 \n\nY. SINGER \n\n5  Evaluation and Applications \n\nIn this section we briefly present evaluation results of the model and its learning algorithm. \nWe  also  discuss  and  present results  obtained  from  learning syntactic  structure of noun \nphrases.  We start with an evaluation of the estimation scheme for a multinomial source. \n\nIn order to check the convergence of a mixture model for a multinomial source, we simulated \na source whose output symbols belong to an alphabet of size 10 and set the probabilities of \nobserving any of the last five  symbols to zero.  Therefore, the actual alphabet is of size 5. \nThe posterior probabilities for the sum of all possible subsets of I:out of size i (1  :::;  i  :::;  10) \nwere calculated after each  iteration.  The results are plotted on the left part of Fig. 2.  The \nvery  first observations rule out alphabets of size lower than 5 by slashing their posterior \nprobability to zero.  After few observations, the posterior probability is concentrated around \nthe actual size, yielding an accurate online estimate of the multinomial source. \n\nThe simplicity of the learning algorithm and the online update scheme enable evaluation of \nthe algorithm on millions of input-output pairs in  few  minutes.  For example, the average \nupdate time for a suffix tree transducer of a maximal  depth  10 when  the output alphabet \nis of size 4 is about 0.2 millisecond on a Silicon Graphics workstation.  A typical result is \nshown in Fig. 2 on the right.  In the example,  I:out  =  I:in  =  {I, 2, 3,4}.  The description \nof the source is as follows.  If Xn  ~ 3 then  Yn  is uniformly distributed over I:out. otherwise \n(xn  :::;  2) Yn  = Xn-S with probability 0.9 and Yn-S = 4 - Xn-S  with probability 0.1. The \ninput sequence Xl, X2,  .\u2022\u2022 was created entirely at random.  This source can be implemented \nby  a  sparse  suffix  tree  transducer of maximal  depth  5.  Note that  the actual  size of the \nalphabet is  only  2  at  half of the leaves  of the tree.  We  used  a  suffix  tree  transducer of \nmaximal  depth  20  to learn  the source.  The negative  of the logarithm of the  predictions \n(normalized  per symbol)  are  shown  for  (a)  the  true source,  (b)  a  mixture of suffix  tree \ntransducers and their parameters, (c) a mixture of only the possible suffix tree transducers \n(the parameters  are  estimated  using the addl  scheme),  and  (d)  a  single (overestimated) \nmodel of depth 8.  Clearly, the mixture mode? converge to the entropy of the source much \nfaster than the single model.  Moreover, employing twice the mixture estimation technique \nresults in an even faster convergence. \n\"I  .. \n\nUbUI \n.. :  .:~.: ......... ~.:~.:~ \n\n\"b'~1 - \"I  -\n\n1.B \n\n. , .   \u2022 \n\n,. \n\nI \n\n,. \n\n\u2022 \n\n1. \n\nI \"  \n\nI , .  \n\n\u2022 \n\n0.1 \n\n\u2022 \n\n..... t \n\n..... t \n\n. . . . .  t \n\n-. \n\"~'\u00b7\u00b7b\"b\"'b\"b'\u00b7b \n.. \n. \n'b'\u00b7L'\u00b7\u00b7LUL\"~'\u00b7~ \n\n..  .. \n. \n. \n. \n... \n\n.. \n.. \n. \n\n.. \n. \n. \n\n\u2022 \u2022   II  \u2022 \u2022   I\" \n\n....  It \n\n....  II \n\nI \n\n.. \n.. \n. \n\n\" \n\n\" \n\n\u2022 \n\n\u2022 \n\nII \n\n... \n\n\u2022\u2022  \" \n\nI \"  \n\nI \"  \n\nI \"  \n\nI \"  \n\n\u2022 \n\n\" \n\n\u2022\u2022\u2022\u2022 \n\n0 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \u2022  \n\n\u2022 \n\ntI \n\nI \n\n1. \n\nI \"  \n\n\u2022 \nI , .  \n\nI \n\nI \n\n\" \n\nI \"  \n\n.g  1.6 \nIi: \n\n(d) Single Overestimated Model \n(e) Mixture 01 Models \n\n-\n-\n-\n.--- (bl  MiX1ure 01  Models and Parameters \n.. ......  (a  Source \n\n\"I  ...  \"I  -.  \"I  -_  ':1  -.  \"I  .. _ \"I  .. _ \n,:~,:~,:~,~,:~,:~ \n\nI \"  \n\nI \n\n10 \n\nI \n\nto \n\n. , .   I \n\n\" \n\n5 \n\n,. \n\n50  100  150  200  250  300  350  400  450  500 \n\nNumber 01 Examples \n\nFigure 2:  Left:  Example of the  convergence of the posterior probability  of a mixture  model for \na multinomial source with large number of possible outcomes when the  actual number of observed \nsymbols is small.  Right:  performance comparison of the predictions of a single model, two mixture \nmodels and the true underlying transducer. \n\nWe are currently exploring the applicative possibilities of the algorithm.  Here we briefly \ndiscuss and  demonstrate how to induce an  English noun phrase recognizer.  Recognizing \nnoun  phrases  is  an  important task  in  automatic  natural  text processing,  for  applications \nsuch as information retrieval, translation tools and data extraction from  texts.  A common \npractice is to recognize noun phrases by first analyzing the text with a part-of-speech tagger, \nwhich assigns the appropriate part-of-speech (verb, noun, adjective etc.)  for each  word in \n\n\fAdaptive  Mixture of Probabilistic Transducers \n\n387 \n\ncontext.  Then, noun phrases are identified by manually defined regular expression patterns \nthat are  matched  against  the part-of-speech  sequences.  We  took an  alternative route by \nbuilding a  suffix  tree transducer based on a  labeled data set from  the UPENN tree-bank \ncorpus.  We defined I:in to be the set of possible part-of-speech tags and set I:out = {O,  I}, \nwhere the output symbol given  its corresponding input symbol (the part-of-speech  tag of \nthe current word) is  1 iff the word is part of a noun phrase.  We used over 250, 000 marked \ntags and tested the performance on more than 37 , 000 tags.  The test phase was performed \nby freezing  the model  structure,  the mixture weights and the estimated parameters.  The \nsuffix tree transducer was of maximal depth  15 hence very long phrases can be statistically \nidentified.  By  tresholding  the output probability  we  classified  the  tags  in  the  test  data \nand found  that less  than 2.4% of the words  were misclassified.  A  typical  result is given \nin Table  1.  We  are  currently  investigating methods  to  incorporate linguistic knowledge \ninto the model and its learning algorithm and compare the performance of the model with \ntraditional techniques. \nTcm \nPNP \n\nmetal. \nNNS \n\nSmith \nPNP \n\ncxcc:utiYe \n\nScmrmc:e \npos  tag \nClass \nPred i ction \nSentence \nP~S tag \nClass \nPrediction \n\n1 \n0.99 \nODd \nCC \n1 \n0.67 \n\n1 \n0.99 \n\nindustrial \n\nJJ \n1 \n0.96 \n\n0 \n0.01 \n\ngroup \nNN \n1 \n0.98 \nmaterial.  mabr \nNN \n1 \n0.96 \n\n1 \n0.99 \n\nNNS \n\ncru..f \nNN \n1 \n0.98 \n\n0 \n0.03 \n\nNN \n1 \n0.98 \nwill \nMD \n0 \n0.03 \n\nof \nIN \n0 \n\nU.K. \nPNP \n1 \n\n0.02 \n\n0.99 \n\nbo:cune \n\nVB \n0 \n0.01 \n\nchainnon \n\nNN \n1 \n0.81 \n\n1 \n0.99 \n\n0 \n0.01 \n\nTable 1:  Extraction of noun phrases using a suffix tree transducer. In this typical example, two long \nnoun phrases were identified correctly with high confidence. \nAcknowledgments \nThanks to Y. Bengio, Y. Freund, F. Pereira, D. Ron. R. Schapire. and N. Tishby for helpful discussions. \nThe work on syntactic structure induction is done in  collaboration with I. Dagan and S.  Engelson. \nThis work was done while the author was at the Hebrew University of Jerusalem. \nReferences \n[1]  Y.  Bengio and P.  Fransconi.  An input output HMM architecture.  InNIPS-7.  1994. \n[2]  N. Cesa-Bianchi. Y.  Freund. D.  Haussler. D.P.  Helmbold, R.E.  Schapire, and M.  K.  Warmuth. \n\nHow to use expert advice. In STOC-24,  1993. \n\n[3]  T.M. Cover and J .A. Thomas.  Elements of information theory.  Wiley. 1991. \n[4]  A.  DeSantis. G.  Markowski.  and M.N. Wegman.  Learning probabilistic prediction functions. \n\nIn Proc. of the 1st Wksp. on Comp. Learning Theory. pages 312-328,1988. \n\n[5]  C.L. Giles. C.B. Miller, D. Chen, G.Z. Sun, H.H. Chen. and Y.C. Lee.  Learning and extracting \nfinite state automata with second-orderrecurrent neural networks. Neural Computation. 4:393-\n405.1992. \n\n[6]  D. Haussler and A. Barron. How well do Bayes methods work for on-line prediction of {+ 1, -1 } \n\nvalues?  In The3rdNEC Symp . on Comput. andCogn., 1993. \n\n[7]  D.P. HeImbold and R.E.  Schapire.  Predicting nearly as well  as  the best pruning of a decision \n\ntree.  In COLT-8.  1995. \n\n[8]  R.A.  Jacobs, M.1.  Jordan.  SJ. NOWlan,  and G.E.  Hinton.  Adaptive mixture of local experts. \n\nNeural Computation, 3:79-87. 1991. \n\n[9]  R.E. Krichevsky and V.K. Trofimov.  The performance of universal encoding.  IEEE Trans. on \n\nInform. Theory. 1981. \n\n[10]  Nick Littlestone and Manfred K. Warmuth.  The weighted majority algorithm.  Information and \n\nComputation, 108:212-261,1994. \n\n[11]  M.D. Riley.  A statistical model for generating pronounication networks. In Proc. of IEEE Con/. \n\non Acoustics. Speech and Signal Processing. pages 737-740.1991. \n\n[12]  D.  Ron. Y.  Singer, and N. Tishby. The power of amnesia. In NIPS-6.  1993. \n[13]  F.MJ. Willems. Y.M.  Shtarkov. and TJ. Tjalkens.  The context tree weighting method:  Basic \n\nproperties.  IEEE Trans. Inform. Theory. 41(3):653-664.1995. \n\n\f", "award": [], "sourceid": 1099, "authors": [{"given_name": "Yoram", "family_name": "Singer", "institution": null}]}