{"title": "Bayesian Inference of Regular Grammar and Markov Source Models", "book": "Advances in Neural Information Processing Systems", "page_first": 388, "page_last": 395, "abstract": null, "full_text": "388 \n\nSmith and Miller \n\nBayesian Inference of Regular Grammar \n\nand Markov Source Models \n\nKurt R. Smith and Michael I. Miller \n\nBiomedical Computer Laboratory \n\nand \n\nElectronic Signals and Systems Research Laboratory \n\nWashington University, SL  Louis. MO 63130 \n\nABSTRACT \n\nIn this paper we develop a Bayes criterion which includes the Rissanen \ncomplexity, for  inferring regular grammar models.  We develop two \nmethods for regular grammar Bayesian inference.  The fIrst method is \nbased  on  treating  the  regular  grammar as  a  I-dimensional  Markov \nsource, and the second is based on the combinatoric characteristics of \nthe regular grammar itself.  We apply the resulting Bayes criteria to a \nparticular example in order to show the efficiency of each method. \n\n1  MOTIVATION \n\nWe are interested in segmenting electron-microscope autoradiography (EMA) images by \nlearning representational models for the textures found in the EMA image.  In studying \nthis  problem,  we have recognized  that both  structural  and  statistical  features  may  be \nuseful for characterizing textures.  This has  motivated us to study the  source  modeling \nproblem  for both  structural sources and  statistical sources.  The statistical  sources  that \nwe have examined are the class of one and two-dimensional Markov sources (see [Smith \n1990] for a Bayesian treatment of Markov random field texture model inference), while \nthe  structural sources  that we are primarily interested  in  here are the class  of regular \ngrammars, which are important due to the role that grammatical constraints may play in \nthe development of structural features for texture representation. \n\n\fBayesian Inference of Regular Grammar and Markov Source Models \n\n389 \n\n2  MARKOV SOURCE INFERENCE \n\nOur primary interest here is the development of a complete Bayesian framework for the \nprocess of inferring  a  regular grammar from  a  training  sequence.  However,  we  have \nshown  previously  that there  exists a  I-D Markov  source which  generates  the regular \nlanguage defined via some regular grammar [Miller, 1988].  We can therefore develop a \ngeneralized Bayesian inference procedure over the class of I-D Markov sources which \nenables  us  to  learn  the  Markov  source corresponding to  the optimal regular grammar. \nWe begin our analysis by developing the general structure for Bayesian source modeling. \n\n2.1  BAYESIAN APPROACH TO SOURCE MODELING \n\nWe  state  the  Bayesian  approach  to  model  learning:  Given  a  set  of source  models \n{ ~, th,\u00b7 . \" 8M.I}  and the observation x, choose the source model a which  most  accurately \nrepresents  the  unknown source that  generated x.  This decision  is made  by  calculating \nBayes risk over the possible models  which produces a general decision criterion for the \nmodel  learning problem: \n\n{ max} log P(xt~) + log  Pj  . \n~8t \u2022. \u00b7.Bit\u00b7} \n\n(2.1) \n\nUnder the additional assumption that the apriori probabilities over the candidate models \nare equivalent, the decision criterion becomes \n\n(2.2) \n\nwhich  is  the  quantity  that  we  will  use  in  measuring  the  accuracy  of  a  model's \nrepresentation. \n\n2.2  STOCHASTIC COMPLEXITY AND  MODEL LEARNING \n\nIt is well known that when given finite data, Bayesian procedures of this kind which do \nnot have any prior on the models suffer from  the fundamental  limitation that they will \npredict models  of greater and  greater complexity.  This  has  led  others  to  introduce \npriors into the Bayes hypothesis testing procedure based on the complexity of the model \nbeing  tested  [Rissanen,  1986].  In  particular,  for  the  Markov  case the complexity  is \ndirectly  proportional  to  the  number of transition  probabilities of the particular model \nbeing  tested  with  the  prior exponentially  decreaSing  with  the  associated  complexity. \nWe now describe the inclusion of the complexity measure in greater detail. \n\nFollowing  Rissanen,  the basic  idea is  to  uncover the  model  which  assigns  maximum \nprobability to the observed data, while also being as simple as possible so as to require a \nsmall Kolmogorov description  length.  The complexity associated with a  model having \nk real parameters and a  likelihood with n independent samples,  is the  now well-known \n!Jog n  which  allows  us  to  express  the generalization of the  original  Bayes procedure \n2 \n(2.2) as the quantity \n\n\f390 \n\nSmith and Miller \n\n(2.3) \n\n\"-\n\nNote well  that a is the k9rdimensional parameter parameterizing model a. which must \nbe estimated from  the observed data %,..  An alternative view of (2.3)  is discovered by \nviewing the second term as the prior in  the Bayes model (2.1) where the prior is defined \nas \n\nP \n\nltl \u00b7 \n---.! 101  \" \n~= e  2 \n\n\u2022 \n\n(2.4) \n\n2.3  I-D l\\fARKOV SOURCE MODELING \n\nConsider that x\" is a  I-D n-Iength  string of symbols which  is  generated by  an  unknown \nfinite-state  Markov  source. \nIn  examining  (2.3),  we  recognize  that  for  I-D  Markov \n\nsources  log P(rl8;)  may  be  written  as  log n P9a(S(Xj)lS\"(Xj_l\u00bb  where  S(x.)  is  a  state \n\n.1 \n\nA \n\nfunction  which  evaluates  to  a  state  in  the  Markov  source  state  set S9;.  Using  this \nnotation, the Bayes hypothesis test for  I-D Markov sources may be expressed as: \n\nj-l \n\n(2.5) \n\nFor the general Markov source inference problem, we know only that the string x\" was \ngenerated by a  I-D Markov source, with the state set S9;  and  the  transition  probabilities \nP9a{StIS,). kJeS9a' unknown.  They must therefore be included in the inft\"rence procedure. \nTo include the complexity term for this case, we note that the number of parameters to \nbe estimated for model a is  simply  the number of entries in  the state-transition matrix \nP4, i.e. 19; = IS9;12.  Therefore for  I-D Markov sources, the generalized Bayes hypothesis \ntest including complexity may be stated as \n\nmta \n\n{ \n~9t, .. ,8M1 \n\n1 ,,\u00b71 \n} n L log Pel.S(Xj)IS(Xj-l\u00bb - ~g n. \n';-1 \n\n'\" \n\nISBJ2 \n2n \n\n(2.6) \n\nwhere we have divided the entire quantity by n in order to express the criterion in terms \nof bits pc7 symbol.  Note that a candidate Markov source model 8; is initially  specified \nby its ordez and corresponding state set  S Ba. \n\nThe procedure for  inferring  1-0 Markov  source  models can thus be stated as follows. \nGiven a sequence x\"  from  some  unknown  source,  consider candidate  Markov  source \nmodels  by  computing  the  state  function  S(x.)  (detemlined  by  the  candidate  model \norder) over the  entire  string x~  Enumerating  the  state  transitions  which  occur in  %,. \nprovides  an  estimate of the  state-transition  matrix  P,. which  is  then  used  to compute \n(2.6).  Now. the inferred Markov source becomes the ooe maximizing (2.6). \n\n'\" \n\n\fBayesian Inference of Regular Grammar and Markov Source Models \n\n391 \n\n3  REGULAR GRAMMAR INFERENCE \n\nAlthough  the Bayes criterion developed  for  I-D Markov  sources (2.6)  is  a  sufficient \nmodel  learning criterion  for  the class of regular grammars,  we will  now  show  that by \ntaking  advantage  of the apriori  knowledge  that  the  source  is  a  regular  grammar,  the \ninference procedure can be made much more efficient  This apriori knowledge brings a \nspecial  structure  to  the  regular  grammar  inference  problem  in  that  not  all  allowable \nsets  of Markov  probabilities  correspond  to  regular  grammars.  In  fact,  as  shown  in \n[Miller,  1988].  corresponding  to  each  regular  grammar  is  a  unique  set  of candidate \nprobabilities, implying  that  the  Bayesian  solution  which  takes  this  into account will  be \nfar more efficient.  We demonstrate  that now. \n\n3.1  BAYESIAN CRITERION l\"SING  GRAMMAR COMBINATORICS \n\nOur approach  is  to  use  the  combinatoric properties of the  regular grammar in  order to \ndevelop the optimal Bayes hypothesis test.  We begin by defining the regular grammar. \n\nDefinition:  A  regular  grammar G is  a quadruple (VN, VT, Ss,R) where VN, VT  are  finite \nsets  of non-terminal  symbols  (or  states)  and  tenninal  symbols  respectively,  Ss  is  the \nsentence  start  state,  and  R  is  a  finite  set  of  production  rules  consisting  of  the \ntransfonnation  of a  non-tenninal  symbol  to  either  a  terminal  followed  by  a  non(cid:173)\ntenninal, or a terminal alone, i.e .. \n\nIn the class of regular grammars that  we consider, we define the depth  of the language \nas  the  maximum  number of tenninal  symbols  which  make  up  a  nontenninal symbol. \nCorresponding to each regular grammar is an associated incidence matrix B  with the i,k,1t \nentry B i) equal  to  the  number of times  there  is  a  production  for some tenninal j  and \nnon-terminals  i.k of the  fonn  Si~Wpk.ER.  Also associated with each grammar Gi is \nthe set of all n-Iength strings produced by  the grammar, denoted as  the regular language \n%Il(Gi). \n\nNow  we make the quite reasonable assumption that no string in the language %Il(Gi) is \nmore or less probable apriori than any  other string in that language.  This indicates that \nall n-lengtb  strings  that can  be  generated  by  Gi are  equiprobable  with  a  probability \ndictated by the combinatorics of the language as \n\nP(XIlIGi) = I  1  I' \n\n%Il(Gi) \n\n(3.1) \n\nwhere I %Il(Gi) I denotes the number of n-Iength sequences in  the language which can be \ncomputed by considering the combinatorics of the language as follows: \n\n\f392 \n\nSmith and Miller \n\nwith  AGi  corresponding  to  the  largest  eigenvalue  of the  state-transition  matrix  BGI' \nThis  results  from  the  combinatoric  growth  rate being  detennined by  the  sum  of the \nentries  in  the  \"til  power  state-transition  matrix  Bo . .,  which  grows  as  the  largest \neigenvalue AGI of BGi [Blahut, 1987].  We can now write (3.1) in these tenns as \n\n(3.2) \n\nwhich expresses the probability of the sequence x\" in tenns of the combinatorics of Gi. \n\nWe  now  use  this  combinatoric  interpretation  of  the  probability  to  develop  Bayes \ndecision criterion over two candidate grammars.  Assume that there exists a fmite space \nof sequences X  \u2022 all  of which  may be generated by  one of the  two  possible  grammars \n{Go. Gl}.  Now by dividing this observation space X into two decision regions. Xo  (for \nGo) and Xl  (for G 1). we can write Bayes risk R in  terms of the  observation probabilities \nP(xIIIGo).P(x\"IG 1): \n\nx\"eXl \n\n.l'\"eXo \n\n(3.3) \n\nThis  implementation  of Bayes risk  assumes  that sequences  from  each  grammar occur \nequiprobably apriori  and that the cost of choosing  the  incorrect grammar is equal to  1. \nNow  incorporating the combinatoric  counting probabilities  (3.2).  we  can  rewrite  (3.3) \nas \n\nwhich can be rewritten \n\nR =  2,  AGo'\" +  L  AGl '\" \n\nX\"eXl \n\nx\"eXo \n\nR =1.+  2, \n2  z,.eXo \n\n(AGI'\u00b7 - ko'\u00b7) . \n\n(3.4) \n\nThe risk is therefore minimized by choosing GO if  AGl'\" < AGo'\u00b7 and  01  if AGI'\u00b7 > AGo'''. \nThis establishes the likelihood ratio for the grammar inference problem: \n\nGl \nAGI'\"  > \nAGo'\u00b7  < \nGo \n\n1 \u2022 \n\nwhich  can alternatively be expressed in tenns of the log as \n\n(max)  -\" log Alii . \nGo.GI \n\nRecognizing this as  the maximum  likelihood decision.  this  decision  criterion  is easily \ngeneralized  to  M  hypothesis.  Now  by  ignoring  any  complexity  component.  the \ngeneralized Bayes test for a regular grammar can be stated as \n\n\fBayesian Inference of Regular Grammar and Markov Source Models \n\n393 \n\n(3.5) \n\nwhere Aai is the largest eigenvalue of the estimated incidence matrix BGi corresponding \nto grammar Gi where BGJ is estimated from .r... \n\n\"\" \n\n\"\" \n\n\"\" \n\nThe  complexity  factor  to  be  included  in  this  Bayesian  criterion  differs  from  the \ncomplexity term  in (2.3) due to the fact that the parameters to be estimated are now  the \nentries  in  the  BGi  matrix  which  are  strictly  binary.  From  a  description  length \ninterpretation  then.  these parameters can be fully  described using  1 bit per entry in BGj. \nThe complexity  term  is  thus  simply ISOil2  which  now  allows  us  to  write  the  Bayes \ninference criterion for regular grammars as \n\n\"\" \n\n(3.6) \n\nin terms of bits per symbol.  We can now  state the algorithm for inferring grammars. \n\nRegular Grammar Inference Algorithm \n\n1.  Initialize the grammar depth to d= 1. \n\n2.  ComputelSGJ =IVT~. \n\n3.  Using the state function  Sd(:rJ  corresponding to  the current depth.  compute \nthe  state  transitions  at  all  sites .t;  in the observed sequence x\"  in  order  to \nestimate  the  incidence  matrix  BGi  for  the  grammar  currently  being \nconsidered. \n\n\"\" \n\n4.  Compute Aaj from BGj.  (recall that this is the largest eigenvalue of BGi). \n\n\"\" \n\n\"\" \n\n5.  Using AajandlSGjl compute (3.6) - denote this aslGj= -log AGj_IS~jI2 . \n\n6.  Increase the grammar depth  d=d+l  and goto 2 (Le.  test another candidate \n\ngrammar)  until IGidiscontinues to increase. \n\nThe regular grammar of minimum depth  which maximizes IGj  (Le.  maximizes (3.6\u00bb  is \nthen the optimal regular grammar source model for the given sequence x,. \n\n3.2  REGULAR GRAMMAR INFERENCE RESULTS \n\nTo compare the efficiency of the two Bayes criteria (2.6)  and (3.6), we will consider a \nregular grammar inference experiment  The regular grammar that we  will  attempt  to \nlearn, which we refer to as the 4-0,ls regular grammar, is a run-length constrained binary \n\n\f394 \n\nSmith and Miller \n\ngrammar which  disallows  4  consecutive  occurrences  of a  0  or 8  1.  Referring  to the \nregular grammar definition. we note that this regular grammar can be described by its \nincidence matrix \n\nB4.O,l \n\n000  I  o 0 \n100  1  o 0 \n010  1  o 0 \no 0  1  010 \no 0  1  001 \no 0  1  000 \n\nwhere the states corresponding to row and column indices are \n\nNote  that  this  regular  grammar  has  a  depth  equal  to  3  and  thus  the  corresponding \nMarkov source has an order equal to 3. \n\nThe inference experiment may  be described as follows.  Given a training set of length 16 \nstrings from the 4-0,ls language, we apply the Bayes criteria (2.6) and (3.6) in an attempt \nto  infer the  regular grammar in each case.  We compute the criteria for  five candidate \nmodels  of order/depth  1 through  5  (recall  that this defmes the  size of the  state  set for \nthe Markov source and the regular grammar, respectively). \n\nTreating  the  unknown  regular  grammar  as  a  Markov  source,  we  estimate  the \ncorresponding  state-transition  matrix  P and then compute the Bayes criterion according \nto (2.6) for each of the five candidate models.  We compute the criterion as a function of \nthe number of training samples for rach candidate model and plot the result in Figure la. \nSimilarly. we estimate the incidence matrix B and compute the Bayes criterion according \nto (3.6) for each of the five regular grammar candidate models. and plot the results as a \nfunction of the number of training samples in Figure lb. \n\n\"\" \n\n\"\" \n\nWe  compare  the  two  Bayesian  criteria  by  examining  Figures  18  and  lb.  Note  that \ncriterion  (3.6) discovers the correct regular grammar (depth = 3) after only 50 training \nsamples (Figure  Ib), while the equivalent Markov source (order = 3) is found only after \nalmost 500 training samples have been used in computing (2.6) (Figure la).  This points \nout that  a  much  more  efficient  inference  procedure  exists  for  regular grammars  by \ntaking advantage of the apriori grammar information (i.e. only the  depth and the binary \nincidence matrix B  must be estimated). whereas for  1-0 Markov sources. both the order \nand the real-valued state-transition matrix P must be estimated. \n\n\"\" \n\n\"\" \n\n4. CONCLUSION \n\nIn conclusion, we stress the importance of casting the source modeling problem within a \nBayesian  framework  which  incorporates  priors  based  on  the  model  complexity  and \nknown model attributes.  Using this approach, we have developed an efficient Bayesian \n\n\fBayesian Inference of Regular Grammar and Markov Source Models \n\n395 \n\n-0.8 \n\n-0.9 \n\n-1 \n\n\u2022 \n\n0 \n\n00 \n\n0  . .. ~ \n\u2022  \u2022\u2022 ....  -\n\u2022 \u2022 \n.... ~ .... ... \n0 \u2022 \u2022 \n\u2022 \n\u2022  \u2022 \u2022 \n* \n\n\u2022 \n\u2022  \u2022 \u2022 \u2022  \u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\"ij_~i()I()I(  )I()I()I(  x x  x Limit \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nx \nx \nx \n\n0 \n\n0 \n\n-0.8 -\n\n-0.9 -\n\n-11-\n\n0  * '\" x>li<~ ......... \n*  x \n\no  0 \n\no \n\n*  x \n* \n\n, .. ' . \no \n\u2022\u2022\u2022\u2022 . . . . .  __ _ \n\u2022 \n.  .0 \n\u2022 \n\u2022 \nn.;l()I()i(  x x x Limit \n\n\u2022\u2022 \n\u2022\u2022 \n\u2022 \n\nJj \n\nX \n\n5 \n\n50 \n\n500 \n\n5()()(; \n\n50000 \n\na) \n\nI \n5 \n\nI \n50 \n\nI \n\n500 \n\nb) \n\nI \n\nI \n\n5000  50000 \n\nGrammar depth  d Markov order:. = 1,. = 2,0 =  3,  \u2022 =  4, x =  5  . \n\nFigure 1:  Results  of computing Bayes criterion measures  (2.6)  and  (3.6) \nvs.  the number of training samples - a) Markov source criterion \n(2.6);  b)  Regular grammar combinatoric criterion (3.6). \n\nframework  for inferring regular grammars.  This type of Bayesian model is  potentially \nquite useful for the texture analysis and image segmentation problem where a consistent \nframework  is  desired  for  considering  both  structural  and  statistical  features  in  the \ntexture/image representation. \n\nAcknowledgements \n\nThis research was supported by the NSF via a Presidential Young Investigator Award \nECE-8552518 and by the NIH via a DRR GrantRR-1380. \n\nRererences \n\nBlahut, R. E. (1987). Principles and Practice of Information TltMry , Addison-Wesley \n\nPublishing Co .\u2022 Reading, MA. \n\nMillex. M. I., Roysam. B. Smith, K. R .\u2022 and Udding, 1. T (1988). \"Mapping Rule-Based \nRegular Grammars to Gibbs  Distributions\", AMS-IMS-SIAM Joint  Conference  011 \nSPATIAL STATISTICS AND IMAGING.  American Mathematical Society. \n\nRissanen, J.  (1986).  \"Stochastic Complexity  and  Modeling-, An1lOls  of Statistics,  14, \n\n00.3. pp. 1~ 1100. \n\nSmith.  K.  R .\u2022  Miller.  M.  I.  (1990).  \"A  Bayesian  Approach  Incorporating  Rissanen \nComplexity  for Learning Markov Random  Field Texture Models\", Proceedings of \nInl Conference on Acoustics, Speech. and Signal Processing. Albuquexque, NM. \n\n\f", "award": [], "sourceid": 231, "authors": [{"given_name": "Kurt", "family_name": "Smith", "institution": null}, {"given_name": "Michael", "family_name": "Miller", "institution": null}]}