{"title": "Relative Density Nets: A New Way to Combine Backpropagation with HMM's", "book": "Advances in Neural Information Processing Systems", "page_first": 1149, "page_last": 1156, "abstract": null, "full_text": "Relative  Density Nets:  A  New Way to \nCombine Backpropagation with  HMM's \n\nAndrew  D.  Brown \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto,  Canada M5S  3G4 \n\nandy@cs.utoronto.ca \n\nGeoffrey E.  Hinton \n\nGatsby Unit,  UCL \n\nLondon,  UK WCIN 3AR \nhinton@gatsby.ucl.ac.uk \n\nAbstract \n\nLogistic units in the first  hidden layer of a  feedforward neural net(cid:173)\nwork  compute  the  relative  probability  of a  data point  under  two \nGaussians.  This  leads  us  to  consider  substituting  other  density \nmodels.  We  present  an architecture for  performing discriminative \nlearning of Hidden Markov Models using a  network of many small \nHMM's.  Experiments on speech  data show it  to be superior to the \nstandard method of discriminatively training HMM's. \n\n1 \n\nIntroduction \n\nA standard way of performing classification using a generative model is to divide the \ntraining cases into their respective classes  and t hen  train a  set of class  conditional \nmodels.  This unsupervised approach to classification is appealing for two reasons.  It \nis possible to reduce overfitting, because t he model learns the class-conditional input \ndensities P(xlc)  rather t han the input-conditional class  probabilities P(clx).  Also, \nprovided  that  the  model  density  is  a  good  match  to  the  underlying  data density \nthen the decision provided by a  probabilistic model is Bayes optimal.  The problem \nwith  this  unsupervised  approach  to  using  probabilistic models  for  classification  is \nthat, for reasons of computational efficiency and analytical convenience, very simple \ngenerative models  are typically used and the optimality of the procedure no longer \nholds.  For this reason it is usually advantageous to train a classifier discriminatively. \n\nIn this paper we will look specifically at the problem of learning HMM's for  classify(cid:173)\ning speech sequences.  It is an application area where the assumption that the HMM \nis the correct generative model for the data is inaccurate and discriminative methods \nof training have been  successful.  The  first  section  will  give  an  overview of current \nmethods of discriminatively training HMM classifiers.  We will then introduce a new \ntype  of multi-layer  backpropagation network  which  takes  better  advantage  of the \nHMM's for  discrimination.  Finally,  we present some simulations comparing the two \nmethods. \n\n\f1 \n\n1 \n\n1 \n[tn] [tn] [tn] HMM's \n\n19 ' S1c1=\" \n\\V \n\nSequence \n\nFigure  1:  An  Alphanet  with  one  HMM  per  class.  Each  computes  a  score  for  the \nsequence and this feeds  into a  softmax output layer. \n\n2  Alphanets  and  Discriminative Learning \n\nThe unsupervised way of using an HMM for classifying a collection of sequences is to \nuse  the Baum-Welch  algorithm  [1]  to fit  one HMM  per class.  Then new  sequences \nare  classified  by  computing  the  probability  of  a  sequence  under  each  model  and \nassigning it to the one with the highest probability.  Speech recognition is one of the \ncommonest applications of HMM's,  but unfortunately an HMM is a  poor model of \nthe speech production process.  For this reason speech researchers have looked at the \npossibility of improving the performance of an HMM classifier by using information \nfrom  negative examples  -\nexamples  drawn from  classes other than the one  which \nthe  HMM  was  meant  to  model.  One  way  of doing  this  is  to  compute  the  mutual \ninformation  between  the  class  label  and  the  data  under  the  HMM  density,  and \nmaximize that objective function  [2]. \nIt was later shown that this procedure could be viewed as  a  type of neural network \n(see  Figure  1)  in  which  the  inputs  to  the  network  are  the  log-probability  scores \nC(Xl:TIH)  of  the  sequence  under  hidden  Markov  model  H  [3].  In  such  a  model \nthere is  one HMM  per class, and the output is  a  softmax non-linearity: \n\n(1) \n\nTraining this model by maximizing the log probability of correct classification leads \nto  a  classifier  which  will  perform  better  than  an  equivalent  HMM  model  trained \nsolely  in  a  unsupervised  manner.  Such  an  architecture  has  been  termed  an  \"AI(cid:173)\nphanet\" because it may be implemented as a recurrent neural network which mimics \nthe forward  pass of the forward-backward algorithm.l \n\n3  Backpropagation  Networks  as  Density Comparators \n\nA  multi-layer  feedforward  network  is  usually  thought  of  as  a  flexible  non-linear \nregression  model,  but  if  it  uses  the  logistic  function  non-linearity  in  the  hidden \nlayer,  there  is  an  interesting  interpretation  of  the  operation  performed  by  each \nhidden  unit.  Given  a  mixture  of  two  Gaussians  where  we  know  the  component \npriors P(9) and the component densities P(xl9) then the posterior probability that \nGaussian,  90 , generated an observation x , is  a  logistic function  whose  argument is \nthe negative log-odds of the two classes  [4] .  This can clearly be seen by rearranging \n\nlThe results  of the forward  pass  are the probabilities of the  hidden states conditioned \n\non the past observations,  or  \"alphas\"  in  standard HMM  terminology. \n\n\fthe expression for  the posterior: \n\nP(Qolx) \n\nP(xI9o)P(Qo) \n\nP(xI9o)P(Qo) + P(xI9d P (Qd \n\n1 \n\n1 +  exp {-log P(x IQo)  -\nP(x lQd \n\nlog  P(Qo)  } \nP(Ql) \n\nIf the class  conditional densities in question are multivariate Gaussians \n\nP(xI9k)  =  121f~1-~ exp {-~(x - Pk)T ~-l(X - Pk)} \n\n(2) \n\n(3) \n\nwith  equal  covariance  matrices,  ~ ,  then  the  posterior  class  probability  may  be \nwritten in this familiar  form: \n\nP(Qo Ix)  =  -l-+-e-xp-{-=---(:-x=Tw-+-b---:-) \n\n1 \n\nwhere, \n\nw \n\nb \n\n(4) \n\n(5) \n\n(6) \n\nThus,  the  multi-layer  perceptron  can  be  viewed  as  computing  pairwise  posteriors \nbetween Gaussians in the input space, and then combining these in the output layer \nto compute a  decision. \n\n4  A  New Kind of Discriminative Net \n\nThis  view  of  a  feedforward  network  suggests  variations  in  which  other  kinds  of \ndensity  models  are  used  in  place  of Gaussians  in  the  input  space.  In  particular, \ninstead  of  performing  pairwise  comparisons  between  Gaussians,  the  units  in  the \nfirst  hidden  layer  can  perform  pairwise  comparisons  between  the  densities  of an \ninput sequence under M  different HMM's.  For a given sequence the log-probability \nof a  sequence  under  each  HMM  is  computed  and  the  difference  in  log-probability \nis  used  as  input  to  the  logistic  hidden  unit. 2  This  is  equivalent  to  computing  the \nposterior responsibilities of a  mixture of two HMM's with equal prior probabilities. \nIn  order to maximally leverage the information captured by the HMM's we use  (~) \nhidden units so  that all  possible pairs are included.  The output of a  hidden unit  h \nis  given  by \n\n(7) \nwhere  we  have  used  (mn)  as  an  index over  the set,  (~) ,  of all  unordered  pairs of \nthe HMM's.  The results of this hidden layer computation are then combined using \na  fully  connected  layer  of free  weights,  W,  and  finally  passed  through  a  soft max \nfunction  to make the final  decision. \n\nak  =  L  W(m ,n)kh(mn) \n\n(mn) E (~) \n\n(8) \n\n(9) \n\n2We take the time averaged log-probability so that the scale of the inputs is independent \n\nof the length of the sequence. \n\n\fDensity \nComparator \nUnits \n\nFigure  2:  A  multi-layer  density  net  with  HMM's  in  the  input  layer.  The  hidden \nlayer units perform all  pairwise comparisons between the HMM's. \n\nwhere we  have  used  u(\u00b7)  as  shorthand for  the logistic function,  and Pk  is  the value \nof  the  kth  output  unit.  The  resulting  architecture  is  shown  in  figure  2.  Because \neach unit in the hidden layer takes as  input the difference  in log-probability of two \nHMM's,  this  can be  thought  of as  a  fixed  layer of weights  connecting each  hidden \nunit to a  pair of HMM's with weights of \u00b1l. \n\nIn contrast to the Alphanet , which allocates one HMM to model each class, this net(cid:173)\nwork does not require a one-to-one alignment between models and classes and it gets \nmaximum discriminative benefit from  the HMM's by comparing all  pairs.  Another \nbenefit  of this  architecture is  that it allows  us  to use  more  HMM's  than  there  are \nclasses.  The unsupervised approach to training HMM  classifiers is  problematic be(cid:173)\ncause it depends on the assumption that a single HMM is  a good model of the data \nand, in the case of speech, this is a poor assumption.  Training the classifier discrim(cid:173)\ninatively alleviated this drawback and the multi-layer classifier goes even further in \nthis direction by allowing many HMM's to be used to learn the decision boundaries \nbetween  the  classes.  The  intuition  here  is  that  many  small  HMM's  can  be  a  far \nmore efficient  way to characterize sequences than one big HMM.  When many small \nHMM's cooperate to generate sequences, the mutual information between different \nparts  of generated  sequences  scales  linearly  with  the  number  of HMM's  and  only \nlogarithmically with the number of hidden nodes in each HMM  [5]. \n\n5  Derivative Updates  for  a  Relative Density Network \n\nThe learning algorithm for  an  RDN  is  just the backpropagation algorithm applied \nto the network architecture as  defined  in  equations  7,8  and 9.  The output layer is \na  distribution over class memberships of data point Xl:T,  and this is  parameterized \nas a  softmax function.  We  minimize the cross-entropy loss  function: \n\nK \n\nf  =  2: tk logpk \n\nk = l \n\n(10) \n\nwhere Pk  is the value of the kth output unit and tk  is  an indicator variable which is \nequal to  1 if k is  the true class.  Taking derivatives  of this expression  with  respect \nto the inputs of the output units  yields \nof \n-=tk-Pk \noak \n\n(11) \n\n\fO\u00a3 \n\nOW(mn) ,k \n\nOak \n\no\u00a3 \noak OW(mn) ,k \n\n-,---- =  (tk  - Pk)h(mn) \n\n(12) \n\nThe derivative of the output of the  (mn)th hidden  unit with respect to the output \nof ith HMM,  \u00a3i, is \n\noh(mn) \n~ =  U(\u00a3m  - \u00a3n)(l - U(\u00a3m  - \u00a3n))(bim  - bin) \n\n(13) \n\nwhere  (bim  - bin)  is  an indicator  which  equals  +1  if i  =  m,  -1 if i  =  n  and  zero \notherwise.  This derivative can be chained with the the derivatives backpropagated \nfrom  the output to the  hidden layer. \n\nFor  the final  step  of the  backpropagation procedure we  need  the  derivative  of the \nlog-likelihood of each  HMM with respect to its parameters.  In the experiments we \nuse  HMM's  with a  single,  axis-aligned,  Gaussian output density per state.  We  use \nthe following  notation for  the parameters: \n\n\u2022  A:  aij  is  the transition probability from state i  to state j \n\u2022  II:  7ri  is  the initial state prior \n\u2022  f./,i: mean vector for  state i \n\u2022  Vi:  vector of variances for  state i \n\u2022  1-l:  set of HMM  parameters {A , II, f./\"  v} \n\nWe  also  use  the variable  St  to represent the state of the HMM  at time t.  We  make \nuse  of the  property of all  latent  variable  density  models  that the  derivative of the \nlog-likelihood  is  equal  to  the  expected  derivative  of the joint  log-likelihood  under \nthe posterior distribution .  For an HMM  this means  that: \n\nO\u00a3(Xl:TI1-l) \n\no1-li \n\n' \"  \n\n=  ~ P(Sl:Tlxl:T' 1-l) o1-l i  log P(Xl:T' Sl:TI1-l) \n\n0 \n\n(14) \n\nThe joint likelihood of an HMM  is: \n\n(logP(Xl:T ' Sl:TI1-l))  = \n\nSl:T \n\nL(b81 ,i)log 7ri  + LL(b8\"jb8 ,_1 ,i)log aij + \n\nT \n\nt=2  i,j \n\n~ ~(b8\" i) [-~ ~IOgVi'd - ~ ~(Xt'd - f./,i,d) 2 /Vi,d]  + canst \n\n(15) \n\nwhere  (-)  denotes  expectations  under  the  posterior  distribution  and  (b 8 , ,i)  and \n(b 8 ,  ,jb8'_1 ,i)  are  the  expected  state  occupancies  and  transitions  under  this  dis(cid:173)\ntribution.  All  the  necessary  expectations  are  computed  by  the  forward  back(cid:173)\nward  algorithm.  We  could  take  derivatives  with  respect  to  this  functional  di(cid:173)\nrectly,  but  that  would  require  doing  constrained  gradient  descent  on  the  prob(cid:173)\nabilities  and  the  variances. \nsoftmax  basis  for  probability  vectors  and  an  exponential  basis  for  the  vari(cid:173)\nance  parameters. \nThis  choice  of  basis  allows  us  to  do  unconstrained  op-\ntimization  in  the  new  basis. \nThe  new  parameters  are  defined  as  follows: \n.  _ \n7r,  - 2: \n\nInstead,  we  reparameterize  the  model  using  a \n\n(v) \n(e(~\u00bb)'  V\"d  - exp(Oi,d  ) \n\na' J  - 2: \n\nexp(e;; \u00bb) \n\n. \n\n_ \n\n.  _ \n\nexp(e;~\u00bb) \nif  exp \n\ni \n\n(e (a\u00bb ) , \n\nJI  exp  1JI \n\nThis results in the following  derivatives: \n\nO\u00a3(Xl :T 11-l) \n\noO(a) \n'J \n\nT \nL \nt = 2 \n\n[(b 8 ,  ,jb8'_1 ,i)  - (b 8'_1 ,i)aij ] \n\n(16) \n\n\f8\u00a3(Xl:T 11\u00a3) \n\n80(7r) \u2022 \n\n8\u00a3(Xl:T 11\u00a3) \n\n8f..li,d \n\n8\u00a3(Xl:T 11\u00a3) \n\n80(v) \n.,d \n\n(8 S1 ,i) - 1fi \n\nT \nl)8st ,i)(Xt,d -\nt= l \n1  T \n2\"l)8st ,i)  [(Xt ,d -\n\nt= l \n\nf..li ,d)/Vi,d \n\nf..li ,d)2/Vi ,d  -\n\nIJ \n\n(17) \n\n(18) \n\n(19) \n\nWhen chained with the error signal backpropagated from  the output, these deriva(cid:173)\ntives give  us  the direction in which  to move the parameters of each  HMM in order \nto increase the log probability of the correct classification of the sequence. \n\n6  Experiments \n\nTo  evaluate  the  relative  merits  of the  RDN,  we  compared  it  against  an  Alphanet \non  a  speaker  identification  task.  The  data  was  taken  from  the  CSLU  'Speaker \nRecognition'  corpus.  It  consisted  of  12  speakers  uttering  phrases  consisting  of 6 \ndifferent sequences of connected digits recorded multiple times  (48)  over the course \nof  12  recording  sessions.  The  data  was  pre-emphasized  and  Fourier  transformed \nin  32ms  frames  at  a  frame  rate  of  lOms.  It was  then  filtered  using  24  bandpass, \nmel-frequency scaled filters.  The log magnitude filter response was then used as the \nfeature vector for  the HMM's.  This pre-processing reduced the data dimensionality \nwhile retaining its spectral structure. \n\nWhile mel-cepstral coefficients are typically recommended for  use with axis-aligned \nGaussians,  they  destroy  the  spectral  structure  of the  data,  and  we  would  like  to \nallow  for  the  possibility  that of the many  HMM's  some  of them  will  specialize  on \nparticular  sub-bands  of  the  frequency  domain.  They  can  do  this  by  treating  the \nvariance  as  a  measure  of the  importance  of a  particular frequency  band -\nusing \nlarge variances for  unimportant bands, and small ones for  bands to which they pay \nparticular attention. \n\nWe  compared the  RDN  with  an Alphanet  and three  other models  which  were  im(cid:173)\nplemented as  controls.  The first  of these was  a  network with a  similar architecture \nto the RDN  (as  shown in figure  2),  except that instead of fixed  connections of \u00b11, \nthe hidden  units  have a  set of adaptable weights  to all  M  of the HMM's.  We  refer \nto this  network as  a  comparative density net  (CDN).  A second control experiment \nused an architecture similar to a CDN without the hidden layer,  i.e.  there is a single \nlayer of adaptable weights directly connecting the HMM's with the softmax output \nunits.  We  label this  architecture a  CDN-l.  The  CDN-l  differs  from  the  Alphanet \nin that each softmax output unit has adaptable connections to the HMM's and we \ncan vary the number of HMM's,  whereas the Alphanet has just one HMM per class \ndirectly connected to each softmax output unit.  Finally, we  implemented a  version \nof a  network similar  to  an  Alphanet,  but  using  a  mixture  of Gaussians  as  the  in(cid:173)\nput  density  model.  The point  of this  comparison  was  to  see  if the  HMM  actually \nachieves  a  benefit  from  modelling  the  temporal  aspects  of the speaker recognition \ntask. \n\nIn  each  experiment  an  RDN  constructed  out  of a  set  of,  M,  4-state  HMM's  was \ncompared to the four  other networks  all  matched to have the same number of free \nparameters, except for  the MoGnet.  In the case of the MoGnet,  we  used the same \nnumber  of  Gaussian  mixture  models  as  HMM's  in  the  Alphanet,  each  with  the \nsame number of hidden states.  Thus, it has fewer  parameters, because it is  lacking \nthe  transition  probabilities  of the  HMM.  We  ran  the  experiment  four  times  with \n\n\fC) \n\n$ \n\n* \n\na: \nc \n0 \n~O.8 \n\u00b7in \ngj \nCi \n\n0.6 \n\nRDN \n\nD  e \n\n~ \n\n~ \n\n8 \n\n~ \n\n0.9 \n\n*0.8 \na: \ngO.7 \n~ \n~0.6 \n\u00b7in \n\nCiO.5 \n\nB  gj \n\n0.4 \n\n0.3 \n\na)  ~ \n\n0.95  ~  e \n\n0.9 \n\n0.85 \n\n0.8 \n\n0.75 \n\n0.7 \n\n0.65 \n\n0.6 \n\n0.55 \n\nE=:l \n\n~ \n\n8  0 \n\nb)  ~ \n\ne  = \n\n0.95 \n\n0.9 \n\n0.85 \n\n0.8 \n\n0.75 \n\n0.7 \n\n0.65 \n\n0.6 \n\n0.55 \n\nB \n\nEJ \n\nRDN \n\nAlphanet  MaGnet \n\nArchitecture \n\nCDN \n\nCDN-1 \n\nRDN \n\nAlphanet  MaGnet \n\nArchitecture \n\nCDN \n\nCDN-1 \n\nd)  ~ \n\n~ ~ \n\nU \n\n8 \n\nAlphanet  MeG net \n\nArchitecture \n\nCDN \n\nCDN-1 \n\nRDN \n\nAlphanet  MeGnet \n\nArchitecture \n\nCDN \n\nCDN-1 \n\nFigure  3:  Results  of the  experiments  for  an  RDN  with  (a)  12,  (b)  16,  (c)  20  and \n(d)  24  HMM's. \n\nvalues  of M  of  12,  16,  20  and  24.  For  the  Alphanet  and  MoGnet  we  varied  the \nnumber  of states  in  the  HMM's  and  the  Gaussian  mixtures,  respectively.  For  the \nCDN  model  we  used  the  same  number  of  4-state  HMM's  as  the  RDN  and  varied \nthe number of units  in  the hidden  layer of the network.  Since the  CDN-1  network \nhas  no  hidden  units,  we  used  the same  number of HMM's  as  the RDN  and varied \nthe  number  of states in  the  HMM.  The experiments  were  repeated  10  times  with \ndifferent  training-test set splits.  All  the models  were  trained using 90  iterations of \na  conjugate gradient optimization procedure  [6] . \n\n7  Results \n\nThe  boxplot  in  figure  3  shows  the  results  of the  classification  performance on  the \n10  runs  in  each  of the  4  experiments.  Comparing  the  Alphanet  and  the  RDN  we \nsee  that  the  RDN  consistently outperforms the  Alphanet.  In  all  four  experiments \nthe difference in their performance under a paired t-test was significant at the level \np  < 0.01.  This indicates  that given a  classification network with a  fixed  number of \nparameters,  there  is  an  advantage to  using  many  small  HMM's  and  using  all  the \npairwise  information  about  an  observed sequence,  as  opposed  to  using  a  network \nwith a  single large HMM  per class. \n\nIn the third experiment involving the MoGnet  we  see  that its performance is  com(cid:173)\nparable to that of the Alphanet.  This suggests that the HMM's ability to model the \ntemporal structure of the data is  not really necessary for  the speaker classification \ntask as  we  have set it Up.3  Nevertheless, the performance of both the Alphanet and \n\n3If we  had done text-dependent speaker identification, instead of multiple digit  phrases \n\n\fthe MoGnet is  less  than the RDN. \n\nUnfortunately  the  CDN  and  CDN-l  networks  perform  much  worse  than  we  ex(cid:173)\npected.  While we expected these models to perform similarly to the RDN, it seems \nthat the optimization procedure takes much longer with these models.  This is  prob(cid:173)\nably  because  the small  initial  weights  from  the  HMM's  to  the  next  layer  severely \nattenuate the backpropagated error derivatives that are used  to train the  HMM's. \nAs  a  result the CDN networks do not converge properly in the time allowed. \n\n8  Conclusions \n\nWe  have introduced relative density  networks,  and shown that this  method of dis(cid:173)\ncriminatively learning many small density models in place of a single density model \nper  class  has  benefits  in  classification  performance.  In  addition,  there  may  be  a \nsmall  speed  benefit  to  using  many  smaller  HMM's  compared  to  a  few  big  ones. \nComputing the probability of a sequence under an HMM is order O(TK2 ), where T \nis  the length of the sequence and K  is  the number of hidden states in the network. \nThus, smaller HMM's  can be evaluated faster.  However, this is  somewhat counter(cid:173)\nbalanced by the quadratic growth in the size  of the hidden layer  as  M  increases. \n\nAcknowledgments \n\nWe  would  like  to  thank  John  Bridle,  Chris  Williams,  Radford  Neal,  Sam  Roweis, \nZoubin Ghahramani, and the anonymous reviewers for  helpful  comments. \n\nReferences \n\n[1]  L.  E.  Baum,  T.  Petrie,  G.  Soules,  and  N.  Weiss,  \"A  maximization  technique \noccurring in the statistical analysis of probabilistic functions of Markov chains,\" \nThe  Annals  of Mathematical  Statistics,  vol.  41,  no.  1,  pp.  164-171, 1970. \n\n[2]  1. R.  Bahl,  P.  F.  Brown,  P.  V.  de  Souza,  and  R.  1.  Mercer,  \"Maximum  mu(cid:173)\ntual  information  of  hidden  Markov  model  parameters  for  speech  recognition,\" \nin  Proceeding  of the  IEEE  International  Conference  on  Acoustics,  Speech  and \nSignal Processing,  pp.  49- 53,  1986. \n\n[3]  J.  Bridle,  \"Training  stochastic  model  recognition  algorithms  as  networks  can \nlead to maximum mutual information estimation of parameters,\" in Advances in \nNeural Information  Processing  Systems (D.  Touretzky, ed.),  vol.  2,  (San Mateo, \nCA),  pp.  211- 217,  Morgan Kaufmann,  1990. \n\n[4]  M.  I. Jordan,  \"Why the logistic function?  A tutorial discussion on probabilities \nand  neural  networks,\"  Tech.  Rep.  Computational  Cognitive Science,  Technical \nReport 9503,  Massachusetts Institute of Technology, August  1995. \n\n[5]  A.  D.  Brown and G. E. Hinton,  \"Products of hidden Markov models,\"  in Proceed(cid:173)\nings  of Artificial Intelligence  and  Statistics  2001  (T.  Jaakkola and T.  Richard(cid:173)\nson,  eds.),  pp.  3- 11, Morgan Kaufmann,  2001. \n\n[6]  C.  E. Rasmussen, Evaluation  of Gaussian Processes  and other Methods for Non(cid:173)\n\nLinear Regression.  PhD  thesis,  University of Toronto,  1996.  Matlab conjugate \ngradient code available from  http ://www .gatsby.ucl.ac.uk/~edward/code/. \n\nthen this might have  made a  difference. \n\n\f", "award": [], "sourceid": 2137, "authors": [{"given_name": "Andrew", "family_name": "Brown", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}