{"title": "An Alternative Model for Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 640, "abstract": null, "full_text": "An Alternative Model for  Mixtures of \n\nExperts \n\nDept.  of Computer Science,  The Chinese  University  of Hong  Kong \n\nShatin, Hong  Kong,  Emaillxu@cs.cuhk.hk \n\nLei  Xu \n\nMichael I.  Jordan \n\nDept.  of Brain and  Cognitive Sciences \n\nMIT \n\nCambridge, MA  02139 \n\nToronto,  M5S  lA4, Canada \n\nGeoffrey E.  Hinton \n\nDept.  of Computer Science \n\nUniversity of Toronto \n\nAbstract \n\nWe propose an alternative model for mixtures of experts which uses \na  different  parametric form  for  the gating network.  The  modified \nmodel is  trained by the EM  algorithm.  In comparison with earlier \nmodels-trained by either EM  or gradient ascent-there is no need \nto  select  a  learning  stepsize.  We  report  simulation experiments \nwhich  show  that  the  new  architecture  yields  faster  convergence. \nWe  also  apply the new  model to two  problem domains:  piecewise \nnonlinear function approximation and the combination of multiple \npreviously  trained classifiers. \n\n1 \n\nINTRODUCTION \n\nFor the  mixtures of experts architecture  (Jacobs,  Jordan,  Nowlan & Hinton,  1991), \nthe EM  algorithm decouples the learning process in a manner that fits well with the \nmodular structure and yields a considerably improved rate of convergence (Jordan & \nJacobs,  1994).  The favorable properties of EM  have also been shown by  theoretical \nanalyses  (Jordan & Xu,  in press;  Xu & Jordan,  1994). \nIt is  difficult  to  apply  EM  to some  parts  of the  mixtures  of experts  architecture \nbecause  of the  nonlinearity of softmax gating network.  This makes the  maximiza-\n\n\f634 \n\nLei Xu,  Michael!.  Jordan,  Geoffrey E.  Hinton \n\ntion  with  respect  to  the  parameters  in  gating network  nonlinear  and  analytically \nunsolvable even  for  the simplest generalized linear case.  Jordan and Jacobs  (1994) \nsuggested  a  double-loop approach  in  which  an inner loop of iteratively-reweighted \nleast  squares  (IRLS)  is  used  to perform  the nonlinear optimization.  However,  this \nrequires extra computation and the stepsize  must be chosen carefully  to guarantee \nthe convergence  of the inner loop. \nWe propose an alternative model for mixtures of experts which uses a different para(cid:173)\nmetric form for  the gating network.  This form is chosen  so that the maximization \nwith respect  to the  parameters of the gating  network can be  handled  analytically. \nThus,  a  single-loop  EM  can be  used,  and  no learning stepsize  is  required  to guar(cid:173)\nantee  convergence.  We  report  simulation experiments  which  show  that  the  new \narchitecture yields faster convergence.  We also apply the model to two problem do(cid:173)\nmains.  One is a  piecewise  nonlinear function  approximation problem with smooth \nblending of pieces specified  by polynomial, trigonometric, or other prespecified  ba(cid:173)\nsis  functions.  The other  is  to combine classifiers  developed  previously-a general \nproblem  with  a  variety  of applications  (Xu,  et  al.,  1991,  1992).  Xu  and  Jordan \n(1993)  proposed  to solve the problem by using the mixtures of experts architecture \nand suggested  an algorithm for  bypassing the difficulty caused  by  the softmax gat(cid:173)\ning  networks.  Here,  we  show  that  the  algorithm of Xu  and  Jordan  (1993)  can  be \nregarded  as  a  special  case  of the single-loop  EM  given  in  this  paper and  that  the \nsingle-loop EM  also provides a further  improvement. \n\n2  MIXTURES  OF  EXPERTS AND  EM LEARNING \n\nThe mixtures  of experts model is  based  on the following conditional mixture: \n\nK \n\nI: 9j(X, lI)P(ylx, OJ), \n\nP(ylx,6) \n\nP(ylx, OJ) \n\nwhere  x  ERn,  and  6  consists  of 1I,{Oj}f,  and  OJ  consists  of {wj}f,{rj}f.  The \nvector Ij (x, Wj)  ~s the output ofthe j-th expert net.  The scalar 9j (x, 1I), j  = 1, ... , K \nis given  by  the softmax function: \n\n9j(X,1I) = e.B;(x,v)/I:e.B;(X,v). \n\n(2) \n\nIn this equation, Pj (x, 1I), j  =  1, ... ,K are  the outputs of the gating network. \nThe parameter 6  is estimated by Maximum Likelihood  (ML),  where  the log likeli(cid:173)\nhood is given by L  =  Lt In P(y(t) Ix(t), 6).  The ML estimate can be found iteratively \nusing the EM  algorithm as follows.  Given the current estimate 6(1~), each iteration \nconsists of two steps. \n(1)  E-step.  For  each  pair  {x(t),y(t)},  compute  h~k)(y(t)lx(t\u00bb) =  PUlx(t),y(t\u00bb),  and \nthen form  a set of objective functions: \n\nQj(Oj) \n\nI:h;k)(y(t)lx(t\u00bb) InP(y(t)lx(t), OJ),  j  = 1,,, ',K; \n\n\fAn Alternative Model for Mixtures  of Experts \n\nQg(V)  =  L L h)k) (y(t) Ix(t\u00bb) lngt) (x(t) ,v(k\u00bb). \n\nt \n\nj \n\n(2).  M-step.  Find  a  new  estimate e(k+l) = {{ot+l)}f=I,V(k+l)} with: \n\nOJk+l)  = argmax  Qj(Oj), j  =  1, ... , K;  v(k+l) = arg max  Qg(v). \n\n~ \n\nv \n\n635 \n\n(3) \n\n(4) \n\nIn  certain  cases,  for  example  when  I; (x, Wj)  is  linear  in  the  parameters  Wj, \nmaXe j  Qj(Oj)  can be solved  by  solving 8Qj/80j  =  O.  When  l;(x,Wj)  is  nonlinear \nwith respect  to Wj,  however,  the maximization cannot be performed  analytically. \nMoreover,  due to the nonlinearity of softmax, maXv Qg(v)  cannot be solved analyti(cid:173)\ncally in any case.  There are two possibilities for  attacking these nonlinear optimiza(cid:173)\ntion problems.  One is to use  a  conventional iterative optimization technique  (e.g., \ngradient ascent)  to perform one or more inner-loop iterations.  The other is to simply \nfind  a  new  estimate such that Qj(ot+l\u00bb) ~ Qj(Ojk\u00bb),  Qg(v(k+l\u00bb)  ~ Qg(v(k\u00bb).  Usu(cid:173)\nally, the algorithms that perform a full maximization during the M step are referred \nas  \"EM\"  algorithms, and algorithms that simply increase the Q function during the \nM  step  as  \"GEM\"  algorithms.  In  this  paper  we  will  further  distinguish  between \nEM  algorithms requiring  and  not  requiring  an  iterative inner loop  by  designating \nthem as  double-loop EM and single-loop EM respectively. \nJordan  and  Jacobs  (1994)  considered  the  case  of linear  {3j(x, v)  =  vJ[x,l]  with \nv  =  [VI,\u00b7\u00b7\u00b7, VK]  and semi-linear I; (wnx, 1])  with nonlinear 1;(.).  They proposed  a \ndouble-loop EM  algorithm by using  the IRLS method to implement the inner-loop \niteration.  For  more  general  nonlinear  {3j(x, v)  and  I;(x, OJ),  Jordan  and  Xu  (in \npress)  showed  that  an extended  IRLS  can  be  used  for  this  inner  loop.  It  can  be \nshown that IRLS and the extension are equivalent to solving eq.  (3)  by the so-called \nFisher Scoring method. \n\n3  A  NEW GATING  NET AND  A  SINGLE-LOOP EM \n\nTo sidestep  the  need  for  a  nonlinear optimization routine  in the inner  loop of the \nEM  algorithm, we  propose  the following modified gating network: \n\ngj(x, v) = CkjP(xlvj)/ L:i CkiP(xIVi) , L:j  Ckj  =  1, Ckj  ~ 0, \n\nP(xIVj)  =  aj(vj)-lbj(x) exp{cj(Vj)Ttj(x)} \n\n(5) \nwhere  v  =  {Ckj,Vj,j  =  1,\u00b7\u00b7\u00b7,K}, tj(x)  is  a  vector  of sufficient  statistics,  and  the \nP(xlvj)'s  are  density  functions  from  the  exponential  family.  The  most  common \nexample is the Gaussian: \n\n(6) \n\nIn eq.  (5), gj(x, v)  is  actually the posterior probability PUlx) that x is  assigned to \nthe partition corresponding to the j-th expert net,  obtained from Bayes'  rule: \n\ngj(x, v) = PUlx) =  CkjP(xIVj)/ P(x, v),  P(x, v) = L CkiP(xlvi). \n\n(7) \n\n\f636 \n\nLei Xu,  Michael I.  Jordan,  Geoffrey E.  Hinton \n\nInserting this 9j(X, v)  into the model eq.  (1),  we  get \n\nP(ylx, 8) = L- ~(  )  P(ylx, OJ). \n\n\"  a\u00b7P(xlv \u00b7) \n\n. \n3 \n\nX,V \n\n(8) \n\nIf we  do  ML  estimation directly  on  this  P(ylx,8)  and  derive  an  EM  algorithm, \nwe  again find  that the maximization maXv Q9(v) cannot be solved  analytically.  To \navoid  this difficulty,  we  rewrite eq.  (8)  as: \n\nP(y, x) =  P(ylx, 8)P(x, v) = L  ajP(xlvj)P(ylx, OJ). \n\n(9) \n\nj \n\nThis suggests an asymmetrical representation for  the joint density.  We accordingly \nperform ML  estimation based  on  L' =  2:t In P(y(t), x(t\u00bb)  to determine  the param(cid:173)\neters  a j  , Vj, OJ  of the  gating  net  and  the  expert  nets.  This  can  be  done  by  the \nfollowing  EM  algorithm: \n(1)  E-step.  Compute \n\nh(k)(y(t) Ix(t\u00bb)  _ \nj \n\na\\k) P( x(t) Iv~k \u00bb)P(y(t) Ix(t)  O<k\u00bb) \n\n3 \n\n3 \n\n- 2:i a~k) P(x(t)lv?\u00bb)P(y(t)lx(t), 0)\"\u00bb)' \n\n'  3 \n\n. \n\n(10) \n\nThen let Qj(Oj),j =  1,\u00b7 .. , K  be the same as given in eq.  (3), and decompose Q9(v) \nfurther  into \n\nL  h)k '(y(t) Ix(t\u00bb) In P( x(t) IVj),  j  =  1, ... , K; \nt \nLLh;k)(y(t)lx(t\u00bb)lnaj,  with a= {al, .. \u00b7,aK}. \n\nj \n\n(2).  M-step.  Find  a new  estimate for  j  =  1,\u00b7\u00b7\u00b7, K \n\nO;k+l)  =  argmaXSj  Qj(Oj),  V]\"+l)  =  argmaxllj  QJ(Vj)' \n\na(k+1)  = arg maXa,  QlX,  s.t.  2:j  aj = 1. \n\n(11) \n\n(12) \n\nThe maximization for  the expert  nets  is  the same as  in eq.  (4).  However,  for  the \ngating net  the maximization now  becomes  analytically solvable as  long  as  P(xlvj) \nis  from the exponential family.  That is,  we  have: \n\nV~k+l) =  2:t h)\")(y(t) Ix(t\u00bb)tj(x(t\u00bb) \n3 \n\n2:t h)k)(y(t)lx(t\u00bb) \n\n, \n\na;k+l)  =  ~ L  h)\")(y(t) Ix(t\u00bb). \n\n(13) \n\nt \n\nIn particular, when  P(xIVj)  is  a  Gaussian density,  the update becomes: \n\n\fAn Alternative Model for Mixtures  of Experts \n\n637 \n\nTwo issues deserve  to be emphasized further: \n(1)  The gating  nets eq.  (2)  and  eq.  (5)  become  identical when  f3j(x, v)  =  lnaj + \nIn bj (x) +Cj (Vj)T tj (x) -In aj(vj) . In other words, the gating net in eq.  (5) explicitly \nuses  this  function  family  instead  of the  function  family  defined  by  a  multilayer \nfeedforward  network. \n(2)  It follows  from eq.  (9)  that max  In P(y, xiS) =  max [In P(ylx, S) + In P(xlv)]. \nSo,  the solution given  by  eqs.  (10)  through  (14)  is  actually different  from the one \ngiven by the original eqs.  (3)  and (4).  The former tries to model both the mapping \nfrom x to y and the input x,  while the latter only models the mapping from x and \ny.  In fact,  here  we  learn the parameters of the gating net  and  the expert  nets  via \nan asymmetrical representation  eq.  (9)  of the joint density  P(y, x)  which  includes \nP(ylx)  implicitly.  However,  in the  testing  phase,  the  total output still follows  eq. \n(8). \n\n4 \n\nPIECEWISE NONLINEAR APPROXIMATION \n\nThe simple form  /j(x, Wj)  =  wJ[x , 1]  is  not the only case to which  single-loop EM \napplies.  Whenever  /j(x, Wj)  can be  written in a  form linear in the parameters: \n\n(15) \n\nwhere 4>i,j(X)  are prespecified  basis functions,  maX8 j  Qf!(Oj),j = 1\"\", K  in eq.  (3) \nis  still a  weighted least squares problem that can be soived analytically. One useful \nspecial  case  is  when  4>i,j(X)  are  canonical  polynomial terms  X~l .. 'X~d,  rj  ~ 0.  In \nthis case,  the mixture of experts  model  implements piecewise  polynomial approxi(cid:173)\nmations.  Another case  is  that 4>i,j(X)  is TIi sini (jll'xt) cosi(jll'xt}, ri  ~ 0,  in  which \ncase  the mixture of experts  implements piecewise trigonometric approximations. \n\n5  COMBINING MULTIPLE  CLASSIFIERS \n\nGiven pattern classes Ci, i  = 1, ... , M, we consider classifiers ej  that for each input \nx produce an output Pj(ylx): \n\nPj(ylx) =  [Pj(ll x ), ... ,pj(Mlx)),  pj(ilx) ~ 0,  LPj(ilx) =  1. \n\n(16) \n\nThe problem of Combining Multiple Classifiers (CMC) is to combine these Pj(ylx)'s \nto give  a  combined  estimate of P(ylx) .  Xu  and  Jordan  (1993)  proposed  to  solve \nCMC  problems by  regarding the problem as a  special example of the mixture den(cid:173)\nsity  problem eq.  (1)  with  the Pj(ylx)'s known  and  only  the gating net 9j(X, v)  to \nbe learned.  In Xu  and Jordan  (1993),  one problem encountered  was  also  the non(cid:173)\nlinearity  of softmax gating networks,  and  an algorithm was  proposed  to avoid  the \ndifficulty. \nActually,  the single-loop EM  given  by  eq.  (10)  and eq.  (13)  can  be  directly  used \nto  solve  the  CMC  problem.  In  particular,  when  P(xlvj)  is  Gaussian,  eq. \n(13) \nbecomes  eq.  (14).  Assuming  that  al  =  . . . =  aK  in  eq.  (7),  eq.  (10)  becomes \n\n\f638 \n\nLei Xu,  Michaell. Jordan,  Geoffrey E. Hinton \n\nh)k) (y(t) Ix(t))  =  P(x(t)lvt))P(y(t)lx(t))/ L:i P(x(t)lvi(k))p(y(t)lx(t)).  If  we  divide \nboth  the  numerator  and  denominator  by  L:i P(x(t) Ivi(k)),  we  get  ht)(y(t) Ix(t))  = \ngj(x, v)P(y(t)lx(t))/ L:i gj(x, v)P(y(t)lx(t)) .  Comparing this equation with eq.  (7a) \nin Xu and Jordan (1993),  we can see  that the two equations are actually the same. \nDespite the different  notation, C\u00a5j(x)  and Pj(y1.t) Ix(t))  in Xu  and Jordan (1993)  are \nthe  same  as  gj(x, v)  and  P(y(t)lx(t))  in  Section  3.  So  the  algorithm  of Xu  and \nJordan (1993)  is a special case of the single-loop EM  given in Section 3. \n\n6 \n\nSIMULATION  RESULTS \n\nWe compare the performance of the EM  algorithm presented earlier with the model \nof mixtures of experts  presented  by  Jordan  and Jacobs  (1994).  As  shown  in  Fig. \nl(a),  we  consider  a  mixture  of experts  model  with  K  = 2.  For  the  expert  nets, \neach  P(ylx, OJ)  is  Gaussian given  by  eq.  (1)  with  linear !;(x,Wj) =  wJ[x, 1] .  For \nthe  new  gating net,  each  P(x, Vj)  in eq.  (5)  is  Gaussian given  by  eq.  (6) .  For the \nold gating net eq.  (2),  f31(X,V) = 0 and f32(X,v)  =  vT[x, 1].  The learning speeds  of \nthe two are significantly different.  The new  algorithm takes k=15 iterations for the \nlog-likelihood to converge  to the value  of -1271.8.  These  iterations  require  about \n1,351,383  MATLAB  flops.  For  the  old  algorithm,  we  use  the  IRLS  algorithm \ngiven  in  Jordan  and  Jacobs  (1994)  for  the  inner  loop  iteration.  In  experiments, \nwe  found  that  it  usually  took  a  large  number  of iterations  for  the  inner  loop  to \nconverge.  To  save  computations,  we  limit the  maximum number  of iterations  by \nTmaz  =  10.  We  found  that  this saved  computation without  obviously  influencing \nthe  overall  performance.  From  Fig.  1 (b),  we  see  that the outer  loop  converges  in \nabout  16  iterations.  Each  inner  loop  takes  290498  flops  and  the  entire  process \nrequires  5,312,695  flops.  So,  we  see  that  the  new  algorithm yields  a  speedup  of \nabout  4,648,608/1,441,475 =  3.9.  Moreover,  no external adjustment is  needed  to \nensure  the  convergence  of the  new  algorithm.  But  for  the  old  one  the direct  use \nof IRLS can  make the inner loop diverge  and  we  need  to appropriately rescale  the \nupdating stepsize  of IRLS. \nFigs.  2(a)  and  (b)  show  the  results  of  a  simulation  of  a  piecewise  polynomial \napproximation problem utilizing the approach described in Section 4.  We consider a \nmixture of experts model with K  =  2.  For expert nets,  each  P(ylx, OJ)  is  Gaussian \ngiven by eq.  (1)  with !;(x,Wj) = W3,jX3 +W2,jX2+Wl,jX+WD,j.  In the new  gating \nnet eq.  (5),  each P(x, Vj)  is again Gaussian given by eq.  (6) .  We see that the higher \norder  nonlinear regression  has been fit  quite well. \nFor multiple classifier combination, the problem and data are the same as in Xu and \nJordan (1993).  Table 1 shows the classification results.  Com-old and Com-new de(cid:173)\nnote the method given in in Xu and Jordan (1993) and in Section 5 respectively.  We \nsee that both improve the classification rate of each individual network considerably \nand that Com - new improves on Com - old. \n\nClassifer el  Classifer el \n\nTraining  set \nTesting  set \n\n89.9% \n89.2% \n\n93.3% \n92.7% \n\nCom- old \n\n98.6% \n98.0% \n\nCom- new \n\n99.4% \n99.0% \n\nTable  1 A comparison of the correct  classification rates \n\n\fAn Alternative Model for Mixtures  of Experts \n\n639 \n\nU \n\n2  U \n\n3  U \n\n4 \n\no.s \n\n(a) \n\n(b) \n\nFigure  1:  (a)  1000  samples from  y  = a1X + a2 + c, a1  = 0.8, a2  = 0.4, x  E [-1,1.5] \nwith  prior a1  = 0.25  and  y  = ai x + a~ + c, ai = 0.8, a2  =' 2.4, x  E [1,4]  with prior \na2  =  0.75,  where  x  is  uniform  random variable  and  z  is  from  Gaussian  N(O, 0.3). \nThe two lines through the clouds are the estimated models of two expert nets.  The \nfits obtained by the two learning algorithms are almost the same.  (b)  The evolution \nof the  log-likelihood.  The  solid  line  is  for  the  modified  learning  algorithm.  The \ndotted  line is for  the original learning algorithm (the outer loop iteration). \n\n7  REMARKS \n\nRecently, Ghahramani and Jordan (1994)  proposed solving function approximation \nproblems by  using a mixture of Gaussians to estimate the joint density of the input \nand output  (see  also Specht,  1991; Tresp,  et al.,  1994).  In the special case of linear \nI;(x, Wj)  =  wnx,I]  and  Gaussian  P(xlvj)  with  equal  priors,  the  method  given \nin  Section  3 provides  the same result  as  Ghahramani and Jordan (1994)  although \nthe  parameterizations of the  two  methods  are  different.  However,  the  method  of \nthis paper also applies to nonlinear l;(x,Wj) = Wn<pj(x) , 1]  for piecewise  nonlinear \napproximation  or  more  generally  I; (x, Wj)  that  is  nonlinear  with  respect  to  Wj, \nand  applies  to cases  in  which  P(y, xlvj, OJ)  is  not Gaussian,  as  well  as  the case  of \ncombining multiple classifiers.  Furthermore, the methods proposed in Sections 3 and \n4 can  also be  extended  to the hierarchical  mixture of experts  architecture  (Jacobs \n& Jordan,  1994)  so  that single-loop EM  can be used  to facilitate its training. \n\nReferences \n\nGhahramani, Z.,  & Jordan,  M.I.  (1994).  Function  approximation via density  esti(cid:173)\nmation using  the  EM  approach.  In  Cowan,  J.D.,  Tesauro,  G.,  and  Alspector,  J., \n(Eds.),  Advances  in  NIPS 6.  San Mateo,  CA:  Morgan Kaufmann. \nJacobs, R.A., Jordan, M.L, Nowlan, S.J., & Hinton, G.E. (1991).  Adaptive mixtures \nof local experts.  Neural  Computation,  9,  79-87. \n\n\f640 \n\nLei Xu,  Michael!.  Jordan,  Geoffrey E.  Hinton \n\n110' \n0 \n\n.as \n\n-\\ \n\nI -\\.5 \n-2 1 \n\n~  -2.5 \n\n-3 \n\n-3.5 \n\n-4 \n0 \n\n-2 \n\n(a) \n\n(b) \n\nFigure  2:  Piecewise  3rd  polynomial  approximation.  (a)  1000  samples  from  y  = \nalx3+a3x+a4+c, x  E [-1,1.5] with prior 0'1  = 0.4 and y = a~x2+a~x2+a4+c, x  E \n[1,4] with prior 0'2 = 0.6, where x is uniform random variable and z is from Gaussian \nN(0,0.15).  The  two  curves  through  the  clouds  are  the  estimated  models  of two \nexpert  nets.  (b)  The evolution of the log-likelihood. \n\nIEEE  Trans.  Neural \n\nJordan,  M.I.,  &  Jacobs,  R.A.  (1994).  Hierarchical mixtures of experts and the  EM \nalgorithm.  Neural  Computation,  6,  181-214. \nJordan,  M.I.,  &  Xu,  L.  (in  press).  Convergence  results  for  the  EM  approach  to \nmixtures-of-experts architectures.  Neural Networks. \nSpecht,  D.  (1991).  A  general  regression  neural  network. \nNetworks,  2,  568-576. \nTresp,  V.,  Ahmad,  S.,  and  Neuneier,  R.  (1994).  Training  neural  networks  with \ndeficient  data.  In  Cowan,  J.D.,  Tesauro,  G.,  & Alspector,  J.,  (Eds.),  Advances  in \nNIPS 6,  San Mateo,  CA:  Morgan Kaufmann. \nXu,  L., Krzyzak A., & Suen,  C.Y. (1991).  Associative switch for combining multiple \nclassifiers.  Proc.  of 1991  HCNN,  Vol.  I.  Seattle, 43-48. \nXu,  L.,  Krzyzak  A., & Suen,  C.Y.  (1992).  Several  methods for  combining multiple \nclassifiers and their applications in handwritten character recognition.  IEEE Thans. \non  SMC,  Vol.  SMC-22, 418-435. \nXu,  L.,  & Jordan,  M.I.  (1993).  EM  Learning on  a generalized finite mixture model \nfor  combining multiple classifiers.  Proceedings  of World  Congress  on  Neural  Net(cid:173)\nworks,  Vol.  IV.  Portland, OR,  227-230. \nXu,  L.,  & Jordan, M.I.  (1994).  On convergence  properties of the EM  algorithm for \nGaussian mixtures.  Submitted to Neural  Computation. \n\n\f", "award": [], "sourceid": 906, "authors": [{"given_name": "Lei", "family_name": "Xu", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}