{"title": "Forward-backward retraining of recurrent neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 743, "page_last": 749, "abstract": null, "full_text": "Forward-backward retraining of recurrent \n\nneural networks \n\nAndrew  Senior \u2022 \nTony Robinson \nCambridge University  Engineering Department \n\nTrumpington Street,  Cambridge, England \n\nAbstract \n\nThis  paper  describes  the  training  of  a  recurrent  neural  network \nas  the  letter  posterior  probability estimator for  a  hidden  Markov \nmodel,  off-line  handwriting recognition  system.  The network  esti(cid:173)\nmates  posterior  distributions for  each  of a  series  of frames  repre(cid:173)\nsenting  sections  of a  handwritten  word.  The  supervised  training \nalgorithm,  backpropagation through time,  requires  target outputs \nto  be  provided for  each  frame.  Three  methods for  deriving  these \ntargets  are  presented.  A  novel  method  based  upon  the  forward(cid:173)\nbackward  algorithm is  found  to  result  in  the  recognizer  with  the \nlowest  error  rate. \n\nIntroduction \n\n1 \nIn  the  field  of off-line  handwriting  recognition,  the  goal  is  to  read  a  handwritten \ndocument  and  produce  a  machine  transcription.  Such  a  system  could  be  used \nfor  a  variety  of purposes,  from  cheque  processing  and  postal  sorting  to  personal \ncorrespondence  reading for  the  blind or historical document reading.  In a  previous \npublication  (Senior  1994)  we  have  described  a  system  based  on a  recurrent  neural \nnetwork  (Robinson  1994) which  can transcribe  a  handwritten document. \n\nThe  recurrent  neural  network  is  used  to  estimate  posterior  probabilities for  char(cid:173)\nacter  classes,  given  frames  of data which  represent  the  handwritten  word.  These \nprobabilities are  combined in a  hidden  Markov model framework,  using the Viterbi \nalgorithm to find  the most probable state sequence. \n\nTo train the network,  a  series  of targets must be given.  This paper describes  three \nmethods that have been used to derive these  probabilities.  The first  is a  naive boot(cid:173)\nstrap method,  allocating equal lengths  to all characters,  used  to start  the training \nprocedure.  The second is a simple Viterbi-style segmentation method that assigns a \nsingle class label to each of the frames of data.  Such a scheme has been  used  before \nin  speech  recognition  using  recurrent  networks  (Robinson  1994).  This  representa(cid:173)\ntion, is found to inadequately represent some frames which can represent two letters, \nor the ligatures between letters.  Thus, by analogy with the forward-backward algo(cid:173)\nrithm (Rabiner and Juang 1986) for  HMM  speech  recognizers,  we  have developed a \n\n\u00b7Now  at IDM  T .J.Watson Research  Center, Yorktown Heights  NYI0598,  USA. \n\n\f744 \n\nA.  SENIOR, T. ROBINSON \n\nforward-backward method for  retraining the recurrent neural network.  This assigns \na  probability distribution across  the output classes  for  each frame of training data, \nand training on these 'soft labels' results in improved performance of the recognition \nsystem. \n\nThis paper  is  organized in four  sections.  The following section outlines the system \nin  which the  neural network is  used,  then section  3 describes  the recurrent  network \nin  more  detail.  Section  4  explains  the  different  methods of target  estimation and \npresents  the  results  of  experiments  before  conclusions  are  presented  in  the  final \nsection. \n\n2  System background \n\nThe  recurrent  network  is  the  central  part  of the  handwriting  recognition  system. \nThe other parts are summarized here  and described  in more detail in another pub(cid:173)\nlication  (Senior  1994).  The  first  stage  of processing  converts  the  raw  data  into \nan  invariant representation  used  as  an  input  to  the  neural  network.  The  network \noutputs are  used  to calculate word  probabilities in a  hidden  Markov model. \n\nFirst,  the scanned page image is  automatically segmented into words and then nor(cid:173)\nmalized.  Normalization  removes  variations  in  the  word  appearance  that  do  not \naffect  its  identity,  such  as  rotation,  scale,  slant,  slope  and  stroke  thickness.  The \nheight of the letters forming the words  is  estimated,  and magnifications, shear  and \nthinning  transforms  are  applied,  resulting  in  a  more  robust  representation  of the \nword.  The  normalized  word is  represented  in  a  compact canonical form  encoding \nboth the shape and  salient features.  All those features  falling within a  narrow ver(cid:173)\ntical strip across  the  word are  termed a  frame.  The representation  derived consists \nof around  80  values  for  each  of the  frames,  denoted  Xt.  The  T  frames  (Xl,' .. , xr ) \nfor  a  whole  word  are  written xl'  Five frames  would  typically  be enough  to  repre(cid:173)\nsent  a  single character.  The recurrent  network takes  these frames sequentially and \nestimates the posterior character probability distribution given the data:  P(Ai IxD, \nfor  each of the letters,  a, .. ,z, denoted Ao, ... , A25 \u2022  These  posterior probabilities are \nscaled  by the  prior class  probabilities, and are treated as the emission probabilities \nin a  hidden Markov model. \n\nA  separate  model  is  created  for  each  word  in  the  vocabulary,  with  one  state  per \nletter.  Transitions are allowed only from a  state to itself or to the next letter in the \nword.  The  set  of states  in  the  models  is  denoted  Q =  {ql, ... , qN}  and  the  letter \nrepresented  by qi  is given  by  L(qi), L : Q 1-+  Ao, ... , A25 \u2022 \nWord error  rates  are  presented  for  experiments on a  single-writer  task  tested  with \na  1330  word  vocabulary!.  Statistical significance  of the  results  is  evaluated  using \nStudent's t-test, comparing word recognition rates taken from a  number of networks \ntrained  under  the  same conditions  but  with different  random initializations.  The \nresults  of the  t-test  are  written:  T( degrees  of freedom)  and  the  tabulated  values: \ntsignificance (degrees  of freedom). \n\n3  Recurrent networks \n\nThis section describes  the recurrent  error propagation network which has been used \nas  the  probability  distribution  estimator  for  the  handwriting  recognition  system. \nRecurrent  networks  have  been  successfully  applied  to  speech  recognition  (Robin(cid:173)\nson  1994)  but  have  not  previously  been  used  for  handwriting  recognition,  on-line \nor  off-line.  Here  a  left-to-right  scanning  process  is  adopted  to  map  the  frames  of \na  word  into a  sequence,  so adjacent frames  are  considered  in  consecutive  instants. \n\nlThe experimental  data are  available  in  ftp:/ /svr-ftp.eng.cam.ac.uk/pub/data \n\n\fForward-backward Retraining of Recurrent Neural Networks \n\n745 \n\nA  recurrent  network  is  well  suited  to  the  recognition  of patterns  occurring  in  a \ntime-series  because  series  of arbitrary length can be  processed,  with the  same pro(cid:173)\ncessing  being  performed  on  each  section  of  the  input  stream.  Thus  a  letter  'a' \ncan  be  recognized  by  the  same  process,  wherever  it  occurs  in  a  word. \ntion,  internal  'state'  units are  available to encode  multi-frame context  information \nso  letters  spread  over  several  frames  can  be  recognized.  The  recurrent  network \n\nIn  addi(cid:173)\n\nInput Frames \n\nOutput \n(Characlcrprobabllllles ) \n\nJ:._-------.. \n\nNetwork \n\n-- JD: \n, \n! \n: \n\n, \n; \n, \ni \n\n_\n\nIT  TT ,  \n\n( ..  vv le e W WW II ...  ) \nf  inpu tiOulpul  Uni ts \n---l-- _. ;---------r -ie~~;k-Um, s \n\nUntt Time  Iklay \n\nFigure  1:  A  schematic of the recurrent  error  propagation network. \nFor clarity only a few  of the  units and links are shown. \n\narchitecture  used  here  is  a  single  layer  of standard perceptrons  with nonlinear  ac(cid:173)\ntivation  functions.  The  output  0 i  of a  unit  i  is  a  function  of the  inputs  aj  and \nthe  network  parameters,  which  are  the  weights  of the  links  Wij  with  a  bias  bi : \n\n(1) \n\n0i \n\nO\"i \n\n!i({O\"j}), \n\n(2) \nthat  is,  each  input  is  connected  to  every  out(cid:173)\nThe  network  is  fully  connected  -\nput.  However,  some  of  the  input  units  receive  no  external  input  and  are  con(cid:173)\nnected  one-to-one  to  corresponding  output  units  through  a  unit  time-delay  (fig(cid:173)\nure  1).  The  remaining  input  units  accept  a  single  frame  of  parametrized  in(cid:173)\nput  and  the  remaining  26  output  units  estimate  letter  probabilities  for  the  26 \ncharacter  classes.  The  feedback  units  have  a  standard  sigmoid  activation  func(cid:173)\ntion  (3),  but  the  character  outputs  have  a  'softmax'  activation  function  (4). \n\nbi  + Lakwik. \n\n(3) \n\neO' \u2022 \n\nL:j  eO\"  \u2022 \n\n(4) \n\nDuring recognition ('forward propagation'), the first frame is presented at the input \nand  the  feedback  units are  initialized to activations of 0.5.  The outputs are calcu(cid:173)\nlated  (1  and 2)  and read off for  use  in the  Markov model.  In  the next iteration, the \noutputs of the feedback  units are copied to the feedback  inputs, and the next frame \npresented  to the inputs.  Outputs are again calculated, and the cycle is  repeated for \neach frame of input, with a  probability distribution being generated for  each frame. \n\nTo allow the  network  to assimilate context  information, several frames  of data are \npassed  through  the  network  before  the  probabilities  for  the  first  frame  are  read \noff,  previous  output  probabilities  being  discarded.  This  input/output  latency  is \nmaintained  throughout  the  input  sequence,  with  extra,  empty  frames  of  inputs \nbeing  presented  at  the  end  to  give  probability distributions for  the  last  frames  of \ntrue  inputs.  A  latency  of two  frames  has  been  found  to  be  most  satisfactory  in \nexperiments to date. \n\n3.1  Training \nTo  be  able  to  train  the  network  the  target  values  (j (t)  desired  for  the  outputs \nOJ (Xt)  j  =  0, ... ,25 for  frame Xt  must be specified.  The target specification  is  dealt \n\n\f746 \n\nA.  SENIOR. T. ROBINSON \n\nwith in the next section.  It is  the discrepancy  between the actual outputs and these \ntargets  which  make  up  the  objective  function  to  be  maximized  by  adjusting  the \ninternal weights  of the  network.  The usual objective function  is  the mean squared \nerror,  but  here  the  relative  entropy,  G,  of the  target  and  output  distributions  is \nused: \n\nG \n\n\" \"  \n\n- L- L- (j (t) log -.-( -)' \n\n(j(t) \noJ  Xt \n\nt \n\nj \n\n(5) \n\nAt  the  end  of a  word,  the  errors  between  the  network's  outputs  and  the  targets \nare  propagated  back  using  the  generalized  delta  rule  (Rumelhart  et  al.  1986)  and \nchanges  to  the  network  weights  are  calculated.  The  network  at  successive  time \nsteps  is  treated  as  adjacent  layers  of a  multi-layer network.  This process  is  gener(cid:173)\nally known  as  'back-propagation through time' (Werbos  1990).  After  processing  T \nframes  of data with  an  input/output  latency,  the  network  is  equivalent  to  a  (T  + \nlatency) layer perceptron sharing weights between layers.  For a detailed description \nof the  training procedure,  the reader  is  referred  elsewhere  (Rumelhart  et  al.  1986; \nRobinson  1994). \n\n4  Target  re-estimation \nThe data used for  training are only labelled by word.  That is,  each image represents \na  single  word,  whose  identity is  known,  but the  frames  representing  that word  are \nnot labelled to indicate which part of the word they represent.  To train the network, \na label for each frame's identity must be provided.  Labels are indicated by the state \nSt  E Q and the corresponding letter L(St) of which a  frame  Xt  is  part. \n4.1  A  simple solution \n\nTo bootstrap the network,  a naive method was used,  which simply divided the word \nup  into  sections  of equal length,  one  for  each  letter  in  the  word.  Thus,  for  an  N(cid:173)\nletter word of T  frames, xI,  the first  letter was assumed to be represented  by frames \nxr,  the next  by  xk+1  and so  on.  The segmentation is  mapped into a  set of targets \n\n2r \n\n.. \n\nas follows: \n\nI'J.(t) \n.. \n\n{  1 if L(St) = Aj \n\n0 otherwise. \n\n(6) \n\nFigure  2a  shows  such  a  segmentation  for  a  single  word.  Each  line,  representing \n(j(t)  for  some  j,  has  a  broad  peak  for  the  frames  representing  letter  Aj.  Such  a \nsegmentation  is  inaccurate,  but  can  be  improved  by  adding  prior  knowledge.  It \nis  clear  that  some  letters  are  generally  longer  than  others,  and  some  shorter.  By \nweighting  letters  according  to their  a  priori lengths  it  is  possible  to give  a  better, \nbut  still very  simple,  segmentation.  The  letters  Ii,  I' are  given  a  length  of ! and \n'm,  w'  a  length  ~ relative  to other  letters.  Thus  in  the  word  'wig',  the first  half \nof the  frames  would  be  assigned  the label  'w',  the  next  sixth Ii' and  the last  third \nthe  label  'g'.  While  this  segmentation  is  constructed  with  no  regard  for  the  data \nbeing segmented,  it is found  to provide a  good initial approximation from which  it \nis  possible to train the network to recognize  words,  albeit with high error  rates. \n\n4.2  Viterbi re-estimation \n\nHaving  trained  the  network  to  some  accuracy,  it  can  be  used  to calculate  a  good \nestimate of the  probability of each frame belonging  to any letter.  The  probability \nof any  state  sequence  can  then  be  calculated  in  the  hidden  Markov  model,  and \nthe  most  likely  state  sequence  through  the  correct  word  S*  found  using  dynamic \nprogramming.  This best state sequence  S*  represents  a  new  segmentation giving a \nlabel for  each frame.  For a network which models the probability distributions well, \nthis  segmentation  will  be  better  than  the  automatic  segmentation  of section  4.1 \n\n\fForward-backward Retraining of Recurrent Neural Networks \n\n747 \n\nFigure 2:  Segmentations of the word 'butler'.  Each  line represents \nP(St = AilS) for one letter ~ and is high for framet when S;  =  Ai. \n(a)  is  the equal-length segmentation discussed  in  section 4.1  (b)  is \na  segmentation of an  untrained network.  (c)  is  the  segmentation \nre-estimated with a  trained network. \n\n\" \n\nsince it takes the data into account.  Finding the most probable state sequence  S\u00b7  is \ntermed a  forced  alignment.  Since  only the correct  word  model need  be  considered, \nsuch an alignment is faster than the search through the whole lexicon that is required \nfor  recognition.  Training on this automatic segmentation gives a  better recognition \nrate,  but still avoids the necessity  of manually segmenting any of the database. \n\nFigure  2  shows  two  Viterbi  segmentations  of  the  word  'butler'.  First,  figure  2b \nshows  the  segmentation arrived  at  by  taking the  most likely state  sequence  before \ntraining the network.  Since the emission probability distributions are random, there \nis  nothing to distinguish  between  the state  sequences,  except  slight  variations due \nto  initial asymmetry in  the  network,  so  a  poor  segmentation results.  After  train(cid:173)\ning  the  network  (2c),  the  durations  deviate  from  the  prior  assumed  durations  to \nmatch the observed data.  This re-estimated segmentation represents  the data more \naccurately,  so  gives  better  targets  towards  which  to train.  A further  improvement \nin recognition accuracy can be obtained by using the targets determined by the re(cid:173)\nestimated segmentation.  This cycle can be repeated  until the segmentations do not \nchange  and performance ceases  to improve.  For  speed,  the  network  is  not  trained \nto convergence  at each  iteration. \n\nIt can be shown (Santini and Del Bimbo 1995) that, assuming that the network has \nenough  parameters,  the  network  outputs  after  convergence  will  approximate  the \nposterior  probabilities  P(~lxD.  Further,  the  approximation P(AilxD  ~ P(Adxt) \nis  made.  The posteriors are scaled  by the class priors P(Ai)  (Bourlard and Morgan \n1993), and these scaled posteriors are  used  in the hidden  Markov model in place of \ndata likelihoods since,  by  Bayes'  rule, \nP(XtIAi) \n\n(7) \n\nP(~lxt) \nP(Ai)\u00b7 \n\n()( \n\nTable  1 shows  word  recognition  error  rates  for  three  80-unit  networks  trained  to(cid:173)\nwards fixed  targets estimated by another network, and then retrained, re-estimating \nthe targets  at each iteration.  The retraining improves the recognition performance \n(T(2) =  3.91, t.9s(2)  =  2.92). \n4.3  Forward-backward re-estimation \n\nThe system described above performs well and is the method used in previous recur(cid:173)\nrent  network  systems,  but examining the  speech  recognition  literature,  a  potential \nmethod of improvement can be seen.  Viterbi frame alignment has so far  been  used \nto  determine  targets  for  training.  This  assigns  one  class  to  each  frame,  based  on \nthe  most  likely  state  sequence.  A  better  approach  might  be  to  allow  a  distribu(cid:173)\ntion across  all the classes  indicating which  are likely and which are not,  avoiding a \n\n\f748 \n\nA. SENIOR, T. ROBINSON \n\nTable 1:  Error rates for  3 networks with 80 units trained with fixed \nalignments, and retrained  with re-estimated alignments. \n\nTraining \nmethod \nFixed  targets  21.2 \nRetraining \n\nError  (%) \nJ.I. \n\n(7 \n\n1.73 \n17.0  0.68 \n\n'hard' classification  at  points  where  a  frame  may indeed  represent  more  than one \nclass  (such as where slanting characters overlap), or none (as in a ligature).  A 'soft' \nclassification  would give a  more accurate  portrayal of the frame identities. \nSuch  a  distribution, 'Yp(t)  = P(St = qplxI, W),  can  be calculated with the forward(cid:173)\nbackward algorithm (Rabiner and Juang 1986).  To obtain 'Yp(t),  the forward  prob(cid:173)\nabilities Ctp(t)  =  P(St =  qp, xD  must  be  combined  with  the  backward  probabilities \nf3p(t)  = P(St  = qp, x;+l)'  The forward  and  backward  probabilities  are  calculated \nrecursively  in the same manner. \n\n<\u00bb \n\nCtr(t + 1) \n\nL Ctp(t)P(xtIL(qp))ap,r, \n\n(8) \n\nr \n\n/3p(t  - 1) \n\n(9) \nSuitable initial distributions Ctr(O)  = 7l'r  and  f3r(r + 1)  = Pr  are  chosen,  e.g.  7l'  and \nP are  one for  respectively  the first  and last character  in the word,  and zero  for  the \nothers.  The  likelihood of observing  the  data Xl  and  being  in state  qp  at  time t  is \nthen given  by: \n\n(10) \nThen the  probabilities 'Yp(t)  of being  in  state qp  at time t  are  obtained by normal(cid:173)\nization and used as  the targets (j (t)  for  the recurrent  network character probability \noutputs: \n\ne  (t)  =  Ctp(t)/3p(t). \n\n(11) \n\n(j (t) \n\n'Yp(t). \n\n(12) \n\nL \n\np:L(qp)=Aj \n\nep(t) \n\nl:r er(t)' \n\nFigure  3a shows  the  initial estimate of the  class  probabilities for  a  sample  of the \nword' butler'.  The probabilities shown are those estimated by the forward-backward \nalgorithm  when  using  an  untrained  network,  for  which  the  P(XtISt  =  qp)  will  be \nindependent  of class.  Despite  the lack of information, the probability distributions \ncan  be  seen  to  take  reasonable  shapes.  The  first  frame  must  belong  to  the  first \nletter,  and the  last  frame  must belong  to the  last letter,  of course,  but it can  also \nbe  seen  that  half way  through  the  word,  the  most  likely  letters  are  those  in  the \nmiddle of the  word.  Several  class  probabilities  are  non-zero  at  a  time,  reflecting \nthe  uncertainty  caused  since  the  network  is  untrained.  Nevertheless,  this  limited \ninformation is  enough to train a  recurrent  network,  because  as  the network  begins \nto  approximate  these  probabilities,  the  segmentations  become  more  definite.  In \ncontrast,  using  Viterbi  segmentations from  an  untrained  network,  the  most  likely \nalignment can be  very different  from  the  true  alignment (figure  2b).  The segmen(cid:173)\ntation  is  very  definite  though,  and  the  network  is  trained  towards  the  incorrect \ntargets,  reinforcing  its  error.  Finally,  a  trained  network  gives  a  much  more  rigid \nsegmentation (figure  3b), with most of the probabilities being zero or one,  but with \na  boundary  of  uncertainty  at  the  transitions  between  letters.  This  uncertainty, \nwhere  a  frame  might  truly  represent  parts  of  two  letters,  or  a  ligature  between \ntwo,  represents  the  data better.  Just  as  with  Viterbi  training,  the  segmentations \ncan  be  re-estimated after training and retraining results  in  improved performance. \nThe  final  probabilistic  segmentation  can  be  stored  with  the  data  and  used  when \nsubsequent  networks  are  trained  on  the  same  data.  Training  is  then  significantly \nquicker  than  when  training towards the  approximate bootstrap segmentations and \nre-estimating the targets. \n\n\fForward-backward Retraining of Recurrent Neural  Networks \n\n749 \n\nFigure  3:  Forward-backward  segmentations  of the  word  'butler'. \n(a)  is  the  segmentation  of an  untrained  network  with  a  uniform \nclass  prior.  (b) shows  the segmentation after training. \n\nThe  better  models  obtained  with  the  forward-backward  algorithm  give  improved \nrecognition  results  over  a  network  trained  with  Viterbi  alignments.  The  improve(cid:173)\nment  is  shown  in  table  2.  It can  be  seen  that  the  error  rates  for  the  networks \ntrained with forward-backward targets are lower than those trained on Viterbi tar(cid:173)\ngets  (T(2) = 5.24, t.97S(2) = 4.30). \n\nTable 2:  Error rates for networks with 80 units trained with Viterbi \nor  Forward-Backward alignments. \n\nTraining \nmethod \nViterbi \nForward-Backward \n\nError 1%) \nJ.I. \n17.0  0.68 \n15.4  0.74 \n\n(7 \n\n5  Conclusions \nThis paper has reviewed the training methods used for  a recurrent  network, applied \nto the  problem of off-line  handwriting recognition.  Three  methods of deriving tar(cid:173)\nget  probabilities for  the  network  have  been  described,  and  experiments  conduded \nusing all three.  The third method is that of the forward-backward procedure,  which \nhas not previously been applied to recurrent  neural network training.  This method \nis  found  to improve the  performance of the network,  leading to reduced  word error \nrates.  Other  improvements not  detailed  here  (including duration  models  and  sto(cid:173)\nchastic  language  modelling)  allow  the error  rate for  this  task  to  be  brought  below \n10%. \nAcknowledgments \nThe  authors  would  like  to  thank  Mike  Hochberg  for  assistance  in  preparing  this \npaper. \nReferences \n\nBOURLARD,  H.  and  MORGAN,  N.  (1993)  Connectionist Speech  Recognition:  A  Hybrid \nApproach .  Kluwer . \nRABINER,  L.  R.  and  JUANG,  B .  H.  (1986)  An introduction  to  hidden  Markov  models. \nIEEE ASSP magazine  3  (1): 4-16. \nROBINSON,  A.  (1994)  The application  ofrecuIIent nets to phone probability estimation. \nIEEE  'lransactions  on  Neural Networks. \nRUMELHART,  D.  E .,  HINTON,  G.  E.  and  WILLIAMS,  R.  J.  (1986)  Learning  internal \nrepresentations  by  eIIor  propagation.  In  Parallel  Distributed Processing:  Explorations \nin  the  Microstructure  of  Cognition,  ed.  by  D.  E.  Rumelhart  and  J .  L.  McClelland, \nvolume  1,  chapter 8,  PE.  318-362.  Bradford Books. \nSANTINI,  S.  and  DEL  BIMBO,  A.  (1995)  RecuIIent  neural  networks  can  be  trained  to \nbe maximum  a  posteriori  probability  classifiers.  Neural Networks  8  (1):  25-29. \nSENIOR,  A .  W .,  (1994)  Off-line  Cursive  Handwriting  Recognition  using  Recurrent \nNeural Networks.  Cambridge  University  Engineering  Department  Ph.D.  thesis.  URL: \n~_~: / / svr-ft.p . enK. cam. ac . uk/pub/reports/senioLthesis . ps . gz. \nWERBOS,  P.  J.  (1990)  Backpropagation  through  time:  What it does and how  to do it. \nProceedings  of the  IEEE 78: 1550-60. \n\n\f", "award": [], "sourceid": 1056, "authors": [{"given_name": "Andrew", "family_name": "Senior", "institution": null}, {"given_name": "Anthony", "family_name": "Robinson", "institution": null}]}