{"title": "A Solution for Missing Data in Recurrent Neural Networks with an Application to Blood Glucose Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 971, "page_last": 977, "abstract": "", "full_text": "A Solution for Missing Data in Recurrent Neural \nNetworks With an Application to Blood Glucose \n\nPrediction \n\nVolker Tresp and Thomas Briegel * \n\nSiemens AG \n\nCorporate Technology \n\nOtto-Hahn-Ring 6 \n\n81730 Miinchen, Germany \n\nAbstract \n\nWe  consider neural  network models for stochastic nonlinear dynamical \nsystems  where measurements  of the  variable of interest are only avail(cid:173)\nable at irregular intervals i.e.  most realizations are missing.  Difficulties \narise  since the solutions for  prediction and maximum  likelihood learn(cid:173)\ning with missing data lead  to  complex integrals, which even  for simple \ncases  cannot  be  solved  analytically.  In  this  paper  we  propose  a  spe(cid:173)\ncific  combination of a nonlinear recurrent  neural  predictive model  and \na  linear error model  which  leads  to  tractable prediction and  maximum \nlikelihood adaptation rules.  In particular,  the  recurrent  neural  network \ncan be trained using the real-time recurrent learning rule and  the linear \nerror model  can be trained by an EM  adaptation rule,  implemented us(cid:173)\ning forward-backward Kalman filter equations. The model is applied to \npredict the glucose/insulin metabolism of a diabetic patient where blood \nglucose measurements  are  only available a few  times a day  at  irregular \nintervals.  The new model shows considerable improvement with respect \nto both recurrent neural networks trained with teacher forcing or in a free \nrunning mode and various linear models. \n\n1 \n\nINTRODUCTION \n\nIn many physiological dynamical systems measurements are acquired at irregular intervals. \nConsider the case of blood glucose measurements of a diabetic who only measures blood \nglucose  levels  a few  times  a  day.  At  the  same  time physiological systems  are  typically \nhighly nonlinear and  stochastic  such that recurrent  neural  networks are  suitable models. \nTypically, such networks are either used purely free running in which the networks predic(cid:173)\ntions are iterated, or in a teacher forcing mode in which actual measurements are substituted \n\n\u2022 {volker.tresp, thomas.briegel} @mchp.siemens.de \n\n\f972 \n\nV.  Tresp  and T.  Briegel \n\nif available. In Section 2 we show that both approaches are problematic for highly stochas(cid:173)\ntic systems and if many realizations of the variable of interest are unknown. The traditional \nsolution is to use a stochastic model such  as  a nonlinear state space model.  The problem \nhere  is  that prediction and training missing data lead to  integrals which are  usually con(cid:173)\nsidered intractable (Lewis,  1986). Alternatively, state dependent linearizations are used for \nprediction and training, the most popular example being the extended Kalman filter.  In this \npaper we introduce a combination of a nonlinear recurrent neural predictive model and a \nlinear error model which leads to tractable prediction and maximum likelihood adaptation \nrules.  The recurrent neural network can  be used in all generality to  model the nonlinear \ndynamics of the system.  The only limitation is  that the error model is  linear which is  not \na major constraint in many applications.  The first advantage of the proposed model is that \nfor single or multiple step prediction we obtain simple iteration rules which are  a combi(cid:173)\nnation of the output of the iterated neural network and a linear Kalman filter which is used \nfor updating the linear error model.  The second advantage is that for maximum likelihood \nlearning the recurrent neural network can be trained using the real-time recurrent learning \nrule  RTRL  and  the  linear error  model  can  be trained by an  EM  adaptation rule,  imple(cid:173)\nmented using forward-backward Kalman filter equations. We apply our model to develop a \nmodel of the glucose/insulin metabolism of a diabetic patient in which blood glucose mea(cid:173)\nsurements are  only available a few  times a  day  at irregular intervals and compare  results \nfrom our proposed model to recurrent neural networks trained and used in the free running \nmode or in the teacher forcing mode as well as to various linear models. \n\n2  RECURRENT SYSTEMS WITH MISSING DATA \n\nY, \n\n/ \n\n.. \n\n\u2022 \n\nreasonable estimate for  1  =:t.  6 \n\n0  ~ .. '_'m~__ I \n\no  teacher forC ing \n\n/ncosuremcnt at (,me t=7  y. \n\nfree  runn ing \n\n' .. -. \n\n6 \n\n~CQsurcment at lIm~ 1= 7 \n\nUIIIIdrrrrI \n\n12  13 \n\n10 \n\nJ  I \n\n1 \n\n2 \n\nS \n\n7 \n\n8 \n\n9 \n\n\" \n\n4 \n\nFigure 1:  A neural network predicts the next value of a time-series based on the latest two \nprevious measurements (left).  As long as no measurements  are available (t = 1 to t = 6), \nthe neural network is iterated (unfilled circles). In a free-running mode, the neural network \nwould ignore the measurement at time t  =  7 to predict the time-series at time t  =  8.  In a \nteacher forcing mode, it would substitute the measured value for one of the inputs and use \nthe iterated value for the other (unknown) input.  This appears to be suboptimal since our \nknowledge about the time-series at time t  = 7 also provides us with information about the \ntime-series at time t =  6.  For example the dotted circle might be a reasonable estimate.  By \nusing the iterated value for the unknown input, the prediction of the teacher forced system \nis  not well defined and will in general lead to unsatisfactory results.  A sensible response \nis shown on the right where the first few predictions after the measurement are close to the \nmeasurement.  This can be achieved by including a proper error model (see text). \n\nConsider a deterministic nonlinear dynamical model of the form \n\nYt  = !w(Yt-l,\"  \u00b7,Yt-N,Ut) \n\nof order N, with  input  Ut  and  where  ! w  (.)  is  a  neural  network model  with parameter(cid:173)\nvector w.  Such a recurrent model is  either used in a free  running mode in which network \npredictions are used in the input of the neural network or in a teacher forcing mode where \nmeasurements  are substituted in the input of the neural network whenever these are avail(cid:173)\nable. \n\n\fMissing Data in RNNs with an Application to  Blood Glucose Prediction \n\n973 \n\nFigure 2:  Left:  The proposed architecture.  Right:  Linear impulse response. \n\n- - .. -.:::...:..- -~- -~ -- ---\n\nBoth can lead to undesirable results when many realizations are missing and when the sys(cid:173)\ntem is highly stochastic.  Figure 1 (left) shows that a free  running model basically ignores \nthe measurement for prediction and that the teacher forced model substitutes the measured \nvalue but leaves  the unknown states at their predicted values which also might lead to un(cid:173)\ndesirable responses.  The traditional solution is to  include a model of the error which leads \nto nonlinear stochastical models, the simplest being \n\nYt  = fw (Yt-l,.'\"  Yt-N, utJ + lOt \n\nwhere  lOt  is  assumed  to  be additive  uncorrelated  zero-mean  noise  with  probability den(cid:173)\nsity P\u20ac  (f)  and represents unmodeled system dynamics.  For prediction and learning with \nmissing values we have  to integrate over the unknowns which leads to complex  integrals \nwhich,  for nonlinear models,  have  to  be approximated.  for  example,  using Monte Carlo \nintegration.l  In general, those integrals are computationally too expensive to solve and,  in \npractice,  one relies on locally linearized approximations of the nonlinearities typically in \nform of the extended Kalman filter.  The extended Kalman filter is suboptimal and summa(cid:173)\nrizes past data by an  estimate of the means and the covariances of the variables  involved \n(Lewis,  1986). \n\nIn this paper we pursue an alternative approach.  Consider the model with state updates \n\n* Yt \n\nXt \n\nYt \n\nfW(Y;-l\"' \"  Y;-N' ut} \n\nK 2: (JiXt-i + lOt \n\ni=l \n\nY;  + Xt  = fw (Y;-l' ... , Y;-N, ue)  + L (JiXt-i + lOt \n\nK \n\ni=l \n\n(1) \n\n(2) \n\n(3) \n\nand with measurement equation \n\nZt=Yt+Ot. \n\n(4) \nwhere  lOt  and  Ot  denote  additive noise.  The  variable of interest Yt  is  now the sum of the \ndeterministic response of the recurrent neural network Y;  and a linear system error model \nXt  (Figure 2).  Zt  is a noisy measurement of Yt.  In particular we are interested in the special \ncases that Yt  can be measured with certainty (variance of Ot  is zero) or that a measurement \nis  missing (variance of Ot  is  infinity).  The nice feature  is  now that Y;  can  be considered \na deterministic input to  the state space  model consisting of the equations (2)- (3).  This \nmeans that for optimal one-step or multiple-step prediction, we can use the linear Kalman \nfilter for equations (2)- (3) and measurement equation (4) by treating Y;  as  deterministic \ninput. Similarly, to train the parameters in the linear part of the system (i.e.  {Oi }f:l) we can \nuse an  EM adaptation rule,  implemented using forward-backward Kalman filter equations \n(see the Appendix). The deterministic recurrent neural network is adapted with the residual \nerror  which  cannot  be  explained  by the  linear  model,  i.e.  target~nn  =  y\":  - :Wnear \n\n1 For maximum likelihood learning of linear models we obtain EM equations which can be solved \n\nusing forward-backward Kalman equations (see Appendix). \n\n\f974 \n\nV.  Tresp  and T.  Briegel \n\nwhere Y~ is a measurement ofYt  at time t  and where  f)/near  is the estimate of the linear \nmodel.  After the  recurrent  neural  network  is  adapted  the  linear  model  can  be  retrained \nusing the residual  error which cannot be  explained by the neural network. then again the \nneural network is retrained and so on until no further improvement can be achieved. \n\nThe  advantage  of this  approach  is  that  all  of the  nonlinear interactions  are  modeled  by \na  recurrent  neural  network which can  be  trained  deterministically.  The  linear model  is \nresponsible for the noise model which can be trained using powerful  learning algorithms \nfor linear systems.  The constraint is that the error model cannot be nonlinear which often \nmight not be a major limitation. \n\n3  BLOOD GLUCOSE PREDICTION OF A DIABETIC \nThe goal of this work is to develop a predictive model of the blood glucose ofa person with \ntype  1 Diabetes  mellitus.  Such a model  can  have  several  useful applications  in therapy: \nit  can  be  used  to  warn  a  person  of dangerous  metabolic  states,  it  can be used  to  make \nrecommendations to optimize the person's therapy and, finally,  it can be used in the design \nof a stabilizing control system for blood glucose regulation, a so-called \"artificial beta cell\" \n(Tresp, Moody and Delong, 1994). We want the model to be able to adapt using patient data \ncollected under normal  every day conditions rather than the controlled conditions typical \nof a clinic.  In a non-clinical setting, only a few  blood glucose measurements per day are \navailable. \n\nOur data set consists of the protocol ofa diabetic over a period of almost six months. Dur(cid:173)\ning that time period,  times and dosages of insulin injections (basal insulin ut  and normal \ninsulin u;),  the times  and  amounts  of food  intake (fast  u~, intermediate ut  and slow  u~ \ncarbohydrates), the times and durations of exercise (regular u~ or intense ui) and the blood \nglucose  level  Yt  (measured  a few  times  a  day) were  recorded.  The  u{, j  =  1, ... ,7 are \nequal to zero except if there is  an event, such as food intake, insulin injection or exercise. \nFor our data set, inputs u{  were recorded with 15 minute time resolution. We used the first \n43 days for training the model (containing 312 measurements of the blood glucose) and the \nfollowing 21  days for testing (containing  151  measurements  of the blood glucose).  This \nmeans that we have to deal with approximately 93% of missing data during training. \n\nThe effects on insulin, food and exercise on the blood glucose are delayed and are approx(cid:173)\nimated by linear response functions.  v{  describes the effect of input u{  on glucose.  As an \nexample, the response vt of normal insulin u; after injection is determined by the diffusion \nof the subcutaneously injected insulin into the blood stream and can be modeled by three \nfirst order compartments in series or,  as we have done, by a response function of the form \nvt  =  l:T g2(t - r)u;  withg2(t)  =  a2t2e-b2t (see figure 2 for a typical impulse response). \nThe functional mappings gj (.) for the digestive tract and for exercise are less well known. \nIn  our experiments  we  followed other authors  and used  response  functions of the above \nform. \n\nThe response functions 9 j  ( .) describe the delayed effect of the inputs on the blood glucose. \nWe assume that the functional form of gj (.) is sufficient to capture the various delays of the \ninputs and can be tuned to the physiology of the patient by varying the parameters  aj ,bj . \nTo be able to capture the highly nonlinear physiological interactions between the response \nfunctions vi  and the blood glucose level Yt,  which is measured only a few times a day,  we \nemploy a neural network in combination with a linear error model as described in Section 2. \nIn our experiments  fw (.)  is a feedforward multi-layer perceptron with three hidden units. \nThe five  inputs to the network were insulin (in;  =  vi + v;>,  food (in; =  vf + vt + vt), \nexercise (inr = vf + vi) and the current and previous estimate of the blood glucose.  To be \nspecific, the second order nonlinear neural network model is \n.  2 \n\nYt  = Yt-l +  W  Yt_llYt_2,lnt ,znt ,znt \n* \n\nf  (. \n\n(5) \n\n* \n\n* \n\n.  1 \n\n.  3) \n\n\fMissing Data in RNNs with an Application to Blood Glucose Prediction \n\nFor the linear error model we also use a model of order 2 \n\n975 \n\n(6) \n\nTable 1 shows the explained variance of the test set for different predictive models.  2 \nIn the first  experiment (RNN-FR) we estimate the blood glucose at  time t  as  the output \nof the neural network Yt  = y;.  The neural network is used in the free  running mode for \ntraining and prediction.  We use RTRL to both adapt the weights in the neural network as \nwell as  all parameters in the response functions 9j (.).  The RNN-FR model explains  14.1 \npercent of the variance.  The RNN-TF model is identical to the previous experiment except \nthat measurements are substituted whenever available.  RNN-TF could explain more of the \nvariance  (18.8%).  The  reason  for  the better performance  is,  of course,  that  information \nabout measurements of the blood glucose can be exploited. \n\nThe model RNN-LEM2 (error model with order 2) corresponds to the combination of the \nrecurrent  neural  network  and  the  linear error  model  as  introduced  in  Section  2.  Here, \nYt  = Xt  + Y;  models the blood glucose and  Zt  = Yt  + 8t  is  the  measurement  equation \nwhere we set the variance of 8t  = 0 for a measurement of the blood glucose at time t  and \nthe variance of 8t  = 00 for missing values.  For ft we assume Gaussian independent noise. \nFor prediction, equation (5) is iterated in the free running mode.  The blood glucose at time \nt  is  estimated using  a  linear Kalman filter,  treating Y;  as  deterministic input in the  state \nspace model Yt  = x t + Y; ,Zt = Yt  + 8t . We adapt the parameters in the linear error model \n(i.e.  (h,  O2 ,  the variance of ft) using an EM  adaptation rule,  implemented using forward(cid:173)\nbackward Kalman filter equations (see Appendix).  The parameters  in the neural network \nare  adapted using RTRL  exactly the same  way as  in the RNN-FR model,  except that the \ntarget is now target~nn =  yr - iftinear  where yr  is  a measurement of Yt  at time t  and \nwhere iftinear  is the estimate of the linear error model (based on the linear Kalman filter). \nThe adaptation of the linear error model and the neural network are performed aIternatingly \nuntil no significant further improvement in performance can be achieved. \n\nAs indicated in Table  1,  the RNN-LEM2 model achieves the best prediction performance \nwith an  explained variance  of 44.9%  (first  order error model  RNN-LEMI:  43.7%).  As \na comparison,  we  show the performance of just the linear error model LEM (this model \nignores all inputs), a linear model (LM-FR) without an error model trained with RTRL and \na linear model with an error model (LM-LEM). Interestingly, the linear error model which \ndoes not see any of the inputs can explain more variance (12.9%) than the LM-FR model \n(8.9%).  The LM-LEM  model,  which  can  be  considered  a  combination of both can  ex(cid:173)\nplain more than the sum of the individual explained variances (31.5%) which indicates that \nthe combined training gives better perfonnance than training both submodels individually. \nNote also,  that the nonlinear models (RNN-FR, RNN-TF,  RNN-LEM)  give considerably \nbetter results than their linear counterparts, confirming that the system is highly nonlinear. \n\nFigure 3  (left)  shows  an example  of the responses  of some  of the  models.  We  see  that \nthe  free  running neural  network (dotted line) has  relatively small  amplitudes and cannot \npredict  the  three  measurements  very  well.  The  RNN-TF  model  (dashed  line)  shows  a \nbetter response to the measurements than the free running network.  The best prediction of \nall measurements is indeed achieved by the RNN-LEM model (continuous line). \n\nBased on the linear iterated Kalman filter we can calculate the variance of the prediction. \nAs shown in Figure 3 (right) the standard deviation is  small right after a measurement  is \navailable and then converges to a constant value. Based on the prediction and the estimated \nvariance, it will be possible to do a risk analysis for the diabetic (i.e a warning of dangerous \nmetabolic states). \n\n2MSPE(model)  is  the  mean  squared  prediction  error  on  the  test  set  of  the  model  and \n\nMSPE( mean) is the mean squared prediction error of predicting the mean. \n\n\f976 \n\nV.  Tresp and T.  Briegel \n\n240~--~----~--~--~ \n\n230 p. \n220  ~\\ \n\n..,-210 1\\\\ \n\n, \n\nJ \n\n\\ \n\n\\ \n\n\\ \n\n~  I\n=-200  I  \\ \nI \n~ 190 \n. \n1 \n'.  ~ \n- ':  '.~'::\"\" \n-g  lBO \n~ \n\n1 70 \u00b7  \n\n'.\"  ~ \n\n,  \" -\n\n\" \n\n\u2022 \n\n_  -\n\n........ \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 ::;-;~ .. -~..,,\"-\n\n160 \n\n150 \n\n2 . 5 \n\n5 \n\ntime  [hOurS] \n\n7 . 5 \n\n2.5 \n\n5 \n\ntime  [hOurS] \n\n7.5 \n\nFigure 3:  Left: Responses of some models to three measurements.  Note, that the prediction \nof the first measurement  is bad for all models but that the RNN-LEM model (continuous \nline) predicts the following measurements much better than both the RNN -FR (dotted) and \nthe RNN-TF (dashed) model.  Right: Standard deviation of prediction error ofRNN-LEM. \n\nTable  1:  Explained variance on test set [in percent]:  100 . (1 - ~;~~ ::~~ ) \n\nMODEL  % \n\nMODEL \n\n% \n\nmean \nLM \nLEM \nRNN-FR \n\nRNN-TF \nLM-LEM \n\n0 \n8.9 \n12.9  RNN-LEMl \n14.1  RNN-LEM2 \n\n18.8 \n3l.4 \n43.7 \n44.9 \n\n4  CONCLUSIONS \n\nWe  introduced a  combination of a  nonlinear recurrent neural network and  a  linear error \nmodel.  Applied to  blood glucose prediction it  gave  significantly better results  than both \nrecurrent neural networks alone and  various linear models.  Further work might lead to  a \npredictive model which can be used by a diabetic on a daily bases.  We believe that our re(cid:173)\nsults are very encouraging. We also expect that our specific model can find applications in \nother stochastical nonlinear systems in which measurements are only available at irregular \nintervals such that in wastewater treatment,  chemical process control and various physio(cid:173)\nlogical systems.  Further work will include error models for the input measurements  (for \nexample, the number of food calories are typically estimated with great uncertainty). \n\nAppendix:  EM Adaptation Rules for Training the Linear Error Model \nModel and observation equations of a general model are3 \n\nXt  = eXt-l + ft \n\nZt  = MtXt + 8t . \n\n(7) \nwhere e is the K  x K  transition matrix ofthe K -order linear error model.  The K  x 1 noise \nterms (t are zero-mean uncorrelated normal vectors with common covariance matrix Q. 8t \nis  m-dimensional  4  zero-mean  uncorrelated normal noise vector with covariance  matrix \nRt \u2022  Recall that we  consider certain measurements and missing values as  special  cases  of \n\n3Note, that any  linear system of order K  can be  transformed  into  a first  order linear system  of \n\ndimension K. \n\n4 m  indicates the dimension of the output of the time-series. \n\n\fMissing Data in RNNs with an Application to Blood Glucose Prediction \n\n977 \n\nnoisy measurements.  The initial state of the system is assumed to be a normal vector with \nmean Jl  and covariance E. \n\nWe  describe  the EM equations  for  maximizing  the  likelihood of the model.  Define  the \nestimated parameters at the (r+ l)st iterate of EM as the values Jl, E, e, Q which maximize \n(8) \n\nG(Jl, E, e, Q)  =  Er (log Llzl, ... , zn) \n\nwhere  log L is  log-likelihood of the complete data  Xo, Xl, \u2022\u2022\u2022 ,  X n ,  ZI, \u2022 .\u2022 ,  Zn  and  Er  de(cid:173)\nnotes  the  conditional  expectation  relative  to  a  density  containing  the  rth  iterate  values \nJl(r), E(r), e(r) and Q(r). Recall that missing targets are modeled implicitly by the defi(cid:173)\nnition of M t  and Rt . \n\nFor calculating the conditional expectation defined in (8)  the following set of recursions \nare used (using standard Kalman filtering results, see (Jazwinski,  1970)). First, we use the \nforward recursion \n\nt \n\nt-l \nX t \npt-l \nK t \nXt \nt \nPi \n\ne  t-l \n- x t _ l \nept-leT + Q \nt-l \nP/-lMtT(MtP/-IMtT + Rt)-1 \nt-l + r.'  (*  M \nx t \np tt-l - K t M t P;-1 \n\nI'q  Yt  -\n\nt-l) \n\ntXt \n\nwhere we take xg  = Jl  and p3 = E. Next, we use the backward recursion \n\nJ \nt-l \nX~_l \nPtn_l \npr-l,t-2 \n\nt \n\nP t-leT(pt-l)-l \nt-l \nx~=~ + Jt-l(X~ - ex~=D \np:~t + Jt-dPr - p;-l)J?'_l \nP/~11JL2 + Jt-dPtt-l - ePtt~l)Jt~2 \n\n(9) \n\n(10) \n\n(11 ) \n\nwith initialization p;: n-l = (1  - KnMn)ep;:::l.  One forward and one backward recur(cid:173)\nsion completes the E-'step of the EM algorithm. \n\nTo  derive the M-step first realize that the conditional expectations in (8) yield to the fol(cid:173)\nlowing equation: \n\nG  =  -~ log IEI- !tr{E-l(Pon + (xo  - Jl)(xo  - Jl)T)} \n\n-% log IQI-1tr{Q-l(C - BeT - eBT  - eAeT)} \n-% log IRtl- !tr{R;-l E~l[(Y; - Mtxt)(y;  - Mtxd T + MtprMtT]} \n\nwhere tr{.} denotes the trace, A = E~=l (pr-l + xr_lx~:d, \nB  = E~=l(Pt~t-l + X~X~:l) and C  = E~=l (pr + x~x~ T). \ne(r + 1)  =  BA- 1 and Q(r + 1)  =  n-1(C - BA- l BT)  maximize the  log-likelihood \nequation (11). Jl (r + 1)  is set to Xo  and E  may be fixed at some reasonable baseline level. \nThe derivation of these equations can be found in (Shumway &  Stoffer,  1981). \n\nThe E- (forward and backward Kalman filter equations) and M-steps are alternated repeat(cid:173)\nedly until convergence to obtain the EM solution. \n\nReferences \nJazwinski, A.  H. (1970) Stochastic Processes and Filtering Theory, Academic Press, N.Y. \nLewis, F. L. (1986) Optimal Estimation, John Wiley, N.Y. \nShumway, R.  H.  and  Stoffer, D.  S.  (1981) TIme Series Smoothing and Forecasting Using \nthe EM Algorithm, Technical Report No.  27, Division of Statistics, UC Davis. \nTresp,  v.,  Moody,  1.  and  Delong,  W.-R.  (1994)  Neural Modeling of Physiological Pro(cid:173)\ncesses,  in Comput.  Leaming Theory and Natural Leaming Sys.  2, S.  Hanson et al.,  eds., \nMIT Press. \n\n\f", "award": [], "sourceid": 1348, "authors": [{"given_name": "Volker", "family_name": "Tresp", "institution": null}, {"given_name": "Thomas", "family_name": "Briegel", "institution": null}]}