{"title": "Probabilistic Anomaly Detection in Dynamic Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 825, "page_last": 832, "abstract": null, "full_text": "Probabilistic Anomaly Detection \n\n\u2022 In \n\nDynamic  Systems \n\nPadhraic Smyth \n\nJet Propulsion  Laboratory 238-420 \nCalifornia Institute of Technology \n\n4800  Oak Grove  Drive \nPasadena,  CA  91109 \n\nAbstract \n\nThis  paper  describes  probabilistic  methods  for  novelty  detection \nwhen  using  pattern  recognition  methods  for  fault  monitoring  of \ndynamic  systems.  The problem of novelty  detection  is  particular(cid:173)\nly  acute  when  prior  knowledge  and  training  data only  allow  one \nto  construct  an  incomplete  classification  model.  Allowance  must \nbe  made  in  model  design  so  that  the  classifier  will  be  robust  to \ndata generated  by  classes  not  included  in  the  training  phase.  For \ndiagnosis  applications  one  practical approach  is  to  construct  both \nan  input  density  model  and  a  discriminative  class  model.  Using \nBayes'  rule  and  prior  estimates  of the  relative  likelihood  of  data \nof known and unknown origin the resulting classification equations \nare  straightforward.  The  paper  describes  the  application  of this \nmethod  in  the  context  of hidden  Markov  models  for  online  fault \nmonitoring of large  ground  antennas for  spacecraft  tracking,  with \nparticular  application  to  the  detection  of  transient  behaviour  of \nunknown  origin. \n\n1  PROBLEM BACKGROUND \n\nConventional control-theoretic models for  fault  detection typically rely on an accu(cid:173)\nrate model ofthe plant being monitored (Patton, Frank, and Clark, 1989).  However, \nin  practice  it  common  that  no such  model  exists  for  complex  non-linear  systems. \nThe  large  ground  antennas  used  by  JPL's  Deep  Space  Network  (DSN)  to  track \n\n825 \n\n\f826 \n\nSmyth \n\nJet Prcpllslon Laboratory \n\nMission \nControl \n\nFigure  1:  Block  diagram of typical  Deep  Space  Network  downlink \n\nplanetary spacecraft fall  into this  category.  Quite  detailed  analytical  models  exist \nfor  the electromechanical  pointing  systems.  However,  these  models  are  primarily \nused for  determining gross system characteristics such as  resonant frequencies;  they \nare known  to be a  poor fit  for  fault  detection  purposes. \n\nWe  have previously described the application of adaptive pattern recognition meth(cid:173)\nods to the problem of online health  monitoring of DSN  antennas (Smyth and Mell(cid:173)\nstrom,  1992;  Smyth,  in  press).  Rapid  detection  and identification of failures  in  the \nelectromechanical antenna pointing systems is highly desirable in order to minimize \nantenna downtime and thus minimise telemetry data loss when communicating with \nremote  spacecraft  (see  Figure  1).  Fault  detection  based  on  manual  monitoring  of \nthe various  antenna sensors  is  neither  reliable or cost-effective. \n\nThe pattern-recognition monitoring system operates as follows.  Sensor data such as \nmotor current, position encoder, tachometer voltages, and so forth are synchronous(cid:173)\nly  sampled  at  50Hz  by  a  data  acquisition  system.  The  data  are  blocked  off  into \ndisjoint  windows  (200  samples  are  used  in  practice)  and  various  features  (such  as \nestimated autoregressive  coefficients)  are extracted; let  the feature  vector  be fl. \nThe features are fed  into a classification model (every 4 seconds)  which in turn pro(cid:173)\nvides posterior probability estimates of the m possible states of the system given the \nestimated features from that window, p(wdfl).  WI  corresponds to normal conditions, \nthe other Wi'S,  1 ~ i  ~ m,  correspond  to known  fault  conditions. \n\nFinally, since the system has  \"memory\"  in the sense  that it is  more likely to remain \nin  the  current  state  than  to  change  states,  the  posterior  probabilities  need  to  be \ncorrelated  over  time.  This  is  achieved  by  a  standard  first-order  hidden  Markov \n\n\fProbabilistic Anomaly Detection in Dynamic Systems \n\n827 \n\nmodel  (HMM)  which  models  the  temporal  state  dependence.  The  hidden  aspect \nof the  model  reflects  the  fact  that  while  the  features  are  directly  observable,  the \nunderlying system states are not, i.e., they are in effect  \"hidden.\"  Hence, the purpose \nof the  HMM  is  to provide  a  model  from  which  the most  likely  sequence  of system \nstates can  be  inferred  given the observed sequence of feature  data. \n\nThe classifier  portion of the model is  trained using simulated hard ware  faults.  The \nfeed-forward  neural  network  has  been  the  model  of choice for  this  application  be(cid:173)\ncause  of its  discrimination  ability,  its  posterior  probability  estimation  properties \n(Richard  and  Lippmann,  1992;  Miller,  Goodman  and  Smyth,  1993)  and  its  rel(cid:173)\natively  simple  implementation  in  software.  It should  be  noted  that  unlike  typical \nspeech recognition HMM  applications, the transition probabilities are not estimated \nfrom  data  but  are  designed  into  the  system  based  on  prior  knowledge  of the  sys(cid:173)\ntem mean time between failure  (MTBF) and other specific knowledge of the system \nconfiguration (Smyth,  in  press). \n\n2  LIMITATIONS  OF THE DISCRIMINATIVE MODEL \n\nThe model described above assumes that there are m  known mutually exclusive and \nexhaustive states  (or  \"classes\")  of the system,  WI, ... ,Wm .  The  mutually exclusive \nassumption is  reasonable in many applications where multiple simultaneous failures \nare  highly  unlikely.  However,  the  exhaustive assumption  is  somewhat  impractical. \nIn particular, for fault detection in a complex system such as  a large antenna, there \nare  thousands  of possible  fault  conditions  which  might  occur.  The  probability  of \noccurrence of any single condition is  very small, but nonetheless there is a significant \nprobability  that  at  least  one  of  these  conditions  will  occur  over  some  finite  time. \nWhile the common faults can be directly modelled it is not practical to assign model \nstates to all  the other minor faults  which  might occur. \nAs  discussed  in  (Smyth and  Mellstrom,  1992;  Smyth  1994)  a  discriminative model \ndirectly  models  P(Wi I~),  the  posterior  probabilities of the  classes  given  the feature \ndata, and assumes that the classes WI,  ... ,Wm  are exhaustive.  On the other hand, a \ngenerative model directly  models the probability density function  of the input data \nconditioned  on  each  class,  p(~IWi)'  and  then  indirectly  determines  posterior  class \nprobabilities by application of Bayes' rule.  Examples of generative classifiers include \nparametric models such as Gaussian classifiers  and memory-based methods such as \nkernel  density  estimators.  Generative  models  are  by  nature  well  suited  to novelty \ndetection  whereas  discriminative  models  have  no built-in  mechanism for  detecting \ndata which  are  different  to  that on  which  the  model  was  trained.  However,  there \nis  a  trade-off;  because  generative  models  typically  are  doing  more  modelling  than \njust searching for  a  decision boundary, they can  be  less  efficient  (than discriminant \nmethods)  in  their  use  of the  data.  For  example, generative  models  typically  scale \npoorly with  input dimensionality for  fixed  training sample size. \n\n3  HYBRID MODELS \n\nA  relatively  simple  and  practical  approach  to  the  novelty  detection  problem  is  to \nuse  both  a  generative  and  discriminative  classifier  (an  idea  originally  suggested  to \nthe author by  R.  P.  Lippmann).  An extra \"m+ lth\" state is  added to the model to \n\n\f828 \n\nSmyth \n\ncover  \"all  other possible states\"  not accounted  for  by  the  known  m  states.  In  this \nframework,  the  posterior  estimates of the  discriminative  classifier  are  conditioned \non the event  that the data come from one of the  m  known  classes . \n\nLet  the symbol w{1 , .. . ,m}  denote  the event  that  the  true system state is  one  of the \nknown  states,  let Wm+l  be  the  unknown  state,  and  let  p(wm+1I~) be the posterior \nprobability that the system is  in an unknown state given  the data.  Hence,  one can \nestimate the posterior probability of individual  known  states as \n\n(1) \n\nwhere Pd(wd~,w{1,,, . ,m}) is  the posterior  probability estimate of state i  as  provided \nby a  discriminative model,  i.e., given that the system is  in one  of the known  states. \n\nThe  calculation  of p(wm+ll~) can  be  obtained  via  the  usual  application  of Bayes' \nrule  if P(~lwm+d, p(wm+d,  and P(~IW{l, ,, . ,m}) are known: \n\n( \nP  Wm+l \n\nI(})  -\n-\n-\n\n\"\"m' \n(I \nP ~ wm+dp(wm+d + P ~ w{1, ... ,m}) L...Ji  p(Wi) \n\nP(~lwm+dp(wm+d \n\n( I \n\n(2) \n\nSpecifying the prior density P(~lwm+d, the distribution of the features conditioned \non  the occurrence  of the  unknown  state,  can  be  problematic.  In  practice  we  have \nused non-informative Bayesian priors for  P(~lwm+d over a bounded space of feature \nvalues  (details  are  available  in  a  technical  report  (Smyth  and  Mellstrom,  1993)) , \nalthough  the  choosing  of a  prior  density  for  data  of  unknown  origin  is  basically \nill-posed.  The  stronger  the  constraints  which  can  be  placed  on  the  features  the \nnarrower  the resulting prior density and the better the ability of the overall  model \nto detect  novelty.  If we  only  have  very  weak  prior  information,  this  will  translate \ninto a  weaker criterion for  accepting points which  belong to the unknown  category. \nThe term P(Wm+l)  (in  Equation  (2))  must  be chosen  based on  the  designer's  prior \nbelief of how  often  the system will  be  in an  unknown state -\na  practical choice  is \nthat the system is  at least as likely to  be  in an  unknown failure  state as  any of the \nknown failure  states. \n\nThe P(~IW{l, ,, .,m}) term in  Equation (2)  is  provided directly by  the generative mod(cid:173)\nel.  Typically  this  can  be  a  mixture  of Gaussian  component  densities  or  a  kernel \ndensity  estimate  over  all  of the  training  data  (ignoring  class  labels) .  In  practice, \nfor  simplicity of implementation we  use a simple Gaussian mixture model.  Further(cid:173)\nmore,  because of the  afore-mentioned scaling problem with  input dimensions,  only \na  subset  of relatively  significant  input features  are  used  in  the  mixture  model.  A \nless  heuristic  approach  to this  aspect  of the  problem  (with  which  we  have  not  yet \nexperimented)  would  be to use a  method such as  projection pursuit  to project  the \ndata into a lower dimensional subspace and perform the input density estimation in \nthis space.  The main  point is  that the generative model  need  not  necessarily  work \nin  the full  dimensional space of the input features. \n\nIntegration  of Equations  (1)  and  (2)  into  the  hidden  Markov  model  scheme  is  s(cid:173)\ntraightforward  and is  not  derived  here -\nthe  HMM  now  has  an  extra state,  \"un(cid:173)\nknown.\" The choice oftransition probabilities between the unknown and other states \nis once  again a  matter of design  choice.  For  the antenna application  at  least,  many \nof the unknown states are believed to be relatively brief transient phenomena which \n\n\fProbabilistic Anomaly Detection in Dynamic Systems \n\n829 \n\nlast  perhaps  no  longer  than  a  few  seconds:  hence,  the  Markov  matrix is  designed \nto  reflect  these  beliefs  since  the  expected  duration  of any  state  d[wd  (in  units  of \nsampling intervals)  must obey \n\n1 \n\nd[wd = - -\nI - PH \n\n(3) \n\nwhere Pii  is  the self-transition  probability of state Wi. \n\n4  EXPERIMENTAL RESULTS \n\nFor illustrative purposes the experimental results from 2 particular models are com(cid:173)\npared.  Each  was  applied  to  monitoring  the  servo  pointing  system of a  DSN  34m \nantenna at  Goldstone,  California.  The models  were  implemented  within  Lab View \ndata acquisition software running in real-time on a Macintosh II computer at the an(cid:173)\ntenna site.  The models had previously  been trained off-line  on  data collected some \nmonths  earlier.  12  input features  were  used  consisting of estimated autoregressive \ncoefficients  and  variance  terms  from  each  window  of 200  samples  of multichannel \ndata.  For  both  models  a  discriminative  feedforward  neural network  model  (with  8 \nhidden  units,  sigmoidal  hidden  and  output  activation  functions)  was  trained  (us(cid:173)\ning conjugate-gradient optimization) to discriminate between  a  normal state and  3 \nknown  and  commonly  occurring fault  states (failed  tachometer,  noisy  tachometer, \nand  amplifier  short  circuit  -\nalso  known  as  \"compensation  loss\").  The  network \noutput  activations  were  normalised  to sum to  1 in  order  to provide  posterior  class \nprobability estimates. \n\nModel  (a)  used  no  HMM  and assumed  that  the 4 known states are exhaustive, i.e., \nit just used the feedforward network.  Model (b) used a HMM with 5 states, where a \ngenerative model  (a Gaussian  mixture model)  and a flat  prior (with bounds on the \nfeature values)  were used to determine the probability of the 5th state (as described \nby  Equations  (1)  and (2)).  The same neural network as in  model  (a)  was  used  as  a \ndiscriminator for  the other  4  known  states.  The generative  mixture  model  had  10 \ncomponents and used only 2 of the 12  input features,  the 2 which were judged to be \nthe  most  sensitive  to system change.  The  parameters  of the  HMM  were  designed \naccording  to the  guidelines  described earlier.  Known  fault  states  were  assumed  to \nbe equally  likely  with  1 hour  MTBF's  and  with  1  hour  mean  duration.  Unknown \nfaults  were  assumed  to  have  a  20  minute  MTBF  and  a  10  second  mean  duration. \nBoth  HMMs  used  5-step  backwards  smoothing,  i.e.,  the  probability  estimates  at \nany  time  n  are  based  on  all  past  data up  to  time  n  and  future  data  up  to  time \nn + 5  (using  a  larger  number  of backward  steps was  found  empirically  to  produce \nno effect  on the estimates). \nFigures  2  (a)  and  (b)  show  each  model's  estimates  (as  a  function  of  time)  that \nthe system is  in the normal state.  The experiment consisted of introducing known \nhardware  faults  into  the  system  in  a  controlled  manner  after  15  minutes  and  45 \nminutes, each  of 15  minutes  duration. \n\nModel  (a) 's estimates  are  quite noisy  and contain a significant  number of potential \nfalse  alarms (highly  undesirable in an operational environment).  Model (b) is  much \nmore  stable  due  to  the  smoothing  effect  of the  HMM.  Nonetheless,  we  note  that \nbetween  the 8th  and  10th minutes,  there  appear  to be  some  possible  false  alarms: \n\n\f830 \n\nSmyth \n\n..  '  ''I'  ~ ... \n\n- - Discriminative model, no HMM \n\nl' \n\nProbability \nof nonnal  0.6 \ncmditionl \n\n0.4 \n\n0.2 \n\n0 \n\n0 \n\n0.8 \n\nProbability \nof nonnal  0.6 \ncmditionl \n\n0.4 \n\n0.2 \n\no \n\n0 \n\nI \n\nl?trom1 \nIn \n~mof \ntaclKmJc1l:r fault \n\n20  ~~~~f \n\nnonnal candiuom \n\n~ \n\n40 \nImrod  ctiooof \n\nSO \n\n60 \nTime  minutes) \n\nalIIUlCIl&&tim lou fault \n\nrr-\n\n- - Hybrid model. with HMM \n\n, \n\nl~~ \nIn \nctimof \ntac:homcliCl' fault \n\n20 \n\nRcsum1 \nnonna1  CCJnditiom \n\n30 \ntim of \n\n~  SO \nc:om'DCllHlim la-. fault \n\nctioo of \n\nTime  minu \n\n60 \ntell) \n\nFigure  2:  Estimated posterior  probability  of normal  state  (a)  using  no  HMM  and \nthe  exhaustive  assumption  (normal  + 3  fault  states),  (b)  using  a  HMM  with  a \nhybrid  model (normal + 3 faults + other state). \n\nthese  data were  classified  into the unknown state (not shown).  On later inspection \nit  was  found  that  large  transients  (of unknown  origin)  were  in  fact  present  in  the \noriginal  sensor  data  and  that  this  was  what  the  model  had  detected,  confirming \nthe  classification  provided  by  the  model.  It is  worth  pointing out  that  the  model \nwithout a generative component (whether with or without the HMM) also detected \na  non-normal  state at the same  time,  but  incorrectly  classified  this state as  one of \nthe known fault  states (these results  are  not  shown). \n\nAlso  not  shown  are  the  results  from  using  a  generative  model  alone,  with  no  dis(cid:173)\ncriminative  component.  While  its  ability  to detect  unknown  states  was  similar  to \nthe hybrid model, its ability to discriminate between known states was significantly \nworse  than the hybrid model. \n\nThe hybrid model has been empirically tested on a variety of other conditions where \nvarious  \"known\"  faults  are omitted from  the  discriminative  training step and  then \n\n\fProbabilistic Anomaly Detection in Dynamic Systems \n\n831 \n\npresented  to the  model  during testing:  in  all  cases,  the  anomalous  unknown state \nwas detected  by  the model,  i.e.,  classified  as a  state which  the  model had not  seen \nbefore. \n\n5  APPLICATION ISSUES \n\nThe model  described  here is  currently being integrated into an  interactive antenna \nhealth  monitoring  software  tool  for  use  by  operations  personnel  at  all  new  DSN \nantennas.  The first  such antenna is  currently being built at the Goldstone (Califor(cid:173)\nnia)  DSN  site and is  scheduled for  delivery to DSN  operations in late 1994.  Similar \nantennas,  also equipped  with fault  detectors  of the general  nature  described  here, \nwill  be constructed at the DSN  ground station complexes in Spain and Australia in \nthe  1995-96 time-frame. \nThe ability to detect previously unseen transient behaviour has important practical \nconsequences:  as  well  as  being  used  to warn  operators  of servo  problems  in  real(cid:173)\ntime,  the  model  will  also  be  used  as  a  filter  to a  data logger  to record  interesting \nand  anomalous  servo  data on a  continuous  basis.  Hence,  potentially  novel  system \ncharacteristics  can  be  recorded  for  correlation  with  other  antenna-related  events \n(such  as  maser  problems,  receiver  lock  drop  during  RF feedback  tracking, etc.)  for \nlater  analysis  to  uncover  the  true  cause  of the  anomaly.  A  long-term  goal  is  to \ndevelop  an  algorithm  which  can  automatically  analyse  the  data which  have  been \nclassified  into  the  unknown  state  and  extract  distinct  sub-classes  which  can  be \nadded as new explicit states to the  HMM  monitoring system in a  dynamic fashion. \nStolcke  and  Omohundro  (1993)  have  described  an  algorithm  which  dynamically \ncreates  a  state model  for  HMMs  for  the case of discrete-valued features.  The case \nof continuous-valued features  is  considerably  more  subtle  and may  not  be solvable \nunless  one  makes  significant  prior  assumptions  regarding  the  nature  of the  data(cid:173)\ngenerating mechanism. \n\n6  CONCLUSION \n\nA  simple  hybrid  classifier  was  proposed  for  novelty  detection  within  a  probabilis(cid:173)\ntic  framework .  Although  presented  in  the  context  of  hidden  Markov  models  for \nfault  detection,  the  proposed  scheme  is  perfectly  general  for  generic  classification \napplications.  For  example,  it  would  seem  highly  desirable  that fielded  automated \nmedical diagnosis systems (such as various neural network models  which  have been \nproposed in the literature) should always  contain a  \"novelty-detection\"  component \nin order that novel  data are identified  and appropriately  classified  by  the  system. \n\nThe  primary  weakness  of the  methodology  proposed  in  this  paper  is  the  necessity \nfor  prior knowledge in the form of densities for  the feature values given the unknown \nstate.  The  alternative  approach  is  not  to explicitly  model  the  the  data from  the \nunknown state but to use some form of thresholding on the input densities from the \nknown states  (Aitchison,  Habbema,  and Kay,  1977;  Dubuisson  and  Masson,  1993). \nHowever,  direct specification of threshold levels  is  itself problematic.  In this sense, \nthe  specification  of  prior  densities  can  be  viewed  as  a  method  for  automatically \ndetermining the appropriate thresholds  (via Equation (2)). \n\n\f832 \n\nSmyth \n\nAs  a  final  general  comment,  it  is  worth  noting  that  online  learning  systems  must \nuse some form of novelty detection.  Hence, hybrid generative-discriminative models \n(a  simple  form  of which  has  been  proposed  here)  may  be  a  useful  framework  for \nmodelling online  learning. \n\nAcknowledgements \n\nThe  author  would  like  to thank  Jeff Mellstrom,  Paul Scholtz,  and  Nancy  Xiao for \nassistance  in  data  acquisition  and  analysis.  The  research  described  in  this  paper \nwas performed at the Jet Propulsion Laboratory, California Institute of Technology, \nunder a  contract with the National Aeronautics  and Space Administration and was \nsupported in  part by  ARPA  under grant  number  NOOOl4-92-J-1860 \n\nReferences \n\nR.  Patton,  P.  Frank,  and  R.  Clark  (eds.),  Fault  Diagnosis  in  Dynamic  Systems: \nTheory  and  Application,  New  York,  NY:  Prentice  Hall,  1989. \n\nP.  Smyth  and  J.  Mellstrom,  'Fault  diagnosis  of  antenna  pointing  systems  using \nhybrid  neural  networks  and  signal  processing  techniques,'  in  Advances  in  Neural \nInformation  Processing  Systems  4,  J.  E.  Moody,  S.  J.  Hanson,  R.  P.  Lippmann \n(eds.), San Mateo,  CA:  Morgan  Kaufmann, pp.667-674,  1992. \nP.  Smyth, 'Hidden  Markov models for  fault  detection in  dynamic systems,'  Pattern \nRecognition,  vo1.27,  no.l, in  press. \n\nM.  D.  Richard  and  R.  P.  Lippmann,  'Neural network  classifiers  estimate  Bayesian \na  posteriori  probabilities,'  Neural  Computation,  3(4),  pp.461-483,  1992. \n\nJ.  Miller,  R.  Goodman,  and  P.  Smyth,  'On  loss  functions  which  minimize  to  con(cid:173)\nditional expected  values  and  posterior  probabilities,'  IEEE  Transactions  on  Infor(cid:173)\nmation  Theory,  vo1.39,  no.4,  pp.1404-1408, July  1993. \n\nP. Smyth, 'Probability density estimation and local basis function  neural networks,' \nin  Computational  Learning  Theory  and  Natural  Learning  Systems,  T.  Petsche,  M. \nKearns,  S.  Hanson,  R.  Rivest  (eds.),  Cambridge,  MA:  MIT  Press,  1994. \nP.  Smyth  and  J.  Mellstrom,  'Failure  detection  in  dynamic  systems:  model  con(cid:173)\nstruction  without  fault  training  data,'  Telecommuncations  and  Data  Acquisition \nProgress  Report,  vol.  112,  pp.37-49,  Jet  Propulsion  Laboratory,  Pasadena,  CA, \nFebruary  15th  1993. \n\nA.  Stokke and S. Omohundro, 'Hidden Markov model induction by  Bayesian merg(cid:173)\ning,'  in  Advances  in  Neural  Information  Processing  Systems  5,  C.  L.  Giles,  S.  J. \nHanson  and  J.  D.  Cowan  (eds.),  San  Mateo,  CA:  Morgan  Kaufmann,  pp.11-18, \n1993. \n\nJ.  Aitchison,  J.  D.  F.  Habbema,  and  J.  W.  Kay,  'A  critical  comparison  of two \nmethods of statistical discrimination,'  Applied Statistics,  vo1.26,  pp.15-25,  1977. \n\nB. Dubuisson and M.  Masson,  'A statistical decision rule with incomplete knowledge \nabout the classes,'  Pattern  Recognition,  vo1.26 , no.l,  pp.155-165,  1993. \n\n\f", "award": [], "sourceid": 805, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}