{"title": "Graphical Models for Recognizing Human Interactions", "book": "Advances in Neural Information Processing Systems", "page_first": 924, "page_last": 930, "abstract": null, "full_text": "Graphical  Models for  Recognizing \n\nHuman Interactions \n\nNuria M.  Oliver, Barbara Rosario and Alex  Pentland \n\n20  Ames Street,  E15-384C, \n\nMedia Arts  and Sciences  Laboratory, MIT \n\nCambridge, MA  02139 \n\n{nuria, rosario, sandy}@media.mit.edu \n\nAbstract \n\nWe describe  a real-time computer vision  and machine learning sys(cid:173)\ntem for  modeling and recognizing  human actions and interactions. \nTwo  different  domains  are  explored:  recognition  of  two-handed \nmotions in  the martial art  'Tai  Chi' , and multiple- person  interac(cid:173)\ntions  in a  visual  surveillance task.  Our system combines top-down \nwith  bottom-up information using  a  feedback  loop,  and  is  formu(cid:173)\nlated with  a  Bayesian  framework.  Two  different  graphical  models \n(HMMs and Coupled HMMs) are used for modeling both individual \nactions and multiple-agent interactions, and CHMMs are shown to \nwork  more efficiently  and  accurately  for  a  given  amount of train(cid:173)\ning.  Finally,  to  overcome  the  limited  amounts  of training  data, \nwe  demonstrate  that  'synthetic  agents'  (Alife-style  agents)  can  be \nused  to develop flexible  prior models of the person-to-person inter(cid:173)\nactions. \n\nINTRODUCTION \n\n1 \nWe  describe  a  real-time computer vision  and machine learning system for  modeling \nand  recognizing  human  behaviors  in  two  different  scenarios:  (1)  complex,  two(cid:173)\nhanded  action  recognition  in  the  martial  art  of  Tai  Chi  and  (2)  detection  and \nrecognition  of individual  human  behaviors  and  multiple-person  interactions  in  a \nvisual  surveillance  task.  In  the  latter  case,  the  system  is  particularly  concerned \nwith  detecting when  interactions between  people  occur,  and  classifying them. \nGraphical  models,  such  as  Hidden  Markov  Models  (HMMs)  [6]  and  Coupled  Hid(cid:173)\nden  Markov  Models  (CHMMs)  [3,  2],  seem appropriate for  modeling and, classify(cid:173)\ning  human behaviors  because  they  offer  dynamic time warping,  a  well-understood \ntraining  algorithm,  and  a  clear  Bayesian  semantics  for  both  individual  (HMMs) \nand  interacting or  coupled  (CHMMs)  generative  processes.  A  major problem  with \nthis  data-driven  statistical  approach,  especially  when  modeling rare or anomalous \nbehaviors,  is  the  limited  number  of training examples.  A  major emphasis  of our \nwork,  therefore,  is  on  efficient  Bayesian  integration of both  prior  knowledge  with \nevidence  from  data.  We  will  show  that for  situations  involving  multiple indepen(cid:173)\ndent (or partially independent) agents the Coupled HMM  approach generates much \nbetter results  than traditional  HMM  methods. \nIn addition, we  have developed a  synthetic  agent  or Alife modeling environment for \nbuilding  and  training  flexible  a  priori  models  of various  behaviors  using  software \nagents.  Simulation  with  these  software  agents  yields  synthetic  data  that  can  be \nused  to  train  prior  models.  These  prior  models can  then  be  used  recursively  in  a \nBayesian  framework to fit  real  behavioral data. \n\n\fGraphical Models for Recognizing Human Interactions \n\n925 \n\nThis synthetic  agent  approach  is  a  straightforward  and  flexible  method  for  devel(cid:173)\noping prior  models,  one  that  does  not  require  strong analytical assumptions to  be \nmade about  the  form  of the  priorsl .  In  addition,  it  has  allowed us  to  develop  ro(cid:173)\nbust models even  when  there are only a few  examples of some target behaviors.  In \nour experiments  we  have  found  that by  combining such  synthetic  priors  with  lim(cid:173)\nited  real  data we  can  easily  achieve  very  high  accuracies  at  recognition of different \nhuman-to-human interactions. \nThe  paper  is  structured  as  follows:  section  2  presents  an  overview  of the  system, \nthe  statistical  models used  for  behavior  modeling and recognition  are  described  in \nsection  3.  Section  4  contains  experimental  results  in  two  different  real  situations. \nFinally section  5 summarizes the main conclusions and our future lines of research . \n\n2  VISUAL INPUT \nWe  have experimented using two  different  types  of visual input.  The first  is  a  real(cid:173)\ntime, self-calibrating 3-D stereo blob tracker (used for  the  Tai  Chi scenario)  [1],  and \nthe  second  is  a  real-time  blob-tracking system  [5]  (used  in  the  visual  surveillance \ntask).  In  both  cases  an  Extended  Kalman filter  (EKF)  tracks  the  blobs'  location, \ncoarse  shape,  color  pattern,  and  velocity.  This  information  is  represented  as  a \nlow-dimensional,  parametric probability  distribution  function  (PDF)  composed  of \na  mixture of Gaussians,  whose  parameters  (sufficient  statistics and mixing weights \nfor  each  of the  components) are estimated using  Expectation  Maximization (EM). \nThis visual  input  module detects  and  tracks  moving objects  -\nbody  parts  in  Tai \nChi  and  pedestrians in  the  visual surveillance task -\nand outputs a feature  vector \ndescribing  their  motion,  heading,  and  spatial  relationship  to  all  nearby  moving \nobjects.  These  output  feature  vectors  constitute  the  temporally  ordered  stream \nof  data  input  to  our  stochastic  state-based  behavior  models.  Both  HMMs  and \nCHMMs,  with varying structures  depending on  the complexity of the behavior,  are \nused for  classifying the observed  behaviors. \nBoth  top-down  and  bottom-up flows  of information are  continuously  managed  and \ncombined for  each  moving object  within the scene.  The  Bayesian graphical models \noffer  a  mathematical framework  for  combining the  observations  (bottom-up)  with \ncomplex behavioral  priors (top-down) to provide expectations that will be fed  back \nto the input  visual system. \n\n3  VISUAL  UNDERSTANDING VIA  GRAPHICAL \n\nMODELS:  HMMs and  CHMMs \n\nStatistical directed acyclic graphs (DAGs) or probabilistic inference networks (PINs \nhereafter)  can  provide  a  computationally efficient  solution  to the  problem of time \nseries  analysis  and  modeling.  HMMs  and  some  of their  extensions,  in  particular \nCHMMs,  can  be  viewed  as a  particular  and simple case  of temporal PIN  or  DAG. \nGraphically Markov  Models  are  often  depicted  'rolled-out in  time' as  Probabilistic \nInference  Networks, such as in figure  1.  PINs present important advantages that are \nrelevant  to our  problem:  they  can  handle  incomplete  data as  well  as  uncertainty; \nthey are trainable and easier to avoid overfitting;  they encode causality in  a natural \nway;  there  are  algorithms  for  both  doing  prediction  and  probabilistic  inference; \nthey  offer  a  framework  for  combining  prior  knowledge  and  data;  and  finally  they \nare  modular and parallelizable. \nTraditional HMMs offer a  probabilistic framework for  modeling processes  that have \nstructure in time.  They offer  clear  Bayesian semantics, efficient  algorithms for state \nand parameter estimation, and they  automatically perform dynamic time warping. \nAn  HMM  is  essentially  a  quantization  of  a  system's  configuration  space  into  a \nsmall number of discrete  states,  together  with  probabilities for  transitions  between \n\n1 Note that our  priors have  the same form  as our  posteriors,  namely,  they  are graphical \n\nmodels. \n\n\f926 \n\nN.  M.  Oliver,  B.  Rosario and A. Pentland \n\nCoupled  ~1ddo.II.rk'\" 1101101 \n\nHidden .... kov Model \n\nS;c\"rrrrH H  I- I S~'~ \"-\n\n1\u00b0 \n. ..  .i  ~S' \n\n... \n\u00b0 \n\nObsc:tvallou \n\no \n\n.(cid:173).-. - ....... \n\n' - -0 '  \n\n-...... . ........... \n\nFigure  1:  Graphical representation  of a  HMM  and  a  CHMM  rolled-out in time \n\nstates.  A single finite  discrete  variable indexes the current state of the system.  Any \ninformation about  the  history  of the  process  needed  for  future  inferences  must  be \nreflected  in  the current  value  of this state variable. \nHowever  many  interesting  real-life  problems  are  composed  of multiple interacting \nprocesses,  and  thus merit a  compositional representation  of two or  more variables. \nThis is  typically  the  case  for  systems  that  have  structure  both  in  time and space. \nWith  a  single  state  variable,  Markov  models  are  ill-suited  to  these  problems.  In \norder  to model  these  interactions a  more complex architecture  is  needed. \nExtensions  to  the  basic  Markov  model generally  increase  the  memory  of the sys(cid:173)\ntem  (durational  modeling),  providing it with  compositional state  in  time.  We  are \ninterested  in  systems  that  have  compositional state  in  space,  e.g.,  more  than  one \nsimultaneous state  variable.  It is  well  known  that the  exact solution of extensions \nof the basic HMM  to  3 or  more chains is intractable.  In  those cases  approximation \ntechniques  are  needed  ([7,  4,  8,  9]).  However,  it  is  also  known  that  there  exists  an \nexact solution for  the  case of 2 interacting chains,  as it is  our case  [7,  2]. \nWe  therefore  use  two  Coupled  Hidden  Markov  Models  (CHMMs)  for  modeling two \ninteracting  processes,  whether  they  are separate  body  parts or  individual humans. \nIn this architecture state chains are coupled via matrices of conditional probabilities \nmodeling  causal  (temporal)  influences  between  their  hidden  state  variables.  The \ngraphical representation  of CHMMs is  shown  in  figure  1.  From the graph it can be \nseen  that for  each  chain,  the state  at  time t  depends  on  the state  at time t  - 1 in \nboth chains.  The influence  of one  chain on  the other is  through  a  causal link. \nIn  this  paper  we  compare  performance  of  HMMs  and  CHMMs  for  maximum  a \nposteriori (MAP)  state estimation .  We  compute the  most likely sequence  of states \nS within a model given the observation sequence 0  =  {01' ... , on}.  This most likely \nsequence  is  obtained by  S =  argmaxsP(SIO). \nIn the case  of HMMs the  posterior  state sequence  probability P(SIO)  is  given  by \n\nT \n\nt=2 \n\nP(SIO) =  P31P31(0I) IIP3t(Ot)P3tI31_1 \n\n(1) \nwhere  S  = {a1,\"\"  aN}  is  the  set  of  discrete  states,  St  E  S  corresponds  to  the \nstate at time t.  Pilj  ==  P31 =a,1 3t_l=a J  is the state-to-state transition probability (i.e. \nprobability  of being  in state  ai  at time t  given  that  the system  was  in  state  aj  at \ntime t - 1).  In  the following  we  will  write  them as  P3tI3t-l'  Pi  ==  P31 =a,  =  P31  are \nthe prior  probabilities for  the initial state.  Finally Pi(Ot)  ==  P3t=a,(Ot)  =  P3t(od are \nthe output  probabilities for  each state2 . \nFor  CHMMs we  need  to introduce another set  of probabilities,  P 3tI3 :_ 1 ,  which  cor-\n\n2The output  probability  is  the probability  of observing  Ot  given  state a,  at time t \n\n\fGraphical Models for Recognizing Human Interactions \n\n927 \n\nrespond  to the  probability of state  St  at  time t  in  one  chain  given  that the  other \n- was  in  state  S~_l at  time t - 1.  These \nchain  -denoted hereafter  by superscript  I \nnew  probabilities express  the causal  influence  (coupling) of one chain  to the other. \nThe posterior state probability for  CHMMs is expressed  as \n\nP(SIO) \n\n= \n\nP  p  (Ol)P,P ,(d) \n\n\"'1\"'1 \n\n\"'1\"'1 \n\nP(O) \n\n1 \n\nx \n\nT \n\nII P \n\nt=2 \n\n\"',1\",-1 \n\nP \n(  )  (') \n\" :I\"'~_I  ,,;1\",-1  \"\"I\";_IPs,  0t  p,,;  \u00b0t \n\nP \n\nP \n\n(2) \nwhere  St, s~; Ot, o~ denote states and observations for each of the Markov chains that \ncompose the CHMMs. \nIn  [2]  a  deterministic  approximation for  maximum  a  posterior (MAP)  state  esti(cid:173)\nmation  is  introduced.  It enables  fast  classification  and  parameter estimation  via \nEM,  and  also obtains  an  upper  bound  on  the  cross  entropy  with  the  full  (combi(cid:173)\nnatoric)  posterior  which  can  be  minimized  using  a  subspace  that  is  linear  in  the \nnumber of state variables.  An  \"N-heads\"  dynamic programming algorithm samples \nfrom  the  O(N)  highest  probability  paths  through  a  compacted  state  trellis,  with \ncomplexity O(T( C N)2)  for  C  chains  of N  states  apiece  observing T  data  points. \nThe  cartesian  product  equivalent  HMM  would  involve  a  combinatoric  number  of \nstates,  typically requiring OCT N 2C )  computations.  We  are  particularly interested \nin efficient,  compact algorithms that can  perform in real-time. \n\n4  EXPERIMENTAL RESULTS \nOur  first  experiment  is  with  a  version  of Tai  Chi  Ch 'uan  (a  Chinese  martial  and \nmeditative art) that is practiced while sitting.  Using our self-calibrating, 3-D stereo \nblob  tracker  [1],  we  obtained 3D  hand  tracking data for  three Tai  Chi gestures  in(cid:173)\nvolving two, semi-independent arm motions:  the left single whip, the left cobra, and \nthe  left  brush  knee.  Figure 4  illustrates one of the gestures  and  the  blob-tracking. \nA detailed description of this set of Tai  Chi experimental results can be found  in  [3] \nand  viewed  at http://nuria . www.media.mit. edurnurial chmm/taichi . html. \n\n. ~ \n-\n\n\"\"-\n\nI \n, \n\nFigure 2:  Selected  frames from  'left  brush  knee.' \n\nWe collected 52  sequences,  roughly  17 of each  gesture  and created  a feature vector \nconsisting of the 3-D (x, y, z) centroid (mean position) of each of the blobs that char(cid:173)\nacterize  the hands.  The resulting six-dimensional time series  was used  for  training \nboth  HMMs  and CHMMs. \nWe  used  the  best  trained  HMMs  and  CHMMs  -\nto \nclassify the full  data set of 52 gestures.  The Viterbi algorithm was used  to find  the \nmaximum likelihood model for HMMs and CHMMs.  Two-thirds ofthe testing data \nhad not  been seen  in  training,  including gestures  performed at varying speeds  and \nfrom slightly different views.  It can be seen from the classification accuracies, shown \nin  table  1,  that  the  CHMMs outperform  the  HMMs.  This difference  is  not  due  to \nintrinsic  modeling power,  however;  from  earlier  experiments we  know  that  when  a \nlarge  number of training samples is  available  then  HMMs  can  reach  similar accu(cid:173)\nracies.  We  conclude  thus  that for  data where  there  are  two  partially-independent \nprocesses  (e.g.,  coordinated  but  not  exactly  linked),  the  CHMM  method  requires \nmuch  less training to achieve  a  high classification accuracy. \nTable  1  illustrates  the  source  of this  training  advantage.  The  numbers  between \n\nusing  10-crossvalidation -\n\n\f928 \n\nN.  M  Oliver,  B.  Rosario and A. Pentland \n\nTable  1:  Recognition  accuracies  for  HMMs  and  CHMMs  on  Tai  Chi  gestures.  The  ex(cid:173)\npressions  between parenthesis correspond  to the  number of parameters of the largest  best(cid:173)\nscoring  model. \n\nRecognition Results on Tai  Chi Gestures \n\nSingle HMMs \n\nCoupled HMMs  (CHMMs) \n\nAccuracy \n\n69.2%  (25+30+180) \n\n100%  (27+18+54) \n\nparenthesis  correspond  to  the  number  of degrees  of freedom  in  the  largest  best(cid:173)\nscoring  model:  state-to-state  probabilities + output  means + output  covariances. \nThe  conventional  HMM  has  a  large  number of covariance  parameters  because  it \nhas  a  6-D  output  variable;  whereas  the  CHMM  architecture  has  two  3-D  output \nvariables.  In  consequence,  due  to  their  larger  dimensionality  HMMs  need  much \nmore  training  data  than  equivalent  CHMMs  before  yielding  good  generalization \nresults. \nOur second experiment was with a pedestrian video surveillance task 3;  the goal was \nfirst  to recognize typical pedestrian behaviors in  an open plaza (e.g., walk from A to \nB,  run from  C  to D),  and second  to recognize  interactions between  the  pedestrians \n(e.g.,  person  X  greets  person  V).  The  task  is  to  reliably  and  robustly  detect  and \ntrack the pedestrians in the scene.  We use in this case 2-D  blob features for modeling \neach  pedestrian.  In  our system one  of the  main cues  for  clustering  the  pixels  into \nblobs  is  motion,  because  we  have  a  static  background  with  moving  objects.  To \ndetect  these  moving objects  we  build  an  eigenspace  that  models  the  background. \nDepending  on  the  dynamics  of  the  background  scene  the  system  can  adaptively \nrelearn  the eigenbackground  to compensate for  changes such  as big shadows. \nThe trajectories of each blob are computed and saved into a  dynamic track memory. \nEach  trajectory  has  associated  a  first  order  EKF  that  predicts  the  blob's  position \nand  velocity  in  the  next  frame  As  before,  the  appearance  of each  blob  is  modeled \nby  a  Gaussian  PDF in  RGB color space,  allowing us  to  handle occlusions. \n\nFigure  3:  Typical  Image  from  pedestrian  plaza.  Background  mean image,  input  image \nwith  blob  bounding  boxes  and  blob segmentation  image \n\nThe  behaviors  we  examine  are  generated  by  pedestrians  walking  in  an  open  out(cid:173)\ndoor  environment.  Our goal  is  to develop  a  generic,  compositional analysis  of the \nobserved  behaviors  in  terms  of states  and  transitions  between  states  over  time  in \nsuch  a  manner that  (1)  the  states correspond  to  our  common sense  notions of hu(cid:173)\nman  behaviors,  and  (2)  they  are  immediately applicable  to  a  wide  range  of sites \nand viewing situations.  Figure 3 shows  a  typical image for our pedestrian scenario, \nthe  pedestrians  found,  and  the  final  segmentation.  Two  people  (each  modeled  as \nits  own generative  process)  may  interact  without  wholly  determining each  others' \nbehavior.  Instead,  each  of them  has  its  own  internal  dynamics  and  is  influenced \n(either weakly or strongly)  by others.  The  probabilities  PStIS~_1  and  PS;ISt_l  from \nequation  2  describe  this  kind  of interactions  and  CHMMs  are  intended  to  model \nthem in  as efficient  a  manner as is  possible. \nWe would like to have a system that will accurately interpret behaviors and interac(cid:173)\ntions within almost any pedestrian scene  with at most minimal training.  As we  have \n\nFurther \n\ninformation \n\nabout \n\nthis \n\nsystem \n\ncan \n\nbe \n\nfound \n\nat \n\n3 \n\nhttp:/www.vismod.www.media.mit.edu/  nuria/humanBehavior IhumanBehavior .html \n\n\fGraphical Models for Recognizing Human Interactions \n\n929 \n\nalready  mentioned , une  critical  problem  is  the  generation  of models  that  capture \nour prior knowledge about human behavior.  To achieve this goal we  have developed \na  modeling environment that uses synthetic agents to mimic pedestrian  behavior in \na  virtual environment.  The agents can  be assigned different  behaviors and they  can \ninteract with each other as well.  Currently they can generate 5 different  interacting \nbehaviors  and  various  kinds  of individual  behaviors  (with  no  interaction) .  These \nbehaviors  are:  following,  meet  and  walk  together  (inter1);  approach,  meet  and  go \non  separately (inter2)  or go on  together  (inter3) ; change direction  in  order to meet , \napproach , meet and continue together (inter4) or go on separately (inter5) .  The pa(cid:173)\nrameters of this virtual environment are modeled using data drawn from a  'generic' \nset  of real  scenes. \nBy  training  the  models  of the  synthetic  agents  to  have  good  generalization  and \ninvariance properties, we  can obtain flexible  prior models for  use  when  learning the \nhuman behavior models from real scenes.  Thus the synthetic prior models allow us \nto  learn  robust  behavior  models  from  a  small  number  of real  behavior  examples. \nThis capability is of special importance in a  visual surveillance task , where typically \nthe  behaviors of greatest  interest  are  also  the rarest . \nTo test our behavior modeling in the pedestrian scenario, we first  used the detection \nand tracking system previously described to obtain 2-D blob features for each person \nin several  hours of video.  More than 20 examples of following and the two first  types \nof meeting behaviors were  detected  and  processed. \nCHMMs  were  then  used  for  modeling  three  different  behaviors:  following ,  meet \nand continue together,  and meet and go on separately.  Furthermore, an  interaction \nversus  no interaction detection test was also performed (HMMs performed so poorly \nat  this  task  that  their  results  are  not  reported).  In  addition  to  velocity,  heading, \nand  position, the feature  vectors  consisted  of the  derivative of the relative distance \nbetween two agents, their degree of alignment (dot product of their velocity vectors) \nand the magnitude of the difference  in  their  velocity  vectors. \nWe tested on this video data using models trained with two types of data:  (1)  'Prior(cid:173)\nonly models', that is, models learned entirely from our synthetic-agents environment \nand then  applied  directly  to the real  data with  no  additional training  or tuning of \nthe  parameters; and  (2)  'Posterior models', or prior-pIus-real data behavior models \ntrained by starting with the prior-only model and then 'tuning' the models with data \nfrom this specific site, using eight examples of each type of interaction.  Recognition \naccuracies  for  both  these  'prior'  and  'posterior'  CHMMs  are  summarized in  table \n2.  It  is  noteworthy  that  with  only  8  training  examples,  the  recognition  accuracy \non  the  remaining data could  be  raised  to  100%.  This  demonstrates  the  ability  to \naccomplish  extremely  rapid  refinement  of our  behavior  models  from  the  initial  a \npriori  models. \n\nTable 2:  Accuracies  on  real  pedestrian  data, (a)  only a  priori models,  (b)  posterior \nmodels (with site-specific  training) \n\nAccuracy on Real  Pedestrian  Data \n\nNo-inter \n\nInterl \n\nInter2 \n\nInter3 \n\n(a)Prior CHMMs \n(b ) Posterior CHMMs \n\n90.9 \n100 \n\n93.7 \n100 \n\n100 \n100 \n\n100 \n100 \n\nIn  a  visual  surveillance  system  the  false  alarm  rate  is  often  as  important  as  the \nclassification  accuracy4  To  analyze  this  aspect  of our  system's  performance,  we \ncalculated the system's ROC  curve.  For accuracies of 95%  the false  alarm rate was \nless  than 0.01. \n\n4In an ideal automatic surveillance system,  all the targeted behaviors should be detected \nwith  a  close-to-zero false  alarm  rate,  so that we  can  reasonably  alert  a  human  operator  to \nexamine  them  further. \n\n\f930 \n\nN.  M.  Oliver,  B.  Rosario and A. Pentland \n\n5  SUMMARY,  CONCLUSIONS  AND  FUTURE  WORK \nIn  this  paper  we  have  described  a  computer  vision  system  and  a  mathematical \nmodeling framework for  recognizing  different  human  behaviors  and interactions  in \ntwo different  real domains:  human actions in  the martial art  of Tai  Chi and  human \ninteractions  in  a  visual  surveillance  task.  Our  system  combines  top-down  with \nbottom-up information in  a  closed feedback  loop,  with both components employing \na  statistical  Bayesian  approach. \nTwo  different  state-based  statistical  learning  architectures,  namely  HMMs  and \nCHMMs,  have  been  proposed  and  compared  for  modeling  behaviors  and  interac(cid:173)\ntions.  The superiority  of the CHMM  formulation has  been  demonstrated  in  terms \nof both  training  efficiency  and  classification  accuracy.  A  synthetic  agent  training \nsystem has  been created  in order  to develop  flexible  prior behavior models, and we \nhave  demonstrated  the  ability  to use  these  prior  models to  accurately  classify  real \nbehaviors with no  additional training on real data.  This fact  is specially important, \ngiven  the  limited amount of training data available. \nFuture directions under  current  investigation include:  extending our agent  interac(cid:173)\ntions to more than two interacting processes;  developing a hierarchical system where \ncomplex behaviors are expressed  in  terms of simpler behaviors;  automatic discovery \nand  modeling of new  behaviors  (both  structure  and parameters) ; automatic deter(cid:173)\nmination  of priors,  their  evaluation  and  interpretation;  developing  an  attentional \nmechanism with a foveated  camera along with a  more detailed representation of the \nbehaviors;  evaluating the  adaptability of off-line  learned behavior structures  to dif(cid:173)\nferent  real  situations;  and exploring a sampling approach for  recognizing  behaviors \nby  sampling the  interactions generated  by  our synthetic  agents. \n\nAcknowledgments \nSincere  thanks to  Michael  Jordan,  Tony Jebara and  Matthew  Brand for  their  ines(cid:173)\ntimable help. \n\nReferences \n1.  A.  Azarbayejani  and  A.  Pentland. \n\nusing  3-D  shape  estimation  from  blob  features. \nferenc e  on  Pattern  R ecognition, Vienna,  August  1996.  IEEE. \n\nReal-time  self-calibrating  stereo  person-tracker \nIn  Proceedings,  International  Con-\n\n2.  M.  Brand. \n\nCoupled  hidden  markov  models  for  modeling  interacting  processes. \n\nNovember  1996.  Submitted  to  Neural  Computation. \n\n3.  M.  Brand  and  N.  Oliver.  Coupled  hidden  markov  models  for  complex  action  recog(cid:173)\n\nnition . \n\nIn  In  Proceedings  of IEEE  CVPR97,  1996. \n\n4.  Z.  Ghahramani  and  M.  1.  Jordan . \n\nIn  D.  S. \nTouretzky,  M. C .  Mozer ,  and  M.  Hasselmo, editors ,  NIPS,  volume  8,  Cambridge,  MA , \n1996.  MITP. \n\nFactorial  hidden  Markov  models. \n\n5.  N.  Oliver,  B.  Rosario,  and  A.  Pentland. \n\nStatistical  modeling  of  human  behaviors . \n\nIn  To  appear in  Proceedings of CVPR98,  Perception of Action  Workshop,  1998. \n\n6  1.  R.  Rabiner .  A  tutorial  on  hidden  markov  models  and  selected  applications  in \n\nspeech  recognition .  PIEEE, 77(2):257- 285 ,  1989. \n\n7.  L.  K.  Saul  and  M.  1.  Jordan.  Boltzmann  chains  and  hidden  Markov  models . \n\nIn \nG.  Tesauro,  D.  S.  Touretzky,  and  T.  Leen ,  editors,  NIPS,  volume  7,  Cambridge,  MA , \n1995.  MITP. \n\n8.  P.  Smyth,  D.  Heckerman ,  and  M.  Jordan. \n\nProbabilistic  independence  networks  for \nhidden  Markov  probability  models.  AI  memo 1565,  MIT, Cambridge,  MA,  Feb  1996. \n\n9  C.  Williams  and  G.  E.  Hinton .  Mean  field  networks  that  learn  to  discriminate  tem(cid:173)\n\nporally  distorted  strings. \n18- 22 ,  San  Mateo , CA,  1990.  Morgan  Kaufmann. \n\nIn  Proceedings,  connectionist models  summ er  school, pages \n\n\f", "award": [], "sourceid": 1560, "authors": [{"given_name": "Nuria", "family_name": "Oliver", "institution": null}, {"given_name": "Barbara", "family_name": "Rosario", "institution": null}, {"given_name": "Alex", "family_name": "Pentland", "institution": null}]}